Skip to content

Chapter 11: Driver Architecture and Isolation

Three-tier protection model, isolation mechanisms, KABI, driver model, device registry, zero-copy I/O, IPC, crash recovery, driver subsystems, D-Bus bridge


Drivers run in one of three isolation tiers: Tier 0 (in-kernel, fully trusted), Tier 1 (Ring 0 with hardware memory domain isolation — MPK/POE/DACR), or Tier 2 (Ring 3 with full process + IOMMU isolation). A fourth class, Tier M, covers intelligent devices (DPUs, SmartNICs) that run their own kernel and communicate via the peer protocol.

Tier assignment is an operational decision: every KABI driver ships with all three transports in the same binary. Administrators set the tier at runtime via echo 0|1|2 > /ukfs/kernel/drivers/<name>/tier — no recompilation needed. Automatic demotion moves crash-prone Tier 1 drivers to Tier 2 after 3 failures in 60 seconds.

Crash recovery is automatic for Tier 1 and Tier 2: driver state is checkpointed and restored on reload without disrupting userspace.

11.1 Three-Tier Protection Model

UmkaOS organizes code into three driver tiers — the UmkaOS Core microkernel (Tier 0), Tier 1 kernel-adjacent drivers, and Tier 2 user-space drivers — plus standard user space. The "three tiers" refer to the three levels at which kernel/driver code executes (Core, Tier 1, Tier 2); user space is not counted as a tier because it uses the standard Linux process model unchanged.

A fourth class, Tier M (Multikernel Peer), emerges when attached hardware runs its own UmkaOS kernel instance or an UmkaOS-compatible shim. Tier M is not a tier within a single UmkaOS instance — it is a physically separate execution environment with isolation stronger than Tier 2 and near-zero host-side driver complexity. See Section 11.1.

+======================================================================+
|                         UmkaOS CORE  (Ring 0)                          |
|  Microkernel: Rust + C/asm for arch boot                            |
|                                                                      |
|  - Capability manager         - Physical memory allocator            |
|  - Thread/process management  - Scheduler (CFS/EEVDF + RT + DL)     |
|  - IPC primitives             - MMU / IOMMU programming              |
|  - Interrupt routing          - vDSO maintenance                     |
|  - Virtual memory manager     - Page cache                           |
|  - Timer management           - Linux syscall interface              |
+======================================================================+
         |  MPK switch (~23 cycles)     |  Shared memory (0 copies)
         v                              v
+======================================================================+
|                    TIER 1: Kernel-Adjacent Drivers                    |
|  Ring 0, hardware memory-domain isolated (MPK/POE/DACR per arch)     |
|                                                                      |
|  - NVMe, AHCI/SATA            - High-perf NICs (Intel, Mellanox)    |
|  - TCP/IP + UDP stack          - GPU compute drivers                 |
|  - Block I/O layer             - Filesystem impls (ext4, XFS, btrfs) |
|  - VirtIO drivers              - Crypto subsystem                    |
|  - KVM hypervisor (*)          - Netfilter/nftables engine           |
+======================================================================+
(*) KVM runs as a Tier 1 driver with extended hardware privileges (KvmHardwareCapability),
    which authorizes umka-core to execute VMX/VHE/H-extension operations on KVM's behalf
    via a validated VMX/VHE trampoline. KVM retains full Tier 1 crash-recovery semantics.
    See Section 19.1.4.6 for the full classification rationale.
         |  Address-space switch           |  IOMMU-isolated
         |  (~200-500 cycles, PCID/ASID)  |  DMA fencing
         v                                v
+======================================================================+
|                    TIER 2: User-Space Drivers                         |
|  Ring 3, separate address space, IOMMU-protected DMA                 |
|                                                                      |
|  - USB drivers                 - Audio (HDA, USB Audio)              |
|  - Input devices               - Bluetooth, WiFi control plane       |
|  - Printers, scanners          - Third-party / vendor drivers        |
|  - Display server drivers      - Non-performance-critical devices    |
+======================================================================+
         |  Standard Linux syscall interface (100% compatible)
         v
+======================================================================+
|                    USER SPACE  (Ring 3)                               |
|  Unmodified Linux binaries: glibc, musl, systemd, Docker, K8s, etc. |
+======================================================================+

════════════ Hardware Fabric Boundary (PCIe / CXL / coherent on-chip) ════════════

+======================================================================+
|          TIER M: Multikernel Peer  (separate kernel instance)        |
|  Own CPU complex · own memory · UmkaOS kernel or UmkaOS-compatible shim  |
|                                                                      |
|  - SmartNIC / DPU (BlueField, Pensando, Marvell OCTEON)             |
|  - Computational storage SoC (Arm Cortex-R, Zynq UltraScale+)      |
|  - On-chip hardware partition (ARM CCA Realm, RISC-V WorldGuard)    |
|  - GPU / NIC with UmkaOS-compatible firmware shim                      |
+======================================================================+
(*) Tier M: deployment-time property — 0 to N peers per host.
    Host representation: umka-peer-transport (~2,000 lines, device-agnostic).

Complexity management — The core-of-core (scheduler + memory + caps + IPC) should be as small as feasible. For reference: seL4's verified microkernel is ~10K SLOC (but provides far fewer services), QNX's microkernel is ~100K, and the Zircon kernel (Fuchsia) is ~200K. Any subsystem that grows beyond the minimum necessary for its function should be re-evaluated for extraction to Tier 1.

Kernel image structure: The three-tier model maps onto a four-level loading architecture: Nucleus (verified nucleus, ~25-35 KB), Evolvable (boot monolith, swappable), on-demand KABI services, and device drivers. See Section 2.21 for the complete structural layout, including CPU-feature-variant modules.

11.1.1.1 Tier Selection Guide

Use this table to determine the correct tier for a new driver or subsystem:

Driver Origin Recommended Tier Rationale
In-tree, audited, performance-critical (NVMe, NIC, filesystem) Tier 1 Crash containment sufficient; shared-memory zero-copy for throughput
In-tree, non-performance-critical (USB, input, Bluetooth) Tier 2 Full process isolation; microsecond overhead irrelevant for HID
In-tree audio (HDA, USB-Audio) Tier 1 (default); optional Tier 2 demotion at >=10ms buffer periods Tier 1 required for professional audio (<5ms latency budgets). Consumer configs may demote to Tier 2 for crash resilience (Section 13.4)
Out-of-tree vendor (GPU, proprietary NIC firmware shim) Tier 2 Full security boundary; unaudited code must not run in Ring 0
Firmware shim on DPU/SmartNIC Tier M Physical isolation; runs on separate CPU complex behind IOMMU
Untrusted third-party / community Tier 2 Strongest software isolation; driver cannot access kernel memory
Performance-critical vendor (e.g., ML accelerator with audited KABI shim) Tier 1 Only after code audit + ML-DSA-65 signing; see Section 12.7

Key distinction: Tier 1 protects against driver bugs (use-after-free, buffer overflow), not driver exploitation (malicious code). A compromised Tier 1 driver has Ring 0 access minus MPK-protected regions — sufficient for containment of accidental faults, not for defense against a determined attacker. Use Tier 2 for any driver where malicious intent is part of the threat model.

11.1.2 How the Tiers Interact

UmkaOS Core to Tier 1: Both run in Ring 0, sharing the same address space, but MPK keys (set via WRPKRU, ~23 cycles) prevent a Tier 1 driver from reading or writing memory belonging to the core or to other Tier 1 domains. Communication uses shared-memory ring buffers (kabi_call! abstraction resolves to Transport T1). The driver runs its own IDL-generated consumer loop within the isolated domain; the core submits requests to the ring without entering the driver domain. Zero copies, zero privilege transitions.

UmkaOS Core to Tier 2: Standard process-based isolation. Tier 2 drivers run in Ring 3 with their own address space. Communication uses mapped shared-memory rings for data (zero copy) and lightweight syscall-based notifications. IOMMU restricts DMA to driver-allocated regions.

How Tier 2 zero-copy works — dual physical page mapping: The shared ring buffers between UmkaOS Core and a Tier 2 driver are backed by a single set of physical pages mapped into two virtual address spaces simultaneously. The kernel side holds a VmArea covering these pages (with VM_SHARED | VM_IO flags) and accesses them through its own virtual address. The Tier 2 driver side calls mmap(UMKA_RING_FD, ...) on a special file descriptor issued at driver registration; the kernel maps the same physical frames into the Tier 2 process address space as a read-write shared mapping. No copy occurs on either side: the kernel writes to its virtual address and the Tier 2 driver reads from its virtual address, both resolving to the same physical frames. The shared region is bounded to the ring buffer size; the Tier 2 driver cannot access kernel memory outside the mapped ring (enforced by VMA boundaries and IOMMU). Cache coherency is guaranteed by the CPU coherency protocol on x86 and ARM; non-coherent platforms add a memory fence before and after ring accesses.

Tier 1/2 to User Space: No direct interaction for control paths — all user-space requests go through UmkaOS Core's syscall layer, which dispatches to the appropriate tier. However, the data path does allow direct shared memory: UmkaOS Core sets up shared ring buffers (Section 11.7) that are mapped into both the driver and user-space address spaces. Once established, data flows through these rings without UmkaOS Core mediation (zero-copy). UmkaOS Core mediates only the ring setup, teardown, and error paths.

UmkaOS Core to Tier M (Peer Kernel): Communication uses typed capability channels over the hardware fabric — PCIe P2P ring buffers, CXL shared memory, or coherent on-chip SRAM. No UmkaOS Core data paths cross the hardware boundary. The host-side umka-peer-transport module (~2,000 lines) manages cluster membership, capability negotiation, and crash recovery. The peer kernel runs its own scheduler, memory manager, and capability space independently; the host CPU is not in the device's data path.


11.1.3 Tier M: Multikernel Peer Isolation

Tier M describes the isolation class of devices running their own UmkaOS kernel instance (or UmkaOS-compatible shim) as cluster peers. The three-tier model (Tiers 0–2) describes isolation within a single UmkaOS instance. Tier M is a between-kernel isolation class.

Key framing: Tier M is a protocol specification, not an OS port requirement. The UmkaOS peer protocol is a single protocol shared by all multikernel communication — Tier M peers on PCIe, distributed nodes over RDMA, and firmware shims all speak the same Layer 1 (membership, capabilities, crash recovery, ring transport). The only differences are Layer 0 (transport binding: PCIe vs RDMA vs CXL) and whether Layer 3 (DSM page coherence) is enabled. A device vendor implements Layers 0-1 as a firmware shim (~10-18K lines of C on an existing RTOS, excluding cryptographic primitives already present in the firmware stack; a reference implementation will be published with measured line counts) without replacing their firmware stack. See Section 24.1 for the standalone protocol spec and the full protocol stack diagram. The complete wire specification (message types, payload structs, PCIe BAR layout, negotiation state machine, transport bindings) is in Section 5.1.

Isolation properties:

  • No shared kernel address space. Tiers 0–2 all execute within the host UmkaOS kernel (Ring 0 or Ring 3) and share kernel data structures at various depths. A Tier M peer has an entirely separate address space, CPU state, and capability namespace. The host UmkaOS Core never maps peer memory.
  • Hardware boundary, not software policy. Tier 1 isolation (MPK/POE/DACR) and Tier 2 isolation (IOMMU) are policies enforced by software-programmable hardware registers — a sufficiently privileged exploit can alter them. The Tier M boundary is a hardware fabric (PCIe, CXL, on-chip partition fence); crossing it requires physical access or firmware compromise of the device, a categorically different threat model.
  • Isolation stronger than Tier 2. Tier 2 is Ring 3 + IOMMU on the host — the driver still shares the host kernel for syscall dispatch, signal delivery, and page table management. A Tier M peer shares none of these. The only communication surface is the typed capability channel, a significantly smaller attack surface.
  • Ordered crash recovery. On peer kernel crash: IOMMU lockout and PCIe bus master disable within 2ms, then distributed state cleanup, then optional FLR and reboot. The host kernel never panics; applications see a brief I/O stall. See Section 5.3.

Performance properties:

  • Host-side complexity: ~2,000 lines of device-agnostic umka-peer-transport regardless of device class, vs. 100K–700K lines of device-specific Ring 0 driver code for equivalent traditional devices.
  • Host CPU out of data path: the peer kernel manages its own scheduler and I/O independently. Host CPU overhead is proportional to control-path events (peer joins, leaves, crashes, capability renegotiation), not to data throughput.
  • Communication latency by fabric:
Fabric Example hardware Round-trip latency
PCIe P2P SmartNIC, DPU, SAS HBA, NVMe SSD, GPU ~1–2 μs
CXL coherent Memory/compute expander ~100–300 ns
s390x Channel I/O (QDIO) OSA-Express NIC, ECKD DASD, FCP storage ~5–50 μs
virtio (virtqueue) QEMU virtio-blk, virtio-net, virtio-ccw ~2–10 μs
USB Microcontroller-based peripherals ~1–10 ms
On-chip partition ARM CCA Realm, RISC-V WorldGuard ~10–50 ns
RDMA (cross-host) Remote kernel node over RoCEv2/InfiniBand ~2–10 μs
Ethernet+TCP Software transport (development, IoT) ~50–500 μs

The peer protocol is transport-agnostic. Any fabric that can carry ring buffer messages and deliver doorbell interrupts (or polling) is a valid Layer 0. USB and Ethernet are slower but functionally equivalent — the protocol adapts to the latency, not the other way around.

Hardware forms — any device with a processor is a potential peer:

Form Examples Notes
SmartNIC / DPU BlueField-3, Pensando Elba, Marvell OCTEON 10 Full UmkaOS kernel port
Storage controller SAS HBA (Broadcom/Marvell), RAID controller, NVMe SSD with embedded ARM Shim on existing controller RTOS (~10-18K lines of C, excluding crypto primitives already in firmware). The controller already IS the storage service — Tier M makes that explicit. Host driver drops from ~150K lines to ~2K lines of transport.
Computational storage NVMe SSD with compute (Zynq UltraScale+, Samsung SmartSSD) Shim or full port. Offloads filtering, compression, encryption to the drive.
GPU / accelerator GPU with firmware (NVIDIA, AMD, Intel), NPU, inference ASIC Shim or full port. Exposes compute service via capability channel instead of monolithic 700K-line host driver.
FPGA Xilinx Kintex-7 (Phase 3 validation), Alveo, Intel Stratix (PCIe EP) Shim in soft-core. Cleanest demo of the protocol — device has never heard of UmkaOS, just speaks the spec. Kintex-7 is the primary Tier M validation platform: implements both network and storage services simultaneously using onboard DDR for ephemeral block storage.
Network processor Barefoot Tofino, Memory-mapped NIC ASIC Shim. Offloads packet processing, exposes forwarding service.
USB peripheral STM32/ESP32 microcontroller, USB-attached sensor hub, USB crypto token Shim over USB bulk transport. Slower but valid — demonstrates protocol universality.
On-chip partition ARM CCA Realm (Neoverse V3+), RISC-V WorldGuard Same physical package; coherent fabric
Remote host x86-64 or AArch64 server over RDMA/Ethernet Full UmkaOS kernel; adds DSM (Layer 3) for shared memory workloads

The on-chip partition form — ARM CCA Realms (shipping in Neoverse V3+, Cortex-X4+) and RISC-V WorldGuard (specification in progress) — is conceptually identical to the discrete device form. The architectural pattern is the same: a separate UmkaOS instance, a hardware-enforced boundary, typed capability channels as the sole communication surface. The difference is physical proximity, which determines communication latency but not the isolation model.

Availability: Tier M is a deployment-time property. A host may have zero, one, or many Tier M peers depending on attached hardware. The host kernel supports the traditional driver model (Tiers 0–2) and the peer model simultaneously with no configuration distinction — umka-peer-transport loads on demand when a peer device is detected.


11.2 Isolation Mechanisms and Performance Modes

Hardware-assisted memory isolation enables UmkaOS's three-tier driver model (Section 12.1) at near-zero overhead on platforms with hardware isolation support (x86 MPK ~23 cycles, ARMv7 DACR ~30-40 cycles, AArch64 page-table ~150-300 cycles). On RISC-V, s390x, and LoongArch64, Tier 1 isolation is not available — drivers choose Tier 0 (in-kernel) or Tier 2 (Ring 3 + IOMMU) based on licensing, driver preference, and admin policy. This section covers the mechanisms, their costs, threat model, and the adaptive policy that allows UmkaOS to run on hardware ranging from x86_64 with MPK (~23 cycles) to RISC-V where Tier 1 is absent entirely. Isolation is one of eight core capabilities — see Section 1.1 for the full list.

11.2.1 Isolation Philosophy: Best Effort Within Performance Budget

Key principle: Driver isolation in UmkaOS is not a single fixed design point. It is a spectrum that varies across hardware architectures, and the approach is deliberately "best effort within the performance budget" rather than "maximum isolation everywhere."

Why this matters:

  1. Hardware capability varies widely: x86_64 has MPK (16 domains, ~23 cycles). AArch64 uses page-table + ASID isolation (~150-300 cycles) as the standard mechanism on all current deployed hardware (Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920). POE (ARMv8.9+/ARMv9.4+, ~40-80 cycles) is an optional hardware acceleration available on newer silicon (Cortex-X4+) that provides 2-4x speedup when present. ARMv7 has DACR (16 domains, ~30-40 cycles). RISC-V has no suitable isolation mechanism — Tier 1 is not available on RISC-V. A design that mandates uniform isolation would either (a) impose unacceptable overhead on some architectures, or (b) fail to leverage better isolation on architectures that support it.

  2. Performance is a requirement, not a nice-to-have: The 5% overhead target is non-negotiable. UmkaOS must be a drop-in replacement for Linux — if I/O latency increases by 20%, users will not adopt it regardless of how strong the isolation is.

  3. The escape hatch always exists: Any Tier 1 driver can be demoted to Tier 2 (full process isolation) at any time — via per-driver manifest, sysfs knob, or automatic crash-count policy. If an administrator values isolation over performance for a specific workload or hardware configuration, that choice is always available. The tradeoff is explicit and user-controlled.

  4. This is not a bug, it's a feature: Some reviewers may see varying isolation strength across architectures as a "flaw" or "inconsistency." It is neither. It is an honest acknowledgment of hardware reality. The alternative — pretending all architectures have identical isolation capabilities, or mandating full process isolation everywhere (and accepting 20-50% overhead) — would make UmkaOS impractical for its intended use case as a Linux replacement.

The design contract:

Hardware Tier 1 Isolation Overhead Alternative
x86_64 with MPK Strong (MPK domains) ~1-2% Demote to Tier 2 for stronger isolation
AArch64 (mainstream: page-table + ASID) Moderate (page-table domains) ~6-12% Demote to Tier 2, or promote to Tier 0 for performance
AArch64 with POE (ARMv8.9+/ARMv9.4+) Strong per-group (3 POE indices = max 3 concurrent Tier 1 groups; drivers sharing an index are co-isolated) ~2-4% Demote to Tier 2 for stronger isolation
ARMv7 with DACR Strong (DACR domains) ~1-2% Demote to Tier 2 for stronger isolation
RISC-V Tier 1 unavailable — drivers use Tier 0 or Tier 2 0% (Tier 0) or Tier 2 overhead Tier 2 for isolation, Tier 0 for performance
PPC32/PPC64LE Strong-Moderate ~1-5% Demote to Tier 2 for stronger isolation
s390x Tier 1 unavailable — drivers use Tier 0 or Tier 2 (Storage Keys too coarse for domain isolation) 0% (Tier 0) or Tier 2 overhead Tier 2 (subchannel protection) for isolation
LoongArch64 Tier 1 unavailable — drivers use Tier 0 or Tier 2 0% (Tier 0) or Tier 2 overhead Tier 2 (IOMMU) for isolation

Summary: UmkaOS provides the best isolation the hardware can deliver within the performance budget, with a user-controlled escape hatch to stronger isolation (Tier 2) when security requirements exceed what the hardware can efficiently provide. This is a pragmatic engineering tradeoff, not a design flaw.

AArch64 deployment note: The global UmkaOS performance budget (≤5% overhead vs Linux) requires POE (ARMv8.9+/ARMv9.4-A, FEAT_S1POE) to be met with Tier 1 on AArch64. Without POE, page-table + ASID isolation costs 6-12% per domain switch, which exceeds the budget for high-throughput workloads (NVMe, network). On current mainstream AArch64 servers (Graviton 2/3/4, Neoverse V1/V2, Ampere Altra) that lack POE, operators have two options: 1. Use Tier 1 and accept the higher overhead (appropriate when crash containment is the priority and workloads have low I/O frequency — e.g., compute-heavy, GPU inference). 2. Prefer Tier 2 for I/O-intensive drivers (USB, SATA, fast storage) and promote only low-frequency drivers to Tier 1. This keeps per-request overhead within budget.

POE support detection is automatic at boot (ID_AA64MMFR3_EL1.S1POE). Operators can also force Tier 2 globally on AArch64 without POE via umka.tier1_aarch64=0.

11.2.2 How MPK Works

Each page table entry contains a 4-bit protection key (PKEY), assigning the page to one of 16 domains (0-15). The PKRU register holds per-domain read/write permission bits. The WRPKRU instruction updates these permissions in approximately 23 cycles (measured: ~23 cycles on Skylake [libmpk, USENIX ATC '19], ~28 cycles on Skylake-SP [EPK, USENIX ATC '22]) -- no TLB flush, no privilege transition, no system call.

11.2.3 Cost Comparison

Mechanism Cost per transition Isolation strength Used for
Function call ~1-5 cycles None Linux monolithic
Intel MPK WRPKRU ~23 cycles Memory domain Tier 1 drivers
Full IPC (seL4-style) ~600-1000 cycles Full address space Too expensive
Address-space switch ~200-600 cycles Full process Tier 2 drivers

MPK gives meaningful isolation -- a Tier 1 driver cannot read or write kernel private data, other driver data, or memory in other MPK domains -- at only approximately 23 cycles per boundary crossing. Combined with IOMMU for DMA fencing, this is the foundation of our performance story.

11.2.4 MPK Domain Allocation

With 16 available domains (PKEY 0-15), the allocation strategy is:

PKEY Assignment
0 UmkaOS Core (kernel private data)
1 Shared read-only (ring buffer descriptors)
2-13 Tier 1 driver domains (12 available)
14 Shared DMA buffer pool
15 Guard / unmapped

When more than 12 Tier 1 domains are needed, related drivers are grouped into the same domain (for example, all block drivers share one domain, all network drivers share another). This grouping is configurable via policy.

11.2.4.1 Domain Lifecycle: Allocation and Release

Isolation domain keys (PKEY on x86, POE index on AArch64, DACR domain on ARMv7, segment register on PPC32, Radix PID on PPC64) are a finite hardware resource managed via a per-CPU allocation bitmap. Keys are allocated at driver load time and must be explicitly released when a driver is unloaded or its isolation context is torn down.

/// Unique domain identifier. Architecture-specific meaning:
/// x86: PKEY (2-13), AArch64 POE: overlay index (3-5),
/// ARMv7: DACR domain (2-13), PPC32: segment register (2-13),
/// PPC64: Radix PID. Opaque to callers outside the isolation subsystem.
pub type DomainId = u64;

/// Release a previously allocated isolation domain key, returning it
/// to the free pool for reuse by future Tier 1 drivers.
///
/// # When called
///
/// - **VFIO unbind**: Step 5 of the VFIO unbind sequence
///   ([Section 18.5](18-virtualization.md#vfio-and-iommufd-device-passthrough-framework--vfio-unbind)).
///   The device's pre-VFIO Tier 1 domain key is released after device
///   quiesce and IOAS teardown.
/// - **Driver unload**: When a Tier 1 driver is unloaded via the KABI
///   module unload path, its domain key is released after the driver's
///   `remove()` callback completes and all ring buffer endpoints are
///   deregistered.
/// - **Tier 1 → Tier 0 promotion**: When an operator promotes a driver
///   to Tier 0 (e.g., via `/etc/umka/driver-policy.d/<name>.toml` or
///   the `performance` isolation mode), the Tier 1 domain is released
///   because Tier 0 runs in PKEY 0 (kernel domain).
/// - **Driver crash recovery**: On Tier 1 driver crash, the domain key
///   is held during recovery (the new driver instance reuses the same
///   domain). Released only if recovery fails and the driver is
///   permanently unloaded.
///
/// # Algorithm
///
/// 1. Clear all page table entries tagged with this domain key:
///    walk the driver's private VA range and zero the PTE protection
///    key bits (x86: bits 62:59, AArch64 POE: bits 62:60).
/// 2. Return the key to the per-architecture free bitmap:
///    `DOMAIN_BITMAP.fetch_and(!(1 << domain_id), Release)`.
/// 3. Flush TLB entries for the affected VA range (architecture-specific
///    invalidation — `INVLPG` on x86, `TLBI` on ARM).
///
/// # Concurrency
///
/// Must be called with the driver's isolation context lock held
/// (preventing concurrent domain switches into the released domain).
/// The TLB flush in step 3 is a cross-CPU shootdown on SMP systems.
pub fn release_isolation_domain(domain_id: DomainId) {
    // Clear PTEs for the domain's VA range.
    arch::current::isolation::clear_domain_ptes(domain_id);
    // Flush TLB BEFORE recycling the key: stale TLB entries referencing the old
    // domain key must be gone before another driver can be assigned the same key.
    // Without this ordering, a new driver could map pages with a recycled key
    // while stale TLB entries from the OLD driver remain cached on other CPUs.
    arch::current::isolation::flush_domain_tlb(domain_id);
    // Return key to free pool (safe now that all stale TLB entries are flushed).
    arch::current::isolation::free_domain_key(domain_id);
}

11.2.5 WRPKRU Threat Model: Crash Containment, Not Exploitation Prevention

Critical design constraint: WRPKRU is an unprivileged instruction. Any code running in Ring 0 — including Tier 1 driver code — can execute WRPKRU to modify its own MPK permission register, granting access to any MPK domain including UmkaOS Core (PKEY 0). This means MPK isolation provides crash containment (preventing buggy drivers from corrupting kernel memory) but does not provide exploitation prevention (compromised Ring 0 code can execute WRPKRU to escape).

Security model — UmkaOS's Tier 1 isolation is designed to survive driver bugs, not driver exploitation. The rationale: the vast majority of kernel crashes are caused by bugs (null dereference, use-after-free, buffer overrun), not by attackers with arbitrary code execution inside a specific driver. For environments requiring defense against compromised Ring 0 code, Tier 2 (full process isolation) provides the strong boundary — at higher latency cost.

What MPK actually protects against:

  • Accidental memory corruption: Null pointer dereferences, buffer overruns, and similar bugs that write to wrong addresses are contained — the hardware fault triggers before the driver can corrupt kernel memory.
  • Crash recovery: When a driver faults, UmkaOS Core can safely restart it without system panic because driver memory is isolated from core state.
  • Fault propagation containment: A bug in one Tier 1 driver cannot corrupt data belonging to UmkaOS Core or drivers in different domain groups. Drivers sharing the same hardware isolation domain can corrupt each other's data; Rust memory safety is the primary defense within shared domains (see Section 24.5).

What MPK does NOT protect against:

  • Deliberate exploitation: An attacker who achieves arbitrary code execution within a Tier 1 driver can execute WRPKRU to escape isolation. The instruction is unprivileged by design and the sanctioned switch_domain() trampoline uses it legitimately — it cannot be detected or blocked.
  • Runtime code injection: JIT code or ROP gadgets that contain WRPKRU can execute the instruction directly.

Driver signing — All Tier 1 drivers must be signed (Section 9.3). An attacker cannot load a malicious driver binary without a valid signature. The attack surface is limited to exploiting bugs in legitimately signed driver code. Combined with Rust's memory safety guarantees and standard Linux hardening (CFI, CET), this raises the bar for achieving arbitrary code execution, but does not eliminate the WRPKRU escape vector.

11.2.5.1 Domain Crossing Protocol Summary

All Tier 0 → Tier 1 communication uses KABI ring buffers (Section 11.8). Direct function calls across isolation boundaries are NOT permitted — the hardware memory domain (MPK/POE/DACR) would fault. The canonical crossing pattern:

  1. Caller (Tier 0) writes a request entry to the target's KABI ring (SPSC or MPSC).
  2. Caller optionally signals the target via doorbell (IPI or ring-buffer-level notification).
  3. Target (Tier 1) reads the ring entry, processes the request.
  4. Target writes a response entry to the response ring (or sets a completion flag).
  5. Caller reads the response.

Subsystems that follow this pattern: - Network RX: Nucleus → umka-net via NetBufRingEntry (Section 16.5) - Network TX: TxDispatch::KabiRing (Section 16.13) - Writeback: Nucleus → VFS via WritebackRequest (Section 4.6) - Fault handling: Nucleus → KVM via FaultRequest (Section 4.15) - NAPI polling: NapiPollDispatch::KabiRing (Section 16.14)

Tier 2 for exploitation-sensitive workloads — For environments where defense against compromised Ring 0 code is required, drivers should run at Tier 2 (full process isolation). The auto-demotion mechanism (Section 11.6) allows administrators to pin specific drivers to Tier 2 via policy, trading higher I/O latency for stronger isolation.

11.2.5.2 PKRU Write Elision (Mandatory)

The ~23-cycle WRPKRU cost is per instruction, not per domain crossing. When an I/O path traverses multiple domains in sequence (e.g., NIC driver → TCP stack → socket layer), a naive implementation issues a WRPKRU at every boundary — 6 writes for a 3-boundary round-trip. UnderBridge (Gu et al., USENIX ATC '20) demonstrated that many of these writes are redundant and must be elided.

WRPKRU elision is a mandatory core design decision, not a deferred optimization. Every WRPKRU instruction in UmkaOS goes through the switch_domain() trampoline (defined below), which enforces shadow comparison before any hardware write. There is no code path in the kernel that issues a raw WRPKRU without shadow checking — this invariant is enforced at the API level (the x86::wrpkru() function is unsafe and only called from switch_domain()).

The three elision techniques (all implemented from day one):

  1. Same-permission transition: if domain A and domain B both need read access to a shared buffer, and the only permission change is adding write access to B's private region, the WRPKRU write may be unnecessary if A's private region is already read-disabled. The key insight: WRPKRU sets all 16 domain permissions simultaneously — if the new permission bitmap happens to be identical to the current one, the write is redundant.

  2. Batched transitions: when crossing A → B → C in rapid succession (e.g., NIC driver → TCP → socket), instead of writing PKRU three times (disable A/enable B, disable B/enable C), compute the final PKRU state and write once. The intermediate states are unnecessary if no untrusted code executes between transitions.

  3. Cached PKRU shadow: a per-CPU shadow of the current PKRU value (stored in CpuLocalBlock, see Section 3.2). Before issuing WRPKRU, switch_domain() compares the desired value against the shadow. If identical, the instruction is skipped entirely. This is a single register comparison (~1 cycle) versus the ~23-cycle WRPKRU.

UmkaOS implementation — every domain switch goes through this trampoline. The pkru_shadow is stored in CpuLocalBlock for single-instruction access. No code path in the kernel issues WRPKRU outside this function. The switch_domain() inline function returns the previous shadow value so the caller can restore it after the cross-domain call completes:

#[inline(always)]
fn switch_domain(target_pkru: u32) -> u32 {
    let shadow = per_cpu::pkru_shadow();
    if shadow != target_pkru {
        // SAFETY: WRPKRU updates permission bits for all 16 MPK domains.
        // target_pkru is computed from the domain allocation table and
        // validated at driver load time — only valid permission sets are
        // reachable. Preemption is disabled (see above).
        unsafe { x86::wrpkru(target_pkru) };
        per_cpu::set_pkru_shadow(target_pkru);
    }
    shadow // Return previous domain value for caller to restore.
}

Per-architecture switch_domain() trampolines — each architecture implements the same shadow-comparison pattern. The arch::current::isolation::switch_domain() function is a compile-time alias to the active architecture's implementation:

// AArch64 with POE (FEAT_S1POE)
#[inline(always)]
fn switch_domain(target_por: u64) -> u64 {
    let shadow = per_cpu::por_shadow();
    if shadow != target_por {
        // SAFETY: MSR POR_EL0 updates overlay permissions for indices 0-7.
        // ISB is mandatory after POR_EL0 write (ARM ARM §D19.2.133).
        unsafe {
            core::arch::asm!("msr POR_EL0, {}", "isb", in(reg) target_por);
        }
        per_cpu::set_por_shadow(target_por);
    }
    shadow // Return previous domain value for caller to restore.
}

// AArch64 without POE (page-table + ASID path)
#[inline(always)]
fn switch_domain(target_ttbr: u64) -> u64 {
    let shadow = per_cpu::ttbr_shadow();
    if shadow != target_ttbr {
        // SAFETY: TTBR0_EL1 write switches the stage-1 translation table.
        // ISB ensures the new table is used by subsequent instructions.
        // TLBI flushes stale ASID-tagged TLB entries.
        // TLBI ASIDE1IS expects the ASID in bits [63:48] of Xt.
        // target_ttbr already has the ASID in bits [63:48] (TTBR0_EL1
        // format: ASID[63:48] | BADDR[47:1] | CnP[0]).
        // Do NOT shift right — that would put the ASID in bits [15:0],
        // causing TLBI to invalidate ASID 0 (kernel) instead of the target.
        unsafe {
            core::arch::asm!(
                "msr TTBR0_EL1, {ttbr}",
                "isb",
                "tlbi aside1is, {ttbr}",
                "dsb ish",
                "isb",
                ttbr = in(reg) target_ttbr,
            );
        }
        per_cpu::set_ttbr_shadow(target_ttbr);
    }
    shadow // Return previous domain value for caller to restore.
}

// ARMv7 (DACR)
#[inline(always)]
fn switch_domain(target_dacr: u32) -> u32 {
    let shadow = per_cpu::dacr_shadow();
    if shadow != target_dacr {
        // SAFETY: MCR p15 c3 c0 0 writes the Domain Access Control Register.
        // ISB is mandatory after DACR write (ARM ARM B3.7.2).
        unsafe {
            core::arch::asm!(
                "mcr p15, 0, {}, c3, c0, 0",
                "isb",
                in(reg) target_dacr
            );
        }
        per_cpu::set_dacr_shadow(target_dacr);
    }
    shadow // Return previous domain value for caller to restore.
}

// PPC32 (segment registers) — unified single-argument signature.
// The DomainDescriptor is looked up from a per-CPU table keyed by domain_id.
// It contains: sr_values: [u32; 16] and active_segments: u16 (bitmask).
// Returns the previous DomainId for save/restore.
#[inline(always)]
fn switch_domain(domain_id: DomainId) -> DomainId {
    let prev = per_cpu::current_domain_id();
    let desc = per_cpu::domain_descriptor(domain_id);
    let mut seg_mask = desc.active_segments;
    while seg_mask != 0 {
        let seg_index = seg_mask.trailing_zeros() as u32;
        seg_mask &= seg_mask - 1; // clear lowest set bit
        let target_sr = desc.sr_values[seg_index as usize];
        let shadow = per_cpu::sr_shadow(seg_index);
        if shadow != target_sr {
            // SAFETY: mtsrin loads a segment register from rB[0:3] (SR index)
            // and rS (segment descriptor value). The SR field in mtsrin is taken
            // from bits 0-3 of rB (PPC big-endian numbering = bits [31:28] of
            // a 32-bit register). mtsr cannot be used here because its SR field
            // is an immediate, not a register operand.
            // isync is required after segment register write to ensure subsequent
            // memory accesses use the new segment descriptor (Power ISA v2.07, §5.4.4.2).
            unsafe {
                core::arch::asm!(
                    "mtsrin {}, {}",
                    in(reg) target_sr,
                    in(reg) (seg_index << 28),
                );
            }
            per_cpu::set_sr_shadow(seg_index, target_sr);
        }
    }
    // Single isync after all segment register writes (batched).
    unsafe { core::arch::asm!("isync"); }
    per_cpu::set_current_domain_id(domain_id);
    prev // Return previous domain ID for caller to restore.
}

// PPC64LE (Radix PID)
#[inline(always)]
fn switch_domain(target_pid: u32) -> u32 {
    let shadow = per_cpu::rpid_shadow();
    if shadow != target_pid {
        // SAFETY: mtspr PIDR switches the process ID for Radix translation.
        // isync is required after PIDR write (Power ISA v3.1, §5.10.1.6).
        unsafe {
            core::arch::asm!("mtspr 48, {}", "isync", in(reg) target_pid);
        }
        per_cpu::set_rpid_shadow(target_pid);
    }
    shadow // Return previous domain value for caller to restore.
}

Context switch coherence: On every context switch, the scheduler calls arch::x86_64::isolation::save_pkru(prev_task) to save the outgoing task's PKRU register value into prev_task.saved_pkru, then calls arch::x86_64::isolation::restore_pkru(next_task) to load next_task.saved_pkru via WRPKRU. The per-CPU CpuLocalBlock.isolation_shadow field (Section 3.2) is updated to next_task.saved_pkru atomically with the WRPKRU execution — the shadow always reflects the actual hardware PKRU register value on this CPU. This invariant is required for the validate_current_domain() fast path which reads the shadow without executing RDPKRU. Any code path that issues WRPKRU outside switch_domain() or the context switch save/restore functions is a bug: it would desync the shadow from the hardware register, causing switch_domain() to skip necessary WRPKRU writes on subsequent domain transitions.

Guaranteed savings — on a typical TCP receive path (4 WRPKRU instructions in the naive case: 2 boundary crossings × 2 switches each), shadow comparison eliminates 1-2 redundant writes (the intermediate transitions where permissions don't actually change). At ~23 cycles per elided write, this saves ~23-46 cycles per packet — reducing TCP path overhead from ~2% to ~1-1.5%. On NVMe paths, back-to-back domain transitions (submit→complete with no intervening domain change) hit the shadow cache and skip the second WRPKRU pair entirely, saving ~46 cycles.

Generalization to other architectures: The shadow-comparison pattern applies to every architecture's isolation register, not just x86 PKRU:

Architecture Register Shadow location Skip cost Hardware write cost
x86-64 PKRU (WRPKRU) CpuLocalBlock.pkru_shadow ~1 cycle (compare) ~23 cycles
AArch64 POR_EL0 (MSR) CpuLocalBlock.por_shadow ~1 cycle ~40-80 cycles
ARMv7 DACR (MCR p15 + ISB) CpuLocalBlock.dacr_shadow ~1 cycle ~30-40 cycles
PPC64 Radix PID (mtspr) CpuLocalBlock.rpid_shadow ~1 cycle ~30-60 cycles
PPC32 Segment regs (mtsrin + isync) CpuLocalBlock.sr_shadow[16] ~1 cycle ~20-40 cycles

RISC-V has no isolation register (Tier 1 is not available on RISC-V) — shadow elision is not applicable. AArch64 uses POR_EL0 when POE hardware is present; on mainstream AArch64 without POE, the shadow tracks ASID/TTBR0 to elide redundant page-table switches. The shadow pattern provides the largest benefit on x86-64 and AArch64 POE, where the hardware write cost is highest relative to the comparison cost.

Cross-architecture memory ordering invariant: All writes to isolation_shadow (and its per-architecture variants: pkru_shadow, por_shadow, dacr_shadow, rpid_shadow, sr_shadow) use Release ordering. All reads use Acquire ordering. This ensures that the shadow value is visible to all CPUs before the next domain switch decision. On x86-64, where stores have implicit release semantics, the Release is a no-op; on weakly-ordered architectures (AArch64, ARMv7, RISC-V, PPC), the ordering fence ensures that a context switch on one CPU publishes the new shadow value before any subsequent switch_domain() call on the same CPU reads it. Without this invariant, a stale shadow read could cause switch_domain() to skip a necessary hardware register write, leaving the CPU in the wrong isolation domain.

11.2.6 Isolation on Other Architectures

Each supported architecture uses its best available fast isolation mechanism:

Architecture Mechanism Switch Cost Domains
x86_64 MPK (WRPKRU) ~23 cycles 12 for drivers
AArch64 (mainstream) Page-table + ASID ~150-300 cycles Unlimited
AArch64 + POE (ARMv8.9+/ARMv9.4+) POE (MSR POR_EL0) ~40-80 cycles 7 usable (3 for drivers after infra deductions; see Section 24.5)
ARMv7 DACR (MCR p15 + ISB) ~30-40 cycles 12 usable
PPC32 Segment registers (mtsrin + isync) ~20-40 cycles 12 usable
PPC64LE Radix PID (mtspr PIDR) ~30-60 cycles Per-process
RISC-V 64 None — Tier 1 unavailable N/A N/A
s390x Storage Keys (ISK/SSK) — too coarse; Tier 1 unavailable N/A N/A
LoongArch64 None — Tier 1 unavailable N/A N/A

Page-table + ASID isolation is the standard AArch64 mechanism and runs on all current ARM datacenter deployments: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. POE (ARM FEAT_S1POE) is a hardware acceleration available on ARMv8.9+/ARMv9.4+ silicon (Cortex-X4+) that reduces switch cost to ~40-80 cycles; it is an optional optimization, not the primary mechanism. When domain counts are exhausted, architectures with register-based isolation fall back to page-table switches. ARMv7 DACR is universally available on all Cortex-A cores and matches MPK in both cost and domain count.

11.2.6.1 Per-Architecture Mechanism Details

  • aarch64: ARM Memory Domains (up to 16 domains via DACR on ARMv7) are not available on ARMv8/AArch64 in the same form. The standard AArch64 isolation mechanism is page-table-based domain isolation with ASID-preserving switches (~150-300 cycles per TTBR0_EL1 write + ISB + TLBI ASIDE1IS). This is what runs on all current deployed ARM servers: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. On hardware with ARM FEAT_S1POE (optional from ARMv8.9/ARMv9.4, available on Cortex-X4+), UmkaOS activates the Permission Overlay Extension as an acceleration: POE provides 8 overlay indices (3 bits from PTE bits [62:60]), with index 0 = no overlay (base permissions only, POE disabled for those pages), giving 7 usable domains — fewer than x86 MPK's 12 driver domains. After infrastructure deductions (index 1: shared read-only, index 2: shared DMA, index 6: userspace, index 7: temporary/debug), only 3 indices remain for Tier 1 driver domains (indices 3-5); see Section 24.5 for the full AArch64 grouping scheme. Domain grouping is therefore much more aggressive on AArch64 when POE is active. POE is an optimization that reduces switch cost to ~40-80 cycles (~2-4x improvement); the system operates correctly without it using the page-table path.
  • armv7: ARMv7 provides hardware Domain Access Control via the DACR register, supporting 16 memory domains (12 usable for drivers — domain 0 reserved for kernel core, domain 1 for shared descriptors, domain 14 for shared DMA, domain 15 as guard; matching x86 MPK infrastructure deductions). Each domain can be set to No Access, Client (checked against page permissions), or Manager (unchecked access) via a single MCR instruction to update DACR. This is the closest hardware analogue to x86 MPK on 32-bit ARM — a single privileged (MCR p15) register write switches domain permissions without TLB flushes. Unlike x86 WRPKRU (which is unprivileged and executable from Ring 3), DACR writes require PL1 — this is a security advantage: user-space code cannot forge domain switches.
  • riscv64: RISC-V currently has no hardware isolation primitive suitable for Tier 1. SPMP (S-mode Physical Memory Protection) is only active when paging is disabled (satp.mode == Bare) and cannot be used in a kernel with virtual memory enabled. Smmtt (Supervisor Domain Access Protection) targets confidential computing, not MPK-style fast domain switching. Pointer Masking (Smnpm/Ssnpm, ratified Oct 2024) is not a domain isolation mechanism. Tier 1 isolation is not available on RISC-V. Drivers that would request Tier 1 on RISC-V choose Tier 0 (in-kernel, fully trusted) or Tier 2 (Ring 3 + IOMMU) based on licensing, driver preference, and admin policy. This is an accepted hardware constraint, not a design flaw. Proprietary drivers must use Tier 2; open-source drivers default per their fallback_bias manifest setting. When RISC-V ISA extensions provide suitable isolation primitives (e.g., future Smpmp or custom domain extensions), UmkaOS will support Tier 1 on RISC-V without requiring architectural changes — the driver model is designed for this upgrade path.
  • ppc32: PPC32 uses segment registers for memory domain isolation. The 32-bit PowerPC architecture provides 16 segment registers (SR0–SR15), each controlling access to a 256 MB virtual address region. Updating a segment register via mtsrin is a single supervisor-mode instruction; with the mandatory isync barrier (Power ISA §5.4.4.2), total cost is ~20-40 cycles. When segments are insufficient, UmkaOS falls back to page-table-based isolation.
  • ppc64le: PPC64LE on POWER9+ uses the Radix MMU with partition table entries (process table / PID) for isolation. On POWER8, the Hashed Page Table (HPT) with LPAR (Logical Partitioning) provides hardware-assisted isolation. The Radix MMU's PID-based isolation switches via mtspr PIDR (~30-60 cycles). HPT fallback uses full page table switches (~200-400 cycles).

11.2.6.2 Per-Architecture Isolation Cost Analysis

The x86_64 MPK WRPKRU instruction provides ~23-cycle domain switches (measured on Skylake-class server cores; varies by microarchitecture — see Section 24.4 for full range: 11 cycles on Alder Lake, up to 260 cycles on Atom). Other architectures use different mechanisms with different cost profiles:

Architecture Mechanism Domain Switch Cost Domains Notes
x86_64 MPK (WRPKRU) ~23 cycles 12 for drivers 16 total keys. PKEY 0 (core), 1 (shared descriptors), 14 (shared DMA), 15 (guard) reserved for infrastructure.
x86_64 (no MPK) Page table switch + ASID ~200-400 cycles Unlimited Used when MPK unavailable (pre-Skylake). Full CR3 write + TLB management.
aarch64 (mainstream) Page table switch + ASID ~150-300 cycles Unlimited Standard mechanism on all current ARM servers: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. TTBR0_EL1 write + ISB + TLBI ASIDE1IS.
aarch64 + POE (ARMv8.9+/ARMv9.4+) MSR POR_EL0 + ISB ~40-80 cycles 7 usable Optional acceleration: ARM FEAT_S1POE available on Cortex-X4+. ISB barrier required (~20-40 cycles). Provides ~2-4x improvement over page-table path.
aarch64 + MTE (not viable for domain isolation) N/A N/A MTE assigns 4-bit tags per 16-byte granule, but tags are compared per-pointer — no single-register switch exists. Valuable for memory safety, not domain isolation.
armv7 DACR (MCR p15 + ISB) ~30-40 cycles 12 usable 16 total domains. Domain 0 (core), 1 (shared descriptors), 14 (shared DMA), 15 (guard) reserved for infrastructure — matching x86 MPK deductions. Single MCR p15, 0, Rd, c3, c0, 0 writes all 16 domain permissions. ARM ARM (B3.7.2) requires ISB after DACR write to guarantee the new domain permissions are observed by subsequent memory accesses (confirmed by Linux set_domain() in arch/arm/include/asm/domain.h). Total: MCR (~10-15 cycles) + ISB (~15-25 cycles).
armv7 (fallback) Page table switch + CONTEXTIDR ~150-300 cycles Unlimited MCR to TTBR0 + ISB + TLBI. Similar cost profile to aarch64 page-table path.
riscv64 Tier 1 not available N/A — drivers use Tier 0 or Tier 2 N/A No suitable hardware isolation exists with paging enabled. SPMP requires paging disabled; Smmtt targets confidential computing. Drivers choose Tier 0 (no isolation overhead) or Tier 2 (full process isolation).
ppc32 Segment registers (mtsrin + isync) ~20-40 cycles 12 usable 16 total segments. Segment 0 (core), 1 (shared descriptors), 14 (shared DMA), 15 (guard) reserved for infrastructure — matching x86 MPK deductions. Single mtsrin instruction per 256 MB segment + isync barrier (required by Power ISA §5.4.4.2 to ensure subsequent accesses use the new segment context). Comparable to armv7 DACR + ISB cost.
ppc32 (fallback) Page table switch ~200-400 cycles Unlimited Full TLB invalidation + page table base update.
ppc64le (Radix) PID switch (mtspr PIDR) ~30-60 cycles Process-table scoped POWER9+ Radix MMU. mtspr PIDR + isync. ~2-3x MPK cost.
ppc64le (HPT) HPT + LPAR switch ~200-400 cycles Unlimited POWER8 Hashed Page Table (pre-POWER9 fallback when Radix mode is unavailable). Tier 1 isolation uses LPAR segment table manipulation + tlbie invalidation.
s390x Tier 1 not available N/A — drivers use Tier 0 or Tier 2 N/A Storage Keys (4-bit per page, ISK/SSK) are page-granularity — too coarse for fast domain switching. Tier 2 uses subchannel protection for I/O device isolation.
loongarch64 Tier 1 not available N/A — drivers use Tier 0 or Tier 2 N/A No hardware isolation mechanism exists. Tier 2 uses IOMMU for device isolation.

Impact on performance budget — The Section 1.3 overhead analysis uses x86_64 MPK (~23 cycles per switch, ~92 cycles per I/O round-trip):

Architecture Overhead per NVMe 4KB read Overhead per TCP RX
x86_64 MPK +1% (92 cycles / 10μs) +2% (~92 cycles / 5μs, with NAPI batching; naive per-packet is ~17-26%, see Section 16.12)
aarch64 page-table (mainstream) +6-12% (600-1200 cycles / 10μs) +12-24% (600-1200 cycles / 5μs)
aarch64 + POE (ARMv8.9+/ARMv9.4+) +2-3% (160-320 cycles / 10μs) +3-6% (160-320 cycles / 5μs)
armv7 DACR +1-2% (120-160 cycles / 10μs) +2-3% (120-160 cycles / 5μs)
riscv64 N/A — Tier 1 not available; Tier 0 drivers have zero isolation overhead (same as Linux), Tier 2 has full process isolation overhead N/A
ppc32 segments +1-2% (80-160 cycles / 10μs) +2-3% (80-160 cycles / 5μs)
ppc64le Radix +1-2% (120-240 cycles / 10μs) +2-5% (120-240 cycles / 5μs)
s390x N/A — Tier 1 not available; Tier 0 drivers have zero isolation overhead, Tier 2 has full process isolation overhead N/A
loongarch64 N/A — Tier 1 not available; Tier 0 drivers have zero isolation overhead, Tier 2 has full process isolation overhead N/A

For armv7 with DACR (including mandatory ISB) and ppc32 with segment registers (including mandatory isync), the overhead remains within the 5% budget, comparable to x86 MPK. For aarch64 with POE and ppc64le with Radix PID, the overhead remains within the 5% budget for storage workloads. On mainstream AArch64 (page-table path), Tier 1 overhead reaches 6-12%, which exceeds the 5% budget for I/O-heavy workloads; administrators can promote performance-critical drivers to Tier 0 or demote to Tier 2 as appropriate.

ARM server reality — Page-table + ASID isolation (~150-300 cycles) is the mechanism that runs on nearly all currently deployed ARM servers. FEAT_S1POE is optional from ARMv8.9/ARMv9.4. Current mainstream datacenter cores — Neoverse V2 (ARMv9.0, AWS Graviton 4, Google Axion), Neoverse V3 (ARMv9.2, AWS Graviton 5, Azure Cobalt 200), Ampere Altra (ARMv8.2), and Kunpeng 920 (ARMv8.2) — do not implement POE. The page-table path is not a fallback; it is the standard operating mode for AArch64. POE is a hardware acceleration that becomes available on ARMv8.9+/ARMv9.4+ silicon (Cortex-X4+) and reduces per-switch cost by ~2-4x when present.

RISC-V reality — RISC-V currently has no hardware isolation mechanism suitable for Tier 1. Tier 1 isolation is not available on RISC-V; drivers choose Tier 0 (in-kernel, fully trusted, zero isolation overhead) or Tier 2 (Ring 3 + IOMMU). The 5% overhead budget applies to operations that do run — without the Tier 1 isolation layer, there is no overhead to measure on that path. Tier 2 remains available for drivers where isolation is required. When RISC-V ISA extensions provide suitable isolation primitives, UmkaOS will support Tier 1 on RISC-V without architectural changes to the driver model.

11.2.7 Adaptive Isolation Policy (Graceful Degradation)

UmkaOS targets eight architectures with fundamentally different isolation capabilities. The design philosophy: use the best isolation the hardware provides; when the hardware provides nothing, degrade gracefully — don't refuse to run. This mirrors Linux's approach to every hardware feature.

Three boot-time modes, selectable via umka.isolation= kernel parameter or runtime sysfs:

  • strict (default when fast isolation available): All Tier 1 drivers run in hardware-isolated domains. Full isolation at ~23-80 cycle cost per switch (register-based) or ~150-300 cycles (page-table, AArch64 mainstream).
  • degraded (default on AArch64 mainstream): Page-table isolation operates the three-tier model with ~150-300 cycle overhead per crossing. This is the normal operating mode for current ARM server deployments, not a degraded state.
  • performance: Tier 1 drivers placed in Tier 0 — zero boundary-crossing overhead, matching Linux exactly. IOMMU DMA fencing and capability checks remain active. Appropriate for I/O-heavy workloads where the page-table path overhead is unacceptable.

On RISC-V, the adaptive policy always selects Tier 0 for all Tier 1 drivers — Tier 1 isolation is not available on RISC-V due to hardware capability limitations. This is not a performance mode selection; it is a platform capability constraint that will be resolved when RISC-V hardware provides suitable isolation primitives.

IsolationCaps-to-tier decision function: At driver load time, the KABI loader calls arch::current::isolation::max_supported_tier() to determine the highest isolation tier the hardware can enforce. The function inspects boot-detected capabilities (MPK via CPUID, POE via ID_AA64MMFR3_EL1.S1POE, DACR availability, segment register count) and returns: Tier 1 if register-based or page-table domain isolation is available, Tier 0 if no isolation mechanism exists (RISC-V, s390x, LoongArch64). Tier 2 is always available (Ring 3 + IOMMU). The function is called once during driver registration to clamp the requested tier. The driver's preferred_tier from its manifest is clamped to this hardware ceiling before applying signing-cert and license policy checks (Section 11.3).

Per-driver fallback behavior is determined by the driver's fallback_bias field in its KabiDriverManifest (see Section 12.6): FallbackBias::Isolation (default — fall back toward Tier 2) or FallbackBias::Performance (fall back toward Tier 0). Operator overrides via /etc/umka/driver-policy.d/<name>.toml take precedence over the manifest hint.

11.2.7.1 Performance Mode Details

When isolation=performance is set or no fast isolation exists, Tier 0 drivers run in the same protection domain as umka-core with zero boundary-crossing overhead. Performance matches Linux exactly. The system logs a prominent warning:

umka: isolation=performance: Tier 0 drivers have no memory isolation
umka: Tier 0 driver crashes may cause kernel panic (same as Linux monolithic behavior)
umka: Tier 2 drivers retain full crash recovery via Ring 3 + IOMMU process isolation
umka: IOMMU DMA fencing is still active — DMA isolation preserved

Key properties of performance mode: - IOMMU DMA fencing remains active — even without MPK memory isolation, DMA operations are still restricted to driver-allocated regions. - Crash recovery is best-effort — without memory isolation, a crashing driver may corrupt umka-core state, making recovery impossible. - Capability system still enforced — the software-level capability model remains active. Only the memory enforcement is relaxed. - No privilege escalation — a driver placed in Tier 0 (from a Tier 1 request) retains its original DeviceCapGrant scope. It does not receive any additional SystemCap entries: same MMIO ranges, same interrupt lines, same DMA allocation limits, same KABI vtable interface. CapabilityGuard validation remains active on every operation. The security invariant is promoted_tier0_caps ⊆ tier1_caps. See Section 11.3 for the full specification. - Security model partially degraded — a malicious driver could exploit the shared address space. This mode is appropriate for trusted environments with known drivers.

Per-driver fallback has two layers:

  1. Driver manifest (.kabi IDL, compiled into binary): the fallback_bias field expresses the driver author's domain knowledge. This is the default when no operator policy exists.
// In umka_nvme.kabi:
module umka_nvme {
    provides nvme_driver >= 1.0;
    requires block_core >= 1.0;
    preferred_tier: 1;
    fallback_bias: performance;   // NVMe is latency-sensitive → prefer Tier 0
}
  1. Operator policy (deployed config file, overrides manifest):
# /etc/umka/driver-policy.d/umka-nvme.toml
[driver]
name = "umka-nvme"
# Override driver's fallback_bias: force Tier 2 isolation on this deployment
no_fast_isolation = "demote_tier2"   # "promote_tier0" | "page_table" | "demote_tier2"

Operator policy options: - promote_tier0: run in Tier 0 (fast, no isolation) — for performance-critical drivers - page_table: use page-table fallback (slow, but isolated) - demote_tier2: move to Tier 2 userspace (full process isolation) — for untrusted or crash-prone drivers

When no operator policy file exists, fallback_bias from the manifest is used directly: Performance maps to promote_tier0, Isolation maps to demote_tier2.

Historical context — Apple's transition from kexts (Ring 0, no isolation) to DriverKit (userspace, full isolation) took 5 years. UmkaOS's approach is more nuanced: rather than a binary choice between "fast and dangerous" and "safe and slow," hardware-assisted isolation (MPK, POE, DACR, segments, Radix PID) provides a third option — "fast and safe" — on modern hardware. The adaptive isolation policy ensures UmkaOS remains viable on older hardware by honestly trading off isolation for performance when the hardware cannot support both simultaneously.

Tier 1 availability matrix (canonical reference):

Architecture Tier 1 mechanism Available? Tier 1 driver behavior Crash containment
x86-64 MPK (WRPKRU) Yes (all) Isolated, crash → reload Yes
AArch64 (POE) POR_EL0 Yes (ARMv8.9+) Isolated, crash → reload Yes
AArch64 (no POE) Page table + ASID Yes (fallback) Isolated, higher overhead Yes
ARMv7 DACR Yes (all) Isolated, crash → reload Yes
RISC-V 64 None No Promoted to Tier 0 No — promoted driver crash = panic (Tier 2 retains full recovery)
PPC32 Segment registers Yes (all) Isolated, crash → reload Yes
PPC64LE Radix PID Yes (POWER9+) Isolated, crash → reload Yes
s390x Storage Keys (page-gran) No (too coarse) Promoted to Tier 0 No — promoted driver crash = panic (Tier 2 retains full recovery)
LoongArch64 None No Promoted to Tier 0 No — promoted driver crash = panic (Tier 2 retains full recovery)

Consequences for promoted-Tier 0 architectures: (1) no crash containment for promoted drivers — a promoted-Tier-1 driver crash = kernel panic (Section 11.9); Tier 2 drivers on these architectures retain full crash recovery via Ring 3 + IOMMU process isolation, (2) capability validation is sole enforcement for promoted drivers (Section 9.1), (3) performance overhead is zero (no domain switching). References: per-arch mechanisms Section 11.2.

11.2.7.2 Capability Validation for Promoted-Tier 0 Drivers

On RISC-V, s390x, and LoongArch64, drivers that request Tier 1 are placed in Tier 0 (or Tier 2) at load time because no hardware memory domain isolation exists. For those placed in Tier 0, this eliminates the domain switch boundary but does NOT eliminate capability validation. The capability check path for promoted-T0 drivers differs from native Tier 0 (static/loadable) code in one critical respect: every KABI vtable dispatch still validates capabilities.

Why this matters: Without hardware isolation, the only enforcement layer preventing a promoted driver from accessing resources outside its grant scope is the software capability check. If promoted-T0 drivers bypassed capability validation (as native Tier 0 code does for performance), a single bug could escalate into unrestricted kernel access — turning a crash-containment loss into a privilege-escalation loss.

Dispatch path for promoted-T0 drivers (T0-validated transport):

/// Per-driver runtime context for promoted-T0 dispatch. One instance exists per
/// loaded driver domain. Fields are read on every `t0_validated_dispatch` call
/// (hot path) so the struct is kept small and cache-line-friendly.
///
/// `liveness` uses `AtomicU8` encoding: `LIVENESS_ACTIVE = 0`, `LIVENESS_CRASHED = 1`.
/// The crash handler (`domain_mark_crashed()`) stores `LIVENESS_CRASHED` with
/// `Release` ordering; dispatch checks it with `Acquire`.
pub struct DriverContext {
    /// Identity of the isolation domain this driver runs in.
    pub domain_id: DomainId,
    /// Capability grant describing which device resources this driver may access.
    pub granted_caps: DeviceCapGrant,
    /// Driver liveness state. Checked before every dispatch to prevent calling
    /// into a crashed domain. See `t0_validated_dispatch` step 1.
    pub liveness: AtomicU8,
    /// Monotonic generation counter. Incremented on each driver reload so that
    /// stale references (e.g., cached vtable pointers) are detected.
    /// Initialized on the new DriverContext before RCU-publish.  Readers see
    /// either the old or new context via RCU, so no torn read occurs.
    /// AtomicU64 for defense-in-depth (eliminates theoretical race under
    /// aggressive compiler reordering).  Load with `Relaxed` in handle-
    /// validation paths — the `liveness` Acquire provides ordering.
    /// u64: at 1 billion reloads/sec, wraps in ~584 years — safe for 50-year uptime.
    ///
    /// # Safety invariant
    /// Every code path that loads `generation` with `Relaxed` ordering MUST
    /// first load `liveness` with `Acquire` on the same thread. The `liveness`
    /// Acquire synchronizes with the crash handler's `Release` store to
    /// `liveness`, establishing a happens-before edge that makes the
    /// `generation` value visible. A `Relaxed` load of `generation` WITHOUT
    /// a preceding `liveness.load(Acquire)` is unsound on weakly-ordered
    /// architectures (ARM, RISC-V, PPC) — it may see a stale generation
    /// from a recycled `DriverContext`.
    pub generation: AtomicU64,
    /// Amortized capability validation token (cached from CapTable).
    pub cap_token: CapValidationToken,
    /// Pointer to the global capability table generation counter for this domain.
    pub cap_table_gen: &'static AtomicU64,
    /// Cached validated capability (populated on first successful validation).
    pub validated_cap: ValidatedCap,
    /// RCU-protected system-wide capability set granted to this driver.
    pub granted_syscaps: RcuCell<SystemCaps>,
}

/// Liveness state: the driver domain is active and accepting dispatches.
pub const LIVENESS_ACTIVE: u8 = 0;
/// Liveness state: the driver domain has crashed and must not be dispatched into.
pub const LIVENESS_CRASHED: u8 = 1;

/// T0-validated transport: direct vtable call with inline capability check.
/// Used for drivers placed in Tier 0 that originally requested Tier 1.
///
/// Cost: ~8-12 cycles per dispatch (capability check + vtable call).
/// Compare: native T0 (no check, ~2-5 cycles), T1 ring buffer (~23+ cycles).
#[inline(always)]
pub fn t0_validated_dispatch<R>(
    driver: &DriverContext,
    required_cap: CapType,
    required_perms: PermissionBits,
    vtable_fn: fn(&DriverContext) -> R,
) -> Result<R, CapError> {
    // 1. Verify the driver domain is still alive. If the domain has been
    //    marked crashed (e.g., after a fault in a previous dispatch),
    //    reject immediately — dispatching into a crashed domain is UB.
    if driver.liveness.load(Acquire) != LIVENESS_ACTIVE {
        return Err(CapError::DriverCrashed);
    }
    // 2. Check the driver's CapValidationToken generation against the
    //    capability table's current generation. O(1), ~3-5 cycles.
    //    This catches revoked capabilities without a full table lookup.
    if driver.cap_token.generation() != driver.cap_table_gen.load(Acquire) {
        return Err(CapError::Revoked);
    }
    // 3. Verify the required permission bits are present in the driver's
    //    cached ValidatedCap. O(1), ~2-3 cycles (bitfield AND + branch).
    if !driver.validated_cap.permissions().contains(required_perms) {
        return Err(CapError::InsufficientPermissions);
    }
    // 4. SystemCaps dual-check (matches Tier 1 kabi_dispatch_with_vcap Step 4).
    //    Required because promoted-T0 drivers on RISC-V/s390x/LoongArch64
    //    bypass the KABI ring protocol and call vtable methods directly.
    //    Without this check, a promoted-T0 driver could escalate from
    //    PermissionBits (object-level) to SystemCaps (system-wide) by
    //    calling a method that requires both.
    let required_syscap = driver.vtable_required_syscaps(required_cap);
    if !required_syscap.is_empty()
        && !driver.granted_syscaps.rcu_read().contains(required_syscap)
    {
        return Err(CapError::InsufficientPermissions);
    }
    // 5. Direct vtable call — no ring buffer, no domain switch.
    Ok(vtable_fn(driver))
}

Transport selection at load time:

Effective Tier Transport Capability Check Isolation
Tier 0 static T0 (direct, unchecked) None (trusted core code) None
Tier 0 loadable T0 (direct, unchecked) None (trusted, signed, load_once) None
Tier 1 → Tier 0 promoted T0-validated (direct + inline cap check) Every dispatch (~8-12 cycles) None (hardware unavailable)
Tier 1 (hardware isolation) T1 (ring buffer) At ring entry (~23+ cycles includes domain switch) MPK/POE/DACR
Tier 2 T2 (IPC) At syscall boundary Full process isolation

The KabiTransport enum distinguishes Direct (native T0, no check) from DirectValidated (promoted T0, inline check):

pub enum KabiTransport {
    /// Native Tier 0: direct vtable call, no capability check.
    /// Only for static core code and signed Tier 0 loadable modules.
    ///
    /// **RCU invariant**: T0 callers MUST hold `rcu_read_lock()` for the
    /// entire duration of a vtable dereference and method call. The vtable
    /// pointer is read via `AtomicPtr::load(Acquire)` and may be swapped
    /// by live evolution at any time. RCU read-side protection guarantees
    /// that the old vtable memory is not freed until the grace period
    /// completes, which cannot happen while any CPU holds an RCU read lock.
    /// Without `rcu_read_lock()`, a T0 caller could dereference a vtable
    /// pointer that has been freed by the evolution framework after the
    /// `AtomicPtr` was swapped but before the method call completes.
    /// This invariant is enforced at compile time: all T0 dispatch macros
    /// (`kabi_call_t0!`) expand to `rcu_read_lock(); dispatch; rcu_read_unlock()`.
    Direct,
    /// Promoted Tier 0: direct vtable call with inline capability validation.
    /// For Tier 1 drivers running as Tier 0 on architectures without
    /// hardware memory domain isolation (RISC-V, s390x, LoongArch64).
    /// Same RCU read-lock requirement as `Direct`.
    DirectValidated,
    /// Tier 1: SPSC/MPSC ring buffer with domain switch.
    Ring,
    /// Tier 2: IPC message passing across address spaces.
    Ipc,
}

Invariants: - DirectValidated is selected if and only if the driver's manifest declares preferred_tier: 1 (or higher) AND the architecture reports arch::current::isolation::supports_fast_isolation() == false. - A driver that declares preferred_tier: 0 is loaded as native Tier 0 (transport Direct, no capability check) — it is fully trusted code that has been reviewed and signed to the Tier 0 standard. - The CapValidationToken generation check in t0_validated_dispatch uses the same mechanism as the Tier 1 ring buffer entry check (Section 12.3) — revocation propagates identically regardless of transport.

Isolation is one of eight core capabilities, not the only one. Even on hardware without fast isolation (RISC-V where Tier 1 is unavailable, older x86 without MPK), UmkaOS still provides: driver crash recovery (best-effort in Tier 0, full in Tier 2), distributed kernel primitives, heterogeneous compute management, structured observability, power budgeting, post-quantum security, live kernel evolution, and a stable driver ABI. A RISC-V server operating without Tier 1 isolation retains all seven other capabilities. This is a hardware-imposed constraint, not a design failure — and it is resolved when suitable RISC-V hardware becomes available.

11.2.8 Isolation Tiers vs. Replaceability: Orthogonal Axes

The Tier 0/1/2 isolation model described in this section is orthogonal to the Nucleus/Evolvable replaceability model defined in Section 1.1. These two classification axes are independent and must not be conflated:

  • Isolation tier (this section) determines the hardware memory boundary: does the component run in the kernel's shared address space (Tier 0), in a hardware-isolated Ring 0 domain (Tier 1), or in a separate Ring 3 process (Tier 2)?
  • Replaceability (Section 1.1) determines whether a component can be live-replaced at runtime without reboot: Nucleus (non-replaceable, formally verified) or Evolvable (swappable via the evolution framework).

Both Nucleus and Evolvable code runs in Tier 0. Nucleus is not "more isolated" than Evolvable — they share the same Ring 0 privilege level and the same address space. The difference is that Nucleus code is formally verified and cannot be swapped, while Evolvable code can be live-replaced. Similarly, an Evolvable component can run at any tier: Tier 0 (EEVDF scheduler, NAPI poll), Tier 1 (NVMe driver, TCP stack), or Tier 2 (USB driver).

Axis Values What it determines
Replaceability Nucleus / Evolvable Can it be live-replaced at runtime?
Isolation Tier 0 / 1 / 2 Hardware memory isolation boundary
Component Replaceability Tier Rationale
Evolution primitive Nucleus 0 Formally verified; evolution depends on its correctness
Capability table lookup Nucleus 0 Security-critical; formally verified with Verus
EEVDF scheduler Evolvable 0 Live-replaceable; Tier 0 for hot-path performance
NAPI poll loop Evolvable 0 Live-replaceable; Tier 0 for packet processing speed
NVMe driver Evolvable 1 MPK-isolated; crash-recoverable in ~50-150 ms
USB driver Evolvable 2 Ring 3 process isolation; restart in ~10 ms

See Section 1.1 for the full specification of the replaceability model, including the Nucleus minimization principle and the evolution framework interaction.


11.3 Driver Isolation Tiers

11.3.1 Tier Classification

Tier 0 has two sub-forms: static (compiled into the kernel binary) and loadable (dynamically loaded but running in the Core domain with no isolation). Both are Tier 0 in the trust and crash-consequence sense; they differ in deployment. See Section 11.3 for details.

Property Tier 0 Static Tier 0 Loadable Tier 1 Tier 2
Location Compiled into kernel binary Ring 0, Core domain, dynamically loaded Ring 0, dynamically loaded, domain-isolated Ring 3, separate process
KABI transport Direct vtable call (T0) Direct vtable call (T0) Ring buffer (T1) Ring buffer (T2)
Isolation None None (same address space) Hardware memory domains + IOMMU Full address space + IOMMU
Crash behavior Kernel panic Kernel panic Reload module (~50-150ms, design target) Restart process (~10ms)
DMA access Unrestricted Unrestricted IOMMU-fenced IOMMU-fenced
Performance Zero overhead ~2–5 cycles (vtable dispatch) ~23 cycles domain switch + marshaling (x86 MPK) ~200-500 cycles per crossing
Trust level Maximum (core kernel) Maximum (signed, sealed index) High (verified, signed) Low (untrusted acceptable)
Unloadable No (static) No (load_once: true) Yes (domain revocation) Yes (process exit)
Examples APIC, timer, early console, Core allocator SCSI mid-layer, MDIO bus, SPI bus core, cfg80211 framework, V4L2 core NVMe, NIC, TCP/IP, FS, GPU, KVM, audio (default), WiFi driver USB, input, BT, audio (optional demotion), HID

Tier 1 isolation mechanism per architecture:

The "hardware memory domains" used for Tier 1 isolation are architecture-specific. Not all architectures have a fast isolation mechanism; RISC-V has none at all and runs Tier 1 drivers as Tier 0. See Section 11.2 for per-architecture cycle costs and Section 11.2 for the adaptive policy.

Architecture Tier 1 Mechanism Switch Cost Domains Availability
x86-64 MPK (WRPKRU) ~23 cycles 12 usable Intel Skylake+ / AMD Zen 3+
x86-64 (no MPK) Page table + ASID ~200-400 cycles Unlimited All x86-64
AArch64 (mainstream) Page table + ASID ~150-300 cycles Unlimited All AArch64 — standard mechanism on Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920
AArch64 + POE (ARMv8.9+/ARMv9.4+) POE (MSR POR_EL0 + ISB) ~40-80 cycles 7 usable (3 for drivers after infra deductions; see Section 24.5) Optional acceleration: FEAT_S1POE on Cortex-X4+
ARMv7 DACR (MCR p15 + ISB) ~30-40 cycles 12 usable All ARMv7 (universal). 16 total domains; 4 reserved for infrastructure (domain 0=core, 1=shared descriptors, 14=shared DMA, 15=guard).
RISC-V 64 Tier 1 not available — drivers use Tier 0 or Tier 2 N/A (no Tier 1 boundary) N/A Hardware capability not yet available on any RISC-V silicon
PPC32 Segment registers (mtsr + isync) ~20-40 cycles 12 usable All PPC32. 16 total segments; 4 reserved for infrastructure (segment 0=core, 1=shared descriptors, 14=shared DMA, 15=guard).
PPC64LE (POWER9+) Radix PID (mtspr PIDR) ~30-60 cycles Process-scoped POWER9+ with Radix MMU
PPC64LE (POWER8) HPT + LPAR ~200-400 cycles Unlimited POWER8
s390x Tier 1 not available — drivers use Tier 0 or Tier 2 N/A (no Tier 1 boundary) N/A Storage Keys (4-bit per page, ISK/SSK) are page-granularity — too coarse for fast domain isolation
LoongArch64 Tier 1 not available — drivers use Tier 0 or Tier 2 N/A (no Tier 1 boundary) N/A No hardware isolation mechanism exists on LoongArch64

On RISC-V 64, Tier 1 isolation is not available. As of early 2026, no ratified RISC-V extension provides a suitable intra-address-space isolation mechanism with paging enabled (SPMP requires paging disabled; Smmtt targets confidential computing; Pointer Masking Smnpm/Ssnpm, ratified Oct 2024, is not a domain isolation mechanism). Drivers that would request Tier 1 on other architectures must choose Tier 0 (in-kernel, no hardware isolation boundary, identical to the Linux monolithic driver model) or Tier 2 (Ring 3 + IOMMU) on RISC-V, based on licensing, driver preference, and admin policy. When RISC-V hardware provides suitable isolation primitives, UmkaOS will activate Tier 1 on RISC-V without requiring changes to the driver model or driver manifests.

Capability lifecycle during tier transitions:

  • Tier 0 → Tier 1 (initial registration): Driver receives capabilities through the DeviceCapGrant bundle during device_init() (Section 11.4). Capabilities are scoped to the driver's isolation domain.

  • Tier 1 → Tier 2 (demotion / crash-triggered restart): All existing CapHandle entries for this driver are revoked (generation increment in the capability table). The kernel creates a new capability space for the Tier 2 driver process and grants fresh capabilities with the same permissions through the Tier 2 registration handshake (umka_driver_register). Cap IDs change — the driver re-acquires them through standard registration. Ring 3 drivers cannot hold kernel-internal capabilities directly; they hold file descriptors that proxy to kernel CapHandle entries.

Runtime PM coordination during demotion: Before revoking capabilities and tearing down the isolation domain, the kernel freezes the device's runtime PM state to prevent concurrent power transitions during the tier change:

Tier 1 → Tier 2 demotion sequence:
  1. rtpm_disable(dev)       — freeze runtime PM; device stays in current
                                power state (Active or Suspended). No suspend/
                                resume callbacks will fire during the transition.
  2. If RtpmState == Suspended: rtpm_get(dev) — resume device to Active (D0).
     The driver needs to be Active for capability revocation and domain teardown
     to complete cleanly (some drivers flush state during shutdown).
  3. Revoke all CapHandle entries (generation increment).
  4. Tear down Tier 1 isolation domain (reclaim MPK/POE/DACR domain slot).
  5. Spawn Tier 2 driver process, perform registration handshake.
  6. Transfer RuntimePm state to new driver instance: the kernel re-attaches
     the DeviceNode's RuntimePm struct to the new Tier 2 driver's ops table.
  7. rtpm_enable(dev)        — thaw runtime PM. The device is now Active in
                                the new tier and will autosuspend normally.

If the demotion is crash-triggered (driver faulted), step 2 uses PCI FLR or device-specific reset instead of rtpm_get(), because the driver's suspend/resume callbacks may not be functional.

  • Tier 2 → Tier 1 (promotion): Reverse of demotion — Tier 2 proxy capabilities are revoked, new Tier 1 domain capabilities are granted.

Runtime PM coordination during promotion: Same freeze/thaw pattern:

Tier 2 → Tier 1 promotion sequence:
  1. rtpm_disable(dev)       — freeze runtime PM.
  2. If RtpmState == Suspended: rtpm_get(dev) — resume to Active.
  3. Send DRIVER_SHUTDOWN to Tier 2 process (graceful exit).
  4. Revoke Tier 2 proxy capabilities.
  5. Create Tier 1 isolation domain (allocate MPK/POE/DACR domain slot).
  6. Load driver into Tier 1 domain, perform KABI registration.
  7. Transfer RuntimePm state to new driver instance.
  8. rtpm_enable(dev)        — thaw runtime PM.
  • Tier 1 unavailable (architectures without fast isolation): On RISC-V, s390x, and LoongArch64 (where Tier 1 hardware isolation is unavailable), drivers that request Tier 1 are placed in Tier 0 or Tier 2 at load time based on fallback_bias, licensing constraints (proprietary drivers must use Tier 2), and admin overrides. A driver placed in Tier 0 has no isolation boundary (same address space, no domain switch overhead) but does not receive additional capabilities. It retains exactly the same DeviceCapGrant scope it would have received as a Tier 1 driver — same device MMIO ranges, same interrupt lines, same DMA buffer allocation limits. The CapabilityGuard still validates every operation against the driver's capability table on every access.

What Tier 0 promotion does NOT grant: - No SystemCap escalation (no CAP_SYS_ADMIN, no raw port I/O beyond the granted MMIO ranges, no ability to access other drivers' memory regions). - No bypass of IOMMU DMA fencing — DMA is still restricted to the driver's IOMMU domain, same as Tier 1. - No ability to call kernel-internal functions outside the KABI vtable interface. The driver's compiled binary only links against the KABI SDK; no additional symbols are resolved at Tier 0. - No write access to kernel data structures outside the driver's own allocation pool.

What Tier 0 promotion does change: - Memory isolation enforcement is removed: a bug in the driver can corrupt kernel memory (same risk as Linux monolithic drivers). - Crash recovery is best-effort: a fault may propagate to kernel panic rather than being contained to the driver's domain. - Domain switch overhead is eliminated: KABI calls use direct vtable dispatch (T0 transport, ~2-5 cycles) instead of ring buffer dispatch (~23+ cycles).

Security invariant: promoted_tier0_caps ⊆ tier1_caps. A promoted Tier 0 driver can never hold capabilities that a Tier 1 configuration would not have granted. This is enforced at load time: device_init() issues the same DeviceCapGrant regardless of the effective isolation tier. The tier selection happens in the transport layer (KabiTransport::Direct vs KabiTransport::Ring), not in the capability grant path. The driver must complete re-registration to receive the new handles.

  • Within a tier (driver reload after crash): Capabilities are revoked before the old driver instance is torn down and re-granted to the new instance during device_init(). State preserved by crash recovery (Section 11.9) does not include capability handles — they are always re-established.

11.3.2 Tier 0: Boot-Critical and Core Framework Code

Tier 0 encompasses all kernel code that runs in Ring 0 inside the Core memory domain, with no hardware isolation boundary between it and the static kernel binary. A crash in any Tier 0 code — static or loadable — causes a kernel panic. Tier 0 is split into two deployment forms.

11.3.2.1 Tier 0 Static

Compiled directly into the kernel binary. Required before any dynamic loading infrastructure is available:

  • Local APIC and I/O APIC
  • PIT/HPET/TSC timer
  • Early serial/VGA console
  • ACPI table parsing (early boot only). Security trade-off: ACPI tables are firmware-provided data that the kernel must trust at boot. A malicious or buggy BIOS can supply corrupt ACPI tables (malformed AML, overlapping MMIO regions, impossible NUMA topologies). UmkaOS's Tier 0 ACPI parser performs defensive parsing: all table lengths are bounds-checked, AML interpretation uses a sandboxed evaluator with a cycle limit (no infinite loops), and MMIO regions claimed by ACPI are validated against the e820/UEFI memory map before being mapped. Despite these defenses, ACPI parsing remains the largest attack surface in Tier 0. The firmware quirk framework (Section 11.6) provides per-platform overrides for known-buggy tables.

Tier 0 static code is held to the highest review standard and kept minimal. Only code that is genuinely required before the module loader and isolation infrastructure are operational belongs here.

11.3.2.2 Tier 0 Loadable Modules

Dynamically loaded into the Core domain after the module loader initialises, but before or during device enumeration. Tier 0 loadable modules:

  • Run in Ring 0 in the same memory domain as static Core
  • Communicate with static Core and other Tier 0 modules via direct vtable calls (Transport T0, Section 12.6) — not ring buffers
  • Are loaded by the kernel-internal module loader (Section 12.7) without requiring userspace
  • Are never unloaded (load_once: true) — safe Tier 0 unloading would require auditing every execution context in the kernel for stale function pointers, which is not tractable
  • Declare their requires and provides dependencies in their .kabi file (Section 12.7)
  • Are transparent to Tier 1 callers — from Tier 1's perspective, calling a Tier 0 loadable service is identical to calling static Core

When to use Tier 0 loadable vs Tier 1: Use Tier 0 loadable for kernel framework modules that provide services to many other drivers and whose correctness can be established by inspection — bus frameworks (SCSI mid-layer, MDIO, SPI bus core), protocol framework layers (cfg80211, V4L2 core, media controller), and subsystem registries (backlight, NVMEM, rfkill). These are trusted, signed, reviewed code that is too large to statically link but too fundamental to pay Tier 1 ring buffer overhead on every call. Use Tier 1 for hardware drivers and any code that benefits from crash containment — if a bug is plausible, it should be Tier 1.

11.3.3 Tier 1: Kernel-Adjacent Drivers (Hardware Memory Domain Isolated)

Tier 1 is crash containment, not a security boundary. Hardware memory domain isolation (MPK, POE, DACR) prevents a buggy driver from corrupting kernel memory by accident. It does not prevent a compromised driver from escaping deliberately — on x86-64, WRPKRU is unprivileged and any Ring 0 code can execute it. This is a known architectural property of Intel MPK, not an UmkaOS design flaw, and is documented as an intentional tradeoff. Tier 2 (Ring 3 + IOMMU) is the security boundary for untrusted or third-party drivers. Do not flag Tier 1 MPK escape as a vulnerability — the threat model is stated and the mitigation is Tier 2. See Section 11.3 for Tier 2.

Performance-critical drivers run in Ring 0 but are isolated via hardware memory domains (MPK on x86-64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE, page-table + ASID on AArch64 mainstream, POE on AArch64 ARMv8.9+/ARMv9.4+ when available — see "Tier 1 isolation mechanism per architecture" table above). Note: Tier 1 isolation is not available on RISC-V; drivers use Tier 0 or Tier 2 instead. Each driver (or driver group) that does have Tier 1 isolation is assigned a protection domain. The driver can only access:

  • Its own private memory (tagged with its domain key)
  • Shared ring buffers (tagged with the shared domain, read-write)
  • Shared DMA buffers (tagged with DMA domain, read-write)
  • Its MMIO regions (mapped with its domain key)

It cannot access: - UmkaOS Core private memory - Other Tier 1 drivers' private memory - Page tables, capability tables, or scheduler state - Arbitrary physical memory

Memory allocation policy by execution context — applies to all driver tiers:

Context Slab/Buddy Allocation Rationale
Hard IRQ handler Forbidden IRQ context cannot sleep; allocators may need to sleep for reclaim. Use pre-allocated per-device buffers.
Softirq / NAPI poll GFP_ATOMIC only (non-blocking, may fail) Softirq context cannot sleep. Pre-allocating is preferred; GFP_ATOMIC is the fallback for rare dynamic needs.
Workqueue / kthread GFP_KERNEL (blocking, may reclaim) Process context, may sleep. Standard allocation path.
Driver init / probe GFP_KERNEL Called from process context during device registration.
Completion callback Forbidden Completion runs in interrupt or softirq context. Schedule deferred work via workqueue if allocation is needed.
Tier 2 driver (Ring 3) Via umka_driver_dma_alloc syscall only Tier 2 drivers cannot call slab/buddy directly; the kernel allocates on their behalf.

GFP flags are defined in Section 4.2. Tier 1 driver allocations are tagged with the driver's memory domain key so the allocator returns pages in the correct protection domain.

11.3.3.1 Tier 1 Driver Memory Allocation Rights

Tier 1 drivers allocate memory via the KABI KernelServicesVTable (Section 12.1). Internally, the kernel fulfills these allocations from slab caches whose backing pages are mapped into the driver's MPK domain. The driver never calls the slab allocator directly — all allocations are kernel-mediated through KABI. Tier 2 drivers (Ring 3 processes) use standard libc malloc from their own address space.

The table above defines when each GFP class may be used. This subsection defines what allocation APIs Tier 1 drivers may access and what invariants govern their allocations. Tier 1 drivers run in Ring 0 but are confined to their hardware memory domain — the allocation policy enforces this confinement at the API level.

Allowed allocation APIs:

API Context Description
Slab allocation via KernelServices.alloc All sleepable contexts Kernel-provided allocator handle passed to the driver at device_init(). Returns memory tagged with the driver's domain key. The driver never constructs an allocator — it receives one from the KABI init handshake.
GfpFlags::KERNEL Process context (workqueue, kthread, driver init/probe) Standard allocation that may sleep for reclaim. Most driver allocations use this path.
GfpFlags::ATOMIC Softirq, NAPI poll, or any context that cannot sleep Draws from a pre-allocated per-CPU reserve pool. May fail under memory pressure — drivers must handle None return.
GfpFlags::NOIO I/O completion paths, block layer callbacks Prevents the allocator from issuing I/O to reclaim pages, avoiding recursion when the allocating driver is itself part of the I/O path (e.g., filesystem or block driver).
DMA buffer allocation via DmaDevice trait Process context or GfpFlags::ATOMIC for small buffers Allocates physically contiguous, IOMMU-mapped memory for device DMA. The driver receives a &dyn DmaDevice reference at init time; all DMA allocations go through this trait. See Section 4.14 for the full DMA API (CoherentDmaBuf, StreamingDmaMap, DmaSgl).

Forbidden operations:

  • Direct buddy allocator access. The buddy allocator is a Core-internal subsystem. Tier 1 drivers allocate through the KABI allocator interface, which internally calls the buddy allocator on the driver's behalf with the correct domain tag. Bypassing this interface would produce pages outside the driver's memory domain, violating isolation.
  • Allocations from another driver's memory domain. Each Tier 1 driver's allocator is scoped to its own domain key. The KABI interface does not expose any mechanism to request memory in a foreign domain. Cross-driver data sharing uses shared ring buffers (Section 11.7) allocated and managed by Core.
  • Unbounded allocation. Every Tier 1 driver operates under a memory limit enforced by a kernel-internal memory accounting structure associated with the driver's isolation domain. The limit is declared in the driver manifest (memory_limit_pages field) and enforced by the KABI allocator — allocation requests that would exceed the limit return None (for ATOMIC) or block until memory is freed (for KERNEL, subject to a timeout). This prevents a single buggy driver from exhausting system memory. The accounting structure tracks both slab objects and page-level allocations attributed to the driver's domain.

Memory domain tagging:

All memory allocated through the KABI allocator interface is tagged with the requesting driver's protection key (PKEY on x86, domain ID on ARMv7 DACR, segment key on PPC32, Radix PID on PPC64LE, or page-table domain on AArch64). This tagging serves two purposes:

  1. Access enforcement. The hardware memory domain mechanism prevents other Tier 1 drivers and Core code (when running in a driver's context) from accessing pages belonging to a different domain. The domain tag is set in the page table entry at allocation time and remains for the page's lifetime.
  2. Bulk cleanup on driver crash or unload. When a driver faults or is unloaded, Core walks the domain's allocation tracking list and frees all pages tagged with that domain key in a single pass. This is step 6 ("UNLOAD DRIVER — Free all driver-private memory") in the crash recovery sequence (Section 11.9). No per-object destructor callbacks are needed — the domain tag is the sole ownership signal, and bulk freeing is O(n) in the number of allocated pages, not O(n) in the number of individual allocations.

Allocator integration and memory pressure:

Tier 1 drivers allocate from the global buddy/slab allocators — there is no per-domain physical memory pool. MPK/POE/DACR protection is applied post-allocation to the virtual mapping (the protection key is set in the PTE, not in the physical page metadata). This means Tier 1 drivers share the same physical memory pool as Core and other Tier 1 drivers; memory pressure is reported via the global watermarks (WMARK_MIN, WMARK_LOW, WMARK_HIGH in Section 4.2). Per-domain accounting is performed by the cgroup memory controller: each Tier 1 driver's isolation domain is associated with a kernel-internal cgroup (Section 17.2) that tracks and limits the domain's memory consumption. When a domain exceeds its cgroup limit, allocations from that domain are throttled or denied (depending on the GFP flags), but the global pool remains available to other domains — a single driver cannot starve the system.

DMA buffers allocated via the DmaDevice trait are also domain-tagged, but their IOMMU mappings are additionally scoped to the device's IOMMU domain. On driver crash, the IOMMU mappings are torn down first (step 2: ISOLATE), then the underlying physical pages are freed during bulk domain cleanup (step 6).

Allocation lifetime semantics:

A Tier 1 driver allocation persists until one of three events:

  1. Explicit free by the driver. The driver calls the KABI deallocation function, which verifies the page belongs to the calling driver's domain (domain tag check) before returning it to the buddy allocator. Freeing a page belonging to a different domain is a fatal error that triggers driver crash recovery.
  2. Driver unload. On orderly unload, Core invokes the driver's device_remove() callback, giving the driver an opportunity to release resources. After the callback returns, Core performs bulk domain cleanup for any remaining allocations as a safety net — well-behaved drivers should have freed everything in device_remove().
  3. Driver crash. On fault, all allocations tagged with the crashed driver's domain key are bulk-freed during the recovery sequence. No driver cooperation is required.

DMA buffers follow the same three-event lifetime but with an additional constraint: the IOMMU mapping must be invalidated before the physical page is freed. The crash recovery sequence handles this ordering (IOMMU teardown in step 2 precedes page release in step 6).

Security limitation: Tier 1 isolation protects against bugs, not exploitation. On x86-64, MPK isolation uses the WRPKRU instruction, which is unprivileged -- any Ring 0 code (including Tier 1 driver code) can execute it to modify its own domain permissions and access any MPK-protected memory, including UmkaOS Core (PKEY 0). This means a compromised Tier 1 driver with arbitrary code execution can trivially bypass MPK isolation. On ARMv7, MCR to DACR is privileged (PL1), which is stronger -- user-space cannot forge domain switches, but kernel-mode drivers still can. On PPC32 and PPC64LE, segment register and AMR updates are similarly supervisor-mode.

Tier 1 threat model: MPK (and its architectural equivalents) provides defense against accidental corruption -- buffer overflows, use-after-free, null dereferences that happen to write to the wrong address. It does not defend against deliberate exploitation where an attacker achieves arbitrary code execution within a Tier 1 driver and intentionally escapes the domain. For the exploitation case, Tier 2 (full process isolation in Ring 3) is the appropriate boundary.

Tier 1 trust requirement: Tier 1 drivers run in Ring 0 with only domain isolation (not address space isolation). They must be treated as trusted code: cryptographically signed, manifest-verified (Section 1.3), and subject to the same security review standard as Core kernel code. Tier 1 is not appropriate for third-party, untrusted, or unaudited drivers. Untrusted drivers must use Tier 2 (Ring 3 process isolation) where a compromised driver cannot escalate to kernel privilege regardless of the exploit technique. See Section 11.3 (Signal Delivery Across Isolation Boundaries) for the complete domain crossing specification during signal handling.

Mitigations that raise the bar for exploitation are detailed in Section 11.2 ("WRPKRU Threat Model: Unprivileged Domain Escape"): binary scanning for unauthorized WRPKRU/XRSTOR instructions at load time, W^X enforcement on driver code pages, forward-edge CFI (Clang -fsanitize=cfi-icall), and the NMI watchdog for detecting PKRU state mismatches.

IBT (Indirect Branch Tracking) ENDBR validation (x86-64, CET-IBT): When the CPU supports CET-IBT (CR4.CET enabled, IA32_U_CET.ENDBR_EN set), every indirect branch target must begin with an ENDBR64 instruction (opcode F3 0F 1E FA). The KABI driver loader validates this at load time: 1. For every entry point function pointer in the driver's KabiDriverManifest (init, probe, vtable method pointers), verify that the target address begins with ENDBR64. Reject the driver if any entry point lacks ENDBR64. 2. Scan all executable pages of the driver binary for indirect branch targets (as identified by relocation entries) and verify ENDBR64 presence. 3. This check is skipped when CET-IBT is not available (pre-Tiger Lake CPUs, non-x86 architectures). The X86Errata::CET_IBT_COMPAT flag gates the check. 4. CR4 security bits (CET, SMEP, SMAP, UMIP) are pinned after boot (X86Errata::CR4_PIN) — no kernel code may clear them post-initialization.

Tier 0 fast path: On RISC-V (where Tier 1 is not available), POWER8, or when isolation=performance promotes all drivers to Tier 0, MPK-specific mitigations are automatically skipped: - Binary scanning for WRPKRU/XRSTOR: skipped (no MPK → no WRPKRU exploit). - NMI PKRU watchdog: disabled (no PKRU state to verify). - W^X enforcement and forward-edge CFI remain active — these defend against code injection and control-flow hijacking regardless of isolation tier and are standard hardening measures, not isolation-specific overhead.

Future: PKS (Protection Keys for Supervisor) -- Intel's PKS extension provides supervisor-mode protection keys that are controlled via MSR writes (privileged operations that require Ring 0 + CPL 0 MSR access). Unlike WRPKRU (which any Ring 0 code can execute), PKS key modifications go through WRMSR to IA32_PKS, which can be trapped by a hypervisor or controlled by umka-core. When PKS-capable hardware is available, UmkaOS will use PKS for Tier 1 isolation, closing the unprivileged-WRPKRU escape path. PKS is available on Intel Sapphire Rapids and later server CPUs.

11.3.3.2 Domain Crossing Protocol (Tier 0 / Tier 1 Ring Buffer Mediation)

Tier 0 (Core) and Tier 1 drivers occupy different hardware memory domains. A Tier 0 function cannot call a Tier 1 function directly (the callee's code and data are in a foreign domain), and vice versa. All cross-domain invocations are mediated by a shared ring buffer allocated and owned by Core. The general protocol is:

  1. Request enqueue. The caller (Tier 0 or Tier 1) writes a serialised request message into the shared ring buffer. The message carries an opcode identifying the KABI operation, a sequence number, and the marshalled arguments. Ring entry format is defined in Section 11.8.
  2. Domain switch. Core performs the hardware domain switch (e.g., WRPKRU on x86-64, DACR write on ARMv7) to enter the target domain.
  3. Request dispatch. The target side polls or is notified of the new entry, deserialises the request, and dispatches it to the corresponding KABI vtable function.
  4. Response enqueue. The target writes the result (return value or error) into the response slot of the same ring entry (or a paired completion ring).
  5. Reverse domain switch. Core switches back to the caller's domain.
  6. Response read. The caller reads the response and resumes execution.

For the three transport classes (T0 direct, T1 ring, T2 IPC) and call-direction rules at the Tier 0 boundary, see Section 12.6.

This protocol applies to every Tier 0 / Tier 1 interaction. Specific subsystems define their own KABI message types but follow the same six-step pattern:

  • Writeback — Core enqueues writeback requests to the filesystem driver's inbound ring.
  • NAPI / network poll — The NIC driver enqueues completed packet batches; Core's network stack reads them from the completion ring.
  • VFS — Filesystem operations (lookup, read, write) are marshalled as KABI ring messages from Core to the Tier 1 filesystem driver.
  • Block I/O — Block requests are enqueued to the Tier 1 block driver's command ring; completions flow back on the completion ring.

The ring buffer design (slot layout, producer/consumer indices, memory ordering) is specified in Section 11.8.

11.3.3.3 VirtIO Device Hosting

VirtIO devices in UmkaOS run as Tier 1 drivers (Ring 0, hardware memory domain isolated). Rationale: VirtIO devices are almost always used in virtualized environments where high-throughput I/O is required; Tier 1 provides ring-based access to the network and block stacks with implicit batching that amortizes the domain crossing cost (~23-80 cycles per operation at N≥12), while the MPK/POE/DACR isolation boundary contains crashes without kernel state corruption.

  • The VirtIO transport layer (PCI or MMIO config space, virtqueue management) is implemented inside the Tier 1 driver domain as a shared library used by all VirtIO device drivers (virtio-blk, virtio-net, virtio-gpu, virtio-console, etc.).
  • The Linux VirtIO userspace API (vhost-user, vDPA) is surfaced through UmkaOS's compat layer unchanged — guest VMs and containers see standard VirtIO PCI/MMIO devices.
  • Tier 2 option: a VirtIO device MAY be hosted as Tier 2 (full userspace process) via vhost-user if the operator prioritizes fault isolation over latency; this adds approximately 5–15 μs of ring-crossing overhead per batch.

Reference specification: VirtIO 1.2 (OASIS, July 2022).

VirtIO Transport Abstraction

VirtIO devices are discovered via two transport mechanisms:

  • VirtIO over PCI: Standard PCI device with vendor ID 0x1AF4. Uses PCI capability structures for device config, common config, notify, ISR, and device-specific regions.
  • VirtIO over MMIO: Memory-mapped registers at a platform-specific base address. Discovered via device tree or ACPI. Used on AArch64, ARMv7, RISC-V, and PPC where PCI may not be available.
/// VirtIO transport abstraction — PCI or MMIO. Shared by all VirtIO device drivers.
pub trait VirtioTransport: Send + Sync {
    /// Read a device configuration field at the given byte offset.
    fn read_config(&self, offset: u32, size: u32) -> u64;
    /// Write a device configuration field.
    fn write_config(&self, offset: u32, size: u32, value: u64);
    /// Read device status register.
    fn read_status(&self) -> u8;
    /// Write device status register.
    fn write_status(&self, status: u8);
    /// Read device feature bits (64-bit: page 0 bits 0-31, page 1 bits 32-63).
    fn read_features(&self) -> u64;
    /// Write driver-accepted feature bits.
    fn write_features(&self, features: u64);
    /// Notify the device that a virtqueue has new descriptors (doorbell kick).
    fn notify(&self, queue_index: u16);
    /// Set up a virtqueue: provide physical addresses of desc, avail, used regions.
    fn setup_queue(&self, index: u16, desc_addr: u64, avail_addr: u64, used_addr: u64);
    /// Query the maximum queue size for a given queue index (power of 2, max 32768).
    fn max_queue_size(&self, index: u16) -> u16;
    /// Read and clear ISR status register.
    fn read_isr(&self) -> u8;
    /// Configure interrupt (MSI-X vector for PCI, IRQ number for MMIO).
    fn configure_interrupt(&self, queue_index: u16) -> Result<u32>;
}

Device initialization protocol (common to all VirtIO device types):

  1. Reset: Write 0 to device status register.
  2. Acknowledge: Set ACKNOWLEDGE (bit 0) in status.
  3. Driver: Set DRIVER (bit 1) in status.
  4. Feature negotiation: Read device features, select driver features, write back.
  5. FEATURES_OK: Set FEATURES_OK (bit 3). Re-read — if cleared, device rejected.
  6. Queue setup: For each virtqueue: select, read max size, allocate DMA, write addrs.
  7. DRIVER_OK: Set DRIVER_OK (bit 2). Device is now live.

Common VirtIO feature bits (transport-level, used by all device types):

/// VirtIO common feature bits (VirtIO 1.2 §6).
pub mod virtio_features {
    /// Indirect descriptors support.
    pub const VIRTIO_F_INDIRECT_DESC: u64      = 1 << 28;
    /// Used buffer notification suppression (event index).
    pub const VIRTIO_F_EVENT_IDX: u64          = 1 << 29;
    /// VirtIO version 1 (modern device).
    pub const VIRTIO_F_VERSION_1: u64          = 1 << 32;
    /// Device can be used on platforms where device access is translated (IOMMU).
    pub const VIRTIO_F_ACCESS_PLATFORM: u64    = 1 << 33;
    /// Device supports packed virtqueue layout.
    pub const VIRTIO_F_RING_PACKED: u64        = 1 << 34;
    /// Device supports in-order buffer use.
    pub const VIRTIO_F_IN_ORDER: u64           = 1 << 35;
    /// Device supports order-platform operations.
    pub const VIRTIO_F_ORDER_PLATFORM: u64     = 1 << 36;
    /// Device supports SR-IOV.
    pub const VIRTIO_F_SR_IOV: u64             = 1 << 37;
    /// Device supports notification data.
    pub const VIRTIO_F_NOTIFICATION_DATA: u64  = 1 << 38;
}

Virtqueue Layout — Split Ring (VirtIO 1.2 §2.7)

The split ring format is the default. Each virtqueue consists of three contiguous DMA regions: descriptor table, available ring, and used ring.

/// Split virtqueue descriptor — 16 bytes. One per slot in the descriptor table.
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.7 (when
/// `VIRTIO_F_VERSION_1` is negotiated). Le* types
/// ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce correct
/// byte order on all eight supported architectures including big-endian
/// PPC32 and s390x. Matches Linux's `__virtio_le16`/`__virtio_le32`/`__virtio_le64`.
#[repr(C)]
pub struct VirtqDesc {
    /// Physical address of the data buffer.
    pub addr: Le64,
    /// Length of the buffer in bytes.
    pub len: Le32,
    /// Flags: NEXT (0) = chained, WRITE (1) = device-writable,
    /// INDIRECT (2) = buffer contains indirect descriptor table.
    pub flags: Le16,
    /// Index of the next descriptor in the chain (valid only if NEXT flag set).
    pub next: Le16,
}
/// 8(addr) + 4(len) + 2(flags) + 2(next) = 16 bytes
const_assert!(size_of::<VirtqDesc>() == 16);

pub const VIRTQ_DESC_F_NEXT: u16     = 1;
pub const VIRTQ_DESC_F_WRITE: u16    = 2;
pub const VIRTQ_DESC_F_INDIRECT: u16 = 4;

/// Split virtqueue available ring — driver writes, device reads.
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.7.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct VirtqAvail {
    /// Flags: NO_INTERRUPT (0) = suppress used buffer notifications from device.
    pub flags: Le16,
    /// Index of the next entry the driver will write.
    pub idx: Le16,
    /// Zero-length marker for the ring entries. Actual ring data follows at
    /// offset `size_of::<VirtqAvail>()`. Access via `VirtqRingView<Le16>`.
    pub ring: [Le16; 0],
    // Followed by: ring[queue_size] entries, then used_event: Le16 (if VIRTIO_F_EVENT_IDX).
}
/// Header-only size: 2(flags) + 2(idx) = 4 bytes (ring entries follow at runtime)
const_assert!(size_of::<VirtqAvail>() == 4);

/// Bounds-checked accessor for variable-length virtqueue ring entries.
/// Created once at VirtIO device init with the validated `queue_size`.
pub struct VirtqRingView<T: Copy> {
    /// Base pointer to the first ring entry. Must be `*mut` because `set()`
    /// writes through this pointer — writing through a `*const`-derived
    /// pointer is UB under both Stacked Borrows and Tree Borrows models.
    base: *mut T,
    /// Number of valid entries (from VirtIO negotiation).
    queue_size: u16,
}

impl<T: Copy> VirtqRingView<T> {
    /// # Safety
    /// `base` must point to `queue_size` valid, writable `T` entries.
    /// The entries must remain valid for the lifetime of this `VirtqRingView`.
    pub unsafe fn new(base: *mut T, queue_size: u16) -> Self {
        Self { base, queue_size }
    }
    pub fn get(&self, idx: u16) -> T {
        assert!(idx < self.queue_size, "ring index out of bounds");
        // SAFETY: idx < queue_size, validated by assert above.
        unsafe { self.base.add(idx as usize).read() }
    }
    pub fn set(&self, idx: u16, val: T) {
        assert!(idx < self.queue_size, "ring index out of bounds");
        // SAFETY: Writes to external device-shared memory via a stored raw
        // pointer (`self.base`). This is NOT interior mutability (no UnsafeCell) —
        // the write target is the VirtIO ring region, external to the struct's
        // own memory. Loading `self.base` (a `*mut T`) through `&self` produces a
        // copy of the pointer with its original provenance. The write through that
        // pointer is sound under both Stacked Borrows and Tree Borrows because the
        // write target is not derived from the `&self` borrow.
        // idx < queue_size is validated by the assert above.
        // Only the driver calls set() on the avail ring (device reads only).
        unsafe { self.base.add(idx as usize).write(val) }
    }
}

/// Split virtqueue used ring — device writes, driver reads.
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.7.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct VirtqUsed {
    /// Flags: NO_NOTIFY (0) = suppress available buffer notifications from driver.
    pub flags: Le16,
    /// Index of the next entry the device will write.
    pub idx: Le16,
    /// Zero-length marker for the ring entries. Access via `VirtqRingView<VirtqUsedElem>`.
    pub ring: [VirtqUsedElem; 0],
    // Followed by: avail_event: Le16 (only if VIRTIO_F_EVENT_IDX).
}
/// Header-only size: 2(flags) + 2(idx) = 4 bytes (ring entries follow at runtime)
const_assert!(size_of::<VirtqUsed>() == 4);

/// Used ring element — 8 bytes.
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.7.
#[repr(C)]
pub struct VirtqUsedElem {
    /// Index of the head descriptor of the completed chain.
    pub id: Le32,
    /// Total bytes written by the device into device-writable buffers.
    pub len: Le32,
}
/// 4(id) + 4(len) = 8 bytes
const_assert!(size_of::<VirtqUsedElem>() == 8);

Alignment: descriptor table at 16-byte boundary, available ring at 2-byte boundary, used ring at 4-byte boundary (VirtIO 1.2 §2.7.10).

Virtqueue Layout — Packed Ring (VirtIO 1.2 §2.8)

If VIRTIO_F_RING_PACKED is negotiated, the packed ring eliminates separate avail/used rings, reducing cache contention:

/// Packed virtqueue descriptor — 16 bytes. Single unified ring.
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.8.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct VirtqPackedDesc {
    /// Physical address of the buffer.
    pub addr: Le64,
    /// Length of the buffer in bytes.
    pub len: Le32,
    /// Buffer ID — opaque to device, returned in used descriptors.
    pub id: Le16,
    /// Flags: NEXT (0), WRITE (1), INDIRECT (2),
    /// AVAIL (7) = driver marks available, USED (15) = device marks used.
    /// AVAIL/USED bits form a wrap counter — alternates each ring wrap.
    pub flags: Le16,
}
/// 8(addr) + 4(len) + 2(id) + 2(flags) = 16 bytes
const_assert!(size_of::<VirtqPackedDesc>() == 16);

The packed ring is preferred when offered — it reduces descriptor access from 3 cache lines (desc + avail + used) to 1 cache line per descriptor on completion.

VirtIO device types used by UmkaOS drivers:

Type ID Device Driver Section
1 Network (virtio-net) Section 16.1
2 Block (virtio-blk) Section 15.5
3 Console (virtio-console) Section 21.1
4 Entropy (virtio-rng) Section 10.1
5 Balloon (virtio-balloon) Section 18.1
16 GPU (virtio-gpu) Section 21.5
26 Filesystem (virtio-fs) Section 14.11

11.3.4 Protection Key Exhaustion (Hardware Domain Limit)

Intel MPK provides only 16 protection keys (PKEY 0-15). With PKEY 0 reserved for UmkaOS Core, PKEY 1 for shared read-only descriptors, PKEY 14 for shared DMA, and PKEY 15 as guard, only 12 keys (PKEY 2-13) are available for Tier 1 driver domains (see Section 11.2, "MPK Domain Allocation"). This limits the number of independently isolated Tier 1 drivers to 12 on x86-64 with MPK. Architectures with equivalent mechanisms (AArch64 POE: 7 usable domains, ARMv7 DACR: 12 usable domains, PPC32 segments: 12 usable) face the same constraint. This is a hard hardware limit that cannot be worked around without changing the isolation granularity. PPC64LE (Radix PID) use process-scoped isolation without a fixed small domain budget, so domain exhaustion does not apply to those architectures — but they pay higher per-switch costs (see Section 11.2 cost table). RISC-V has no Tier 1 isolation at all; domain exhaustion does not apply.

When domains are exhausted (more concurrent Tier 1 drivers than available hardware domains — 12 on x86 MPK, 7 on AArch64 POE, 12 on ARMv7 DACR, 12 on PPC32 segments), UmkaOS applies three strategies in priority order:

  1. Domain grouping (default): Related drivers share a protection key. For example, all block storage drivers (NVMe, AHCI, virtio-blk) share one key, all network drivers (NIC, TCP/IP stack) share another. Grouping reduces isolation granularity -- a bug in one block driver can corrupt another block driver's memory within the same group -- but preserves isolation between groups (network cannot corrupt storage). Grouping policy is configurable via the driver manifest:
    [driver.isolation]
    isolation_group = "block"  # Share isolation domain with other "block" group drivers
    

Default domain grouping table (x86-64 MPK, 12 available domains):

PKEY Group Occupants
2 block NVMe, AHCI, virtio-blk, SCSI
3 network NIC drivers (Intel, Mellanox, virtio-net)
4 netstack TCP/IP + UDP stack, netfilter
5 filesystem ext4, XFS, Btrfs, ZFS
6 gpu GPU compute driver (or Tier 2 if untrusted)
7 kvm KVM hypervisor
8 crypto Kernel crypto + IPsec offload
9 usb USB host controller (xHCI)
10-13 overflow Additional drivers; auto-assigned by priority

AArch64 POE (3 available domains): block + network + netstack share PKEY 3; filesystem + crypto share PKEY 4; gpu/kvm share PKEY 5. Remaining drivers demote to Tier 2. Operators override grouping via /sys/kernel/umka/isolation/domains.

  1. Automatic Tier 2 demotion: Drivers below a configurable priority threshold are demoted to Tier 2 (process isolation) when all hardware isolation domains are consumed. Only the most performance-critical drivers retain Tier 1 placement. The priority is determined by match_priority in the driver manifest -- higher priority retains Tier 1.

  2. Domain virtualization (future): On context switch, the scheduler can save and restore the isolation domain register (PKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, segment registers on PPC32) along with a remapped domain assignment table, allowing more logical domains than hardware provides by time-multiplexing physical domains. Domain virtualization adds overhead to context switches (~50-100 cycles for the register save/restore and domain table lookup (warm-cache fast path: WRPKRU ~20 cycles + L1-resident domain table lookup; cold-cache misses add ~100-200 cycles to the domain table access)) and is used only when strategies 1 and 2 are insufficient. This is a future optimization -- domain grouping and Tier 2 demotion handle all current deployment scenarios.

Strategy 3: POE + ASID Domain (ARMv8.9+ systems with POE support)

On AArch64 systems with Permission Overlay Extensions (ARMv8.9+ / FEAT_S1POE):

  • Each Tier 1 driver domain is assigned a POE domain (POR_EL0 register field, up to 8 domains).
  • Domain switch: MSR POR_EL0, x0 (single instruction, ~40-80 cycles, no TLB flush).
  • Domain assignment: PKEY 0 = UmkaOS Core private, PKEY 1 = shared read-only ring descriptors, PKEY 2 = shared DMA pool, PKEYs 3-5 = Tier 1 driver domains (3 domains), PKEY 6 = userspace, PKEY 7 = temporary/debug. After infrastructure reservations, only 3 POE indices are available for Tier 1 drivers — significantly fewer than x86 MPK's 12 driver domains.
  • Fallback: If the hardware supports POE but a driver requires exclusive ASID isolation (e.g., cryptographic device handling key material), the ASID-table strategy (Strategy 2) is used for that driver even on POE-capable hardware. The driver registers require_asid_isolation: true in its .kabi manifest.
  • Combined POE+ASID: For the highest isolation guarantee on ARMv8.9+, use both: POE for fast memory-domain switching + a dedicated ASID for the driver domain. This prevents both memory domain escapes (POE) and TLB side-channel attacks (ASID). Cost: ~80-150 cycles per domain switch (ASID flush + POE switch); used for Tier 1 drivers handling sensitive key material.
  • Detection: POE availability is checked at boot via ID_AA64MMFR3_EL1.S1POE[7:4] != 0. Exposed via IsolationCapabilities::poe_available: bool to the driver subsystem.

When domain grouping is applied, the kernel logs a warning (umka: isolation domain exhausted, grouping {driver_a} with {driver_b}) and exposes the current domain allocation in /sys/kernel/umka/isolation/domains for admin visibility.

Practical impact: A typical server has 5-8 performance-critical driver types (NVMe, NIC, TCP/IP, filesystem, GPU, KVM, virtio, crypto). With grouping, these fit within the hardware domain budget on x86 (12 domains), ARMv7 (12), and PPC32 (12) with room to spare. On AArch64 with POE (7 total usable indices, of which only 3 are available for Tier 1 drivers after infrastructure reservations — see Section 24.5 in 23-roadmap.md for the full index allocation), a typical 5-8 driver configuration requires at least one grouping (e.g., NVMe + filesystem share a domain). Systems with unusually many distinct Tier 1 drivers (e.g., multi-vendor NIC + storage + GPU + FPGA configurations) trigger Tier 2 demotion for the lowest-priority drivers.

Long-term trajectory: the domain budget pressure diminishes as devices become peers — but this is a multi-year ecosystem shift, not a near-term fix. The devices that consume the most Tier 1 domain slots today — GPU (~700K lines of handwritten driver code, excluding auto-generated headers), high-end NIC/DPU (~150K lines), and high-throughput storage controllers — are exactly the devices most suited to become UmkaOS multikernel peers (Section 5.2). When a device runs its own UmkaOS kernel and participates as a cluster peer, it is handled entirely by umka-peer-transport (~2K lines) and consumes zero MPK domains; it exits the Tier 1 population entirely and is contained by the IOMMU hard boundary instead.

However, UmkaOS cannot assume vendor adoption. Rewriting device firmware to implement UmkaOS message passing requires vendor investment, ecosystem tooling, and standardization effort that will take years to mature. For the foreseeable future, most devices will continue to use traditional Tier 1 and Tier 2 drivers, and the domain budget strategies above (grouping, Tier 2 demotion, domain virtualization) are the primary long-term solution — not a temporary workaround. Domain virtualization (strategy 3) and PKS (Section 11.3, future work) remain genuinely important during this extended transition window and must be implemented correctly. They cannot be dismissed as "probably never needed."

The peer kernel model is the correct direction — it reduces the Tier 1 population, eliminates device-specific Ring 0 code, and strengthens the isolation boundary — but UmkaOS must operate correctly and efficiently with today's hardware for years before that future materializes. Domain grouping and automatic Tier 2 demotion are therefore the primary and durable strategies. The ecosystem shift toward peer kernels is a beneficial long-term trend that will progressively ease the domain budget, not a solution that UmkaOS can depend on today.

11.3.5 Tier 2: User-Space Drivers (Process-Isolated)

Non-performance-critical drivers run as user-space processes with full address space isolation. Communication with UmkaOS Core uses:

  • Shared-memory ring buffers (mapped into both address spaces)
  • Lightweight notification via eventfd-like mechanism
  • IOMMU-restricted DMA (driver can only DMA to its allocated regions)

Tier 2 MMIO access model. Tier 2 drivers access device MMIO registers via umka_driver_mmio_map (Section 11.3, KABI syscall table), which maps a device BAR region into the driver process's address space. This mapping is direct -- the driver reads and writes device registers without kernel mediation on each access, avoiding per-access syscall overhead. However, the mapping is kernel-controlled and revocable:

  1. Setup-time validation. The kernel validates every umka_driver_mmio_map request: the BAR index must belong to the driver's assigned device, the offset and size must fall within the BAR's bounds, and the driver must hold the appropriate device capability. The kernel never maps BARs belonging to other devices or kernel-reserved MMIO regions.

  2. IOMMU containment. Even though the driver can program device registers via MMIO (including registers that initiate DMA), all DMA transactions from the device pass through the IOMMU. The device's IOMMU domain restricts DMA to regions explicitly allocated by the kernel on behalf of the driver (umka_driver_dma_alloc). A compromised Tier 2 driver that programs arbitrary DMA addresses into device registers will trigger IOMMU faults -- the DMA is blocked by hardware, not by software trust. This is the same IOMMU fencing applied to Tier 1 drivers, and it is the primary defense against DMA-based attacks from any driver tier.

  3. MMIO revocation on containment. When the kernel needs to contain a Tier 2 driver (crash, fault, admin action, or auto-demotion), it unmaps all MMIO regions from the driver process's address space as part of the containment sequence. This is a standard virtual memory operation (page table entry removal + TLB invalidation) that completes in microseconds. After MMIO revocation, any subsequent MMIO access by the driver process triggers a page fault and process termination -- the driver cannot issue further device commands. Combined with IOMMU fencing (which blocks DMA initiated before revocation from reaching non-driver memory), MMIO revocation provides a complete device access cutoff without requiring Function Level Reset.

PCIe peer-to-peer DMA and IOMMU group policy -- The "complete device access cutoff" guarantee above depends on all DMA traffic passing through the IOMMU. This holds when the device is in its own IOMMU group (ACS enabled on all upstream PCIe switches). However, devices behind a non-ACS PCIe switch can perform peer-to-peer DMA that bypasses the IOMMU entirely — a contained device could still DMA to a peer device's memory regions without IOMMU interception. UmkaOS addresses this by enforcing an IOMMU group co-isolation policy: when devices share an IOMMU group (no ACS), UmkaOS places all devices in that group under the same Tier 2 driver process (or co-isolates them in the same Tier 1 domain). IOMMU revocation during containment therefore affects the entire group atomically — there is no "partially contained" state where one device in the group is fenced but a peer is not. See Section 11.5 (IOMMU Groups) for the full ACS detection and group assignment policy.

Synchronous vs. asynchronous revocation -- For deliberate containment actions (admin-initiated revocation, auto-demotion, fault-triggered isolation), MMIO revocation is synchronous: the kernel performs the TLB shootdown and waits for acknowledgment from all CPUs before the containment call returns. This guarantees that no MMIO access from the driver process is possible after the containment operation completes. For the crash case (driver process dies due to SIGSEGV/SIGABRT), the dying process's threads are killed first, so the TLB shootdown is a cleanup operation -- the driver threads are no longer executing, making the timing of the shootdown a correctness concern only for the page allocator (which must not reuse the MMIO-mapped pages until the shootdown completes).

  1. FLR-free recovery (optimistic path). In the normal case, Tier 2 recovery does not require Function Level Reset. Tier 1 recovery requires FLR because the driver runs in Ring 0 and may have left the device in an arbitrary hardware state that only a full reset can clear. Tier 2 recovery can typically avoid FLR because: (a) IOMMU containment prevents DMA escapes regardless of device state, (b) MMIO revocation prevents further device manipulation, and (c) the device's hardware state can be re-initialized by the replacement driver instance during its init() call. However, devices with complex internal state machines (GPUs, SmartNICs, FPGAs) may not be safely re-initializable without a full reset. If the replacement driver's init() detects an unresponsive or inconsistent device (no response to MMIO reads, unexpected register state, completion timeout), the registry escalates to FLR. This fallback is not the common case for simple devices (NICs, HID, storage controllers), but should be expected for complex devices with substantial internal firmware state.

11.3.6 Tier Mobility and Auto-Demotion

Key principle: UmkaOS's isolation model is designed for flexibility, not dogma. Different hardware has different isolation capabilities (see Section 11.2 in README.md for the full architecture-specific analysis). The tier system allows administrators to make explicit tradeoffs between isolation and performance:

  • Tier 1 provides isolation using the best available hardware mechanism: register-based on x86-64/ARMv7/PPC32/PPC64LE (~1-4% overhead), or page-table-based on AArch64 mainstream (~6-12% overhead), or POE-accelerated on AArch64 ARMv8.9+/ARMv9.4+ (~2-4% overhead). On RISC-V, s390x, and LoongArch64, Tier 1 is unavailable — drivers use Tier 0 or Tier 2.

  • Tier 2 provides strong process-level isolation on all architectures, at the cost of higher latency (~200-600 cycles per domain crossing vs ~23-80 cycles for Tier 1).

  • The escape hatch is always available: Any Tier 1 driver can be manually demoted to Tier 2 by the administrator, or automatically demoted after repeated crashes. This allows environments that prioritize security over performance to opt into stronger isolation regardless of hardware capabilities.

Design intent: The system does not force a one-size-fits-all choice. A high-frequency trading system on x86_64 might run all drivers in Tier 1 for maximum performance. A secure enclave handling sensitive data on a RISC-V system might run all drivers in Tier 2 for maximum isolation. Both are valid deployments of the same kernel.

Drivers declare a preferred tier and a minimum tier in their manifest:

# # drivers/tier1/nvme/manifest.toml
[driver]
name = "umka-nvme"
preferred_tier = 1
minimum_tier = 1          # NVMe cannot function well in Tier 2
fallback_bias = "performance"  # latency-sensitive → prefer Tier 0 if Tier 1 unavailable
# # drivers/tier2/usb-hid/manifest.toml
[driver]
name = "umka-usb-hid"
preferred_tier = 2
minimum_tier = 2
fallback_bias = "isolation"    # HID handles untrusted USB input → prefer safety

The kernel's policy engine decides the actual tier based on the following rules, applied in order. The effective tier is the minimum across all applicable ceilings:

  1. Trust level: Unsigned drivers are forced to Tier 2.
  2. Signing certificate tier ceiling: The signing key's DriverCertEntry.max_tier caps the maximum tier. A driver signed by a vendor key with max_tier=2 cannot be assigned Tier 0 or Tier 1, regardless of other policy. MOK-enrolled keys (source == CERT_SOURCE_MOK) are always capped at Tier 2 by the kernel — the max_tier field in MOK entries is ignored. See Section 12.7 for the DriverCertEntry structure and the integration with the .kabi keyring.
  3. License-tier ceiling: The driver's KabiDriverManifest.license_id determines a license-derived maximum tier, enforced per the OKLF Approved Linking License Registry (Section 24.7):
  4. ALLR Tier 1/2 licenses (GPL-2.0, MIT, BSD, Apache, etc.): no restriction
  5. ALLR Tier 3 (CDDL-1.0, GPL-3.0-only, EUPL-1.2): max Tier 1 (no Tier 0 — static linking into Core creates a derivative work)
  6. Process-isolation-only (LGPL-3.0, EPL-2.0-no-secondary): max Tier 2
  7. Proprietary (license_id 0xF000-0xFFFF) or unspecified (0x0000): max Tier 2
  8. Crash history: After 3 crashes within 60 seconds, a Tier 1 driver is automatically demoted to Tier 2 (if minimum_tier allows).
  9. Admin overrides: System administrator can force any tier via configuration. Admin overrides can lower a tier but cannot raise it above the cert or license ceilings (rules 2-3). This prevents a compromised admin account from granting Ring 0 access to proprietary code.
  10. Signature verification: Cryptographically signed drivers can be granted Tier 1.

These rules implement three independent enforcement layers:

Layer What it enforces Who controls Fakeable?
OKLF license (legal) Proprietary code cannot legally run in Ring 0 Copyright law (courts) No (legal liability)
Signing cert max_tier (trust policy) Distro controls which signing keys vouch for which tiers Distro builder No (embedded at kernel build time)
license_id in manifest (audit/automation) Automates license-to-tier policy for distro-built drivers Driver author (self-declared) Yes (self-declaration)

The signing cert max_tier is the primary technical enforcement — it is set by the distro at kernel build time and cannot be faked by the vendor. The license_id field is useful for distro-built drivers (where the distro built from source and knows the actual license) but is a self-declaration for vendor-signed drivers. A vendor lying about license_id is legally actionable under the OKLF but not technically preventable — just like Linux's MODULE_LICENSE("GPL").

A driver must pass ALL layers to reach Tier 0 or Tier 1. The cert max_tier is the hard ceiling that cannot be bypassed. Example: a driver signed by a vendor key pinned at max_tier=2 is capped at Tier 2 regardless of its license_id claim. Conversely, a distro-built driver signed with the distro key (max_tier=0) but declaring a CDDL license is capped at Tier 1 (license ceiling applies because the distro's own build system set the license_id truthfully).

On architectures without fast isolation hardware (RISC-V), drivers with minimum_tier = 1 in their manifest are loaded as Tier 0 (in-kernel, fully trusted). The KABI loader logs: driver {name}: Tier 1 unavailable on {arch}, loaded as Tier 0. The minimum_tier field expresses the minimum logical isolation level; architectures that lack the hardware mechanism for a given tier automatically map the driver to the next available lower tier. This is not an error — it reflects the architectural reality that RISC-V's security boundary for these drivers is the IOMMU (Tier 2), not memory domain isolation.

11.3.7 Graceful Tier Degradation

A core design principle of UmkaOS's driver model is that the driver always works. Unlike Linux, where a missing MODULE_LICENSE or unsigned module on a Secure Boot system means the driver does not load at all, UmkaOS demotes drivers to a lower isolation tier. The hardware still functions; only the isolation boundary changes.

Trigger Effect Recovery path
Unsigned driver Max Tier 2 Sign the driver binary
Vendor cert (max_tier=2) Max Tier 2 Distro promotes cert (re-pins with higher max_tier), or vendor open-sources and distro builds from source
MOK-enrolled signing key Max Tier 2 Embed key in kernel build as CERT_SOURCE_BUILTIN
Proprietary license (license_id 0xF000+) Max Tier 2 Open-source the driver, or accept Tier 2 performance
CDDL/GPLv3/EUPL-1.2 license Max Tier 1 (no Tier 0) Relicense, or accept KABI IPC boundary
Crash count exceeded (3 within 60 seconds) Demoted Tier 1 → Tier 2 Reset crash counter, fix the bug, reload
Hardware lacks isolation (RISC-V) Tier 1 → Tier 0 Wait for hardware ISA extension
Admin override Any tier (within cert/license ceiling) Reconfigure

The effective tier is always min(requested, cert_ceiling, license_ceiling, crash_policy, hardware_capability, admin_override). Demotion is logged with the specific reason via dmesg and recorded in the FMA event log (Section 20.1).

This graceful degradation is a key differentiator: vendor drivers are never "broken" on UmkaOS — they may run slower (Tier 2 vs Tier 1) but they always function. This removes the most common source of user frustration with Linux Secure Boot and module signing enforcement.

11.3.8 Debugging Across Isolation Domains (ptrace)

ptrace(PTRACE_PEEKDATA) on a Tier 1 driver thread must read memory tagged with the driver's PKEY, which the debugger process does not have access to. The kernel handles this by performing the read on behalf of the debugger:

ptrace access flow for MPK-isolated memory (high-level overview):
  1. Debugger calls ptrace(PTRACE_PEEKDATA/POKEDATA, target_tid, addr).
  2. Kernel checks: does `addr` belong to a MPK-protected region?
  3. If yes: kernel performs a TOCTOU-safe PKRU manipulation
     (see Security Note below) to grant temporary access,
     performs the copy, then restores PKRU. This happens in kernel mode,
     so the debugger process never gains direct access.
  4. If no: standard ptrace read/write path (no MPK involvement).

ptrace write flow:
  Same as read, but with write permission instead of read.
  PKRU manipulation is a single WRPKRU instruction (~23 cycles; see
  [Section 24.4](24-roadmap.md#formal-verification-readiness--performance-impact) for detailed WRPKRU cycle count
  analysis (11–260 cycles depending on pipeline state and microarchitecture)).

PTRACE_ATTACH to a Tier 1 driver thread:
  Requires CAP_SYS_PTRACE (same as Linux).
  The debugger can single-step, set breakpoints, and inspect registers.
  Memory access goes through the kernel-mediated PKRU path above.

### Security Note: TOCTOU Mitigation

The ptrace PKRU manipulation flow has a Time-Of-Check-Time-Of-Use (TOCTOU) concern:
the kernel checks access, changes PKRU, performs the copy, then restores PKRU.
Between the PKRU change and restore, if the traced driver could execute arbitrary code,
it could issue its own `WRPKRU` and escape isolation.

**Mitigation strategy:**
ptrace PKRU-protected access (TOCTOU-safe): 1. Acquire pt_reg_lock(target_tid) — traced thread cannot run. 2. Verify debugger holds CAP_SYS_PTRACE and ptrace relationship is authorized. This check happens before any PKRU state change. 3. Verify address belongs to a valid MPK region owned by target. 4. With IRQs disabled and pt_reg_lock held: a) Save current PKRU b) Set PKRU to grant temporary access to target's PKEY c) Perform the copy (read or write) d) Restore saved PKRU 5. Release pt_reg_lock(target_tid)
This approach creates a **locked validation window**: the traced process cannot execute
between authorization and data copy, and cannot escape by issuing its own `WRPKRU`
because it is blocked by `pt_reg_lock`. The authorization check occurs before any
PKRU manipulation, ensuring that unauthorized debuggers cannot exploit the window.

**Alternative approaches considered:**

1. **Permanently grant debugger PKRU access**: Rejected — violates isolation principle.
2. **Copy through a bounce buffer with kernel mapping**: Adds overhead but would work;
   however, PKRU manipulation is fast (~23 cycles) and the lock-based approach is
   simpler when the debugger is already ptrace-attached.
3. **Disable PTRACE_PEEKDATA on Tier 1 drivers**: Would compromise debuggability;
  the lock-based approach provides security without removing functionality.

The key invariant is: *no user-space code from the traced process runs between PKRU
authorization and PKRU restoration*. `pt_reg_lock` enforces this invariant.

Weak-isolation fast path: On platforms without MPK (or equivalent domain registers), the entire PKRU manipulation flow is unnecessary. ptrace uses the standard kernel read/write path — the driver's memory is in the same address space with no domain protection, so no temporary access grant is needed. The pt_reg_lock and TOCTOU-safe window are only instantiated when the architecture reports hardware domain support.

11.3.9 Signal Delivery Across Isolation Boundaries

When a signal targets a thread running in a Tier 1 (domain-isolated) driver:

Signal delivery to Tier 1 driver thread:
  SIGKILL / SIGSTOP (non-catchable):
    Kernel handles these directly — no signal frame is pushed.
    For SIGKILL: the driver thread is terminated. The kernel runs
    the driver's cleanup handler (if registered via KABI) in a
    bounded context (timeout: 100ms). If cleanup doesn't complete,
    the driver's isolation domain is revoked and all its memory freed.

  Catchable signals (SIGSEGV, SIGUSR1, etc.):
    1. Kernel saves driver's PKRU state.
    2. Kernel sets PKRU to the process's default domain (no driver
       memory access) before pushing the signal frame to the user stack.
    3. Signal handler runs in the process's normal domain — it cannot
       access driver-private memory.
    4. On sigreturn: kernel restores the saved PKRU and resumes the
       driver code with its original domain permissions.

  This ensures a signal handler in application code cannot accidentally
  (or maliciously) access driver-private memory while handling a signal
  that interrupted driver execution.

> **Weak-isolation fast path**: Without hardware domain registers (no MPK/POE/DACR),
> the PKRU save/restore steps are elided. Signals are delivered using the standard
> kernel signal path without domain register manipulation. The signal handler runs
> with normal kernel permissions — on these platforms, the driver memory is not
> domain-protected anyway, so there is nothing to save or restore.

See also: Section 5.11 (SmartNIC and DPU Integration) adds an offload tier where driver data-plane operations are proxied to a DPU over PCIe or shared memory, using the same tier classification and IOMMU fencing model.

11.3.10 eBPF Interaction with Driver Isolation Domains

eBPF programs are a cross-cutting kernel extensibility mechanism used for tracing (kprobe, tracepoint), networking (XDP, tc), security (LSM, seccomp), and scheduling (struct_ops). Because eBPF programs execute in kernel mode with access to kernel data structures, their interaction with driver isolation domains requires explicit specification to prevent isolation domain circumvention.

Threat model: An eBPF program, if not properly constrained, could: 1. Access Tier 1/Tier 2 driver memory directly without going through the isolation boundary 2. Bypass MPK/POE protections by running in the same domain as umka-core 3. Modify driver state without proper capability checks 4. Exfiltrate data from isolated driver memory to user space via BPF maps

Isolation architecture: eBPF programs do not run in the same isolation domain as umka-core (PKEY 0). Each loaded eBPF program is assigned to a dedicated BPF isolation domain that is distinct from: - umka-core (PKEY 0) - All Tier 1 driver domains (PKEY 2-13 on x86-64) - The shared DMA domain (PKEY 14) - The guard domain (PKEY 15)

This means eBPF programs cannot directly access driver-private memory, umka-core internal state, or any isolation domain's memory without explicit kernel mediation.

Access rules for eBPF programs:

  1. No direct driver memory access: An eBPF program attached to a kprobe or tracepoint within a Tier 1 driver's code path executes in its own BPF domain, not the driver's domain. The BPF program cannot read or write the driver's private heap, stack, or MMIO-mapped device registers. Any access to driver state must go through BPF helper functions that perform cross-domain access on the program's behalf.

  2. BPF helper mediation: All BPF helpers that access kernel or driver state (e.g., bpf_probe_read_kernel(), bpf_sk_lookup(), bpf_ct_lookup()) are implemented as kernel-mediated cross-domain operations. The helper:

  3. Validates that the target memory region belongs to a domain for which the BPF program's domain holds the appropriate capability (see rule 4)
  4. Copies data between the target domain and the BPF program's stack or map memory using kernel-internal mappings that bypass domain restrictions
  5. Returns an error if the capability check fails or the access is out of bounds

  6. Map isolation: BPF maps created by an eBPF program are owned by that program's BPF domain. Other isolation domains (including drivers) cannot access these maps without an explicit capability grant. Cross-domain map sharing follows the standard capability delegation mechanism (Section 9.1): the BPF domain must grant MAP_READ and/or MAP_WRITE capabilities to the target domain. This prevents a compromised driver from exfiltrating data through BPF maps it does not own.

  7. Capability requirements for driver access: BPF helpers that query or modify driver state require the BPF domain to hold the appropriate capability:

  8. bpf_skb_adjust_room() (modify packet buffer in NIC driver): requires CAP_NET_RAW in the caller's network namespace
  9. bpf_xdp_adjust_head() / bpf_xdp_adjust_tail(): requires CAP_NET_RAW
  10. Helpers that read driver statistics or state: require CAP_SYS_ADMIN or a subsystem-specific read capability The verifier rejects at load time any program that calls a helper for which the loading context (the process calling bpf()) does not hold the required capabilities. The eBPF runtime re-checks capabilities at helper invocation time to handle capability revocation after program load.

  11. XDP and driver datapath: XDP programs attached to a NIC driver's receive path do not execute in the NIC driver's isolation domain. Instead:

  12. The driver's receive handler (running in the driver's domain) copies the packet descriptor into a shared bounce buffer accessible to the BPF domain
  13. The XDP program runs in the BPF domain, reading from and writing to the bounce buffer
  14. Return values (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) are communicated back to the driver via a shared-memory return code
  15. If the XDP program modifies the packet (XDP_TX or XDP_REDIRECT with modified data), the driver copies the modified packet back to its own domain before transmission or redirect This bounce-buffer design ensures the XDP program never directly accesses driver-private state (DMA rings, completion queues, device registers).

Performance note for 100Gbps+: At 100Gbps with 64-byte packets (~148 Mpps), per-packet bounce copies become a bottleneck (~10ns each = ~1.5 CPU cores just for memcpy). For high-speed NICs (≥25Gbps), UmkaOS supports a zero-copy XDP fast path: the NIC driver maps its receive ring into the BPF isolation domain as read-only (via the shared DMA buffer pool, PKEY 14 on x86 / domain 2 on AArch64), allowing XDP programs to inspect packets in-place without a copy. Modification still requires a copy-on-write to a BPF-writable buffer. This zero-copy path is opt-in per driver (xdp_features flag XDP_F_ZEROCOPY_RX) and requires IOMMU to fence the BPF domain's read-only mapping.

Weak-isolation fast path: When running without hardware isolation domains (isolation=performance or architectures without fast isolation), the bounce buffer is bypassed. XDP/TC programs access the driver's packet buffer directly (true zero-copy, matching Linux's XDP model). The BPF verifier still enforces bounds checking and memory safety — only the domain separation between BPF and driver memory is lost. Since the driver code itself already has unrestricted kernel memory access on these platforms, the bounce buffer would be protecting the driver's memory from BPF while the driver can already read/write all of kernel memory. The per-packet memcpy savings are significant at high packet rates (100Gbps with 64-byte packets = ~148M copies/sec eliminated).

  1. TC (traffic control) BPF: Same model as XDP — TC programs execute in a BPF domain, not in the network driver's or umka-net's domain. Packet data is copied through a shared buffer; the program cannot access umka-net's socket buffers, routing tables, or connection tracking state except through verified BPF helpers (bpf_fib_lookup(), bpf_ct_lookup(), etc.) that perform capability-checked cross-domain access.

  2. Kprobe and tracepoint attachment to drivers: When a BPF program is attached to a kprobe within a Tier 1 driver's code:

  3. The kprobe fires while the CPU is running in the driver's isolation domain
  4. The BPF program is invoked after the kernel switches to the BPF domain
  5. The program receives only the function arguments (copied to BPF stack) and cannot access the driver's heap, globals, or MMIO regions
  6. Return probes (kretprobe) receive the return value copied to BPF stack The domain switch before BPF execution and the argument copy are performed by the kprobe infrastructure in umka-core, ensuring the BPF program is fully contained within its own domain.

  7. LSM BPF and security hooks: LSM BPF programs attached to security hooks (file open, socket create, etc.) run in a BPF domain. They cannot access the credentials, file descriptors, or socket state of the process that triggered the hook except through BPF helpers (bpf_get_current_pid_tgid(), bpf_get_current_cred(), etc.) that copy the relevant data into the BPF program's memory. Security decisions (allow/deny) are returned via an integer return code; the program cannot directly modify kernel security state.

Domain allocation for BPF: On x86-64, BPF domains are allocated from the same PKEY pool as Tier 1 drivers (PKEY 2-13). Typical systems run 5-8 Tier 1 driver domains, leaving 4-7 domains for BPF programs. When domain exhaustion occurs (drivers + BPF programs > 12 domains), BPF programs share a common BPF domain rather than each getting a dedicated domain. This reduces isolation granularity between BPF programs but preserves isolation between BPF and drivers and between BPF and umka-core. BPF-to-BPF isolation is a best-effort optimization, not a security guarantee — BPF programs are verified code with bounded execution, and their primary isolation boundary is BPF-to-driver and BPF-to-core, both of which are always maintained regardless of domain pressure. BPF domain allocations are visible in /sys/kernel/umka/isolation/domains with type: bpf to distinguish them from Tier 1 driver domains. The total domain count displayed includes both driver and BPF allocations. On architectures without a fixed domain limit (PPC64LE, AArch64 mainstream page-table path), each BPF program gets its own domain. On RISC-V (no Tier 1), BPF domains are not applicable — BPF programs run without isolation domains.

Crash handling: A crash (verifier bug, JIT bug, or helper bug) within a BPF program triggers the same containment as a Tier 1 driver crash: - The BPF domain is revoked - All maps owned by that domain are invalidated (subsequent lookups return -ENOENT) - Attached hooks are automatically detached - The program is marked as faulted and cannot be re-attached without reload

Unlike Tier 1 drivers, BPF programs do not have a recovery path — they are considered stateless (persistent state lives in maps, which survive program reload). The administrator must reload the program manually or via orchestration.

Full specification: The complete BPF isolation model — domain confinement, map access control, capability-gated helpers, cross-domain packet redirect rules, and verifier enforcement — is specified in Section 19.2 (BPF Isolation Model). BPF isolation domain specification: see Section 19.2. The networking-specific rules (conntrack kfuncs, XDP redirect rate limiting) are in Section 16.18. Although Section 16.18 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. The rules above are a driver-centric summary; Section 19.2 provides the canonical specification.

11.3.11 Tier 2 Interface and SDK

Tier 2 drivers run in separate user-space processes. They communicate with umka-core via dedicated KABI syscalls — not the domain ring buffers used by Tier 1.

KABI syscalls for Tier 2 drivers:

These syscalls use a dedicated syscall range (__NR_umka_driver_base + offset, allocated from the UmkaOS-private syscall range defined in Section 19.1). They are not Linux-compatible syscalls -- they are UmkaOS-specific and used only by the Tier 2 driver SDK. The SDK wraps them behind the same KernelServicesVTable interface that Tier 1 drivers use, so driver code is tier-agnostic.

KABI Syscall Syscall Offset Arguments Return Purpose
umka_driver_register 0 manifest: *const DriverManifest, manifest_size: u64, out_services: *mut KernelServicesVTable, out_device: *mut DeviceDescriptor IoResultCode Register with device registry. Kernel validates manifest, assigns capabilities, returns kernel services vtable and device descriptor. Called once at driver process startup.
umka_driver_mmio_map 1 device_handle: DeviceHandle, bar_index: u32, offset: u64, size: u64, out_vaddr: *mut u64 IoResultCode Map a device BAR (or portion) into driver address space. Kernel validates BAR ownership, IOMMU group, and capability before creating the mapping. The mapping is revocable: the kernel can unmap it at any time during driver containment (see "Tier 2 MMIO access model" above).
umka_driver_dma_alloc 2 size: u64, align: u64, flags: AllocFlags, out_vaddr: *mut u64, out_dma_addr: *mut u64 IoResultCode Allocate DMA-capable memory. Kernel allocates physical pages, creates IOMMU mapping, maps into driver process. Returns both virtual and DMA (bus) addresses.
umka_driver_dma_free 3 vaddr: u64, size: u64 IoResultCode Release a DMA buffer. Kernel tears down IOMMU mapping, unmaps from process, frees physical pages.
umka_driver_irq_wait 4 irq_handle: u32, timeout_ns: u64 IoResultCode Block until the registered interrupt fires or timeout expires. Returns IO_SUCCESS on interrupt, IO_TIMEOUT on timeout. Uses eventfd internally for efficient wakeup.
umka_driver_complete 5 request_id: u64, status: IoResultCode, bytes_transferred: u64 IoResultCode Post an I/O completion to umka-core. The completion is forwarded to the originating io_uring CQ or waiting syscall.

Error codes: All Tier 2 KABI syscalls return IoResultCode (defined in umka-driver-sdk/src/abi.rs). Common errors: IO_ERR_INVALID_HANDLE (bad device handle), IO_ERR_PERMISSION (missing capability), IO_ERR_NO_MEMORY (allocation failed), IO_ERR_BUSY (resource in use), IO_ERR_TIMEOUT, IO_ERR_INTERRUPTED (-EINTR, signal received).

Signal interruptibility: All 6 Tier 2 KABI syscalls are interruptible. Non-fatal signals (SIGTERM, SIGINT, etc.) cause the syscall to return IO_ERR_INTERRUPTED (-EINTR). The driver should retry or clean up as appropriate. SIGKILL causes immediate process exit; the kernel's Tier 2 crash recovery sequence (Section 11.9) activates on process exit detection, cleaning up IOMMU mappings, MMIO regions, ring state, and DMA buffers. umka_driver_irq_wait uses wait_event_interruptible internally.

DMA subsystem call path — Tier 2 DMA syscalls map directly to the kernel-internal DmaDevice trait (Section 4.14):

Tier 2 Syscall Kernel-Internal Call
umka_driver_dma_alloc(size, align, flags, ...) device.dma_alloc_coherent(size, gfp) → creates IOMMU mapping → maps result into driver process virtual address space
umka_driver_dma_free(vaddr, size) Unmaps from process → device.dma_free_coherent(buf) → tears down IOMMU mapping → frees physical pages

The kernel performs three additional operations around the DmaDevice trait call: 1. Capability check: verifies the driver holds CAP_DMA for this device 2. IOMMU scoping: the DmaDeviceHandle.iommu_domain ensures the allocation is mapped only into this device's IOMMU domain, not globally visible 3. Process mapping: maps the allocated physical pages into the Tier 2 driver's Ring 3 address space with appropriate permissions (read-write, no-execute)

AccelMemHandle (Section 22.4) is a separate abstraction for device-local memory (GPU VRAM, NPU SRAM). It does NOT go through DmaDevice because the memory is allocated on the device, not in host RAM. The host-side DMA path is used only when migrating pages between host and device (Section 22.4).

Performance: Per-I/O overhead floor is ~200-400ns (two syscall transitions). For high-IOPS devices (NVMe, 100GbE), this is significant — those belong in Tier 1. Tier 2 suits devices where overhead is negligible: USB, printers, audio (~1-10ms periods), experimental drivers, and third-party binaries compiled against the stable SDK.

Security boundary: A Tier 2 driver crash is an ordinary process crash. It cannot corrupt kernel memory or issue DMA outside IOMMU-fenced regions. On containment, the kernel revokes all MMIO mappings (preventing further device register access) and tears down IOMMU entries (causing any residual in-flight DMA to fault). The kernel restarts the driver process if the restart policy permits (~10ms recovery).

Tier 2 mandatory syscall filter:

Every Tier 2 driver process runs under a mandatory seccomp-BPF filter (Section 10.3) installed by the kernel during umka_driver_register. The default allowlist:

Category Allowed syscalls
KABI IPC umka_driver_call, umka_driver_complete, umka_driver_register
Memory mmap, munmap, madvise (for shared ring buffer mapping)
Synchronization futex, clock_gettime, clock_nanosleep
I/O read, write, close, epoll_wait, epoll_ctl (for fd-based IPC)
Lifecycle exit_group, rt_sigreturn

All other syscalls return -EPERM. The filter CANNOT be relaxed by the driver. A driver manifest may request additional syscalls (e.g., socket for a network driver), which are granted only if the driver's signing certificate includes the corresponding capability claim in its [permissions] section.


11.4 Device Registry and Bus Management

Summary: This section specifies the kernel-internal device registry — a topology-aware tree that tracks all hardware devices, their parent/child relationships, driver bindings, power states, and capabilities. It covers: the registry data model (Section 11.4), bus enumeration and matching (Section 11.4), device lifecycle and hot-plug (Section 11.4, Section 11.4), and power management ordering (Section 11.4). The registry is the single source of truth for "what hardware exists" and is used by the scheduler (Section 7.1), fault manager (Section 20.1), DPU offload layer (Section 5.11), and unified compute topology (Section 22.8).

Companion sections (split for size): - Section 11.5 — IOMMU groups, DMA identity mapping, PCIe ASPM - Section 11.6 — service discovery, KABI integration, crash recovery integration, boot sequence, sysfs compatibility, firmware management

11.4.1 Motivation and Prior Art

11.4.1.1 The Problem

UmkaOS's KABI provides a clean bilateral vtable exchange between kernel and driver. But the current design has no answer for:

  • Device hierarchies: How does the kernel model that a USB keyboard is behind a hub, which is behind an XHCI controller, which sits on a PCI bus? The topology matters for power management ordering, hot-plug teardown, and fault propagation.
  • Driver-to-device matching: When the kernel discovers a PCI device with vendor 0x8086 and device 0x2723, how does it know which driver to load? Currently there is no matching mechanism.
  • Power management ordering: Suspending a PCI bridge before its child devices causes data loss. The kernel needs to know the topology to get the ordering right.
  • Cross-driver services: A NIC may need a PHY driver. A GPU display pipeline may need an I2C controller. There is no way for drivers to discover and use services provided by other drivers.
  • Hot-plug: When a USB device is yanked, the kernel must tear down the device, its driver, and all child devices in the correct order.

The key insight from macOS IOKit: the kernel should own the device relationship model. But IOKit's mistake was embedding the model in the driver's C++ class hierarchy, coupling it to the ABI. We build it as a kernel-internal service that drivers access through KABI methods.

11.4.1.2 What We Learn From Existing Systems

Linux (kobject / bus / device / driver / sysfs): - Device model is a graph of kobject structures exposed via sysfs. - Bus types (PCI, USB, platform) each implement their own match/probe/remove. - Strengths: sysfs gives userspace introspection; uevent mechanism for hotplug. - Weaknesses: driver matching is bus-specific with no unified property system; power management ordering is heuristic (dpm_list), not topology-derived; the kobject model is deeply entangled with kernel internals — drivers directly embed and manipulate kobjects.

macOS IOKit (IORegistry): - All devices modeled as a tree of C++ objects (IORegistryEntry → IOService → ...). - Matching uses property dictionaries ("matching dictionaries"). - Power management tree mirrors the registry tree — IOPMPowerState arrays per driver. - Strengths: property-based matching is elegant; PM ordering derives from the tree; service publication/lookup via IOService matching. - Weaknesses: C++ class hierarchy is the ABI — changing a base class breaks all drivers (fragile base class problem). This is why Apple deprecated kexts and moved to DriverKit. The matching system is over-general (personality dictionaries are complex). Memory management is manual.

Windows PnP Manager: - Kernel-mode PnP manager maintains a device tree. Device nodes have properties. - INF files declare driver matching rules (declarative, external to the binary). - Power management uses IRP_MN_SET_POWER directed through the tree. - Strengths: INF-based declarative matching is clean; power IRPs propagate with correct ordering; robust hotplug. - Weaknesses: IRP-based model is complex; WDM/WDF driver model is notoriously difficult.

Fuchsia (Driver Framework v2): - "Bind rules" — a simple declarative language — match drivers to devices. - Driver manager runs as a userspace component. Device topology is a tree of nodes in a namespace. - Strengths: clean separation of concerns; bind rules are simple and composable; userspace driver manager can be restarted independently. - Weaknesses: everything going through IPC adds latency; the DFv1-to-DFv2 migration shows that evolving the framework is painful.

11.4.1.3 UmkaOS's Position

We take the best ideas from each:

Concept Borrowed From Adaptation
Property-based matching IOKit Declarative match rules in driver manifest, not runtime OOP matching
Registry as a tree IOKit, Linux Kernel-internal tree, drivers get opaque handles only
PM ordering from topology IOKit, Windows Topological sort of device tree, timeouts at each level
Service publication/lookup IOKit Mediated by registry through KABI, not direct object references
Sysfs-compatible output Linux Registry is the single source of truth for /sys
Uevent hotplug notifications Linux Registry emits Linux-compatible uevents
Declarative bind rules Fuchsia Match rules embedded in driver ELF binary

What we take from none of them: the registry is a kernel-internal data structure. Drivers never see it directly. They interact through opaque DeviceHandle values and KABI vtable methods. No OOP inheritance, no C++ objects, no kobject embedding, no global symbol tables. The flat, versioned, append-only KABI philosophy is fully preserved.


11.4.2 Design Principles

  1. Kernel owns the graph, drivers own the hardware logic. The registry manages topology, matching, lifecycle, and power ordering. Drivers manage hardware registers, DMA, and device-specific protocols. Clean separation.

  2. Drivers are leaves, not framework participants. A driver does not subclass a framework object. It fills in a vtable and receives callbacks. The registry decides when to call those callbacks based on topology and policy.

  3. No ABI coupling. The registry is kernel-internal. Drivers interact with it through KABI methods appended to KernelServicesVTable. If the registry's internal data structures change, no driver recompilation is needed.

  4. Topology drives policy. Power management ordering, hot-plug teardown, crash recovery cascading, and NUMA affinity are all derived from the device tree topology. No heuristics, no manually maintained ordering lists.

  5. Capability-mediated access. All cross-driver interactions go through the registry, which validates capabilities and handles tier transitions (isolation domain switches, user-kernel IPC). Drivers never communicate directly.


11.4.3 Registry Data Model

11.4.3.1 DeviceNode

The fundamental unit is a DeviceNode — a kernel-internal structure that drivers never see directly.

Heap allocation requirement: DeviceNode and its child structures (Vec, String, HashMap in PropertyTable, XArray in DeviceRegistry) require heap allocation. The device registry is initialized at Phase 4.2 (Section 11.6), which is after the physical memory allocator and virtual memory subsystem are running (steps 4b-4c). Tier 0 devices (APIC, timer, serial) that are needed before heap init do not use the registry — they are registered retroactively after registry init (Section 11.6). No registry data structures are used during early boot before the heap is available.

// Kernel-internal — NOT part of KABI

pub struct DeviceNodeId(pub u64);   // Unique, monotonically increasing, never reused

pub struct DeviceNode {
    // Identity
    id: DeviceNodeId,
    name: ArrayString<64>,          // e.g., "pci0000:00", "0000:00:1f.2", "usb1-1.3"

    // Tree structure
    parent: Option<DeviceNodeId>,
    /// Bounded by bus topology: USB hub max 127 ports, PCIe max ~256 downstream,
    /// platform typically <64.
    children: Vec<DeviceNodeId>,    // Ordered by discovery time

    // Service relationships (non-tree edges)
    /// Bounded by service types per device, typically <16.
    providers: Vec<ServiceLink>,    // Services this node consumes
    /// Bounded by consumer count, typically <32.
    clients: Vec<ServiceLink>,      // Nodes that consume services from this node

    // Device identity
    bus_type: BusType,              // Reuses existing BusType from abi.rs
    bus_identity: BusIdentity,      // Bus-specific ID (PCI IDs, USB descriptors, etc.)
    properties: PropertyTable,      // Key-value property store

    // Lifecycle
    state: DeviceState,
    driver_binding: Option<DriverBinding>,

    // Placement
    numa_node: i32,                 // -1 = unknown

    // Power
    power_state: PowerState,
    runtime_pm: RuntimePm,
    /// Monotonic timestamp of most recent KABI call through this device's vtable
    /// trampoline. Updated with Ordering::Relaxed on every dispatch; read by the
    /// autosuspend timer to detect idle devices (Section 11.4.6.4).
    activity_timestamp: AtomicU64,

    // Security
    device_cap: CapHandle,          // Capability for this device

    // Resources
    resources: DeviceResources,     // BAR mappings, IRQs, DMA state

    // Intrusive list link for deferred probe queue.
    // Used by `DeviceRegistry.deferred_list: IntrusiveList<DeviceNode>`.
    deferred_link: IntrusiveListNode,

    // IOMMU
    iommu_group: Option<IommuGroupId>,  // Shared IOMMU group (for passthrough)

    /// Pre-VFIO isolation tier. When a device is claimed by the VFIO/IOMMUFD
    /// passthrough framework, this field records the device's previous isolation
    /// tier (0 = Tier 0, 1 = Tier 1, 2 = Tier 2). On VFIO release, the device
    /// registry uses this to restore the original tier and re-bind the previous
    /// driver. `None` when the device is not currently claimed by VFIO.
    pre_vfio_tier: Option<u8>,

    // Reliability
    /// Sliding-window failure tracker. Records timestamps of recent failures
    /// in a circular buffer (capacity: 16 entries). The demotion policy checks
    /// how many failures occurred within the configured window (default: 60 seconds).
    /// See `FailureWindow` definition below.
    failure_window: FailureWindow,
    last_transition_ns: u64,

    // State buffer integrity
    /// HMAC-SHA256 key for state buffer integrity verification.
    /// Generated by umka-core on first driver load for this DeviceHandle.
    /// Persists across driver crash/reload cycles; discarded only on
    /// DeviceHandle removal (device unplugged or deregistered).
    /// See [Section 12.5](12-kabi.md#kabi-idl-language-specification--vtable-definition) for the canonical `DriverHmacKey` definition;
    /// see Section 11.4.3.1 below for the key lifecycle specification.
    state_hmac_key: Option<DriverHmacKey>,
}
11.4.3.1.1.1 DriverHmacKey: Key Storage and Lifecycle

The state_hmac_key field above is backed by DriverHmacKey, which controls every aspect of key material storage, protection, derivation, and rotation. The key must reside exclusively in UmkaOS Core private memory so that a compromised Tier 1 driver (which runs at Ring 0 and can execute WRPKRU) cannot extract it and forge state buffer integrity tags. See the threat model discussion in Section 11.6 (TOCTOU mitigation) for why Tier 2 isolation is required to prevent key extraction by an actively exploited driver.

See Section 12.5 for the canonical DriverHmacKey definition (including the created_at_ns timestamp field used for generation-reset disambiguation).

Key lifecycle:

  • AllocationDriverHmacKey::new(slot, generation) is called under PKEY 0 protection during driver_load(). The call:
  • Reads 32 bytes from the platform TRNG (RDRAND on x86-64; SoC TRNG on ARM/RISC-V).
  • Reads TPM PCR[7] (32 bytes) via the TPM KABI call (Section 9.3).
  • Derives the key via HKDF-SHA256: key = HKDF(IKM=trng_bytes, salt=pcr7, info="umka-driver-hmac" || slot || gen).
  • Stores the result in DriverHmacKey.key within the PKEY 0 slab.

  • Access — Only UmkaOS Core code executing with PKRU granting PKEY 0 read/write can dereference DriverHmacKey.key. Driver code (PKEY 2-13 domains) receives a page fault if it attempts to read the key's memory. The HMAC computation itself is performed by a dedicated Core function (driver_state_hmac_compute) that briefly acquires PKEY 0 access, performs the computation into a stack-local output buffer, then restores the caller's PKRU before returning. The key material is never copied to driver-accessible memory.

  • Rotation — On every driver reload (crash recovery or explicit operator unload), generation increments, a new TRNG sample is drawn, and a fresh key is derived. The old DriverHmacKey is dropped, triggering Zeroize::drop which overwrites the key material with zeros via volatile writes before the slab page is returned to the allocator.

  • Storage locationumka_core::driver_registry::SLOT_KEYS is a slab-allocated array in PKEY 0-protected memory. At boot (after device enumeration, Phase 4.x), the driver registry allocates max_driver_slots entries from the slab allocator and maps the backing pages into the .pkey0 virtual address range — exclusively assigned to PKEY 0 in the Core page tables, inaccessible to all Tier 1 drivers. The array is indexed by DriverSlot (the same integer used in DeviceNode); max_driver_slots is discovered at boot from the device count and the configured tier limits (no compile-time cap). On architectures without PKEY support (RISC-V, s390x, LoongArch64), the array is protected by page-level permissions (kernel RW, no Tier 1 access) since Tier 1 drivers run as Tier 0 on those architectures.

  • Discarding — When a DeviceNode is removed from the registry (device unplugged or driver_deregister() called by the operator), state_hmac_key is set to None, dropping the DriverHmacKey value and zeroizing the key. Subsequent crash recovery for this slot (if a new device is hotplugged to the same slot) generates an entirely fresh key.

11.4.3.2 PropertyTable

Properties are the lingua franca of matching and introspection. They serve the same role as IOKit's property dictionaries and Linux's sysfs attributes.

// PropertyValue variants String, Bytes, and StringArray use heap-allocated
// containers. These are only constructed after heap init (boot step 4b+).
// For pre-heap device identification, Tier 0 devices use fixed-size
// ArrayString<64> in BusIdentity (Section 11.4.3.3) which is stack-allocated.
pub enum PropertyValue {
    U64(u64),
    I64(i64),
    String(String),
    Bytes(Vec<u8>),
    Bool(bool),
    StringArray(Vec<String>),
}

/// Fixed-capacity string key for device properties. 64 bytes covers all
/// standard property names (e.g., "compatible", "reg", "interrupt-parent").
pub type PropertyKey = KernelString<64>;

/// Maximum properties per device. Firmware-provided properties beyond this
/// limit are logged and dropped.  Prevents pathological memory consumption
/// from malicious ACPI tables or DTB blobs.
pub const MAX_PROPERTIES_PER_DEVICE: usize = 128;

/// Source annotation for device properties. Identifies who set the property.
/// Used for diagnostics and `/sys/devices/` property attribution.
/// Enum avoids per-property heap allocation (most annotations are one of
/// ~5 fixed strings).
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum PropertySource {
    /// Property set by ACPI namespace walk.
    Acpi,
    /// Property set by device tree (DTB) parser.
    DeviceTree,
    /// Property set by bus enumeration code.
    BusEnumerator,
    /// Property set during driver probe.
    DriverProbe,
    /// Property set by the device registry itself.
    Registry,
    /// Property set by a driver at runtime. Includes the driver name
    /// (fixed-capacity, no heap allocation).
    Driver(ArrayString<32>),
}

impl PropertySource {
    /// Return a human-readable string for diagnostics and sysfs display.
    pub fn as_str(&self) -> &str {
        match self {
            Self::Acpi => "ACPI",
            Self::DeviceTree => "DT",
            Self::BusEnumerator => "bus-enumerator",
            Self::DriverProbe => "driver-probe",
            Self::Registry => "registry",
            Self::Driver(name) => name.as_str(),
        }
    }
}

/// Stored as a sorted Vec for cache-friendly iteration and binary search.
/// Device nodes rarely have more than ~30 properties.  Pre-allocated with
/// capacity 32 for common-case efficiency.
pub struct PropertyTable {
    /// Tuple: (key, source_annotation, value).
    /// `source_annotation` identifies who set the property.
    /// Bounded to `MAX_PROPERTIES_PER_DEVICE` entries.
    entries: Vec<(PropertyKey, PropertySource, PropertyValue)>,
}

Standard property keys (well-known constants):

Key Type Description Set By
"bus-type" String "pci", "usb", "platform", "virtio", "channel-io" Bus enumerator
"vendor-id" U64 PCI/USB vendor ID Bus enumerator
"device-id" U64 PCI/USB device ID Bus enumerator
"subsystem-vendor-id" U64 PCI subsystem vendor Bus enumerator
"subsystem-device-id" U64 PCI subsystem device Bus enumerator
"class-code" U64 PCI class code / USB class Bus enumerator
"revision-id" U64 Hardware revision Bus enumerator
"compatible" StringArray DT/ACPI compatible strings Firmware parser
"device-name" String Human-readable name Bus enumerator
"driver-name" String Name of bound driver Registry
"driver-tier" U64 Current isolation tier Registry
"numa-node" I64 NUMA node ID Topology scanner
"location" String Physical topology path (e.g., PCI BDF) Bus enumerator
"serial-number" String Device serial if available Bus enumerator

Properties set by "Bus enumerator" are populated during device discovery by whatever code enumerates the bus (PCI config space scan, USB hub status, ACPI namespace walk). Properties set by "Registry" are managed by the kernel. Drivers can set custom properties on their own device node via KABI. Custom driver properties must use the "drv:<driver_name>:<key>" namespace prefix (e.g., "drv:ixgbe:fw-version") to prevent collisions between unrelated drivers setting properties on the same device node. The registry rejects property keys that do not match the calling driver's name prefix.

11.4.3.3 BusIdentity

A union-like enum holding bus-specific identification. Derives from the existing PciDeviceId in the driver SDK.

pub enum BusIdentity {
    Pci {
        segment: u16,
        bus: u8,
        device: u8,
        function: u8,
        id: PciDeviceId,        // Existing type from abi.rs
    },
    Usb {
        bus_num: u16,
        port_path: [u8; 8],    // Hub topology chain
        port_depth: u8,
        vendor_id: u16,
        product_id: u16,
        device_class: u8,
        device_subclass: u8,
        device_protocol: u8,
        interface_class: u8,
        interface_subclass: u8,
        interface_protocol: u8,
    },
    Platform {
        compatible: ArrayString<64>,    // ACPI _HID or DT compatible
        unit_id: u64,                   // ACPI _UID or DT unit address
    },
    VirtIo {
        device_type: u32,
        vendor_id: u32,
        device_id: u32,
    },
    /// s390x Channel I/O subsystem device. See [Section 11.10](#channel-io-subsystem).
    ChannelIo {
        css_id: u8,             // Channel Subsystem Image ID (0 for most configs)
        ssid: u8,               // Subchannel Set ID (0-3)
        sch_no: u16,            // Subchannel number (used in STSCH/SSCH instructions)
        devno: u16,             // Device number (from PMCW.dev, distinct from sch_no)
        cu_type: u16,           // Control Unit type (e.g., 0x3832 = virtio)
        cu_model: u8,           // Control Unit model
        dev_type: u16,          // Device type (from SenseID)
        dev_model: u8,          // Device model
    },
}

/// Bus type discriminant for the two-level `bus_index` XArray.
/// First-level key: bus type. Second-level key: bus-specific u64 device ID.
/// This avoids HashMap in the warm-path device lookup (per collection policy,
/// integer-keyed data uses XArray, not HashMap).
#[repr(u8)]
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum BusType {
    Pci       = 0,
    Usb       = 1,
    Platform  = 2,
    Acpi      = 3,
    VirtIo    = 4,
    ChannelIo = 5,
}

/// Pack a bus-specific device identity into a u64 XArray key.
///
/// Each bus type has a deterministic packing that fits in 64 bits:
/// - **PCI**: `bdf: u32` (segment:16 | bus:8 | devfn:8) → zero-extend to u64.
/// - **USB**: `FNV-1a(vid, pid, port_path)` → u64 hash. Collision-free for
///   practical device counts (<10^4). Hash collision → linear probe in the
///   XArray entry's collision list (bounded by MAX_USB_DEVICES_PER_BUS).
/// - **Platform**: `FNV-1a(compatible_string) XOR unit_id` → u64.
/// - **ACPI**: `FNV-1a(_HID string)` → u64.
/// - **VirtIO**: `(device_type as u64) << 32 | device_id as u64` → u64.
/// - **ChannelIo**: `(css_id as u64) << 24 | (ssid as u64) << 16 | sch_no as u64` → u64.
pub fn bus_identity_to_key(id: &BusIdentity) -> (BusType, u64) {
    match id {
        BusIdentity::Pci { segment, bus, device, function, .. } => {
            let bdf = ((*segment as u32) << 16) | ((*bus as u32) << 8)
                | ((*device as u32) << 3) | (*function as u32);
            (BusType::Pci, bdf as u64)
        }
        BusIdentity::Usb { bus_num, port_path, port_depth, vendor_id, product_id, .. } => {
            // FNV-1a hash of (vid, pid, port_path[..port_depth]).
            let mut h: u64 = 0xcbf2_9ce4_8422_2325;
            for &b in &[*vendor_id as u8, (*vendor_id >> 8) as u8,
                        *product_id as u8, (*product_id >> 8) as u8] {
                h ^= b as u64; h = h.wrapping_mul(0x0100_0000_01b3);
            }
            for &b in &port_path[..*port_depth as usize] {
                h ^= b as u64; h = h.wrapping_mul(0x0100_0000_01b3);
            }
            (BusType::Usb, h)
        }
        BusIdentity::Platform { compatible, unit_id } => {
            let name_hash = fnv1a_64(compatible.as_bytes());
            (BusType::Platform, name_hash ^ *unit_id)
        }
        BusIdentity::VirtIo { device_type, device_id, .. } => {
            (BusType::VirtIo, (*device_type as u64) << 32 | *device_id as u64)
        }
        BusIdentity::ChannelIo { css_id, ssid, sch_no, .. } => {
            (BusType::ChannelIo, (*css_id as u64) << 24 | (*ssid as u64) << 16 | *sch_no as u64)
        }
    }
}

// Legacy BusLookupKey retained only as documentation of the packing rationale.
// The actual registry uses `XArray<BusType, XArray<u64, DeviceNodeId>>` —
// a two-level XArray with O(1) lookup per level.

Non-tree edges representing provider-client relationships between devices:

pub struct ServiceLink {
    service_name: ArrayString<64>,  // e.g., "phy", "i2c", "gpio", "block"
    node_id: DeviceNodeId,
    cap_handle: CapHandle,          // Capability for mediated access
}

/// Binding between a device node and its driver. Stored in
/// `DeviceNode.driver_binding` after a successful probe.
pub struct DriverBinding {
    /// Bus type this driver handles (PCI, USB, Platform, etc.).
    pub bus_type: BusType,
    /// Driver manifest name (e.g., "nvme", "i915", "xhci-hcd").
    pub driver_name: ArrayString<64>,
    /// Match function: returns true if this driver can drive the given device.
    /// Called during device enumeration with the device's BusIdentity and
    /// PropertyTable. Must be pure (no side effects, no allocation).
    ///
    /// NOTE: This is a kernel-generated function pointer (compiled from the
    /// driver's declarative MatchRule entries at load time), NOT a driver-
    /// provided callback. It uses kernel-internal types (PropertyTable)
    /// that do not cross the KABI boundary.
    pub match_fn: fn(&BusIdentity, &PropertyTable) -> bool,
    /// Probe function: initialize the driver for the device. Called outside
    /// the registry lock with a cloned Arc<DeviceNode> reference.
    /// Returns Ok(()) on success; Err triggers fallback to next matching driver.
    pub probe_fn: fn(DeviceHandle) -> Result<(), KernelError>,
    /// Remove function: clean up driver state. Called during orderly device
    /// removal or driver unload. Must be idempotent.
    pub remove_fn: fn(DeviceHandle),
    /// Isolation tier at which this driver operates (0, 1, or 2).
    pub tier: u8,
}

/// A registered match rule from a driver manifest. The registry stores these
/// in `match_rules` and evaluates them when a new device arrives.
pub struct MatchRegistration {
    /// Driver that registered this match rule.
    pub driver_name: ArrayString<64>,
    /// Bus type filter (only match devices on this bus).
    pub bus_type: BusType,
    /// Match criteria: bus-specific IDs to accept. For PCI: vendor/device ID
    /// pairs. For USB: interface class/subclass/protocol tuples.
    pub match_ids: MatchIdTable,
    /// Priority: when multiple drivers match, higher priority wins.
    /// Default = 0. Bus-specific override drivers use higher values.
    pub priority: i32,
}

/// Match ID table: bus-specific device identification criteria.
// sizeof(MatchIdTable) is dominated by the largest variant (Pci: 64 *
// sizeof(PciDeviceId)).  This is acceptable because MatchRegistration is
// cold-path only (stored in the device registry, accessed during probe
// and hotplug).
pub enum MatchIdTable {
    /// PCI: list of (vendor_id, device_id) pairs. 0xFFFF = wildcard.
    Pci(ArrayVec<PciDeviceId, 64>),
    /// USB: list of (interface_class, interface_subclass, interface_protocol).
    Usb(ArrayVec<UsbMatchId, 64>),
    /// Platform: list of compatible strings (ACPI _HID / DT compatible).
    Platform(ArrayVec<ArrayString<64>, 16>),
    /// VirtIO: list of device_type values.
    VirtIo(ArrayVec<u32, 16>),
    /// s390x Channel I/O: list of (cu_type, dev_type) pairs.
    ChannelIo(ArrayVec<(u16, u16), 16>),
}

/// Device power management coordinator. One per `DeviceRegistryInner`.
/// Manages system-wide device power state transitions, runtime PM policy
/// enforcement, and suspend/resume ordering.
pub struct PowerManager {
    /// Global power state (running, suspending, suspended, resuming).
    pub state: AtomicU32,
    /// Ordered list of devices for suspend/resume sequencing.
    /// Suspend walks the list tail-to-head (leaves first, root last).
    /// Resume walks head-to-tail (root first, leaves last).
    pub device_order: RwLock<Vec<DeviceNodeId>>,
    /// Runtime PM autosuspend default timeout (ms). Devices inherit this
    /// unless they override via their `RuntimePm` configuration. Default: 2000ms.
    pub autosuspend_default_ms: u32,
    /// Count of devices currently in a non-suspended power state.
    /// When this reaches 0 during system suspend, the platform can
    /// enter the target sleep state (S3/S4/S5).
    ///
    /// **Memory ordering**: Incremented with `Release` after device resume
    /// completes (pairs with the suspend path's `Acquire` load).
    /// Decremented with `AcqRel` in the device suspend completion path;
    /// the `Acquire` component ensures all device quiesce operations are
    /// visible before the count reaches 0, and the `Release` component
    /// ensures the wakeup write to `suspend_complete` sees the final
    /// count. The system suspend path loads with `Acquire` before entering
    /// the target sleep state, guaranteeing all device suspend side-effects
    /// are globally visible on weakly-ordered architectures.
    pub active_device_count: AtomicU32,
    /// Wait queue signaled when `active_device_count` reaches 0.
    pub suspend_complete: WaitQueueHead,
}

11.4.3.5 Tree Structure Example

Root
 +-- acpi0 (ACPI namespace root)
 |    +-- pci0000:00 (PCI host bridge, segment 0, bus 0)
 |    |    +-- 0000:00:1f.0 (ISA bridge / LPC)
 |    |    +-- 0000:00:1f.2 (SATA controller)
 |    |    |    +-- ata0 (ATA port 0)
 |    |    |    |    +-- sda (disk)
 |    |    |    +-- ata1 (ATA port 1)
 |    |    +-- 0000:00:14.0 (USB XHCI controller)
 |    |    |    +-- usb1 (USB bus)
 |    |    |    |    +-- usb1-1 (hub)
 |    |    |    |    |    +-- usb1-1.1 (keyboard)
 |    |    |    |    |    +-- usb1-1.2 (mouse)
 |    |    +-- 0000:03:00.0 (NVMe controller)
 |    |    |    +-- nvme0n1 (NVMe namespace 1)
 |    |    +-- 0000:04:00.0 (NIC - Intel i225)
 |    |    |    ...provider-client link: "phy" --> phy0 (not a child)
 +-- platform0 (Platform device root)
      +-- serial0 (Platform UART)
      +-- phy0 (Platform PHY device)

Two types of edges:

  1. Parent-Child (structural containment): A PCI device is a child of a PCI bridge. A USB device is a child of a USB hub. This is the primary tree structure.

  2. Provider-Client (service dependency): Lateral edges. A NIC is a client of a PHY's "phy" service. A GPU display driver is a client of an I2C controller's "i2c" service. These edges do not form cycles (enforced by the registry).

11.4.3.6 The Registry

// DeviceRegistry uses XArray, HashMap, Vec, and VecDeque — all heap-allocated.
// The registry is initialized at Phase 4.2 ([Section 11.6](#device-services-and-boot--boot-sequence-integration)), after the heap
// is available. It is never accessed before heap init.
pub struct DeviceRegistry {
    /// The inner data structure, protected by an RwLock.
    ///
    /// **Locking strategy**:
    /// - **Read lock** (hot path): device lookup by ID/bus identity — O(1) via
    ///   `bus_index` HashMap. Per-bus subtree iteration also holds the read lock.
    ///   Multiple concurrent readers (e.g., driver probes, service lookups) proceed
    ///   without contention.
    /// - **Write lock** (cold path): device add/remove (boot-time enumeration and
    ///   hotplug). Write lock is never held across driver probe calls — the probe
    ///   runs outside the lock after cloning the DeviceNode Arc reference.
    ///
    /// This RwLock-based design ensures that the registry is never a contention
    /// bottleneck during normal operation (reads dominate; writes are rare and brief).
    pub registry: RwLock<DeviceRegistryInner>,
}

pub struct DeviceRegistryInner {
    /// All nodes, indexed by ID. XArray provides O(1) lookup by integer key
    /// (DeviceNodeId is a u64 newtype) with RCU-compatible reads.
    nodes: XArray<Arc<DeviceNode>>,

    /// Next node ID (monotonically increasing).
    next_id: AtomicU64,

    /// Index: bus identity --> node ID (fast device lookup).
    /// Two-level XArray: first level keyed by `BusType` (6 entries), second
    /// level keyed by the bus-specific u64 device key (see `bus_identity_to_key`).
    /// O(1) lookup per level. Replaces the previous HashMap<BusLookupKey, _>
    /// to comply with the integer-keyed collection policy (XArray, not HashMap).
    /// All bus identity variants pack deterministically into u64:
    ///   PCI: bdf(u32) → u64; USB: FNV-1a(vid,pid,path) → u64;
    ///   Platform: FNV-1a(compat)^uid → u64; VirtIO: type<<32|id → u64;
    ///   ChannelIo: css<<24|ssid<<16|sch_no → u64. See `bus_identity_to_key()`.
    /// Path temperature is warm (device lookup during open/probe).
    bus_index: [XArray<u64, DeviceNodeId>; 6],

    /// Index: property key+value --> set of node IDs (for matching).
    property_index: HashMap<PropertyKey, Vec<DeviceNodeId>>,

    /// Index: driver name --> set of node IDs (for crash recovery).
    driver_index: HashMap<ArrayString<64>, Vec<DeviceNodeId>>,

    /// Registered match rules from all known driver manifests.
    match_rules: Vec<MatchRegistration>,

    /// Pending hotplug events, drained by the named `hotplug` workqueue
    /// (2 threads, depth 128, SCHED_OTHER — registered in the standard
    /// workqueue table, [Section 3.11](03-concurrency.md#workqueue-deferred-work)).
    ///
    /// Bus drivers report device arrival/departure via KABI methods
    /// (`registry_report_device`, `registry_report_removal`). These
    /// methods enqueue a `HotplugEvent` and submit a work item to the
    /// `hotplug` workqueue. The worker thread dequeues events under the
    /// registry write lock and runs the match engine, driver load, and
    /// uevent emission sequence.
    ///
    /// The `VecDeque` is bounded at 128 entries (matching the workqueue
    /// depth). If the queue is full, `registry_report_device` returns
    /// `IO_ERR_BUSY` (-EBUSY) to the bus driver, which retries after
    /// a short delay (10 ms exponential backoff, capped at 500 ms,
    /// maximum 20 attempts). After 20 failed attempts, the bus driver
    /// logs an FMA warning and drops the event. This provides backpressure
    /// during device enumeration storms (e.g., USB hub with 127 devices).
    hotplug_queue: VecDeque<HotplugEvent>,

    /// Handle to the `hotplug` named workqueue. Created during Phase 4.4a
    /// (bus enumeration) per the workqueue boot phase table
    /// ([Section 3.11](03-concurrency.md#workqueue-deferred-work)). The `DeviceRegistryInner` is
    /// initialized at Phase 4.2, but `hotplug_wq` is populated lazily
    /// on first use (wrapped in `Option<Arc<WorkQueue>>` — `None` until
    /// Phase 4.4a creates it).
    hotplug_wq: Option<Arc<WorkQueue>>,

    /// Deferred probe list. Devices whose last probe returned
    /// `ProbeResult::Deferred` are held here for re-probe when a new
    /// driver or resource becomes available. Protected by its own
    /// SpinLock to avoid holding the registry RwLock during re-probe
    /// (which may trigger driver load → registry write → deadlock).
    ///
    /// **Lock ordering invariant**: Device removal MUST first unlink from
    /// `deferred_list` before removing from `nodes` XArray. Lock order:
    /// `deferred_list.lock()` (LEVEL 80) before `registry.write()` (LEVEL 90).
    /// The `Arc<DeviceNode>` prevents use-after-free, but the logical invariant
    /// (device not in deferred_list after removal) requires this ordering.
    deferred_list: SpinLock<IntrusiveList<DeviceNode>>,

    /// Power management state.
    power_manager: PowerManager,
}

/// A hotplug event reported by a bus driver and processed by the `hotplug`
/// workqueue.
pub enum HotplugEvent {
    /// Device arrival: a new device was discovered on the bus.
    DeviceArrival {
        /// Parent device (bus controller).
        parent: DeviceHandle,
        /// Bus type of the new device.
        bus_type: BusType,
        /// Bus-specific identity (PCI BDF, USB port path, etc.).
        bus_identity: BusIdentity,
        /// Initial properties reported by the bus driver.
        properties: PropertyTable,
    },
    /// Device removal: a device has departed (orderly or surprise).
    DeviceRemoval {
        /// Handle of the removed device.
        device: DeviceHandle,
        /// Whether the removal was orderly (driver notified before hardware
        /// disappeared) or surprise (hardware gone, driver finds out after).
        surprise: bool,
    },
}

Device probe outside the lock: when the match engine selects a driver for a device, the registry clones the Arc<DeviceNode> reference while holding the read lock, then drops the lock before calling the driver's probe function. This prevents a slow or failing probe from blocking device lookups or other probe attempts. The probe result is applied under a brief write lock after the probe returns.

The registry lives entirely within UmkaOS Core. It is never exposed as a data structure to drivers.

11.4.3.7 DeviceResources

Each device node tracks its allocated hardware resources. This is the kernel-internal counterpart of what Linux spreads across struct resource, struct pci_dev fields, and struct msi_desc lists. DeviceResources provides hardware resource handles only (BAR mappings, IRQ vectors, DMA constraints). Capabilities to actually access these resources are granted separately via the DeviceCapGrant bundle (Section 11.4); a driver must hold the appropriate CapHandle before accessing any resource described here.

/// Hardware resources allocated to a device. Kernel-internal, NOT part of KABI.
pub struct DeviceResources {
    /// PCI Base Address Register mappings (up to 6 BARs per PCI function).
    pub bars: [Option<BarMapping>; 6],

    /// Interrupt allocations (legacy, MSI, or MSI-X vectors).
    /// Bounded by MSI-X table size (max 2048 per PCIe 3.0+).
    /// Populated during device enumeration (cold path).
    pub irqs: Vec<IrqAllocation>,

    /// Number of pages currently pinned for DMA by this device.
    /// Page reclaim (Section 4.10) checks this count before attempting to compress
    /// or swap a page — DMA-pinned pages are never eligible.
    pub dma_pin_count: AtomicU32,

    /// Maximum DMA-pinnable pages for this device (enforced by cgroup and
    /// per-device limits). 0 = unlimited.
    pub dma_pin_limit: u32,

    /// MMIO regions mapped for this device (non-BAR, e.g., firmware tables).
    pub mmio_regions: Vec<MmioRegion>,

    /// Legacy I/O port ranges (x86 only, rare in modern hardware).
    pub io_ports: Vec<IoPortRange>,

    /// s390x Channel I/O resources. `None` on non-s390x platforms.
    /// See [Section 11.10](#channel-io-subsystem) for the full Channel I/O subsystem spec.
    pub channel_io: Option<ChannelIoResources>,

    /// DMA address mask — how many bits of physical address the device can
    /// generate. Determines bounce buffer requirements.
    pub dma_mask: u64,             // e.g., 0xFFFFFFFF for 32-bit DMA
    pub coherent_dma_mask: u64,    // For coherent (non-streaming) DMA

    /// Hardware command queue depth (entries per queue). 0 = not applicable
    /// (e.g., legacy devices without command queuing).
    /// For NVMe: from CAP.MQES + 1. For AHCI: from CAP.NCS + 1.
    pub queue_depth: u32,

    /// Number of hardware I/O queues. 0 = not applicable.
    /// For NVMe: from IDENTIFY Controller. For AHCI: number of ports.
    pub num_queues: u16,
}

pub struct BarMapping {
    pub bar_index: u8,
    pub phys_addr: u64,
    pub size: u64,
    pub flags: BarFlags,
    /// Kernel virtual address if mapped. None = not yet mapped (lazy).
    pub mapped_vaddr: Option<u64>,
}

bitflags::bitflags! {
    #[repr(transparent)]
    pub struct BarFlags: u32 {
        const MEMORY_64     = 1 << 0;  // 64-bit MMIO (vs 32-bit)
        const IO_PORT       = 1 << 1;  // I/O port space (legacy x86)
        const PREFETCHABLE  = 1 << 2;  // Can be mapped write-combining
    }
}

pub struct IrqAllocation {
    pub irq_type: IrqType,
    pub vector: u32,          // Global IRQ vector number
    pub cpu_affinity: Option<u32>,  // Preferred CPU for this interrupt
}

#[repr(u32)]
pub enum IrqType {
    LegacyPin = 0,   // INTx (shared, level-triggered)
    Msi       = 1,   // Message Signaled Interrupt (single vector)
    MsiX      = 2,   // MSI-X (independent vectors, per-queue)
}

pub struct MmioRegion {
    pub phys_addr: u64,
    pub size: u64,
    pub cacheable: bool,
}

pub struct IoPortRange {
    pub base: u16,
    pub size: u16,
}

/// s390x Channel I/O resources for a subchannel device.
/// Populated by the Channel I/O enumerator; see [Section 11.10](#channel-io-subsystem).
pub struct ChannelIoResources {
    /// Subchannel Information Block — cached copy from last STSCH.
    pub schib: Schib,
    /// Channel paths available to this subchannel (up to 8, bitmap in PMCW.PIM).
    pub path_mask: u8,
    /// Whether this device supports QDIO (Queued Direct I/O).
    pub qdio_capable: bool,
    /// QDIO input queues (populated after QDIO activation; empty before).
    pub qdio_input_queues: Vec<QdioQueueDesc>,
    /// QDIO output queues.
    pub qdio_output_queues: Vec<QdioQueueDesc>,
}

/// Descriptor for a QDIO queue (input or output). The actual SBAL ring
/// memory is allocated separately; this struct holds the kernel-side metadata.
pub struct QdioQueueDesc {
    /// Queue index (0-based within input or output set).
    pub index: u8,
    /// Number of SBALs in this queue's ring (typically 128).
    pub sbal_count: u16,
    /// Physical address of the SBAL array (page-aligned).
    pub sbal_phys: u64,
}

DMA pin counting is a critical safety mechanism:

  • The dma_pin_count is incremented AFTER successful IOMMU programming, not before. This prevents phantom pins on failed mappings. The dma_map_*() sequence is:
  • Validate address, alignment, size, direction.
  • Check dma_pin_count.load(Relaxed) < dma_pin_limit.
  • Allocate IOVA from IovaBitmapAllocator.
  • Program IOMMU page table entry.
  • Flush IOTLB (arch-specific).
  • ON SUCCESS: dma_pin_count.fetch_add(1, Relaxed).
  • ON FAILURE at any step: undo all prior steps, NO pin increment.
  • Every dma_unmap_*() call decrements it.
  • The page reclaim path (Section 4.12) checks whether a page's owning device has active DMA pins before attempting compression or swap-out. Pages with active DMA mappings are unconditionally skipped — moving a page while a device is DMAing to it would cause silent data corruption.
  • On driver crash recovery (Section 11.6), all DMA mappings for the crashed driver are forcibly invalidated (IOMMU entries torn down), and dma_pin_count is reset to zero. This is safe because the device has been reset.
  • The dma_pin_limit provides defense-in-depth: a buggy or malicious driver cannot pin all of physical memory for DMA. The limit is enforced by the kernel, not the driver.

Resource lifecycle:

Resources are allocated during device discovery (BARs, IRQs) and driver initialization (DMA mappings, additional MMIO). On device removal or driver crash, all resources are reclaimed by the registry in reverse order: DMA mappings first (IOMMU teardown), then IRQs (free vectors), then BAR unmappings, then MMIO unmappings.

11.4.3.8 Device Capability Grant Bundle

When the kernel calls a driver's tier entry point (entry_direct, entry_ring, or entry_ipc from the KabiDriverManifest), it delivers a DeviceCapGrant alongside the KernelServicesVTable. The DeviceCapGrant defines the initial set of capabilities that the driver domain receives. This is the single point where system policy decisions (which SystemCaps a driver may hold, which BAR regions it may access, how many IRQ vectors it may register) are applied.

/// Initial capability bundle delivered to a driver during device_init().
/// Kernel-internal — NOT part of KABI. The driver receives the effects of
/// this grant (CapHandle tokens, SystemCaps in the domain) but never sees
/// the DeviceCapGrant struct itself.
///
/// Constructed by the bus-specific match engine (PCI, platform, USB, etc.)
/// based on the device's DeviceResources and the system security policy.
/// Immutable after construction — the kernel retains a read-only copy for
/// audit logging and crash-recovery re-grant.
pub struct DeviceCapGrant {
    /// Target device handle. Links this grant to a specific DeviceNode.
    pub device: DeviceHandle,

    /// Isolation tier assigned to this driver instance.
    /// Determines the delivery transport for capability handles
    /// (Tier 1: KABI descriptor table, Tier 2: sealed IPC handles).
    pub tier: IsolationTier,

    /// SystemCaps granted to the driver domain for the KABI dispatch
    /// dual-check protocol ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange--kabi-operation-permission-requirements)).
    /// Copied into `DriverDomain::granted_syscaps` at domain creation.
    ///
    /// Standard grants by device class:
    /// - Block/NVMe driver: `CAP_DMA | CAP_IRQ`
    /// - Network driver:    `CAP_DMA | CAP_IRQ | CAP_NET_ADMIN`
    /// - GPU driver:        `CAP_DMA | CAP_IRQ | CAP_DMA_IDENTITY` (if passthrough)
    /// - USB HCD:           `CAP_DMA | CAP_IRQ | CAP_SYS_RAWIO`
    /// - Platform sensor:   `CAP_IRQ` (no DMA needed)
    ///
    /// The system security policy may further restrict these grants
    /// (e.g., deny CAP_DMA_IDENTITY to all Tier 2 drivers).
    pub syscaps: SystemCaps,

    /// BAR access capabilities. One CapHandle per BAR that the driver
    /// is permitted to map. The CapHandle carries PermissionBits (READ,
    /// WRITE, or both) matching the BAR's intended use. BARs not listed
    /// here are inaccessible to the driver — the IOMMU will fault any
    /// attempt to access them.
    pub bar_caps: ArrayVec<BarCapGrant, 6>,

    /// IRQ vector registration limit. The driver may register at most
    /// this many interrupt handlers via `register_interrupt()`. Prevents
    /// a buggy or malicious driver from exhausting the system's IRQ
    /// vector space. Set from the device's MSI-X table size (for MSI-X
    /// devices) or 1 (for legacy/MSI devices).
    pub max_irq_vectors: u32,

    /// DMA capability parameters. Constrains the driver's DMA allocations.
    pub dma_grant: Option<DmaCapGrant>,

    /// PermissionBits on the device capability itself. Controls what
    /// operations the driver may perform on the device object. Typically
    /// `READ | WRITE` for data-plane drivers, `READ | WRITE | ADMIN` for
    /// drivers that may reconfigure the device.
    pub device_permissions: PermissionBits,

    /// Maximum isolation tier to which this driver may delegate its
    /// capabilities. `None` means unrestricted (driver may delegate to
    /// any tier). `Some(IsolationTier::Tier1)` prevents the driver from
    /// delegating capabilities to Tier 2 (Ring 3) components.
    ///
    /// This mirrors `CapConstraints::max_delegation_tier`
    /// ([Section 9.1](09-security.md#capability-based-foundation--cap-delegation-protocol)) but
    /// is set at the grant level by system policy, not per-capability.
    /// When the driver calls `cap_delegate()`, the effective
    /// `max_delegation_tier` is `min(cap.constraints.max_delegation_tier,
    /// grant.max_delegation_tier)`. This ensures system-wide policy
    /// (e.g., "Tier 2 drivers may not re-delegate to other Tier 2 domains")
    /// cannot be overridden by per-capability constraints.
    pub max_delegation_tier: Option<IsolationTier>,
}

/// Per-BAR capability grant. Associates a BAR index with the CapHandle
/// and permissions the driver receives for that BAR.
pub struct BarCapGrant {
    /// BAR index (0-5 for PCI, 0 for platform devices).
    pub bar_index: u8,
    /// CapHandle delivered to the driver for this BAR.
    pub cap_handle: CapHandle,
    /// Permissions on this BAR (READ, WRITE, or both).
    pub permissions: PermissionBits,
}

/// DMA capability constraints delivered as part of the DeviceCapGrant.
pub struct DmaCapGrant {
    /// Maximum number of pages the driver may pin for DMA simultaneously.
    /// Enforced by the kernel via `DeviceResources::dma_pin_limit`.
    pub max_pinned_pages: u32,
    /// DMA address mask (how many bits of bus address the device supports).
    /// Determines whether bounce buffers are needed for high memory.
    pub dma_mask: u64,
    /// Coherent DMA address mask. Limits the address range for coherent
    /// (non-streaming) DMA allocations. If narrower than `dma_mask`, the
    /// kernel uses a separate low-address pool for coherent buffers.
    pub coherent_dma_mask: u64,
    /// Whether identity-mapped DMA is permitted. Only `true` if the
    /// driver holds `CAP_DMA_IDENTITY` (set in `syscaps` above).
    pub identity_map_allowed: bool,
}

Grant construction sequence (performed by driver_bind() in the registry):

  1. The bus match engine selects a driver for the device.
  2. The registry reads the device's DeviceResources (BARs, IRQs, DMA constraints).
  3. The system security policy is consulted to determine which SystemCaps the driver class may receive. Policy is stored as a static table keyed by (bus_type, device_class) pairs, loaded at boot from the kernel command line or a signed policy blob.
  4. A DeviceCapGrant is constructed with the intersection of hardware capabilities and policy-permitted capabilities.
  5. The grant is passed to delegate_to_tier() (Section 9.1) which creates the individual CapEntry objects and delivers handles via the tier-appropriate transport. Precondition: debug_assert!(cap_table_initialized()) — the global CapTable must be initialized (Phase 2.2, see Section 2.3) before any device probing (Phase 4.4a+).
  6. The DeviceCapGrant.syscaps value is stored into DriverDomain::granted_syscaps (an RcuCell<SystemCaps>), where the KABI dispatch trampoline reads it via RCU on every call. Live policy updates can swap the bitmap without crash/reload (see Section 12.3).
  7. The kernel retains a read-only copy of the DeviceCapGrant in the DeviceNode for:
  8. Audit logging: every capability grant is logged to the FMA telemetry ring.
  9. Crash recovery re-grant: when a driver is reloaded after a crash, the kernel re-constructs an identical (or policy-updated) grant for the new instance.

Re-grant on driver reload:

When a driver crashes and is reloaded (Section 11.9), all capabilities from the previous grant are revoked (generation increment). The kernel then constructs a fresh DeviceCapGrant for the new driver instance. The re-grant may differ from the original if system policy has changed (e.g., an administrator revoked CAP_DMA_IDENTITY for this device class after the crash). The new grant is delivered through the same sequence as the initial grant. The driver receives new CapHandle tokens — it cannot reuse handles from the previous instance.

Native Tier 0 drivers: Native Tier 0 drivers (in-kernel, statically linked or Tier 0 loadable modules) implicitly hold all SystemCaps relevant to their function. They do not receive a DeviceCapGrant via the KABI path — their capabilities are part of the kernel's own capability set. The DeviceCapGrant mechanism is specific to Tier 1 and Tier 2 drivers that operate in isolated domains.

Promoted Tier 0 drivers (Tier 1 drivers running as Tier 0 on architectures without fast isolation — RISC-V, s390x, LoongArch64): These drivers do receive a DeviceCapGrant through the normal KABI path, identical to what they would receive as Tier 1. The promotion changes only the transport and isolation boundary, not the capability scope. See Section 11.3 (capability lifecycle during tier transitions) for the invariant promoted_tier0_caps <= tier1_caps.

11.4.4 Device Matching

11.4.4.1 Match Rules

Drivers declare what hardware they support through match rules embedded in the driver binary. Match rules are stored in a dedicated ELF section (.kabi_match) and read by the kernel loader before init() is called.

/// A single match rule. Drivers can declare multiple rules — any match
/// triggers binding.
#[repr(C)]
pub struct MatchRule {
    pub rule_size: u32,         // Forward compat
    pub match_type: MatchType,
    pub data: MatchData,        // 128-byte union, interpreted per match_type
}
// MatchRule: rule_size(u32=4) + match_type(MatchType repr(u32)=4) + data(MatchData=128) = 136 bytes.
const_assert!(size_of::<MatchRule>() == 136);

#[repr(u32)]
pub enum MatchType {
    PciId       = 0,    // Match by PCI vendor/device ID (with wildcards)
    PciClass    = 1,    // Match by PCI class code (with mask)
    UsbId       = 2,    // Match by USB vendor/product ID
    UsbClass    = 3,    // Match by USB class/subclass/protocol
    VirtIoType  = 4,    // Match by VirtIO device type
    Compatible  = 5,    // Match by "compatible" string (DT/ACPI)
    Property    = 6,    // Match by arbitrary property key/value
}

/// Match data union — interpreted per MatchType variant.
/// 128 bytes to accommodate the largest variant (Compatible: 128-byte string).
///
/// **Requirement**: `kabi-gen` MUST zero-initialize the full 128-byte union
/// at generation time. Smaller variants leave tail bytes uninitialized in the
/// ELF `.kabi_match` section; zeroing prevents information leaks from
/// toolchain artifacts in stripped production binaries.
#[repr(C)]
pub union MatchData {
    pub pci_id: PciMatchData,        // MatchType::PciId or PciClass
    pub usb_id: UsbMatchData,        // MatchType::UsbId
    pub usb_class: UsbClassMatch,    // MatchType::UsbClass
    pub virtio: VirtIoMatchData,     // MatchType::VirtIoType
    pub compatible: [u8; 128],       // MatchType::Compatible (NUL-terminated)
    pub property: PropertyMatch,     // MatchType::Property
    pub _raw: [u8; 128],             // Pad to 128 bytes
}
// MatchData: largest variant = compatible/[u8;128]/_raw = 128 bytes.
const_assert!(size_of::<MatchData>() == 128);

#[repr(C)]
pub struct UsbMatchData {
    pub vendor_id: u16,      // 0xFFFF = wildcard
    pub product_id: u16,     // 0xFFFF = wildcard
}
const_assert!(size_of::<UsbMatchData>() == 4);

#[repr(C)]
pub struct UsbClassMatch {
    pub class: u8,           // USB class code
    pub subclass: u8,        // 0xFF = wildcard
    pub protocol: u8,        // 0xFF = wildcard
}
const_assert!(size_of::<UsbClassMatch>() == 3);

#[repr(C)]
pub struct VirtIoMatchData {
    pub device_type: u32,    // VirtIO device type ID
}
const_assert!(size_of::<VirtIoMatchData>() == 4);

#[repr(C)]
pub struct PropertyMatch {
    pub key: [u8; 64],       // Property key (NUL-terminated)
    pub value: [u8; 64],     // Property value (NUL-terminated)
}
const_assert!(size_of::<PropertyMatch>() == 128);

Example — PCI ID match:

#[repr(C)]
pub struct PciMatchData {
    pub vendor_id: u16,         // 0xFFFF = wildcard
    pub device_id: u16,         // 0xFFFF = wildcard
    pub subsystem_vendor: u16,  // 0xFFFF = wildcard
    pub subsystem_device: u16,  // 0xFFFF = wildcard
    pub class_code: u32,        // Class code value
    pub class_mask: u32,        // Bits to compare (0 = ignore class)
}
// PciMatchData: 4×u16(8) + 2×u32(8) = 16 bytes.
const_assert!(size_of::<PciMatchData>() == 16);

A match table header in the ELF binary:

#[repr(C)]
pub struct MatchTableHeader {
    pub magic: u32,             // 0x4D415443 ("MATC")
    pub header_size: u32,
    pub rule_count: u32,
    pub rule_size: u32,         // sizeof(MatchRule)
    // Followed by `rule_count` MatchRule structs
}
// MatchTableHeader: 4 × u32 = 16 bytes.
const_assert!(size_of::<MatchTableHeader>() == 16);

11.4.4.2 Match Engine

The kernel runs a simple priority-ordered match algorithm:

For each DeviceNode in Discovered state:
  1. Collect the node's properties and bus identity
  2. For each registered driver (sorted by priority):
     a. For each MatchRule in that driver's match table:
        - Evaluate the rule against the node's properties
        - If match: record (driver, node, specificity) as a candidate
  3. Select the candidate with highest specificity
  4. If found: begin driver loading for this node
  5. If no match: node stays in Discovered state (deferred probe)

Match specificity ranking (highest first):

Rank Match Type Score Example
1 Exact vendor + device + subsystem 100 This exact card from this exact OEM
2 Exact vendor + device ID 80 Any board with this chip
3 Full class code match 60 Any NVMe controller (class 01:08:02)
4 Partial class code (masked) 40 Any mass storage controller (class 01:xx:xx)
5 Compatible string (position-weighted) 20+ DT/ACPI compatible, first entry scores higher
6 Generic property match 10 Fallback / catchall

Combination rule: When a single driver has multiple match rules and more than one matches a device, the driver's effective specificity is the maximum of all matching rule scores (not a sum). This ensures an exact vendor/device ID match (score 100) always dominates a class-code match (score 60) from the same driver, reflecting "most specific match wins" semantics.

When two drivers match with equal specificity, the driver with higher match_priority (declared in its manifest) wins. If still tied, first-registered wins.

Algorithmic complexity: the match engine performs a linear scan of registered MatchRule entries, bounded by the total number of registered drivers (typically <200 in a full desktop system, <50 on embedded). For each device, the engine evaluates at most N_drivers * max_rules_per_driver match rules. With <200 drivers and <=8 rules per driver, this is <1600 comparisons — well under 1 microsecond on any modern CPU. No hash table or trie is needed because the match runs once per device enumeration event (not per I/O operation). During boot enumeration, all devices are matched in a single pass; hotplug events match one device at a time.

11.4.4.3 Deferred Matching

Some devices cannot be matched immediately — their driver may not yet be loaded (e.g., initramfs not yet mounted, or driver installed later by package manager).

  • Devices with no match stay in Discovered state indefinitely.
  • When a new driver is registered (loaded from initramfs, installed at runtime), all Discovered devices are re-evaluated against the new match rules.
  • A KABI method registry_rescan() triggers manual re-evaluation.

This is analogous to Linux's deferred probe mechanism, but simpler because the matching is centralized rather than spread across per-bus probe functions.

11.4.4.4 Deferred Probe for Resource Dependencies

A matched driver may fail to initialize because a resource it depends on (clock, regulator, GPIO, PHY, or another driver's service) is not yet available. Rather than failing permanently, the driver's init() callback returns ProbeResult::Deferred to signal that the probe should be retried later.

/// Result of a driver's `init()` callback. Replaces the former binary
/// `Result<(), KernelError>` return, adding a deferral variant.
pub enum ProbeResult {
    /// Driver initialized successfully. Transition to Active state.
    Ok,
    /// A required resource is not yet available (clock provider not
    /// registered, regulator not probed, PHY not bound). The registry
    /// re-queues this device for a later probe attempt.
    Deferred {
        /// Human-readable reason for deferral (e.g., "clock 'bus' not found").
        /// Logged via FMA for boot-time debugging. Truncated to 64 bytes.
        reason: ArrayString<64>,
    },
    /// Permanent failure. Transition to Error state.
    Err(KernelError),
}

Deferred probe queue: The device registry maintains a deferred probe list (deferred_list: SpinLock<IntrusiveList<DeviceNode>>) containing devices whose last probe returned ProbeResult::Deferred. The queue is processed as follows:

  1. After each successful probe: When any driver's init() returns ProbeResult::Ok, the registry triggers a deferred probe pass — all devices on deferred_list are re-probed in FIFO order. A newly available resource (clock, regulator, PHY service) may unblock one or more deferred devices.

  2. After bus enumeration completes: When all bus enumerators (PCI, USB, platform, DT) finish their initial scan (end of boot Phase 4.4a), the registry performs a final deferred probe sweep. Devices still deferred after this sweep are logged with a warning via FMA.

  3. Timeout: A device may remain in Deferred state for at most 30 seconds (wall clock time from first deferral). If the 30-second deadline expires and the device is still deferred, it transitions to Error state with KernelError::ProbeTimeout. The timeout prevents a misconfigured device tree from stalling boot indefinitely. The timeout is configurable via kernel parameter umka.deferred_probe_timeout_ms (default: 30000).

  4. Maximum retries: Each device is re-probed at most 10 times. If init() returns Deferred on the 10th attempt, the device transitions to Error. This prevents an oscillating probe from consuming unbounded CPU.

  5. Aggregate boot timeout: The total boot-phase deferred probe window is bounded at 120 seconds (configurable via umka.deferred_probe_total_timeout_ms, default 120000). If the aggregate deferred probe phase exceeds this limit, all remaining deferred devices are transitioned to Error and boot proceeds. This prevents pathological configurations (many devices all deferring to 30s) from stalling boot.

Clock dependency integration: The clock framework (Section 2.24) freezes the clock tree at the end of Phase 4.4b. Clock-provider drivers (platform clock controllers, PLL drivers) register their clocks during deferred probe retry in Phase 4.4b, before the tree is frozen. Phase 5.x device probe references clocks as consumers (via clk_get()), not as providers. If a consumer calls clk_get() for a clock not yet registered, it receives Err(EPROBE_DEFER) and the driver's init() translates this to ProbeResult::Deferred { reason: "clock 'ref' not found" }. When the platform clock driver probes successfully and registers its clocks (still within Phase 4.4b), the deferred probe pass re-tries the waiting driver, which now finds the clock via clk_get(). After the freeze, any attempt to register a new clock provider returns Err(EBUSY) — all providers must be registered before Phase 5.x.

11.4.4.5 DriverManifest Extensions

The DriverManifest (defined in umka-driver-sdk/src/capability.rs) gains match-related fields (appended per ABI rules):

// Appended to DriverManifest
pub match_rule_count: u32,      // Number of match rules in .kabi_match section
pub is_bus_driver: u32,         // 1 = this driver discovers child devices
pub match_priority: u32,        // Higher = preferred when specificity ties
pub _pad: u32,

11.4.4.6 Module Loader Queue

When the match engine selects a driver for a device (step 4 in Section 11.4), it submits a DriverLoadRequest to the module loader work queue. The module loader runs as a set of kernel worker threads and serializes concurrent loading, signature verification, and domain allocation.

LoadReason is the shared type defined in Section 12.7 (11-kabi.md). Variants used by the device driver loader: HotPlug (device enumeration), Boot (initramfs/cmdline), Dependency, CrashRecovery, UserRequest.

/// A device-driver-specific load request. More detailed than the KABI-level
/// `ModuleLoadRequest` (Section 12.1.9.6): includes the trigger device, result
/// type `DriverHandle`, priority override, and timeout. Uses the shared
/// `LoadReason` enum from Section 12.1.9.6.
pub struct DriverLoadRequest {
    /// Absolute path to the `.kabi` manifest file in the umkafs namespace,
    /// e.g., `/ukfs/kernel/drivers/nvme/nvme.kabi`.
    /// Bounded by PATH_MAX (4096); typically ~80 bytes. Warm-path allocation
    /// (device hot-plug) — acceptable per collection policy.
    pub manifest_path: Box<str>,
    /// Reason for this load request (determines scheduling priority).
    pub reason: LoadReason,
    /// Device that triggered this load when `reason == LoadReason::HotPlug`.
    /// `None` for dependency loads, user requests, and boot-time loads.
    pub trigger_device: Option<DeviceHandle>,
    /// Completion channel: the loader sends `Ok(DriverHandle)` on success
    /// or `Err(KernelError)` on failure (bad signature, manifest error,
    /// domain allocation failure, driver `init()` returning an error, etc.).
    pub result_tx: oneshot::Sender<Result<DriverHandle, KernelError>>,
    /// Priority override. `0` = derive from `LoadReason` (default).
    /// `1`–`255` = explicit override (higher = higher priority).
    pub priority_override: u8,
    /// Load timeout in milliseconds. `0` = system default (30 000 ms).
    /// The loader cancels and returns `Err(KernelError::Timeout)` if the
    /// driver does not complete `init()` within this window.
    pub timeout_ms: u32,
}

/// Module loader subsystem. Uses the named `mod-loader` workqueue
/// (4 threads, depth 256, SCHED_OTHER — registered in the standard
/// workqueue table, [Section 3.11](03-concurrency.md#workqueue-deferred-work)).
///
/// **Why a named workqueue instead of a private SpinLock<BinaryHeap>:**
/// UmkaOS requires all deferred work to flow through named workqueues
/// ([Section 3.11](03-concurrency.md#workqueue-deferred-work)) — anonymous work submission is not
/// permitted. The `mod-loader` workqueue provides bounded depth (256),
/// named threads (`umkad-mod-loader-N` visible in `ps`), and
/// backpressure via `WouldBlock` when the queue is full.
///
/// **Priority ordering**: The `mod-loader` workqueue dispatches items in
/// FIFO order (standard `BoundedMpmcRing` semantics). Priority is
/// implemented by the `ModuleLoaderQueue` wrapper, which maintains a
/// `SpinLock<BinaryHeap<PrioritizedLoadRequest>>` as a **staging buffer**.
/// When a load request is submitted, it enters the staging heap. A
/// dedicated drain loop (running on each `mod-loader` worker thread)
/// pops the highest-priority request from the heap and executes it.
/// This preserves priority ordering while routing all execution through
/// the named workqueue infrastructure.
///
/// Bounded capacity prevents memory exhaustion from a flood of hotplug events
/// (e.g., enumerating a USB hub with 127 devices simultaneously).
pub struct ModuleLoaderQueue {
    /// Named workqueue handle. The backing worker thread pool is created
    /// at Phase 2.7 (general workqueue init, see
    /// [Section 3.11](03-concurrency.md#workqueue-deferred-work)). The `ModuleLoaderQueue` struct
    /// itself is created at Phase 4.3 (`kabi_runtime_init`) and wraps
    /// the workqueue with priority staging. The queue is activated at
    /// Phase 4.4a (bus enumeration begins, device probing starts
    /// submitting load requests).
    /// Worker threads appear as `umkad-mod-loader-0` .. `umkad-mod-loader-3`.
    workqueue: Arc<WorkQueue>,
    /// Priority staging buffer. Incoming requests are inserted here;
    /// worker threads pop the highest-priority request on each iteration.
    /// The heap capacity is bounded by `workqueue.queue_depth` (256):
    /// `submit_load` returns `WouldBlock` when staging + in-flight
    /// requests reach this limit.
    staging: SpinLock<BinaryHeap<PrioritizedLoadRequest>>,
    /// Limits the number of concurrently executing module loads.
    /// Default: 4 concurrent loads (one per loader worker thread).
    concurrency: Semaphore,
    /// Total requests enqueued since boot.
    pub total_enqueued: AtomicU64,
    /// Total requests that completed successfully.
    pub total_loaded: AtomicU64,
    /// Total requests that failed (signature rejection, manifest error,
    /// driver init failure, timeout, or domain allocation failure).
    pub total_failed: AtomicU64,
}

/// Internal wrapper that adds an effective priority to a `DriverLoadRequest`
/// for ordering in the priority staging `BinaryHeap` inside `ModuleLoaderQueue`.
struct PrioritizedLoadRequest {
    /// Effective priority: `priority_override` if non-zero, else derived from
    /// `reason` (HotPlug/Boot = 200, CrashRecovery = 180, Dependency = 150,
    /// UserRequest = 100). No aging: hotplug storms are bounded by physical
    /// device count (finite), so UserRequest starvation is theoretical.
    /// The bounded device count and 4 concurrent loader threads ensure all
    /// queued requests drain within seconds under realistic workloads.
    pub priority: u8,
    pub request: DriverLoadRequest,
}

impl PartialOrd for PrioritizedLoadRequest {
    fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
        Some(self.cmp(other))
    }
}
impl Ord for PrioritizedLoadRequest {
    fn cmp(&self, other: &Self) -> core::cmp::Ordering {
        self.priority.cmp(&other.priority)
    }
}
impl PartialEq for PrioritizedLoadRequest {
    fn eq(&self, other: &Self) -> bool {
        self.priority == other.priority && self.request.manifest_path == other.request.manifest_path
    }
}
impl Eq for PrioritizedLoadRequest {}

Module loading sequence (executed by a loader worker thread after dequeuing):

1. Verify driver binary signature (ML-DSA-65 or SLH-DSA-SHAKE-128f per [Section 9.3](09-security.md#verified-boot-chain)).
   Reject if signature is absent or invalid.
2. Parse .kabi manifest: validate fields, check KabiVersion compatibility.
3. Allocate an isolation domain (MPK PKEY, POE overlay, or equivalent per arch).
   If no domains are available: reject with KernelError::ResourceExhausted.
4. Map driver binary into the new domain (read+execute, no write).
4a. For DMA-capable devices, construct the DmaDeviceHandle:
    let iommu_group = iommu_find_group(dev.bus_id());
    let domain = iommu_group.get_or_create_domain();
    domain.attach_device(dev.bus_id())?;
    let dma_handle = DmaDeviceHandle {
        iommu_domain: RcuCell::new(Some(Arc::clone(&domain))),
        dma_mask: dev.read_dma_mask(),
        coherent_dma_mask: dev.read_coherent_dma_mask(),
    };
    // dma_handle passed to driver's probe() via the `services` parameter.
    // See DmaDeviceHandle definition in [Section 4.14](04-memory.md#dma-subsystem).
5. Construct `DriverProbeConfig` from registry properties + kernel parameters + admin overrides.
6. Call driver_entry.init(services, descriptor, config). Apply timeout_ms watchdog.
7. On success: transition device state to Active, send Ok(handle) to result_tx.
8. On failure: free domain, unmap binary, send Err(...) to result_tx.
   Registry transitions device state to Error.

11.4.4.7 Driver Probe Configuration

Linux's module_param mechanism was dropped because it is tied to the .ko binary model (Section 19.7). UmkaOS replaces it with a structured, typed probe configuration that drivers receive at init() time.

Configuration sources (in precedence order, highest first):

  1. Admin overrides — written to /ukfs/kernel/drivers/<name>/config/<key> via umkafs. These persist across driver reload and crash recovery. Stored in the DeviceNode.admin_config field.
  2. Boot parametersumka.driver.<name>.<key>=<value> on the kernel command line. Parsed at boot and stored in the BOOT_DRIVER_CONFIG table.
  3. Registry properties — populated by the bus driver at device discovery time (DT compatible, PCI config space, ACPI _DSD, etc.).
  4. Driver defaults — declared in the .kabi IDL file.
/// Per-driver probe configuration. Passed to `init()` alongside the
/// `KernelServicesVTable` and `DeviceDescriptor`.
///
/// Drivers read configuration values by key name. The kernel merges values
/// from all four sources (admin overrides > boot params > bus properties >
/// driver defaults) before calling init(). Drivers never see the source —
/// only the final merged value.
#[repr(C)]
pub struct DriverProbeConfig {
    /// Number of key-value pairs.
    pub count: u32,
    pub _pad: u32,
    /// Pointer to `count` entries. Kernel-owned, valid for the duration
    /// of `init()` only. Driver must copy values it needs to retain.
    pub entries: *const DriverConfigEntry,
}
// DriverProbeConfig: count(u32=4) + _pad(u32=4) + entries(ptr).
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<DriverProbeConfig>() == 16);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<DriverProbeConfig>() == 12);

/// **Per-tier delivery of DriverProbeConfig:**
///
/// - **Tier 0 (in-kernel)**: Direct pointer access. The `entries` pointer
///   points to kernel memory in the same address space. Zero-copy.
///
/// - **Tier 1 (hardware-isolated)**: The config array is allocated in the
///   shared read-only domain (PKEY 1 on x86, equivalent on other arches).
///   The driver's domain has read access to this memory during `init()`.
///   After `init()` returns, the shared mapping is revoked — the driver
///   must copy any values it needs to retain (per the doc comment above).
///
/// - **Tier 2 (process-isolated)**: The config array is serialized into
///   the driver process's address space via the `umka_driver_register`
///   syscall. The kernel copies the `DriverConfigEntry` array into a
///   page mapped into the driver process, and passes the userspace VA
///   as the `entries` pointer. The page is unmapped after `init()` returns.
///   See [Section 12.6](12-kabi.md#kabi-transport-classes) for the T2 driver lifecycle.

/// A single configuration key-value pair.
#[repr(C)]
pub struct DriverConfigEntry {
    /// Null-terminated key name (e.g., "ring_size", "max_queues").
    /// Max 63 bytes (including null terminator).
    pub key: [u8; 64],
    /// Value type tag (same encoding as `ParamValue` in
    /// [Section 20.9](20-observability.md#kernel-parameter-store)).
    pub value_type: u8,
    pub _pad: [u8; 7],
    /// Value data. Interpretation depends on `value_type`.
    pub value: DriverConfigValue,
}
// DriverConfigEntry: key([u8;64]=64) + value_type(u8=1) + _pad([u8;7]=7) +
//   value(DriverConfigValue=256) = 328 bytes.
const_assert!(size_of::<DriverConfigEntry>() == 328);

/// Configuration value union. Matches the `ParamValue` wire format.
#[repr(C)]
pub union DriverConfigValue {
    pub u32_val: u32,
    pub u64_val: u64,
    pub i32_val: i32,
    pub bool_val: u8,
    pub str_val: [u8; 256],
}
// DriverConfigValue: largest variant = str_val([u8;256]) = 256 bytes, align 8 (u64_val).
const_assert!(size_of::<DriverConfigValue>() == 256);

IDL declaration: Drivers declare their configuration schema in the .kabi file. The kabi-gen compiler validates boot-time and admin overrides against this schema at load time — writes that violate the schema are rejected with EINVAL.

// ixgbe.kabi — NIC driver configuration parameters
module ixgbe_driver {
    // ... provides / requires ...

    config {
        /// Maximum number of TX/RX queue pairs.
        max_queues: u32 { min: 1, max: 128, default: 16 };
        /// Interrupt coalescing delay in microseconds.
        rx_usecs: u32 { min: 0, max: 1000, default: 50 };
        /// Enable XDP mode.
        enable_xdp: bool { default: false };
    }
}

umkafs exposure: Each loaded driver's configuration is readable (and writable for admin overrides) at /ukfs/kernel/drivers/<name>/config/<key>. Writing to a key while the driver is active stores the value as an admin override; the new value takes effect on the next driver reload.

Boot parameter parsing: The kernel command line scanner recognizes umka.driver.<name>.<key>=<value> patterns at boot. These are stored in a static BOOT_DRIVER_CONFIG: XArray<ArrayVec<DriverConfigEntry, 32>> keyed by driver name hash. The module loader merges boot config entries with bus properties and admin overrides before constructing the DriverProbeConfig.

11.4.4.8 Driver Blacklist and Device Blocking

UmkaOS provides three mechanisms for preventing driver loading, ordered by scope:

1. Boot-time driver blacklist (umka.blacklist=<name>[,<name>...]):

Prevents named drivers from ever being loaded. The kernel command line scanner populates a static DRIVER_BLACKLIST: SpinLock<ArrayVec<ArrayString<64>, 64>> at boot. The module loader checks this list before signature verification — a blacklisted driver is rejected with KernelError::Blacklisted without reading the binary.

# Kernel command line example:
umka.blacklist=nouveau,snd_hda_intel

2. Runtime driver disable (umkactl driver disable <name>):

Adds the driver to the runtime blacklist. Any devices currently using the driver are stopped (state transition to Stopping → Removed). New devices matching the driver's rules will remain in Discovered state until the driver is re-enabled.

# Disable a driver at runtime:
umkactl driver disable nouveau
# Or via umkafs:
echo "disable" > /ukfs/kernel/drivers/nouveau/control

# Re-enable:
umkactl driver enable nouveau
echo "enable" > /ukfs/kernel/drivers/nouveau/control

3. Per-device blocking (umkactl device block <BDF|path>):

Prevents a specific device from being probed. The device remains in Discovered state with the blocked flag set in its DeviceNode. Blocked devices do not participate in driver matching.

/// Device blocking state in DeviceNode.
pub struct DeviceBlockState {
    /// True if the device is blocked from probing.
    pub blocked: bool,
    /// Source of the block (for admin diagnostics).
    pub source: BlockSource,
}

#[repr(u8)]
pub enum BlockSource {
    /// Blocked via `umka.block_device=` boot parameter.
    BootParam = 0,
    /// Blocked via `umkactl device block` or umkafs write.
    AdminAction = 1,
    /// Blocked by FMA after repeated failures ([Section 20.1](20-observability.md#fault-management-architecture)).
    FmaQuarantine = 2,
}

Boot-time device blocking uses umka.block_device=<BDF>[,<BDF>...] (PCI Bus:Device.Function format). This is useful for hardware that causes system instability or for reserving devices for VFIO passthrough.


11.4.5 Device Lifecycle

11.4.5.1 State Machine

The registry manages each device through a well-defined state machine. Only the kernel initiates transitions — drivers cannot set their own state.

                                    +-> [Error] ------+----> [Quarantined]
                                    |                  |          |
[Discovered] -> [Matching] -> [Loading] -> [Initializing] -> [Active]
      ^              ^                          |                 |
      |              |                          |                 v
      |              +--- (no match) -----------+            [Suspending]
      |              |                                            |
      |              +-- (admin re-enable) -- [Quarantined]      v
      +-- (hotplug rescan) ---- [Removed]                  [Suspended]
      |                            ^                            |
      |                            |                            v
      +-- (driver reload) ----- [Stopping] <-------------- [Resuming]
                                   ^                            |
                                   |                            v
                                [Recovering] <------------- [Active]
#[repr(u32)]
/// FMA ([Section 20.1](20-observability.md#fault-management-architecture--fma-diagnosis-engine)) may trigger `DeviceState` transitions
/// via `fma_demote_device()` and `fma_disable_device()` when diagnosis rules fire.
pub enum DeviceState {
    Discovered    = 0,  // Node exists, no driver bound
    Matching      = 1,  // Match engine evaluating
    Loading       = 2,  // Driver binary being loaded
    Initializing  = 3,  // driver init() called, waiting for result
    Active        = 4,  // Driver running normally
    Suspending    = 5,  // Suspend requested, waiting for driver ack
    Suspended     = 6,  // Driver has acknowledged suspend
    Resuming      = 7,  // Resume requested, waiting for driver ack
    Stopping      = 8,  // Driver being stopped (unload, removal, admin)
    Recovering    = 9,  // Driver crashed, recovery in progress
    Removed       = 10, // Device physically removed (hotplug)
    Degraded      = 11, // Device operational but with reduced capability (FMA drain,
                        // correctable error threshold exceeded); new I/O may be
                        // rejected or throttled. Set by `fma_demote_device()`.
    Error         = 12, // Fatal error, non-functional
    Quarantined   = 13, // Driver permanently disabled (crash threshold exceeded);
                        // requires manual re-enable via sysfs
    FaultedUnrecoverable = 14, // Hardware fault that cannot be recovered by driver
                               // reload (e.g., uncorrectable ECC, PCIe link failure,
                               // device firmware hang). Set by FMA when diagnosis
                               // determines the fault is in the hardware, not the
                               // driver. Device remains in this state until physical
                               // replacement or manual override via sysfs.
}

11.4.5.2 Transition Table

From To Trigger Driver Callback
Discovered Matching New device or new driver registered None
Matching Loading Match found None
Matching Discovered No match None
Loading Initializing Binary loaded, vtable exchange begins init()
Initializing Active init() returns ProbeResult::Ok None
Initializing Discovered init() returns ProbeResult::Deferred None (re-queued for deferred probe)
Initializing Error init() returns ProbeResult::Err or timeout None
Active Suspending PM suspend request suspend()
Suspending Suspended suspend() returns success None
Suspending Error suspend() timeout or failure shutdown() (force)
Suspended Resuming PM resume request resume()
Resuming Active resume() returns success None
Resuming Recovering resume() failure None
Active Stopping Admin request, unload, or hotplug removal shutdown()
Active Recovering Fault detected (domain violation, watchdog, crash) None
Recovering Loading Recovery initiated, fresh binary load (fresh init())
Error Quarantined Crash threshold exceeded (5+ failures in window) None
Quarantined Matching Manual administrator re-enable via sysfs None
Error FaultedUnrecoverable FMA diagnosis determines hardware fault (not driver) None
FaultedUnrecoverable Matching Admin override via sysfs after hardware repair/replacement None
Any Removed Physical device gone + teardown complete shutdown() if possible

11.4.5.3 Timeouts

Every callback has a timeout. If the driver does not respond within the timeout, the kernel force-stops it (same mechanism as crash recovery: revoke isolation domain / kill process).

Callback Tier 1 Timeout Tier 2 Timeout
init() 5 seconds 10 seconds
shutdown() 3 seconds 5 seconds
suspend() 2 seconds 5 seconds
resume() 2 seconds 5 seconds

All timeouts are configurable via kernel parameters.


11.4.6 Power Management

11.4.6.1 Power States

#[repr(u32)]
pub enum PowerState {
    D0Active    = 0,    // Fully operational
    D1LowPower  = 1,    // Low-power idle (quick resume)
    D2DeepSleep = 2,    // Deeper sleep (longer resume, less power)
    D3Hot       = 3,    // Low-power, PME-capable. Device retains soft state.
    D3Cold      = 4,    // Full power off. Device loses all state, requires full re-init.
}

11.4.6.2 Topology-Driven Ordering

This is the primary advantage of having a kernel-owned device tree. Suspend/resume ordering is derived from topology, not maintained as a separate list.

Suspend order (depth-first, leaves first):

For each subtree rooted at device D:
  1. Suspend all clients of D (provider-client links)
  2. Recursively suspend all children of D (bottom-up)
  3. Suspend D itself

Resume order (exact reverse):

For each subtree rooted at device D:
  1. Resume D itself
  2. Recursively resume all children of D (top-down)
  3. Resume all clients of D

This is computed once by topological sort when a system PM transition begins. Provider- client edges are treated as additional dependency edges in the sort. The result is cached and invalidated when the tree topology changes.

Why this is better than Linux: Linux maintains a dpm_list that approximates topological order but can get it wrong. The ordering is based on registration order and heuristic adjustments, not the actual device tree. UmkaOS computes the correct order directly from the tree.

11.4.6.3 PM Failure Handling

When a driver fails to suspend within its timeout:

  1. Registry marks the node as Error.
  2. Driver is force-stopped (revoke isolation domain / kill process).
  3. Suspend continues for remaining devices — one broken driver does not block the entire system.
  4. On resume, the failed device's driver is reloaded fresh (leveraging crash recovery from Section 11.9).
  5. Failure is logged with context for admin diagnosis.

This directly implements the principle from Section 18.4: "Tier 1 and Tier 2 drivers that fail to suspend within a timeout are forcibly stopped and restarted on resume."

11.4.6.4 Runtime Power Management

The canonical runtime PM state machine — two-counter model with RtpmState enum, active_count counter, autosuspend timer, parent-child sequencing, and driver callbacks — is specified in Section 7.5.

Driver registry integration: the device registry is responsible for:

  • Per-device policy storage: each DeviceNode stores the runtime PM fields defined in Section 7.5 (RuntimePm). The rtpm_* API functions operate on these fields.
  • KABI trampoline activity tracking: every KABI call that passes through the service vtable trampoline (Section 11.6) calls rtpm_mark_active() on the target device, which resets the autosuspend timer. The per-call overhead is one Ordering::Relaxed atomic store to a per-device cache line — no contention with other devices, no cross-core invalidation traffic.
  • Tier-aware behavior: Tier 1 drivers that fail to complete a runtime_suspend() callback within RTPM_SUSPEND_TIMEOUT_MS are forcibly domain-isolated and restarted. Tier 2 drivers are process-killed and restarted. Tier 0 drivers log warnings only.

Runtime PM is independent of system PM. A device can be in D1 (runtime idle) while the system is fully running.


11.4.7 Hot-Plug

11.4.7.1 Bus Drivers as Event Sources

Bus drivers (PCI host bridge, USB XHCI, USB hub) are the source of hotplug events. They detect device arrival/departure and report to the registry through KABI methods.

A bus driver is identified by is_bus_driver = 1 in its DriverManifest. It has the HOTPLUG_NOTIFY capability (already defined in capability.rs).

11.4.7.2 Device Arrival

1. Bus driver detects new device
   (PCIe hot-add interrupt, USB port status change, ACPI _STA change)
2. Bus driver calls registry_report_device() via KABI
   - Passes: parent handle, bus type, bus-specific identity, initial properties
3. Registry creates a new DeviceNode in Discovered state
4. Registry populates properties from the bus driver's report
5. Registry runs the match engine on the new node
6. If match found: load driver, init, transition to Active
7. Registry emits uevent for Linux compatibility (udev/systemd)

11.4.7.3 Device Removal (Orderly)

1. Bus driver detects device departure (link down, port status change)
2. Bus driver calls registry_report_removal() via KABI
3. Registry processes the subtree bottom-up:
   a. For each child (deepest first):
      - Stop the child's driver (shutdown callback)
      - Release capabilities
      - Remove child node
   b. Stop the target device's driver
   c. Release all capabilities
   d. Remove the DeviceNode
4. Registry emits uevent (removal)

11.4.7.4 Surprise Removal

When a device is physically yanked without warning (e.g., USB unplug during I/O):

  1. Bus driver detects absence (failed transaction, link down).
  2. Registry receives the removal report.
  3. All pending I/O for the device and its children is completed with -EIO.
  4. shutdown() is called on the driver — it may fail quickly because the hardware is gone. This is expected and handled gracefully (timeout → force-stop).
  5. The node subtree is torn down.

This mirrors crash recovery but is initiated by the bus driver rather than by a fault.

11.4.7.5 Uevent Compatibility

For Linux userspace compatibility (udev, systemd-udevd), the registry emits uevent notifications matching the Linux format:

ACTION=add
DEVPATH=/devices/pci0000:00/0000:03:00.0
SUBSYSTEM=pci
PCI_ID=8086:2723
PCI_CLASS=028000
DRIVER=umka-iwlwifi

This feeds into umka-sysapi/src/sys/ for sysfs and umka-sysapi/src/dev/ for devtmpfs, as outlined in Section 19.1.


11.4.8 Concurrency and Performance

11.4.8.1 Locking Strategy

  • Read path (hot): Property queries, service lookups, sysfs reads. Reader-writer lock allows concurrent reads.
  • Write path (cold): Node creation, state transitions, driver binding, hotplug. Takes exclusive write lock.
  • Per-node state: Atomic field for lock-free state checks ("is this device active?" does not need the tree lock).
  • PM ordering cache: Computed once per PM transition. Invalidated when tree topology changes (hotplug).

11.4.8.2 Scalability

  • Device enumeration: O(n*m) where n = match rules, m = unmatched devices. With <1000 drivers and <200 devices on a typical system, this completes in microseconds. Runs once at boot + on hotplug.
  • Service lookup: Hash-indexed by service name. O(1) amortized.
  • Property query: Binary search on sorted PropertyTable. O(log n), n < 30.
  • PM ordering: Topological sort is O(V+E) where V = nodes, E = edges. Computed once, cached.

11.4.8.3 Memory Budget

Component Per Node Notes
DeviceNode struct ~512 bytes Fixed-size fields
PropertyTable (avg 15 props) ~1 KB Key strings + values
Children/providers/clients ~128 bytes Vec overhead
Total per node ~1.7 KB

A typical desktop with ~200 devices: ~340 KB. A busy server with ~1000 devices: ~1.7 MB. Well within kernel memory budget.


11.4.9 Resolved Design Decisions

The following design questions have been resolved:

1. USB topology depth: full topology. The registry represents the full USB hub topology (up to 7 levels). Hub nodes carry a UsbHub property struct with port count and per-port power control. This is required for correct power-management ordering (suspend leaf-first, resume root-first) and surprise-removal cascading (removing a hub invalidates all downstream devices). The node overhead is trivial — one DeviceNode per hub.

2. GPU sub-device modeling: child nodes. Each GPU sub-function (display controller, compute engine, video encoder, copy engine) is a child DeviceNode with its own BusIdentity::PciFunction and capability flags. The parent GPU node holds shared state (VRAM, power domain). Each child binds its own extension vtable (AccelComputeVTable, AccelDisplayVTable per Section 22.1) while sharing the parent's AccelBaseVTable. This enables independent driver binding per sub-function (e.g., a display driver and a compute driver on the same GPU).

3. Firmware enumerators: pluggable Tier 0 backends. A FirmwareEnumerator trait defines two methods: enumerate(registry: &mut DeviceRegistry) and match_device(node: &DeviceNode) -> Option<DeviceProperties>. Two implementations:

  • AcpiEnumerator — walks the ACPI namespace (_STA, _HID, _CRS), creates platform device nodes.
  • DtEnumerator — walks the flattened device tree compatible strings, creates platform device nodes.

Architecture selection is compile-time via arch::current::firmware_enumerator(): x86 → ACPI, ARM/RISC-V → DT, ARM server → both. Both enumerators are kernel-internal (Tier 0), never exposed through KABI.

4. Multi-function PCI devices: one node per function. The topology is: PciBridge → PciSlot → PciFunction(0..N). The PciSlot node is a lightweight grouping node (no driver binding) that carries the slot's physical identity (segment/bus/device). Each PciFunction child has its own BAR resources, MSI vectors, and IOMMU group assignment. This matches Linux's sysfs model and makes SR-IOV VF creation (Decision 8) natural — VFs are additional function children. Recovery ordering for multi-function devices follows the device tree: if function 0 crashes, sibling functions (1, 2, ...) are notified via the registry's DeviceEvent::SiblingReset event. Each sibling driver independently decides whether to re-probe its function or wait for the parent slot to stabilize. The parent PciSlot node coordinates FLR (Function Level Reset) if the failing function requests it.

5. Service versioning: yes, using InterfaceVersion. registry_publish_service requires the service vtable to start with the standard vtable_size: u64, version: u32 header, same as all KABI vtables (Section 12.2). Lookup performs major-version matching; minor-version differences are handled by vtable_size-based field presence detection. No new mechanism — reuses the existing KABI version negotiation protocol.

6. Multi-provider services: topology-aware lookup + enumeration variant. registry_lookup_service(name) returns the closest provider by walking: same device → sibling nodes → parent subtree → global. registry_lookup_all_services(name) returns an iterator over all providers, ordered by topological distance. The "closest" heuristic covers the common case (e.g., an I2C client finding its controller); the enumeration variant handles multi-path cases (RAID member discovery, network bonding).

7. Persistent device naming: yes, bus-identity + serial derived. The registry generates a stable device path from bus-specific identity:

Bus Stable Path Source
PCI segment:bus:device.function (stable if ACPI/DT provides _BBN/_SEG)
USB Hub chain + port number (stable as long as physical topology unchanged)
NVMe PCI path + namespace ID
SCSI WWID / VPD page 83

The stable path is stored as a stable_path: ArrayString<128> property on each DeviceNode. The SysAPI layer creates /dev/disk/by-id/, /dev/disk/by-path/ etc. as symlinks. The kernel itself never uses these names — they are purely for userspace convenience.

8. IOMMU group granularity for SR-IOV: PF driver creates VF nodes via KABI. The PF driver calls registry_create_vf_nodes(pf_handle: DeviceHandle, count: u32) which:

  1. Validates the PF has ACS on its upstream port (required for per-VF IOMMU groups).
  2. Creates count child DeviceNodes with BusIdentity::Pci entries for each VF BDF.
  3. Assigns each VF its own IOMMU group (if ACS permits) or groups them with the PF.
  4. Triggers driver matching on each new VF node (the same driver or the VFIO passthrough driver are both valid matches).

Destruction: registry_destroy_vf_nodes(pf_handle) tears down all VFs, unmapping their IOMMU entries and revoking any VFIO leases. Fails with IO_RESULT_BUSY if any VF is actively in use by a guest VM.

9. AML interpreter scope: minimal production subset, growth-on-demand. The initial interpreter supports the following ACPI methods (the minimum for real x86 server/desktop boot): _STA, _CRS, _HID, _UID, _BBN (base bus number), _SEG (PCI segment), _PRT (PCI routing table), _OSI (OS identification — most DSDTs gate behavior on this), _DSM (device-specific method — used by PCIe, NVMe, USB controllers), _PS0/_PS3 (power state transitions), _INI (device initialization), _REG (operation region handler registration), and _CBA (ECAM base for PCIe config space on modern systems).

Required AML bytecode opcodes: Store, If/Else, Return, Buffer, Package, Integer/String/ Buffer operations, Method invocation, OperationRegion, Field. Without _OSI and _DSM, most x86 laptops and many servers fail to enumerate devices correctly. Extend only when real hardware fails to enumerate — do not speculatively implement unused methods.

10. Resource reservation for hot-plug: configurable per-slot defaults, ACPI-guided. Default reservation per hot-plug capable slot: 256MB MMIO, 256MB prefetchable MMIO, and 8 bus numbers (matching Linux's heuristic). Configurable via kernel command-line parameters (pci_hp_mmio=128M, pci_hp_prefetch=256M, pci_hp_buses=4). The PCI allocator reads ACPI _HPP (Hot Plug Parameters) and _HPX (Hot Plug Extensions) methods if present — these override the defaults with firmware-provided values. Reserved regions are tracked as "allocated but unoccupied" to prevent other devices from claiming them.

11. KABI long-term evolution: 5 releases default, LTS KABI opt-in. The support window is 5 major releases. A KABI version may be designated LTS at release time (not retroactively), extending its support to 7 releases. LTS designation requires that at least one major driver ecosystem (storage, network, or accelerator) has certified against that KABI version.

Lifecycle (see Section 12.2 Rule 2a): - At KABI_vN+3 (or +5 for LTS): deprecated methods gain #[deprecated(since = "KABI_vN")] and emit a kernel log warning when called. - At KABI_vN+5 (or +7 for LTS): deprecated methods are replaced with kabi_deprecated_stub (returns -ENOSYS). The vtable slot is preserved — vtable_size does not shrink. Drivers still within the support window receive a clean error; drivers outside the window fail version negotiation before reaching the tombstone. - The kabi_version field (not vtable_size) is the primary version discriminant, preventing size-based collisions between pre-deprecation and post-deprecation vtables.

12. IOMMU nested translation performance: proactive large page promotion. The IOMMU mapper always selects the largest page size that fits the DMA mapping alignment and size:

Condition IOMMU Page Size
Mapping ≥ 1GB and 1GB-aligned 1GB (rare; occurs for GPU BAR mappings)
Mapping ≥ 2MB and 2MB-aligned 2MB
All other cases 4KB

This is a policy in the IOMMU mapping path, not a reactive monitor. Per-device IOMMU stats (IOTLB miss rate via performance counters, if available) are exposed through the FMA health telemetry path (Section 20.1) for observability, but the promotion decision itself is always proactive.

11.5 IOMMU and DMA Mapping

Summary: This section specifies IOMMU group management, IOMMU domain types, per-device DMA identity mapping, and PCIe Active State Power Management (ASPM). These are hardware-level isolation and power mechanisms that the device registry (Section 11.4) uses to enforce driver DMA containment and link-level power savings. See Section 11.4 for the registry data model and Section 11.6 for service discovery and boot integration.

11.5.1 IOMMU Groups

IOMMU groups model hardware isolation boundaries. An IOMMU group is the smallest unit of device isolation that the hardware can enforce — all devices in a group share the same IOMMU domain (page table).

pub struct IommuGroupId(pub u64);

pub enum IommuDomainType {
    /// Kernel DMA domain — device DMA goes through kernel-managed IOMMU
    /// page tables. Default for all devices.
    Kernel,

    /// Identity-mapped DMA domain — IOMMU programs 1:1 physical-to-bus
    /// mapping. Device DMA addresses equal physical addresses. Requires
    /// explicit admin opt-in per device. See "Per-Device DMA Identity Mapping"
    /// below for constraints and security implications.
    Identity {
        /// Upper bound of the 1:1 mapping (typically max_phys_addr).
        phys_range_end: u64,
    },

    /// VM passthrough domain — entire group assigned to a VM. The VM's
    /// IOMMU page tables control device DMA. Used for VFIO passthrough.
    VmPassthrough {
        vm_id: u64,
        /// Second-level page table root (EPT/NPT base).
        /// **Lifetime invariant**: valid for the lifetime of the owning VM.
        /// The VM destruction path tears down the IOMMU domain (releasing this
        /// root) before freeing the EPT/NPT pages.
        page_table_root: u64,
    },

    /// Userspace DMA domain — for Tier 2 drivers that need direct DMA
    /// (e.g., DPDK-style networking). IOMMU restricts DMA to the
    /// driver process's permitted regions.
    UserspaceDma {
        owning_pid: u64,
    },
}

Why IOMMU groups matter:

  • VFIO passthrough: When assigning a device to a VM (GPU, NIC, NVMe controller, FPGA, etc.), the kernel must assign the entire IOMMU group. If two devices share a group (e.g., GPU and its audio function on the same PCI slot, or NIC and a co-located function), both must be assigned together. The registry validates this constraint before permitting passthrough. See Section 22.7 for GPU-specific passthrough details.

  • ACS (Access Control Services): PCIe ACS capabilities determine group boundaries. With ACS, each PCI function can be its own group. Without ACS, all devices behind a non-ACS bridge form a single group (because they could DMA to each other without going through the IOMMU).

  • Isolation guarantee: The IOMMU group is the hardware's isolation primitive. The registry enforces that no device in a passthrough group remains in the kernel domain — this would allow the VM to DMA to the kernel device's memory.

Group discovery:

During PCI enumeration (Section 11.6), the registry determines IOMMU groups by walking the PCI topology and checking ACS capability bits:

For each PCI device:
  1. Walk upstream to the root port, checking ACS at each bridge.
  2. If all bridges have ACS: device is in its own group.
  3. If a bridge lacks ACS: all devices below that bridge share a group.
  4. Peer-to-peer devices behind the same non-ACS switch: same group.

Passthrough assignment flow:

1. Admin requests device passthrough for VM (via /dev/vfio/N or umka-kvm API)
2. Registry looks up device's DeviceNode → iommu_group
3. Registry checks: all devices in group unbound or assignable?
4. If yes: unbind kernel drivers, switch group to VmPassthrough domain
5. Program IOMMU with VM's second-level page tables
6. VM's guest OS sees the device and loads its own driver
7. On VM teardown: switch back to Kernel domain, rebind kernel drivers

The registry prevents partial group assignment: if device A and device B share IOMMU group 7, and only A is requested for passthrough, the request is rejected with -EBUSY unless B is also unbound. This prevents a safety violation where the VM could DMA to B's kernel-managed memory.

VmPassthrough vs IoAddrSpace — Canonical Path Selection

Two IOMMU programming paths exist for assigning a device to a VM. They are mutually exclusive per device — a device must use exactly one path at any given time:

Path Entry point IOMMU domain type Mapping authority Use case
KVM direct KVM_CREATE_VM + KVM_ASSIGN_DEV_IRQ or internal umka-kvm API IommuDomainType::VmPassthrough KVM programs the IOMMU page table directly from the VM's SLAT root. No software shadow map — the hardware page table is the EPT/NPT/Stage-2 table. Legacy KVM device assignment (pre-VFIO) or internal umka-kvm fast path for statically assigned devices where the VMM does not need per-mapping control.
iommufd/VFIO /dev/iommu + IOMMU_IOAS_MAP / /dev/vfio/N IommuDomainType::Kernel (the IoAddrSpace wraps a kernel-type domain) IoAddrSpace.mappings BTreeMap is the software authority; IoAddrSpace.pgd is the hardware authority. Standard QEMU/Firecracker/Cloud Hypervisor path. VMM explicitly maps guest memory regions via ioctls, gaining per-range control (map, unmap, migrate).

Invariant: At device-attach time, the registry checks the device's current IommuDomainType. If the device is already in a VmPassthrough domain, an IOMMU_DEVICE_ATTACH (iommufd path) is rejected with EBUSY. Conversely, if the device is already attached to an IoAddrSpace, a KVM direct-assign request is rejected. Switching between paths requires full detach (IOTLB drain + IOMMU context entry removal) followed by reattach via the other path.

Recommendation: The iommufd/VFIO path is the preferred production path because it supports live migration (IOMMU_IOAS_COPY), per-range DMA protection, and VFIO interrupt remapping. The VmPassthrough fast path exists for minimal-overhead scenarios (embedded hypervisors, static device assignment) where the VMM does not need dynamic IOVA management.

IOMMU Group Assignment Algorithm (device discovery):

The following algorithm runs during device enumeration (Section 2.1, boot hardware discovery) to assign each device to an isolation domain:

For each PCIe device discovered during enumeration:
  a. Query the IOMMU group ID for the device from the IOMMU driver.
     (IOMMU groups are defined by hardware — devices sharing a stream ID
     or lacking ACS isolation are in the same group.)
  b. If the group ID is new (first device in this group):
     - Allocate a new isolation domain for this group.
     - Register: iommu_group_domains[group_id] = new_domain.
  c. If the group ID already has a domain assignment:
     - Assign this device to the existing domain.
     - Log: "Device [bus:dev.fn] shares IOMMU group [id] with [other devices]
       — assigned to same isolation domain [domain_id]."

ACS (Access Control Services) check:
  PCIe ACS must be enabled on root ports and upstream ports/switches to allow
  per-function IOMMU groups. If ACS is absent on an upstream bridge:
  - All devices downstream of that bridge share one IOMMU group.
  - They are assigned to a shared isolation domain.
  - Log a warning: "PCIe switch at [bus:dev.fn] lacks ACS — [N] devices share
    one IOMMU group. Per-device isolation not possible."
  - UmkaOS does NOT disable the device — it runs with reduced isolation (shared
    domain) and logs the degraded state to FMA (Section 20.1).

Singleton groups (preferred):
  When ACS is present and hardware supports per-function translation,
  each device gets its own IOMMU group and its own isolation domain.
  This is the default and preferred configuration for Tier 1 drivers.

Driver cgroup co-isolation enforcement:
  UmkaOS enforces that all devices in an IOMMU group belong to the same driver
  cgroup. If a user attempts to assign two devices from the same IOMMU group
  to different drivers, the second assignment fails with -EACCES and the error
  message: "Device [bus:dev.fn] shares IOMMU group [id] with device [other
  bus:dev.fn] — both must be assigned to the same driver."

11.5.1.1 IOMMU Group Formation: pci_device_group Algorithm

The pseudo-code above describes the consumer side of IOMMU group assignment — how the device registry attaches devices to existing domains. The following specifies the formation side: how pci_device_group determines which IOMMU group a newly enumerated PCI device belongs to. This algorithm matches the Linux implementation (kernel/drivers/iommu/iommu.c, intel/iommu.c, amd/iommu.c, arm-smmu-v3.c) and is the authoritative procedure for all three major IOMMU hardware families.

/// ACS flags that together guarantee DMA request isolation between PCIe peers.
/// Without all four bits set on an upstream bridge, devices behind that bridge
/// can issue peer-to-peer DMA that bypasses IOMMU translation entirely.
///
/// - SV  (Source Validation): bridge verifies the requester ID is valid
/// - RR  (Request Redirection): DMA requests are redirected through the IOMMU
/// - CR  (Completion Redirection): completions return through the IOMMU
/// - UF  (Upstream Forwarding): upstream traffic is forwarded to the RC
const REQ_ACS_FLAGS: AcsFlags =
    AcsFlags::SV | AcsFlags::RR | AcsFlags::CR | AcsFlags::UF;

/// Determine the IOMMU group for a PCI device.
///
/// Called during driver registration and device hotplug. Returns the
/// `IommuGroup` to which this device must belong. A device can only
/// be assigned an `IommuDomain` that covers its entire group — partial
/// assignment is a hardware violation and is rejected by the registry.
///
/// # Algorithm
///
/// The four steps below are executed in order. The first step that
/// produces an existing group terminates the search and returns that group.
/// If no group is found, a fresh group is allocated in step 4.
pub fn pci_device_group(
    dev: &PciDevice,
    iommu: &IommuInstance,
) -> Arc<IommuGroup> {
    // Step 1: DMA alias resolution.
    //
    // Conventional PCI devices behind a PCIe-to-PCI bridge have their
    // requester ID rewritten to the bridge's BDF by the bridge — the IOMMU
    // sees the bridge's requester ID, not the device's own BDF. Such devices
    // are called "DMA aliases" of the bridge. All devices sharing the same
    // alias must be in the same IOMMU group because the IOMMU cannot
    // distinguish their DMA transactions.
    //
    // `resolve_dma_aliases` walks the alias set (via the PCIe alias capability
    // and conventional PCI bridge topology) and returns the canonical anchor BDF.
    let anchor = resolve_dma_aliases(dev);
    if let Some(existing) = iommu.group_for_bdf(anchor.bdf()) {
        return existing;
    }

    // Step 2: ACS boundary walk.
    //
    // Walk upstream bridges from the device toward the root complex. At each
    // bridge, check whether all four REQ_ACS_FLAGS bits are set in the bridge's
    // ACS capability register. The first bridge that lacks full ACS is the
    // isolation failure point: it cannot prevent peer devices from issuing
    // DMA to each other without going through the IOMMU. Move the group
    // anchor up to that bridge — all devices below it must share one group.
    //
    // Stop walking when we reach a bridge that has all four ACS bits set;
    // that bridge IS the isolation boundary. Devices on opposite sides of a
    // fully ACS-capable bridge can have separate IOMMU groups.
    let anchor = walk_acs_boundary(anchor, REQ_ACS_FLAGS);
    if let Some(existing) = iommu.group_for_bdf(anchor.bdf()) {
        return existing;
    }

    // Step 3: Multifunction slot grouping.
    //
    // PCI multifunction devices (multiple functions on the same device number,
    // e.g., device 0, functions 0..7) can DMA-alias each other when ACS is
    // absent. If the anchor is a multifunction device without full ACS, all
    // sibling functions on the same slot must share an IOMMU group.
    if anchor.is_multifunction() && !anchor.has_acs(REQ_ACS_FLAGS) {
        if let Some(existing) = find_sibling_function_group(&anchor, iommu) {
            return existing;
        }
    }

    // Step 4: Allocate a new group.
    //
    // No existing group was found via aliases, ACS failures, or multifunction
    // sharing. This device is hardware-isolated from all others and gets its
    // own IOMMU group (the preferred configuration for Tier 1 isolation).
    IommuGroup::new(iommu)
}

IommuGroup struct (canonical definition):

/// Maximum PCIe devices in one IOMMU group. ACS-disabled PCIe switches can group
/// entire bus fabrics; 128 is a safe upper bound for realistic PCIe topologies.
/// Typical observed groups: 1-4 devices (ACS enabled), up to ~32 (ACS disabled
/// on a 32-device bus segment). 128 provides 4x headroom over the worst observed
/// case.
///
/// This is a **stack-allocation capacity hint** for `ArrayVec`, not a hardware
/// limit. The error path handles overflow gracefully: `iommu_group_add_device()`
/// returns `Err(KernelError::ResourceExhausted)` and the device is left
/// unmanaged (no DMA remapping). An FMA warning is emitted for the administrator.
///
/// **Upgrade path**: If this bound is reached on exotic PCIe topologies, increase
/// the constant and rebuild, or convert to `Box<[Arc<PciDevice>]>` with
/// runtime-discovered capacity. The `iommu.group_max` boot parameter can
/// override this value when the boot-param infrastructure is available.
pub const IOMMU_GROUP_MAX_DEVICES: usize = 128;

pub struct IommuGroup {
    /// Unique group ID assigned at creation. Never reused after group destruction.
    /// u64 per project policy for monotonically increasing kernel-internal identifiers.
    pub id:      u64,
    /// IOMMU hardware instance that manages this group.
    pub iommu:   Arc<IommuInstance>,
    /// PCIe devices sharing this IOMMU domain.
    /// Fixed capacity: avoids heap allocation during bus enumeration and device hotplug.
    pub devices: ArrayVec<Arc<PciDevice>, IOMMU_GROUP_MAX_DEVICES>,
    /// The currently active IOMMU domain. One domain covers the entire group —
    /// it is not possible to assign different domains to devices in the same group.
    pub domain:  RwLock<Option<Arc<IommuDomain>>>,
}

IommuPageTable type — opaque, architecture-specific IOMMU page table root:

/// Opaque arch-specific IOMMU page table root pointer. The concrete layout
/// differs per IOMMU engine: VT-d root/context table entry on x86-64,
/// SMMU Stream Table / stage-1 PGD on AArch64, TCE table on POWER IODA,
/// RISC-V IOMMU DDT entry, etc. The IOMMU driver casts this to the
/// appropriate arch-specific type internally.
///
/// Note: the VFIO/iommufd subsystem ([Section 18.5](18-virtualization.md#vfio-and-iommufd-device-passthrough-framework))
/// uses the alias `IommuPgd` for this same concept. Both names refer to the
/// same underlying opaque pointer type.
pub type IommuPageTable = *mut u8;

/// Convert a physical address (HPA) to an `IommuPageTable` pointer.
///
/// This is a thin wrapper around `phys_to_virt()` that casts the resulting
/// kernel virtual address to the opaque `IommuPageTable` type. Used when
/// creating `IommuDomain` structs from IOMMU hardware registers that store
/// page table roots as physical addresses (e.g., VT-d context entry base,
/// SMMU Stream Table Entry S1CtxPtr).
///
/// The reverse path (`IommuPageTable` → `PhysAddr`) uses `virt_to_phys()`
/// and is needed during VM creation when the IOMMU domain's page table root
/// is programmed into SLAT/nested translation hardware.
///
/// # Safety
///
/// `hpa` must be a valid physical address of a page table root that has been
/// allocated via the IOMMU driver's page table allocator. The caller must
/// ensure the page table remains valid (not freed) for the lifetime of the
/// returned pointer.
#[inline(always)]
pub unsafe fn iommu_pt_from_phys(hpa: PhysAddr) -> IommuPageTable {
    // SAFETY: Caller guarantees hpa is a valid physical address of an
    // allocated IOMMU page table root.
    phys_to_virt(hpa) as *mut u8
}

IommuDomain struct (canonical definition):

/// A single IOMMU address space. Contains the hardware page table that translates
/// device-visible I/O virtual addresses (IOVAs) to physical addresses. One domain
/// covers an entire IOMMU group — all devices in the group share this translation.
pub struct IommuDomain {
    /// Unique domain ID, assigned by the IOMMU driver at allocation time.
    /// u64 per project policy. The hardware domain ID width varies (VT-d: 8-16
    /// bits, SMMU: 16 bits), but this is the software domain ID. Hardware IDs
    /// are derived via masking.
    pub domain_id: u64,
    /// Domain type: Kernel (driver DMA), VmPassthrough (KVM), or Identity (1:1).
    pub domain_type: IommuDomainType,

    /// Arch-specific IOMMU page table root. This is the hardware data structure
    /// programmed into the IOMMU engine (context entry for VT-d, Stream Table
    /// entry for SMMU, TCE table for POWER IODA, etc.).
    pub page_table: IommuPageTable,
    /// Active IOVA→physical mappings for **kernel-initiated DMA only** (Tier 1/2
    /// drivers using `umka_driver_dma_alloc`). Uses an RCU-protected XArray
    /// (integer-keyed by IOVA start address) for lock-free reader access on the
    /// DMA fault handler and diagnostic hot paths. Writers hold
    /// `mappings_writer_lock` and publish updates via RCU.
    ///
    /// **Concurrency model**:
    /// - **Readers** (DMA fault handler, IOTLB miss handler, diagnostics):
    ///   `rcu_read_lock()` + `mappings.load(iova_start)`. Lock-free, zero
    ///   contention. O(1) lookup via XArray radix tree.
    /// - **Writers** (DMA map/unmap, typically from process context or softirq):
    ///   Acquire `mappings_writer_lock` (Mutex), insert/remove entry in the
    ///   XArray, call `synchronize_rcu()` on unmap to ensure no reader holds
    ///   a stale reference. Map operations do not require `synchronize_rcu()`
    ///   — new entries are immediately visible and harmless to readers.
    ///
    /// **Why XArray instead of SpinLock<BTreeMap>**: IOMMU mapping lookups
    /// occur on DMA fault paths and diagnostic paths that may run concurrently
    /// with map/unmap operations from other CPUs. A SpinLock<BTreeMap> would
    /// serialize all readers behind writers, creating contention under heavy
    /// DMA mapping churn (e.g., 100K+ streaming DMA map/unmap per second on
    /// a busy NIC). The RCU-protected XArray provides O(1) lock-free reads
    /// with writer serialization only against other writers.
    ///
    /// **Ownership rule — VFIO/iommufd path**: When this domain is owned by an
    /// `IoAddrSpace` ([Section 18.5](18-virtualization.md#vfio-and-iommufd-device-passthrough-framework)), this
    /// `mappings` XArray is **not populated**. The `IoAddrSpace.mappings`
    /// BTreeMap is the authoritative software mapping record for all IOVA→HPA
    /// entries created via `IOMMU_IOAS_MAP`. The hardware page table
    /// (`page_table`) is the source of truth for DMA translation in both paths;
    /// the software map exists only for userspace API bookkeeping
    /// (range validation, unmap tracking). Having two software shadow maps for
    /// the same hardware table would be a consistency hazard.
    ///
    /// **Selection rule**: Kernel DMA domains (`IommuDomainType::Kernel`) use
    /// `IommuDomain.mappings`. VFIO/iommufd domains use `IoAddrSpace.mappings`.
    /// `IommuDomainType::VmPassthrough` domains use `IoAddrSpace.mappings` when
    /// going through the iommufd path, or are programmed directly by KVM when
    /// using the fast `VmPassthrough` path (see below).
    pub mappings: XArray<DmaMapping>,
    /// Writer-side lock for `mappings`. Held during DMA map/unmap operations
    /// to serialize mutations. Readers do not acquire this lock (RCU read-side).
    pub mappings_writer_lock: Mutex<()>,
    /// IOVA address space allocator. Bitmap-based for O(1) allocation of
    /// power-of-two-aligned regions.
    pub iova_allocator: IovaBitmapAllocator,
}

/// Bitmap-based IOVA allocator for `IommuDomain`.
///
/// Manages a contiguous IOVA address range (typically 0..`iova_limit`) using a
/// hierarchical bitmap that supports O(1) allocation and O(1) deallocation of
/// power-of-two-aligned regions. This is analogous to a buddy allocator but
/// optimized for IOVA space (no physical page backing needed).
///
/// The allocator is initialized at domain creation time with the IOVA range
/// derived from the IOMMU hardware capabilities (`iova_base`..`iova_limit`).
///
/// **Thread safety**: All operations require the domain's `mappings_writer_lock`
/// to be held (same lock that serializes map/unmap). No internal locking.
pub struct IovaBitmapAllocator {
    /// Base of the allocatable IOVA range.
    pub iova_base: u64,
    /// End of the allocatable IOVA range (exclusive).
    pub iova_limit: u64,
    /// Bitmap pages for each order level. `levels[0]` tracks 4 KiB granules;
    /// `levels[k]` tracks 2^(12+k) byte regions. Maximum order is 30
    /// (1 GiB regions), supporting up to 4 TiB IOVA space with a 512 KiB bitmap.
    ///
    /// # Safety
    /// Raw pointers to page-allocated bitmap memory. Owned exclusively by this
    /// allocator instance. Valid from `IovaBitmapAllocator::new()` until `drop()`.
    /// All access is serialized by the domain's `mappings_writer_lock` — no
    /// concurrent reads or writes to the bitmap pages can occur.
    /// The backing memory is freed in `Drop::drop()`.
    pub levels: ArrayVec<*mut u64, 31>,
    /// Total number of 4 KiB granules in the IOVA range.
    pub total_granules: u64,
    /// Number of currently allocated granules.
    pub allocated_granules: u64,
}

/// `Drop` impl: iterates `levels` and frees each bitmap page via `free_pages()`.
/// Leaking `IovaBitmapAllocator` without calling `drop()` leaks O(31) pages per domain.
impl Drop for IovaBitmapAllocator {
    fn drop(&mut self) {
        for &page_ptr in &self.levels {
            if !page_ptr.is_null() {
                // SAFETY: page_ptr was allocated by alloc_pages() in new().
                unsafe { free_pages(page_ptr as *mut u8, 0); }
            }
        }
    }
}

impl IovaBitmapAllocator {
    /// Allocate an IOVA region of `size` bytes (rounded up to page granularity).
    /// Returns the allocated IOVA base address, guaranteed to be naturally aligned
    /// to the allocation size (power-of-two rounding).
    ///
    /// Returns `None` if the IOVA space is exhausted.
    pub fn alloc(&mut self, size: u64) -> Option<u64> { /* bitmap search */ }

    /// Free a previously allocated IOVA region.
    /// `iova` must be a value returned by `alloc()`. `size` must match the
    /// original allocation size. Double-free is detected via bitmap state
    /// and triggers a kernel warning (FMA event, not panic).
    pub fn free(&mut self, iova: u64, size: u64) { /* bitmap clear */ }
}

impl IommuDomain {
    /// Returns the translation mode for this domain. Used by `device_needs_swiotlb()`
    /// ([Section 4.14](04-memory.md#dma-subsystem)) to determine if SWIOTLB bounce buffering is required.
    pub fn translation_mode(&self) -> IommuTranslationMode {
        match self.domain_type {
            IommuDomainType::Kernel | IommuDomainType::UserspaceDma
                => IommuTranslationMode::Full,
            IommuDomainType::Identity { .. }
                => IommuTranslationMode::Identity,
            IommuDomainType::VmPassthrough { .. }
                => IommuTranslationMode::Full,
        }
    }
}

/// IOVA range key for the mappings XArray.
/// The XArray is keyed by `start` (u64). Range lookups (containment checks
/// for IOVA fault handling) use `XArray::range()` with `start..=last`.
pub struct IovaRange {
    pub start: u64,
    pub last: u64,  // inclusive end — matches Linux iova_domain convention
}

/// DMA/IOMMU access permission flags. **Canonical definition** — all
/// subsystems that need device-side access permissions (VFIO, IOMMUFD,
/// driver DMA setup, DMA subsystem) use this single type.
///
/// Maps to `DmaDirection` ([Section 4.14](04-memory.md#dma-subsystem)) as follows:
/// `ToDevice` → `READ`, `FromDevice` → `WRITE`, `Bidirectional` → `READ | WRITE`.
bitflags! {
    pub struct DmaProt: u32 {
        const READ    = 0x1;
        const WRITE   = 0x2;
        const NOEXEC  = 0x4;  // where supported by IOMMU hardware
    }
}

/// A single DMA mapping record.
pub struct DmaMapping {
    pub phys_addr: u64,
    pub size: u64,
    pub prot: DmaProt,
}

Interface:

  • map(iova: u64, phys: u64, size: u64, prot: DmaProt) -> Result<(), IommuError> — creates an IOMMU page table entry mapping the IOVA range to the physical range. Returns Err(IommuError::AlreadyMapped) if the IOVA range overlaps an existing mapping.
  • unmap(iova: u64, size: u64) -> Result<u64, IommuError> — removes the mapping and returns the original physical address. Returns Err(IommuError::NotMapped) if no mapping exists at the given IOVA.
  • flush_iotlb(iova: u64, size: u64) — invalidates IOMMU TLB entries for the given range. This is arch-specific hardware invalidation (see table below). Flush is always required after unmap — no lazy invalidation is permitted because the IOMMU domain is a security boundary (a stale TLB entry would allow a driver to DMA to memory it no longer owns).

Per-arch IOMMU programming:

Arch IOMMU Hardware Page Table Format Flush Mechanism
x86-64 Intel VT-d 4-level (compatible with CPU page tables) IOTLB Invalidate via invalidation queue
x86-64 AMD-Vi 4-level (AMD I/O page table) INVALIDATE_IOMMU_PAGES command
AArch64 ARM SMMU v3 Stream Table + CD + stage-1/2 translation TLBI via command queue
RISC-V RISC-V IOMMU (ratified 2023) Sv48-based, similar to CPU page tables IOTINVAL.VMA command
PPC64LE POWER IODA TCE (Translation Control Entry) table TCE invalidate via OPAL call
ARMv7 ARM SMMU v1/v2 2-level page table TLBI via SMMU registers
PPC32 Platform-specific Not standardized; Tier 2 only if IOMMU present Platform-specific
s390x z/Architecture (subchannel isolation) CCW-based I/O protection per subchannel SIGP-based; no IOTLB concept
LoongArch64 Loongson IOMMU (7A1000/7A2000 bridge) 4-level translation table IOMMU invalidation register

Firmware table parsing (determines IommuInstance → device scope during early boot, before pci_device_group is called per-device):

  • Intel VT-d (ACPI DMAR table): Each DRHD (DMA Remapping Hardware Definition) record describes one VT-d engine and its device scope entries (BDF ranges it manages). Devices not covered by any explicit DRHD scope fall under the catch-all DRHD with the INCLUDE_PCI_ALL flag. RMRR (Reserved Memory Region Reporting) records list physical address ranges that must be identity-mapped in every domain — typically BIOS-owned USB buffers and legacy VGA regions. UmkaOS programs RMRR regions as immutable identity entries in every new IOMMU domain before handing it to a driver.

  • AMD-Vi (ACPI IVRS table): IVHD (I/O Virtualization Hardware Definition) records list each AMD IOMMU and the BDF ranges it controls. UmkaOS builds a flat lookup table amd_iommu_dev_table[BDF] during IVRS parsing, giving O(1) device-to-IOMMU resolution at pci_device_group() call time. IVMD (I/O Virtualization Memory Definition) records specify unity-mapped regions (analogous to Intel RMRR).

  • ARM SMMU v3 (ACPI IORT table or Device Tree): Stream IDs (SIDs) are assigned by firmware and recorded in IORT iommu-map table entries or DT iommus / iommu-map properties. Each non-PCI device (platform device, ACPI device) gets its own IOMMU group unconditionally — non-PCI devices cannot alias each other. PCI devices behind an SMMU use pci_device_group() as above, with the SMMU providing the IommuInstance.

  • POWER IODA (DT-based on PowerNV, hcall-based under pseries/KVM): PPC64 uses IODA (I/O Device Architecture) with TCE (Translation Control Entry) tables for DMA translation. Under QEMU pseries with KVM, the hypervisor manages IOMMU via H_PUT_TCE / H_PUT_TCE_INDIRECT hcalls — the guest kernel submits TCE table entries and the hypervisor programs the hardware. Under PowerNV (bare-metal), OPAL firmware calls (opal_pci_map_pe_dma_window, opal_pci_tce_kill) manage IODA tables directly. IOMMU groups follow the PHB (PCI Host Bridge) PE (Partitionable Endpoint) model: each PE is an IOMMU group. The TCE table provides 4KB-granularity address translation.

  • Loongson IOMMU (on 7A1000/7A2000 bridge chip, DT-based): Loongson 3A5000+ with the 7A bridge chip provides an IOMMU for DMA address translation. DMA is cache-coherent on Loongson 3A5000+ when using the coherent DMA zone (no cache maintenance needed). The Loongson IOMMU driver is required for Tier 2 isolation on LoongArch platforms. Linux support has been progressively added. Phase 3+ item.

UmkaOS driver isolation requirement at driver_register():

A Tier 2 driver receives an IommuDomain that covers its entire IOMMU group. The registration sequence is:

  1. Parse the firmware table (DMAR/IVRS/IORT) to find which IommuInstance owns the device, then call pci_device_group() to determine the device's IommuGroup.
  2. If the group already has an active IommuDomain: attach this device to that domain (the entire group is now under the driver's control — the registry verifies that all other group members are either unbound or owned by the same driver process).
  3. If the group has no active domain: allocate a new IommuDomain, program the IOMMU hardware page tables (initially empty — no DMA permitted), then attach all devices in the group to the new domain.
  4. Grant the driver process DMA access via umka_driver_dma_alloc (Section 11.5); each allocation adds an IOVA→PA entry to the domain's page tables and the IOMMU issues an IOTLB invalidation.
  5. Any device in the group that issues a DMA transaction to an address outside its domain's IOVA space triggers an IOMMU fault → driver crash recovery path (Section 11.7).

11.5.2 IOMMU Implementation Complexity

IOMMU management is one of the most complex subsystems in any OS kernel, and this complexity should not be understated. The following areas are known to be difficult and are called out explicitly as high-effort implementation items:

Nested/two-level translation (SR-IOV + VFIO) — when a VM uses VFIO passthrough with SR-IOV virtual functions, the IOMMU must perform two-level address translation: guest virtual → guest physical (first level, programmed by the guest's IOMMU driver) then guest physical → host physical (second level, programmed by the host). Intel VT-d calls this "scalable mode with first-level and second-level page tables"; AMD-Vi calls it "guest page tables with nested paging." The two-level walk doubles TLB pressure and introduces a multiplicative page table depth (4-level × 4-level = 16 potential memory accesses per translation miss). IOTLB sizing and invalidation granularity are critical performance levers.

Performance bottlenecks — known IOMMU performance traps: - Map/unmap storm: high-throughput I/O paths (NVMe at millions of IOPS, 100GbE line-rate) can generate millions of IOMMU map/unmap operations per second. Each map/unmap involves IOTLB invalidation. UmkaOS mitigates this with: (1) persistent DMA mappings for ring buffers (map once at driver init, never unmap), (2) batched invalidation (accumulate invalidations, flush once per batch), (3) per-CPU IOMMU invalidation queues to avoid contention. - IOTLB capacity: hardware IOTLB entries are scarce (~128-512 entries on typical Intel VT-d). Under heavy I/O with many DMA mappings, IOTLB misses add ~100-500ns per translation. Large pages (2MB, 1GB) in IOMMU page tables dramatically reduce IOTLB pressure — UmkaOS's DMA mapping interface prefers large-page-aligned allocations when possible. - Invalidation latency: IOTLB invalidation on Intel VT-d is not instantaneous. Drain-all invalidation can take ~1-10μs. Page-selective invalidation is faster but not supported on all hardware. UmkaOS checks hardware capability registers and uses the finest granularity available.

ACS (Access Control Services) — PCIe ACS is required for proper IOMMU group isolation. Without ACS on a PCIe switch, all devices behind that switch land in the same IOMMU group (defeating per-device isolation). Many consumer motherboards lack ACS on the root port or PCIe switch, causing all devices to share one IOMMU group. UmkaOS detects this at boot and logs a warning. The pcie_acs_override kernel parameter (Linux compatibility) allows overriding this for testing, but with an explicit security warning.

Errata — IOMMU hardware has errata. Intel VT-d errata include broken interrupt remapping on certain steppings, incorrect IOTLB invalidation scope, and non-compliant default domain behavior. UmkaOS's errata framework (Section 2.18) includes IOMMU errata alongside CPU errata — detected at boot, with workarounds applied automatically.

Intel QAT (QuickAssist) DTLB flush errata: Intel QAT devices (C62x, DH895xCC) have a device-side TLB (DTLB) that caches IOMMU translations independently of the IOMMU's IOTLB. Standard IOTLB invalidation does NOT flush the device-side DTLB — the QAT device may continue using stale translations after an IOMMU page table update. This causes DMA-to-freed-page corruption that is extremely difficult to diagnose (appears as random memory corruption, not an IOMMU fault). Workaround: after any IOMMU mapping change for a QAT device, the driver must issue an additional device-specific DTLB flush via the QAT configuration register (ADF_DEV_RESET_RING) or perform a full device quiesce→remap→restart sequence. Matched via DeviceMatch::Pci { vendor: 0x8086, device: 0x37C8.. } (IommuVendor::IntelVtd).

AMD-Vi DTE (Device Table Entry) update ordering: When modifying AMD-Vi device table entries (e.g., changing a device's IOMMU domain for Tier reassignment), the update must follow a strict protocol to prevent stale DTE reads by the IOMMU hardware: 1. Write the new DTE to the device table in memory 2. Issue INVALIDATE_DEVTAB_ENTRY command via the AMD-Vi command buffer 3. Wait for command completion (poll the AMD-Vi completion semaphore) 4. Issue INVALIDATE_IOMMU_PAGES for the affected device's domain 5. Wait for invalidation completion Steps 2-3 must complete before step 4 — if the IOTLB invalidation executes while the IOMMU hardware still has the old DTE cached, the invalidation targets the wrong domain and the stale translations persist. This ordering bug was the root cause of several VFIO passthrough corruption reports in Linux (fixed in amd_iommu_update_and_flush_device_table()). UmkaOS enforces this ordering in the IommuDomain::attach_device() path with explicit command buffer serialization.

Intel VT-d Write Buffer Flush (RWBF) quirk: Older VT-d implementations (pre-Haswell era) lack DMA write coherency between the CPU and IOMMU hardware — the IOMMU may cache a stale copy of a page table entry after the CPU writes a new one. These IOMMUs report ecap.C = 0 (no page-walk coherency) in the DMAR ECAP register. On non-coherent VT-d units, the kernel must issue a Write Buffer Flush (IommuCmd::WRITE_BUFFER_FLUSH) after every page table modification and before issuing an IOTLB invalidation. Without this flush, the IOMMU may walk a stale page table and either fault on a valid mapping or (worse) use an old mapping to route DMA to a freed page. Modern VT-d (ecap.C = 1) guarantees page-walk coherency and does not require the flush.

Caching Mode (virtualized IOMMU): When running under a hypervisor that virtualizes VT-d (QEMU, Hyper-V, Xen), the virtual IOMMU may set ecap.CM = 1 (Caching Mode). In caching mode, the IOMMU hardware does NOT perform implicit IOTLB invalidation when page table entries are modified — ALL invalidations must be explicit. This is more restrictive than bare-metal behavior where the IOMMU may proactively drop stale cache entries. UmkaOS detects ecap.CM = 1 at IOMMU init and forces strict invalidation mode for ALL domains (not just Tier 2), because missed invalidations under caching mode cause silent mapping corruption rather than performance-only issues.

AMD-Vi ATS Write Permission Bypass (Kaveri/Steamroller): On AMD Family 15h Models 30h-3Fh (Kaveri, Godavari), AMD-Vi ATS (Address Translation Services) does not correctly check write permissions on translated requests. A device using ATS can issue DMA writes to pages mapped read-only in the IOMMU page tables. Workaround: set the L2 debug register bit 0 (AtsIgnoreIWDis) via the AMD-Vi MMIO configuration space during IOMMU initialization on affected steppings. Without this fix, Tier 2 device isolation is ineffective for write protection — a compromised device could overwrite kernel pages despite correct IOMMU page table permissions.

ARM SMMU-500 prefetch errata: ARM MMU-500 IP block (used in Qualcomm SDM845, Marvell Armada, and other SoCs) has a prefetcher that can speculatively fill the TLB with stale translations during concurrent page table updates. Workaround: disable the prefetcher by setting ARM_MMU500_ACTLR_CPRE (bit 1) in the SMMU ACTLR register for each context bank. This reduces SMMU performance but prevents stale-TLB-induced DMA corruption. Detected via DeviceMatch::DeviceTree { "arm,mmu-500" }.

ARM MMU-600 dual-stage TLB stale entry: ARM MMU-600 IP block (used in server platforms with nested/dual-stage translation for KVM) can retain stale Stage-1 TLB entries when Stage-2 translations are invalidated. This affects KVM guests using IOMMU passthrough (VFIO) with nested page tables. Workaround: after Stage-2 invalidation, issue an additional Stage-1 invalidation (CMD_TLBI_EL2_ALL) to force a full TLB refresh. Detected via DeviceMatch::DeviceTree { "arm,mmu-600" }.

Kdump/kexec IOMMU state preservation: During warm reboot (kdump after panic, or kexec into a new kernel), the IOMMU hardware retains its pre-existing translation state. If the new kernel does not properly tear down or fence the old IOMMU configuration before reprogramming, in-flight DMA from devices still using the old mappings can corrupt the new kernel's memory. UmkaOS handles this in two ways: - Normal kexec: Before jumping to the new kernel, the old kernel disables DMA bus-mastering on all devices, waits for outstanding DMA completion, then disables the IOMMU translation. This is clean but requires a cooperative shutdown. - Panic kdump: The crashdump kernel cannot trust any prior state. It re-initializes the IOMMU from scratch, fences all DMA by programming identity mappings for only the crashdump memory regions, and relies on the IOMMU's global invalidation to flush all prior state. Devices that are still DMA-active are fenced by the identity map (their DMA targets the old physical addresses, which are now either identity-mapped to safe regions or faulted by the IOMMU).

Tier 2 security invariants — The following are hard requirements for Tier 2 driver assignment. Violating any one allows a compromised Tier 2 driver to escalate to kernel compromise:

  1. Interrupt remapping mandatory. UmkaOS refuses Tier 2 assignment for any device whose IOMMU does not support interrupt remapping (IommuCaps::INTR_REMAP). Without interrupt remapping, a Tier 2 driver can inject arbitrary MSI interrupts (specifying arbitrary vector + destination CPU), achieving kernel code execution. If interrupt remapping is unavailable (old hardware, BIOS disabled), the device can only be assigned as Tier 0 or Tier 1.

  2. Strict IOTLB invalidation mandatory. All Tier 2 IOMMU domains use strict (synchronous) IOTLB invalidation. Lazy/batched invalidation is a performance optimization that creates a window where stale IOMMU mappings remain active after dma_unmap — acceptable for Tier 0/Tier 1 (trusted code), but a security hole for Tier 2 (untrusted code that could exploit the stale-mapping window for use-after-free DMA). Lazy invalidation is only permitted for Tier 0 (fully trusted) drivers.

  3. SWIOTLB bounce buffering for sub-page protection. Tier 2 DMA mappings that do not span full IOMMU pages (4KB or 2MB) must use SWIOTLB bounce buffering. Without this, a Tier 2 driver's DMA mapping exposes all data within the IOMMU page (Thunderclap-class sub-page data exfiltration). The bounce buffer ensures only the driver's intended buffer is DMA-accessible.

  4. IOMMU hardware state verification at boot. The kernel independently verifies that the IOMMU is actually enabled and translating by performing a test transaction after IOMMU initialization — it does not trust firmware's claim of IOMMU enablement. A firmware bug that leaves the IOMMU in passthrough mode while reporting it as active would silently defeat all Tier 2 isolation (CVE-2025-11901 class). The verification sequence: (a) program a known-bad IOVA mapping, (b) trigger a test DMA read via a safe device, (c) confirm the IOMMU fault is raised. If no fault occurs, the kernel logs a critical warning and refuses Tier 2 assignment for all devices.

11.5.3 Per-Device DMA Identity Mapping (Opt-In Escape Hatch)

UmkaOS's default IOMMU policy is translated DMA for all devices — every DMA transaction passes through IOMMU page tables. This is non-negotiable for the driver isolation model: crash recovery, DMA fencing, and containment all depend on the kernel's ability to revoke DMA access by reprogramming IOMMU entries.

However, certain scenarios require identity-mapped DMA (device DMA addresses = physical addresses, IOMMU programmed as 1:1 pass-through for that device's domain).

dma_mask constraint on identity mapping: Even in identity mode, the kernel MUST respect the device's dma_mask. The 1:1 mapping covers only physical addresses in the range [0, min(max_phys_addr, device.dma_mask)]. Physical addresses above the device's dma_mask are NOT identity-mapped — the device cannot address them. If a DMA operation targets memory above dma_mask on an identity-mapped device, the DMA subsystem falls back to SWIOTLB bounce buffering (same as the non-IOMMU path). This prevents silent failures on 32-bit DMA devices in identity mode on systems with DRAM above 4 GB.

Scenarios requiring identity-mapped DMA:

  • Latency-critical bare-metal I/O. High-frequency trading NICs, ultra-low-latency NVMe, and RDMA HCAs where the ~100-500ns IOTLB miss penalty on unmapped addresses is unacceptable. Persistent DMA mappings (Section 11.4) mitigate this for ring buffers, but scatter-gather DMA with dynamic buffer addresses still pays the IOTLB miss cost.
  • Broken IOMMU interactions. Devices with firmware or silicon bugs that produce incorrect DMA addresses under translation (e.g., devices that hardcode physical addresses in firmware descriptors, or devices that ignore bus addresses returned by the OS).
  • Debug and development. Tracing raw DMA transactions with hardware analyzers is simpler when bus addresses equal physical addresses.
/// Per-device DMA translation policy. Set via admin sysfs or boot parameter.
/// Default is Translated for all devices.
#[repr(u32)]
pub enum DeviceDmaPolicy {
    /// All DMA goes through IOMMU page tables (default). Full isolation.
    Translated = 0,

    /// IOMMU programmed with 1:1 identity mapping for this device's domain.
    /// Device DMA addresses equal physical addresses. IOMMU is still active
    /// (interrupt remapping, fault reporting) but provides no DMA containment.
    Identity = 1,
}

Constraints and trade-offs:

Property Translated (default) Identity
DMA containment Full — device can only reach explicitly mapped regions None — device can DMA to any physical address in its identity window
Crash recovery IOMMU entries revoked → in-flight DMA faults Identity mapping cannot be selectively revoked without full device reset
Driver tier Any tier Tier 1 only (kernel-space drivers with CAP_DMA_IDENTITY)
IOTLB miss cost ~100-500ns per miss Zero (1:1 mapping fits in a single large-page IOTLB entry)
Interrupt remapping Active Active (identity mapping does not affect interrupt remapping)
IOMMU group rule Per-device Entire IOMMU group must use Identity if any member does

Identity mapping scope: The kernel programs a 1:1 IOMMU mapping covering the physical address range [Section 20.2](20-observability.md#stable-tracepoint-abi--audit-subsystem)) with the device BDF, requesting process, and admin credential. 4. **IOMMU group enforcement.** If device A is set to Identity and shares an IOMMU group with device B, device B is also switched to Identity (since devices in the same IOMMU group can peer-to-peer DMA without IOMMU translation). The kernel logs a warning identifying all affected devices. 5. **No crash recovery guarantee.** The kernel marks devices in Identity mode with aNO_DMA_FENCE` flag. On driver crash, the kernel performs a Function Level Reset (FLR) or secondary bus reset instead of relying on IOMMU revocation — this is slower (10-100ms vs microseconds) but is the only safe option without DMA fencing.

Implementation: The IommuDomainType enum (defined above) already includes all four domain types needed for DMA fencing: Kernel, Identity, VmPassthrough, and UserspaceDma. The fencing behavior described above applies to all domain types.

Global identity mode on weak-isolation architectures:

On most architectures, per-device identity mapping (above) is the correct granularity: even if one device needs passthrough, the rest should remain IOMMU-translated. However, on architectures where Tier 1 CPU-side isolation is already absent or equivalent to Tier 0, the IOMMU is the only remaining isolation boundary — and if the admin has already accepted that Tier 1 drivers share the kernel address space without hardware memory protection, the IOMMU overhead protects only against rogue DMA from device firmware, not from the driver code itself.

For these cases, UmkaOS provides a global identity mode restricted to platforms where CPU-side Tier 1 isolation is weak or absent:

/// System-wide DMA translation policy. Boot parameter only —
/// cannot be changed at runtime.
///
/// Boot parameter: umka.dma_default_policy={translated,identity}
/// Default: translated (always)
#[repr(u32)]
pub enum SystemDmaPolicy {
    /// All devices use IOMMU translation (default on all architectures).
    Translated = 0,

    /// All Tier 1 devices default to identity-mapped DMA. Tier 2
    /// (userspace) devices always remain Translated regardless of this
    /// setting. Individual devices can still be overridden to Translated
    /// via sysfs. Requires umka.isolation=performance or equivalent
    /// weak-isolation architecture.
    IdentityDefault = 1,
}

Preconditions for umka.dma_default_policy=identity:

The kernel refuses this boot parameter unless at least one of: 1. umka.isolation=performance is also set (admin has explicitly opted out of Tier 1 CPU-side isolation — all drivers run as Tier 0 or Tier 2). 2. The architecture has no fast isolation mechanism and Tier 1 uses page-table switching with overhead equivalent to Tier 2 (currently: PPC64LE POWER8, AArch64 mainstream with I/O-heavy workloads). On RISC-V, Tier 1 is not available and all Tier 1 is unavailable and drivers already use Tier 0 or Tier 2; this condition does not apply.

If neither condition is met, the kernel prints a boot warning and ignores the parameter:

umka: dma_default_policy=identity rejected: CPU-side Tier 1 isolation is active.
      Use umka.isolation=performance or per-device umka.dma_identity=<BDF> instead.

What global identity mode does NOT affect: - Tier 2 (userspace) drivers — always IOMMU-translated, regardless of policy. A compromised userspace process with identity-mapped DMA would be a full kernel compromise. - VM passthrough (VmPassthrough) — VM IOMMU domains are unaffected; the hypervisor's second-level page tables remain in control. - Interrupt remapping — remains active on all devices. Identity mode disables DMA address translation only, not interrupt remapping. - Per-device overrides — individual devices can be set to Translated via sysfs even when the global default is Identity. This allows an admin to protect specific devices (e.g., an untrusted USB controller) while running most devices in identity mode.

Rationale: On RISC-V 64, where Tier 1 isolation is not available and Tier 0 drivers share the kernel address space in Ring 0 with full memory access, the IOMMU is protecting against a strictly weaker threat (device firmware DMA) than the one already accepted (driver code CPU access). Paying ~100-500ns per IOTLB miss on every DMA operation to defend against device firmware — while the driver itself has unrestricted access to all of kernel memory — is a questionable trade-off for performance-sensitive workloads. The same logic applies when isolation=performance explicitly promotes all drivers to Tier 0 on any architecture.

Why this is not Linux's iommu.passthrough=1: Linux's global passthrough exists for legacy compatibility — many Linux drivers assume physical addresses equal bus addresses, and passthrough preserved that assumption. UmkaOS's global identity mode exists for a different reason: to avoid paying IOMMU overhead on platforms where the security benefit is already negated by the absence of CPU-side isolation. The precondition check ensures it cannot be enabled on platforms where IOMMU translation is the critical isolation boundary (x86-64 with MPK, AArch64 with POE, etc.).


11.5.4 IOMMU Fault Routing for VM-Assigned Devices

When a device assigned to a VM via VFIO (Section 18.5) generates an IOMMU fault (DMA to an address not mapped in the VM's IOAS), the kernel must route the fault notification to the VMM (typically QEMU) so it can take corrective action (inject a machine check into the guest, reset the device, or terminate the VM). Without explicit fault routing, IOMMU faults on passthrough devices are handled only by the kernel's FMA subsystem — the VMM is unaware that its guest's device has faulted.

11.5.4.1 Fault Detection

IOMMU hardware reports faults through architecture-specific mechanisms:

Architecture Fault Reporting Mechanism Interrupt
x86-64 (VT-d) Fault Recording Register + Fault Event MSI DMAR fault IRQ
AArch64 (SMMU) Event Queue (EVTQ) + MSI SMMU event IRQ
RISC-V (IOMMU) Fault/Event Queue + MSI IOMMU fault IRQ
PPC64LE (TCE) TCE Error Register Platform-specific

The per-architecture IOMMU driver's fault interrupt handler reads the faulted device's BDF (bus/device/function), the faulted IOVA, the fault type (read/write/no-translation), and constructs an IommuFaultRecord:

/// IOMMU fault record — architecture-neutral representation of an IOMMU
/// translation fault. Produced by the per-arch IOMMU fault interrupt handler.
pub struct IommuFaultRecord {
    /// PCI BDF (bus/device/function) of the device that caused the fault.
    pub bus_dev_fn: u32,
    /// IOVA that was attempted and faulted.
    pub faulted_iova: u64,
    /// Fault type.
    pub fault_type: IommuFaultType,
    /// Direction of the faulted access.
    pub direction: IommuFaultDirection,
    /// PASID (Process Address Space ID) if the fault is PASID-tagged.
    /// 0 if no PASID context.
    pub pasid: u32,
}

#[repr(u8)]
pub enum IommuFaultType {
    /// No translation found for the IOVA (unmapped address).
    NoTranslation = 0,
    /// Translation found but permissions deny the access (read-only page, no-execute).
    PermissionDenied = 1,
    /// Device attempted DMA outside its assigned IOMMU domain.
    DomainViolation = 2,
}

/// Direction of the faulted DMA access. Richer than a simple bool to capture
/// execute and atomic faults reported by modern IOMMUs (AMD IOMMU v2,
/// Intel Scalable IOT, ARM SMMU v3.2+).
#[repr(u8)]
pub enum IommuFaultDirection {
    /// Device read from host memory.
    Read = 0,
    /// Device write to host memory.
    Write = 1,
    /// Instruction fetch (GPU shader fetch, NIC inline crypto instruction load).
    Execute = 2,
    /// Atomic operation (PCIe AtomicOps — fetch-add, compare-swap, etc.).
    AtomicOp = 3,
}

11.5.4.2 Fault Routing Path

IOMMU hardware fault
Per-arch IOMMU fault IRQ handler (NMI-safe, runs in IRQ context)
    │ Constructs IommuFaultRecord
iommu_handle_fault(record: &IommuFaultRecord)
    ├─── Device is NOT VM-assigned (kernel driver domain):
    │    │ Standard FMA path:
    │    │ fma_emit(FaultEvent::Generic { ... })
    │    │ If Tier 1 driver: trigger crash recovery
    │    │ ([Section 11.9](#crash-recovery-and-state-preservation))
    │    └─── Return
    └─── Device IS VM-assigned (VFIO passthrough):
         │ VM fault routing path:
         ├─ 1. Look up the VfioDevice by BDF in the VFIO device registry.
         ├─ 2. Signal the VMM via the device's fault eventfd:
         │     vfio_dev.fault_eventfd.signal(1)
         │     The VMM has registered this eventfd via
         │     VFIO_DEVICE_FEATURE(VFIO_DEVICE_FEATURE_DMA_FAULT).
         ├─ 3. Queue the IommuFaultRecord in the device's fault ring:
         │     vfio_dev.fault_ring.push(record)
         │     The VMM reads queued faults via read() on the VFIO device fd
         │     (or via VFIO_DEVICE_GET_FAULT ioctl).
         ├─ 4. Emit FMA event (parallel to VMM notification):
         │     fma_emit(FaultEvent::Generic {
         │         device_id: vfio_dev.device_node_id(),
         │         event_code: FMA_IOMMU_FAULT_VM,
         │         payload: [vm_id, faulted_iova, fault_type, ...],
         │     })
         └─ 5. If fault_type is DomainViolation or repeated NoTranslation
              (3+ faults within 100 ms from same device):
              Escalate — disable bus mastering on the device and signal
              the VMM's error eventfd (KVM_IRQFD error path).
              The VMM is expected to inject a machine check or terminate
              the guest.

11.5.4.3 VMM-Side Fault Eventfd Registration

The VMM registers a fault notification eventfd during VFIO device setup. This is a per-device eventfd (separate from the MSI-X irqbypass eventfds used for normal interrupt injection).

/// VFIO device feature for DMA fault notification.
/// Set via VFIO_DEVICE_FEATURE ioctl with feature=VFIO_DEVICE_FEATURE_DMA_FAULT.
pub struct VfioDmaFaultFeature {
    /// Eventfd file descriptor. The kernel signals this fd (writes 1) whenever
    /// an IOMMU fault is recorded for this device. The VMM polls or epoll-waits
    /// on this fd.
    pub fault_eventfd: RawFd,
}

/// Per-device IOMMU fault ring, readable by the VMM.
/// Fixed-size ring buffer (64 entries) in kernel memory, mapped read-only
/// into the VMM's address space via mmap on the VFIO device fd at a
/// dedicated region offset.
pub struct VfioFaultRing {
    /// Ring entries. Each entry is a serialized IommuFaultRecord (32 bytes).
    pub entries: [IommuFaultRecord; 64],
    /// Producer index (kernel writes). AtomicU32 for lock-free producer.
    pub head: AtomicU32,
    /// Consumer index (VMM reads). AtomicU32 for lock-free consumer.
    pub tail: AtomicU32,
}

Integration with KVM: When the kernel escalates a fault (step 5 above), it signals the VM's KVM error eventfd. KVM translates this into a guest machine check exception (MCE on x86, SError on AArch64). The guest OS then handles the fault through its own error recovery path — on a well-behaved guest, this triggers the guest's PCIe AER handler, which may reset the device or offline it.

No-IOMMU passthrough: In the iommu_off=dangerous development mode, there is no IOMMU fault detection. DMA faults manifest as silent data corruption or host crashes. This is explicitly documented as outside the security envelope.


11.5.5 PCIe ASPM (Active State Power Management)

Active State Power Management (ASPM) is a PCIe link-level power optimization defined in the PCIe Base Specification §5.4. When a PCIe link has no pending transactions, both the upstream port and downstream device can agree to enter a lower-power link state (L0s or L1), turning off link transmitters. Without ASPM, idle PCIe links consume 1–4 W of unnecessary link-training and signal-conditioning power.

UmkaOS manages ASPM as part of the PCIe port driver, integrated with the device registry (Section 11.4). ASPM state is negotiated per-link at device enumeration time and coordinated with per-device runtime PM (Section 7.2.12).


State Exit latency Power savings Mechanism
L0 None None Normal operating state
L0s < 64 ns ~100–300 mW/link Tx powered down; fast restart
L1 < 70 µs ~300–600 mW/link Both Tx and Rx powered down
L1.1 < 70 µs Higher than L1 L1 + CLKREQ# deassertion
L1.2 < 32 ms (typ) Maximum L1.1 + reference clock off

L1.2 provides the largest power savings but requires reference clock re-lock time on resume. It is appropriate for devices with latency-tolerant access patterns (USB host controllers, audio controllers, GPIO expanders). It is inappropriate for NVMe SSDs on latency-sensitive I/O paths.


11.5.5.2 Types

/// Power state management for one PCIe link (upstream port ↔ downstream device).
pub struct PcieAspmLink {
    /// Root port or PCIe switch downstream port (the link's upstream end).
    pub upstream_port:       Arc<PciDevice>,
    /// PCIe endpoint or switch upstream port (the link's downstream end).
    pub downstream_dev:      Arc<PciDevice>,
    /// ASPM states supported by both endpoints (intersection of LNKCAP bits [11:10]).
    pub supported:           AspmStates,
    /// Currently enabled ASPM states.
    pub enabled:             AspmStates,
    /// L0s exit latency in nanoseconds (from upstream port LNKCAP register).
    pub l0s_exit_latency_ns: u32,
    /// L1 exit latency in microseconds (from upstream port LNKCAP register).
    pub l1_exit_latency_us:  u32,
}

bitflags::bitflags! {
    /// ASPM link states. Corresponds to LNKCTL register bits [1:0] and ASPM L1 SS.
    pub struct AspmStates: u8 {
        /// L0s: fast-exit power saving (Tx powered down).
        const L0S  = 0x01;
        /// L1: full link power saving (both Tx and Rx powered down).
        const L1   = 0x02;
        /// L1.1: L1 + CLKREQ# deassertion (saves reference clock power).
        const L1_1 = 0x04;
        /// L1.2: deepest link state (reference clock off; max savings).
        const L1_2 = 0x08;
    }
}

/// System-wide ASPM policy. Configurable via `umka.pcie_aspm=` kernel parameter.
#[repr(u8)]
pub enum AspmPolicy {
    /// Disable all ASPM. Use for benchmarking or when ASPM causes hardware issues.
    Disabled    = 0,
    /// Enable only L0s (fast exit, minimal latency impact).
    L0sOnly     = 1,
    /// Enable only L1 (slower exit, higher power savings).
    L1Only      = 2,
    /// Enable all supported states. Recommended for battery-powered systems.
    AllStates   = 3,
    /// Default: enable ASPM on low-bandwidth links; disable on high-throughput links.
    /// Balances power savings with I/O latency for mixed workloads.
    Performance = 4,
}

11.5.5.3 ASPM Enablement Algorithm

ASPM is configured at device enumeration time, after BAR assignment and before driver probe:

Procedure AspmConfigure(upstream_port, downstream_dev):

1. Read LNKCAP register (offset 0x0C in PCI Express Capability) from both endpoints.
   upstream_aspm_caps   = (upstream_port.lnkcap >> 10) & 0x3   // bits [11:10]
   downstream_aspm_caps = (downstream_dev.lnkcap >> 10) & 0x3  // bits [11:10]
   link_aspm = upstream_aspm_caps & downstream_aspm_caps  // intersection

2. Check BIOS ASPM lock bit (ACPI FADT IAPC_BOOT_ARCH, offset 0x6D bit 4):
   If locked AND (umka.pcie_aspm_override != "1"):
     Skip ASPM configuration for this link. BIOS retains control.
     Log KERN_DEBUG "pcie %s: ASPM controlled by BIOS"

3. Apply policy from `umka.pcie_aspm=disabled|l0s|l1|all|performance` (default: performance):
   policy = read_kernel_param("umka.pcie_aspm", default: AspmPolicy::Performance)

4. Compute desired ASPM states based on policy and link_aspm:
   - Disabled:     desired = AspmStates::empty()
   - L0sOnly:      desired = link_aspm & AspmStates::L0S
   - L1Only:       desired = link_aspm & (AspmStates::L1 | AspmStates::L1_1 | AspmStates::L1_2)
   - AllStates:    desired = link_aspm
   - Performance:  desired = AspmPolicy::performance_heuristic(link_aspm, downstream_dev)

5. Performance heuristic:
   Disable ASPM on this link if downstream device matches any of:
   a. Link speed >= PCIe Gen4 AND link width >= x4 (high-bandwidth: NVMe, 25GbE+)
   b. Device class is Mass Storage Controller (class_code 0x01xxxx) and
      link speed >= Gen3 (latency-sensitive NVMe/AHCI)
   c. Device is registered with required_max_latency_ns < L1 exit latency

   If none match: desired = link_aspm (enable all supported states).

6. L0s latency check:
   For each driver bound to downstream_dev:
     if driver.required_max_latency_ns < link.l0s_exit_latency_ns:
       desired &= !AspmStates::L0S  // Disable L0s for this link
       Log KERN_INFO "pcie %s: L0s disabled (driver needs < %u ns, link exit = %u ns)"

7. Program the upstream port LNKCTL register (PCIe config space, offset 0x10):
   ASPM is ALWAYS controlled from the upstream port, not the downstream device.
   lnkctl = read_pci_config_word(upstream_port.bdf, PCI_EXP_LNKCTL)
   lnkctl = (lnkctl & !ASPM_CTL_MASK) | desired.bits()
   write_pci_config_word(upstream_port.bdf, PCI_EXP_LNKCTL, lnkctl)
   link.enabled = desired

8. Log:
   KERN_INFO "pcie %04x:%02x:%02x.%d: ASPM enabled=%s (supported=%s, policy=%s)"

ACPI _DSM interaction (for platforms with PCIe port firmware hooks):

On ACPI platforms: before step 7, check for a _DSM method on the upstream port's
ACPI node with PCIe port _DSM UUID (E5C937D0-3553-4D7A-9117-EA4D19C3434D).
If function 3 (ASPM supported): execute _DSM to let firmware validate ASPM.
If _DSM returns a non-zero value indicating an error: disable ASPM for this link.
If no _DSM: proceed with step 7 using LNKCAP2-derived capabilities only.


11.5.5.4 Linux External ABI

/sys/bus/pci/devices/<bdf>/link/l0s_aspm
  Values: "enabled" | "disabled" | "not supported"
  Read-write: write "enabled"/"disabled" to change L0s ASPM state.

/sys/bus/pci/devices/<bdf>/link/l1_aspm
  Values: "enabled" | "disabled" | "not supported"
  Read-write: includes L1, L1.1, and L1.2 collectively.

/sys/bus/pci/devices/<bdf>/link/aspm_ctrl
  Bitmap of enabled AspmStates (hex). Read-write.
  Allows fine-grained control of L1 sub-states.

Kernel parameter: umka.pcie_aspm=disabled|l0s|l1|all|performance
  Default: performance. Applied to all links unless overridden per-device via sysfs.

umka.pcie_aspm_override=1
  When set: kernel ignores BIOS ASPM lock bit and programs ASPM regardless.
  Use only when BIOS incorrectly locks ASPM on hardware that supports it.

11.5.6 Tier 2 Streaming DMA Syscalls

Tier 2 drivers run in userspace (Ring 3) and cannot call kernel DMA functions directly. UmkaOS provides three syscalls for Tier 2 drivers to perform streaming DMA operations through the kernel's DMA subsystem. These syscalls are the Tier 2 equivalent of DmaDevice::dma_map_single, DmaDevice::dma_unmap_single, and DmaDevice::dma_sync_for_cpu / dma_sync_for_device.

All three syscalls require CAP_DMA_MAP in the caller's capability set (granted to Tier 2 driver processes at device assignment time). The device is identified by its DeviceHandle (obtained via the VFIO or iommufd device assignment protocol).

/// Map a userspace buffer for streaming DMA by a Tier 2 device.
///
/// The kernel pins the userspace pages, creates an IOMMU mapping in the
/// device's domain, and returns a device-visible DMA address. For sub-page
/// mappings (offset + size does not cover a full page), the kernel uses
/// SWIOTLB bounce buffering to prevent the device from accessing adjacent
/// data within the same page ([Section 4.14](04-memory.md#dma-subsystem--swiotlb-software-iommu-bounce-buffering)).
///
/// # Arguments
///
/// * `dev_handle` — Device handle from VFIO/iommufd assignment.
/// * `user_addr` — Userspace virtual address of the buffer to map.
/// * `size` — Size of the mapping in bytes.
/// * `direction` — DMA transfer direction (ToDevice, FromDevice, Bidirectional).
///
/// # Returns
///
/// On success: `DmaMapResult { dma_addr, mapping_id }`.
/// `dma_addr` is the device-visible address to program into device registers.
/// `mapping_id` is an opaque handle for subsequent sync/unmap calls.
///
/// On error: errno (EFAULT if user_addr invalid, ENOMEM if IOMMU IOVA space
/// exhausted, EPERM if caller lacks CAP_DMA_MAP, EINVAL if direction is invalid).
///
/// # Syscall number
///
/// `__NR_umka_dma_map_streaming` (UmkaOS-native negative syscall number).
pub fn sys_umka_dma_map_streaming(
    dev_handle: DeviceHandle,
    user_addr: usize,
    size: usize,
    direction: DmaDirection,
) -> Result<DmaMapResult, Errno>;

/// Unmap a previously created streaming DMA mapping.
///
/// Unpins userspace pages, tears down the IOMMU mapping, and (for
/// FromDevice/Bidirectional mappings) copies bounce-buffered data back
/// to the userspace buffer if SWIOTLB was used.
///
/// After this call, the `dma_addr` returned by the corresponding
/// `umka_dma_map_streaming` is invalid — the device MUST NOT access it.
///
/// # Arguments
///
/// * `dev_handle` — Device handle.
/// * `mapping_id` — Opaque handle from `umka_dma_map_streaming`.
///
/// # Syscall number
///
/// `__NR_umka_dma_unmap_streaming` (UmkaOS-native negative syscall number).
pub fn sys_umka_dma_unmap_streaming(
    dev_handle: DeviceHandle,
    mapping_id: DmaMappingId,
) -> Result<(), Errno>;

/// Synchronize a streaming DMA mapping for CPU or device access.
///
/// Must be called before the CPU reads from a FromDevice mapping
/// (`SyncForCpu`) or before the device reads after a CPU write
/// (`SyncForDevice`). On cache-coherent architectures these are
/// no-ops; on non-coherent architectures they perform the appropriate
/// cache maintenance operations.
///
/// # Arguments
///
/// * `dev_handle` — Device handle.
/// * `mapping_id` — Opaque handle from `umka_dma_map_streaming`.
/// * `sync_type` — `SyncForCpu` or `SyncForDevice`.
///
/// # Syscall number
///
/// `__NR_umka_dma_sync` (UmkaOS-native negative syscall number).
pub fn sys_umka_dma_sync(
    dev_handle: DeviceHandle,
    mapping_id: DmaMappingId,
    sync_type: DmaSyncType,
) -> Result<(), Errno>;

/// Opaque identifier for a streaming DMA mapping. Returned by
/// `umka_dma_map_streaming`, consumed by `umka_dma_unmap_streaming`
/// and `umka_dma_sync`. The kernel uses this to look up the mapping
/// in the per-device mapping table (XArray keyed by mapping_id).
#[repr(transparent)]
pub struct DmaMappingId(pub u64);

/// Result of a successful streaming DMA map.
#[repr(C)]
pub struct DmaMapResult {
    /// Device-visible DMA address for hardware register programming.
    pub dma_addr: DmaAddr,
    /// Opaque mapping identifier for sync/unmap.
    pub mapping_id: DmaMappingId,
}
// DmaMapResult: dma_addr(DmaAddr=u64=8) + mapping_id(DmaMappingId=u64=8) = 16 bytes.
const_assert!(size_of::<DmaMapResult>() == 16);

/// Synchronization direction for `umka_dma_sync`.
#[repr(u32)]
pub enum DmaSyncType {
    /// Synchronize for CPU access (cache invalidate on non-coherent archs).
    SyncForCpu    = 0,
    /// Synchronize for device access (cache writeback on non-coherent archs).
    SyncForDevice = 1,
}

Per-device mapping table: Each Tier 2 device maintains an XArray of active streaming DMA mappings, keyed by DmaMappingId. The mapping table is allocated in kernel memory (outside the Tier 2 driver's address space) and is used to validate unmap/sync syscall arguments — a Tier 2 driver cannot forge a DmaMappingId to access another device's DMA mappings. The XArray entry stores the IOVA, physical address, size, direction, and SWIOTLB slot (if bounce-buffered).

Cross-references: - DMA subsystem core types: Section 4.14 - SWIOTLB sub-page bounce buffering: Section 4.14 - Tier 2 device assignment: Section 18.5


11.6 Device Services and Boot Integration

Summary: This section specifies cross-driver service discovery and mediation, KABI integration (new methods appended to KernelServicesVTable), crash recovery integration with the device registry, boot sequence integration (Tier 0 retroactive registration, console handoff, PCI enumeration, ACPI/DT enumerators, firmware quirks), sysfs compatibility, and firmware management. See Section 11.4 for the registry data model and Section 11.5 for IOMMU group management.

11.6.1 Service Discovery

11.6.1.1 The Problem

Drivers sometimes need services from other drivers — not through direct communication, but through mediated access. Examples:

  • NIC needs a PHY driver (MII bus)
  • GPU display pipeline needs I2C controller for DDC/EDID
  • RAID controller needs to discover member disks
  • Filesystem driver needs its underlying block device

In Linux, each of these has a subsystem-specific mechanism (phylib, i2c_adapter, md_personality, etc.) with its own registration/lookup API. In IOKit, it is done through IOService matching. UmkaOS unifies service discovery through the registry.

11.6.1.2 Service Publication

A driver can publish a named service on its device node:

Driver A (e.g., PHY driver):
  1. Completes init, device node is Active
  2. Calls registry_publish_service("phy", &phy_vtable)
  3. Registry records: node A provides service "phy" with given vtable

The phy_vtable is a service-specific C-ABI vtable (same flat, versioned approach as all other KABI vtables). The registry stores a reference to it.

11.6.1.3 Service Lookup

A driver can look up a named service:

Driver B (e.g., NIC driver):
  1. Needs PHY service
  2. Calls registry_lookup_service("phy", scope=ParentSubtree)
  3. Registry searches for a node in scope that publishes "phy"
  4. Registry validates Driver B has PEER_DRIVER_IPC capability
  5. Registry creates a provider-client link (B consumes A's "phy")
  6. Registry returns a wrapped service vtable and a ServiceHandle

Lookup scope options:

#[repr(u32)]
pub enum ServiceLookupScope {
    Siblings       = 0,    // Same parent only
    ParentSubtree  = 1,    // Parent and all its descendants
    Global         = 2,    // Entire registry (expensive, rare)
    Specific       = 3,    // A specific node (by DeviceHandle)
}

11.6.1.4 Mediated Access

The registry mediates all cross-driver service access. This is critical:

  1. The registry validates capabilities before returning a service handle.
  2. The returned vtable is wrapped by the registry — calls go through a trampoline that:
  3. Validates the service handle is still valid
  4. Performs the isolation domain switch if provider and client are in different Tier 1 domains
  5. Handles the user-kernel transition if one side is Tier 2
  6. The registry can revoke a service link at any time (e.g., when the provider crashes).
  7. The registry tracks all active links for PM ordering (clients must suspend before providers).
  8. Drivers never hold direct pointers to each other's memory.

11.6.1.5 Service Recovery

When a provider driver crashes and is reloaded:

  1. The registry invalidates all service handles pointing to the crashed provider.
  2. Client drivers that call the service vtable receive -ENODEV from the trampoline.
  3. After the provider is reloaded and republishes its service, client drivers receive a service_recovered callback (optional, new addition to DriverEntry):
// Appended to DriverEntry (optional)
/// # Safety
///
/// - `ctx` must be the same opaque pointer passed at driver init.
/// - `service_name` must point to a valid UTF-8 byte sequence of exactly
///   `service_name_len` bytes, accessible for the duration of the call.
/// - The callback runs in process context (not IRQ/NMI).  The driver may
///   re-acquire the service handle, re-validate cached state, or return
///   an error to decline recovery.
pub service_recovered: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    service_name: *const u8,
    service_name_len: u32,
) -> InitResultCode>,

The client driver can then re-acquire the service handle and resume operations.

11.6.1.6 Service Handle Liveness Protocol

After a Tier 1 driver crashes, any ServiceHandle held by Tier 2 or user processes points to a stale vtable. Calling through a stale vtable is a use-after-free (UAF) vulnerability. UmkaOS prevents this via generation counters:

/// Kernel-internal service reference. NOT exposed at KABI boundary.
/// The KABI-stable token is `ServiceHandle` (a newtype over `u64`).
/// Mapping: `ServiceHandle::id` → kernel looks up `InternalServiceRef` via service registry.
///
/// Contains a generation counter that is checked on every dispatch
/// to detect stale handles pointing to crashed providers.
pub struct InternalServiceRef {
    /// Provider descriptor pointer (points into umka-core memory, not driver memory).
    provider: *const ProviderDescriptor,
    /// Generation of the provider at handle creation time.
    /// Must match provider.state_generation on dispatch or the call fails.
    generation: u64,
    /// Rights granted to the holder of this handle.
    rights: Rights,
}

/// Per-provider state generation counter. Incremented when:
/// 1. The provider crashes and is reloaded.
/// 2. The provider explicitly invalidates all handles (e.g., after a
///    security-relevant config change).
/// Stored in umka-core memory (not in the driver's memory domain) so it
/// remains valid even after the driver domain is destroyed.
pub struct ProviderDescriptor {
    /// Monotonically increasing. Odd = active; even = inactive/crashed.
    /// Initial value: 1 (odd = active) when the provider first registers.
    /// Updated atomically by umka-core on crash detection.
    pub state_generation: AtomicU64,
    // ... vtable pointer and other registry fields follow
}

Dispatch check (in the trampoline layer, before every cross-domain call):

fn trampoline_dispatch(handle: &InternalServiceRef, request: &Request) -> Result<Response, Error> {
    // Check liveness: read the provider's current generation.
    // Ordering::Acquire: ensures we see any writes made by the crash handler
    // that incremented state_generation.
    let current_gen = unsafe {
        (*handle.provider).state_generation.load(Ordering::Acquire)
    };
    if current_gen != handle.generation {
        return Err(Error::ProviderDead);
    }
    // Generation matched: safe to call through vtable.
    // (Note: generation can still change between the check and the call.
    // The domain fault handler catches this and returns ProviderDead to
    // the caller via the normal crash-recovery path.)
    dispatch_to_tier1(handle, request)
}

Handle invalidation on crash: When a Tier 1 driver panics: 1. The domain fault handler (already specified in Section 11.9) catches the fault. 2. It atomically increments provider.state_generation via CAS (compare_exchange(old, old+1, AcqRel, Acquire)). If the CAS fails (concurrent crash handling — should not happen with single-threaded fault handler), it retries. The CAS ensures the transition is exactly +1, not a larger skip that could confuse ABA detection. 3. All subsequent dispatch attempts to this provider return Err(ProviderDead). 4. After successful driver reload (init() returns Ok), the kernel atomically increments provider.state_generation from even (inactive) to odd (active) via compare_exchange(even, even+1, Release, Relaxed). This explicit activation step makes the provider available for new ServiceHandle creation. Handles created before the crash remain permanently stale (their generation will never match the new odd value). 5. Callers that receive Err(ProviderDead) must re-open the service to get a new ServiceHandle with the current generation (the kernel creates a new InternalServiceRef with the updated generation and maps it to a fresh ServiceHandle::id).

Crash during reload: If the driver crashes during init() (before returning Ok), state_generation is still at the even value from the prior crash (step 4 gates the even-to-odd transition on init() success). The fault handler must ensure the state remains even (inactive). Since the state is already even, the fault handler uses a parity-aware CAS to set the next even value: let new = (old | 1) + 1; — this rounds up to the next even number regardless of the current parity, maintaining the odd=active/even=inactive invariant even under unexpected state. In practice, since the state is already even (init never completed), the CAS advances from even to the next even (e.g., 2 -> 4), providing a distinct generation that invalidates any stale handles from a concurrent open attempt. The crash recovery system will attempt reload up to max_crash_count (3) times before giving up and leaving the provider permanently inactive.

Invariant: ProviderDescriptor is always allocated in umka-core memory, never in the driver's isolation domain. This ensures the descriptor (including state_generation) remains accessible and uncorrupted after the driver domain is torn down during crash recovery.

Design intent: InternalServiceRef cannot be "refreshed" — a crashed provider's internal reference cannot be upgraded to point at the new instance. This is intentional: the crash may indicate a security event, and forcing callers to explicitly re-open (obtaining a new ServiceHandle) ensures they notice the crash and can apply any required policy (e.g., re-authenticate, validate new driver version). The generation counter is the minimal mechanism; it adds one Acquire load (~3-5 cycles, L1-resident) per cross-domain call.

11.6.1.7 Service Vtable Trampoline Mechanism

The trampoline referenced in Section 11.6 is a thin wrapper function generated per vtable method that mediates isolation domain transitions. Every cross-domain vtable call passes through this trampoline — there is no direct function pointer invocation across isolation boundaries.

Trampoline steps (for a Tier 1 → kernel call):

  1. Validate liveness — check state_generation (described in the Service Handle Liveness Protocol above). If stale, return Err(ProviderDead) without touching any hardware state.
  2. Switch to the caller's isolation domain — call arch::current::isolation::switch_domain() which executes the arch-specific domain switch instruction: | Arch | Instruction | Mechanism | |------|-------------|-----------| | x86-64 | WRPKRU | MPK protection key switch (with shadow elision) | | AArch64 | MSR POR_EL0 | POE permission overlay register | | AArch64 fallback | Page table + ASID switch | For hardware without ARMv8.9-A POE | | ARMv7 | MCR p15, 0, Rd, c3, c0, 0 (DACR) | Domain Access Control Register | | PPC32 | mtsr | Segment register switch | | PPC64LE | mtspr PIDR | Radix PID switch (POWER9+) | | RISC-V | Page table switch | No fast isolation mechanism available |
  3. Call the actual vtable function pointer — the callee executes in its own domain.
  4. Switch back to the callee's domain — a second switch_domain() call restores the original isolation domain.

Tier 1 → Tier 1 calls (cross-driver service access) require a double domain switch: domain-A → kernel (Core domain) → domain-B. The first switch enters the Core domain to validate the service handle, and the second switch enters the target driver's domain. The return path reverses: domain-B → Core → domain-A.

Implementation: the trampoline is a generic Rust function pointer wrapper, not inline assembly. It calls arch::current::isolation::switch_domain(target_domain_id) which is an #[inline(always)] function containing the arch-specific instruction (on x86-64 this goes through the PKRU shadow elision path described in Section 11.2.2). Per-arch cycle costs are documented in the isolation mechanism table in Section 11.2.

/// Generic trampoline for cross-domain vtable method calls.
/// `F` is the vtable function pointer type; `Args` are the method arguments.
///
/// This function is instantiated per vtable method by the kabi-gen code generator.
/// It is never called directly by driver code — the generated caller stubs
/// invoke it transparently.
///
/// **Panic handling**: Tier 1 drivers MUST be compiled with `panic = abort`.
/// If `vtable_fn` panics, the abort triggers the crash recovery NMI mechanism,
/// which restores the domain register (WRPKRU(0) on x86-64, equivalent on
/// other architectures) as part of the forced context save. Unwinding is NOT
/// supported across isolation domain boundaries. RAII cleanup (DMA unmapping,
/// ring buffer release) is handled by the crash recovery procedure, not by
/// drop impls.
fn trampoline_call<F, Args, Ret>(
    handle: &InternalServiceRef,
    callee_domain: DomainId,
    vtable_fn: F,
    args: Args,
) -> Result<Ret, Error>
where
    F: FnOnce(Args) -> Ret,
{
    // Step 1: liveness check (generation counter).
    let current_gen = unsafe {
        (*handle.provider).state_generation.load(Ordering::Acquire)
    };
    if current_gen != handle.generation {
        return Err(Error::ProviderDead);
    }

    // Step 2: TOCTOU mitigation — check per-CPU domain_valid flag BEFORE
    // switching domains. This ordering is critical on AArch64 mainstream
    // (page-table + ASID isolation): switch_domain writes TTBR0_EL1, which
    // references a physical page table. If domain revocation freed or
    // invalidated that page table, the TTBR write would point to freed
    // memory, causing a kernel panic from Core code (the trampoline).
    // On x86-64 MPK, switch_domain writes PKRU (no memory reference), so
    // the order is less critical — but this ordering is safe on ALL
    // architectures and avoids an unnecessary domain switch on the
    // already-crashed fast path.
    //
    // The crash handler clears domain_valid on the faulting CPU before
    // NMI delivery; remote CPUs' domain_valid is cleared by the NMI
    // handler (up to 10us delay). During this window, remote CPUs are
    // protected by hardware domain revocation — any switch_domain into
    // the revoked domain will fault at the first callee dereference.
    // The domain_valid check is an optimization to avoid the hardware
    // fault path, not a safety gate.
    if !per_cpu::domain_valid.load(Ordering::Acquire) {
        return Err(Error::ProviderDead);
    }

    // Step 3: switch to callee's isolation domain.
    let saved_domain = arch::current::isolation::switch_domain(callee_domain);

    // Step 4: call the actual vtable method.
    let result = vtable_fn(args);

    // Step 5: restore caller's isolation domain.
    arch::current::isolation::switch_domain(saved_domain);

    Ok(result)
}

// **TOCTOU window analysis**: Between Step 1 (generation check) and Step 2
// (domain_valid check), the provider can crash on another CPU. The crash handler:
//   1. Clears `per_cpu::domain_valid` on the faulting CPU (~1 cycle)
//   2. Increments `state_generation` (makes future Step 1 checks fail)
//   3. Revokes domain hardware permissions (WRPKRU/POR_EL0/DACR deny-all)
//   4. Sends NMI to all other CPUs (10us worst-case delivery)
//   5. NMI handler clears `domain_valid` on each remote CPU
//      and restores the domain register to Core (PKEY 0 / POR default)
// The domain_valid check (Step 2) is performed BEFORE switch_domain (Step 3)
// to avoid a race on AArch64 mainstream (page-table + ASID isolation) where
// switch_domain writes TTBR0_EL1 referencing a physical page table that may
// have been freed by domain revocation. On x86-64 MPK, switch_domain writes
// PKRU (no memory dereference), so the race is benign — but checking first
// is safe and correct on ALL architectures.
// During the ~10us NMI delivery window, remote CPUs that pass the
// domain_valid check are protected by hardware domain revocation: the
// domain's hardware permissions are already set to deny-all, so any callee
// access after switch_domain will fault. The domain_valid flag is an
// optimization to avoid the hardware fault path, not a safety gate.
// Step 2 catches crashes that occur after the generation check passes but
// before the domain switch. The remaining window (domain_valid check to
// first callee dereference, ~5 cycles + switch cost) is bounded by NMI
// delivery — the NMI handler restores the domain register, causing any
// callee access to fault.

11.6.1.8 Registry Event Notifications

Beyond driver-to-driver service recovery, kernel subsystems need to react to device lifecycle events. The registry provides an internal notification mechanism (not exposed through KABI — this is kernel-to-kernel only).

/// Registry event types that kernel subsystems can subscribe to.
#[repr(u32)]
pub enum RegistryEvent {
    /// A new device node was created (after bus enumeration).
    DeviceDiscovered  = 0,
    /// A device transitioned to Active (driver bound and initialized).
    DeviceActive      = 1,
    /// A device is being removed (before teardown begins).
    DeviceRemoving    = 2,
    /// A device's driver crashed and recovery is starting.
    DeviceRecovering  = 3,
    /// A device's power state changed.
    PowerStateChanged = 4,
    /// IOMMU group assignment changed (passthrough ↔ kernel domain).
    IommuGroupChanged = 5,
    /// A service was published or unpublished.
    ServiceChanged    = 6,
    /// Device driver has completed recovery — re-initialized successfully.
    /// Subscribers (VFS, networking, accelerator scheduler) can rebind
    /// to the recovered device. Emitted after the post-swap watchdog
    /// confirms stable operation for 5 seconds.
    DeviceRecovered   = 7,
}

/// Callback type for registry event notifications.
pub type RegistryNotifyFn = fn(
    event: RegistryEvent,
    node_id: DeviceNodeId,
    context: *mut c_void,
);

Subscribers:

Kernel Subsystem Events Purpose
Memory manager (Section 4.1) DeviceDiscovered, DeviceRemoving Update NUMA topology when devices with local memory appear/disappear
Scheduler (Section 7.1) DeviceActive, DeviceRemoving Update IRQ affinity recommendations
FMA engine (Section 20.1) DeviceRecovering Log fault management events, track failure patterns
AccelScheduler (Section 22.1) DeviceActive, DeviceRecovering, PowerStateChanged Manage accelerator context lifecycle
Sysfs compat (Section 11.6) All events Update /sys filesystem in real-time

Notifications are dispatched synchronously during registry state transitions. Subscribers must not block — they record the event and defer heavy work to a workqueue. This prevents a slow subscriber from delaying device bring-up.


11.6.2 KABI Integration

11.6.2.1 New Methods Appended to KernelServicesVTable

All new methods are Option<...> for backward compatibility. Older kernels that do not have the registry will have these as None. Drivers must check for None before calling.

// === Device Registry (appended to KernelServicesVTable) ===

/// Report a newly discovered device to the registry.
/// Called by bus drivers (PCI enumeration, USB hub, etc.).
pub registry_report_device: Option<unsafe extern "C" fn(
    parent_handle: DeviceHandle,
    bus_type: BusType,
    bus_identity: *const u8,
    bus_identity_len: u32,
    properties: *const PropertyEntry,
    property_count: u32,
    out_handle: *mut DeviceHandle,
) -> IoResultCode>,

/// Report that a device has been physically removed.
pub registry_report_removal: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
) -> IoResultCode>,

/// Get a property value from a device node.
pub registry_get_property: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    key: *const u8,
    key_len: u32,
    out_value: *mut PropertyValueC,
    out_value_size: *mut u32,
) -> IoResultCode>,

/// Set a property on a device node.
pub registry_set_property: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    key: *const u8,
    key_len: u32,
    value: *const PropertyValueC,
    value_size: u32,
) -> IoResultCode>,

/// Publish a named service on this device node.
pub registry_publish_service: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    service_name: *const u8,
    service_name_len: u32,
    service_vtable: *const c_void,
    service_vtable_size: u64,
) -> IoResultCode>,

/// Look up a named service.
pub registry_lookup_service: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    service_name: *const u8,
    service_name_len: u32,
    scope: u32,
    out_service_vtable: *mut *const c_void,
    out_service_handle: *mut ServiceHandle,
) -> IoResultCode>,

/// Release a previously acquired service handle.
pub registry_release_service: Option<unsafe extern "C" fn(
    service_handle: ServiceHandle,
) -> IoResultCode>,

/// Get the device handle for the current driver instance.
pub registry_get_device_handle: Option<unsafe extern "C" fn(
    out_handle: *mut DeviceHandle,
) -> IoResultCode>,

/// Enumerate children of a device node.
pub registry_enumerate_children: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    out_handles: *mut DeviceHandle,
    max_count: u32,
    out_count: *mut u32,
) -> IoResultCode>,

// === Clock Framework (appended to KernelServicesVTable, clock_v1) ===

/// Look up a clock by consumer name and return an opaque clock handle.
///
/// Drivers hold `DeviceHandle` (KABI-level token), not `DeviceNode` (kernel-
/// internal). This method bridges the gap: the kernel resolves
/// `device_handle` → `DeviceNode` internally, then delegates to
/// `clk_get(&DeviceNode, name)` ([Section 2.24](02-boot-hardware.md#clock-framework)).
///
/// The returned `ClkHandleId` is an opaque u64 token representing a
/// `ClkHandle` stored in the kernel's per-driver clock handle table.
/// The driver uses this token with `clk_enable`, `clk_disable`,
/// `clk_get_rate`, and `clk_set_rate` below.
///
/// # Arguments
/// - `device_handle`: The calling driver's device handle (from
///   `DeviceDescriptor.device_handle` or `registry_get_device_handle()`).
/// - `name`: Clock consumer name as declared in the device tree
///   `clock-names` property (e.g., `"bus"`, `"ref"`, `"pclk"`).
/// - `name_len`: Length of `name` in bytes (not NUL-terminated).
/// - `out_clk_handle`: On success, receives the opaque clock handle token.
///
/// # Returns
/// - `IO_SUCCESS` (0): Clock found and handle stored in `out_clk_handle`.
/// - `IO_ERR_NODEV` (-ENODEV, -19): `device_handle` is invalid or does
///   not correspond to a registered device.
/// - `IO_ERR_NOENT` (-ENOENT, -2): No clock with the given `name` is
///   associated with this device in the device tree / ACPI tables.
/// - `IO_ERR_PERM` (-EPERM, -1): Caller lacks permission for clock access.
///
/// # Kernel-internal implementation
/// 1. Resolve `device_handle` → `Arc<DeviceNode>` via `XArray` lookup.
/// 2. Call `clk_get(&device_node, name)` to obtain a `ClkHandle`.
/// 3. Store the `ClkHandle` in a per-driver `XArray<ClkHandle>` keyed by
///    a monotonically increasing `ClkHandleId`.
/// 4. Return the `ClkHandleId` to the driver.
/// On driver unload or crash, all entries in the per-driver clock handle
/// table are dropped, which decrements enable refcounts via `ClkHandle::drop`.
pub clk_get: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    name: *const u8,
    name_len: u32,
    out_clk_handle: *mut u64,
) -> IoResultCode>,

/// Enable a clock previously obtained via `clk_get`.
///
/// Increments the clock's enable refcount. Idempotent: calling enable on
/// an already-enabled handle is a no-op.
///
/// # Returns
/// - `IO_SUCCESS`: Clock enabled.
/// - `IO_ERR_INVAL` (-EINVAL, -22): `clk_handle` is not a valid token
///   for the calling driver.
pub clk_enable: Option<unsafe extern "C" fn(
    clk_handle: u64,
) -> IoResultCode>,

/// Disable a clock previously obtained via `clk_get`.
///
/// Decrements the clock's enable refcount. If count reaches 0, the
/// hardware gate is closed. Idempotent on already-disabled handles.
///
/// # Returns
/// - `IO_SUCCESS`: Clock disabled (or was already disabled).
/// - `IO_ERR_INVAL` (-EINVAL, -22): Invalid `clk_handle`.
pub clk_disable: Option<unsafe extern "C" fn(
    clk_handle: u64,
) -> IoResultCode>,

/// Get the current output frequency of a clock in Hz.
///
/// # Arguments
/// - `clk_handle`: Handle from `clk_get`.
/// - `out_rate_hz`: On success, receives the current rate. 0 = gated.
///
/// # Returns
/// - `IO_SUCCESS`: Rate written to `out_rate_hz`.
/// - `IO_ERR_INVAL` (-EINVAL, -22): Invalid `clk_handle`.
pub clk_get_rate: Option<unsafe extern "C" fn(
    clk_handle: u64,
    out_rate_hz: *mut u64,
) -> IoResultCode>,

/// Request a rate change on a clock.
///
/// The framework adjusts dividers and PLLs to achieve the nearest
/// achievable rate. Returns the actually-set rate in `out_actual_hz`.
///
/// # Arguments
/// - `clk_handle`: Handle from `clk_get`.
/// - `rate_hz`: Requested frequency in Hz.
/// - `out_actual_hz`: On success, receives the actually-set rate
///   (may differ from `rate_hz` if hardware cannot match exactly).
///
/// # Returns
/// - `IO_SUCCESS`: Rate changed; actual rate written to `out_actual_hz`.
/// - `IO_ERR_INVAL` (-EINVAL, -22): Invalid `clk_handle`.
/// - `IO_ERR_DEVICE` (-EIO, -5): No achievable rate within hardware limits.
pub clk_set_rate: Option<unsafe extern "C" fn(
    clk_handle: u64,
    rate_hz: u64,
    out_actual_hz: *mut u64,
) -> IoResultCode>,

/// Release a clock handle. Decrements enable refcount if the handle was
/// enabled. Removes the handle from the per-driver clock handle table.
///
/// Drivers SHOULD call this when they no longer need a clock. However,
/// handles are also released automatically on driver unload/crash via the
/// per-driver handle table cleanup, so explicit release is not mandatory
/// for correct operation.
///
/// # Returns
/// - `IO_SUCCESS`: Handle released.
/// - `IO_ERR_INVAL` (-EINVAL, -22): Invalid `clk_handle` (already released
///   or never allocated).
pub clk_release: Option<unsafe extern "C" fn(
    clk_handle: u64,
) -> IoResultCode>,

11.6.2.2 New ABI Types

/// Opaque handle to a device node in the registry.
#[repr(C)]
pub struct DeviceHandle {
    pub id: u64,
}
const_assert!(size_of::<DeviceHandle>() == 8);

impl DeviceHandle {
    pub const INVALID: Self = Self { id: 0 };
}

/// Stable C-ABI service token. Passed across isolation domain boundaries.
/// Kernel resolves this id to an `InternalServiceRef` at each call site.
/// Liveness: the module providing this service cannot be unloaded while any
/// active `ServiceHandle` referring to it is held by a capability.
#[repr(C)]
pub struct ServiceHandle {
    pub id: u64,
}
const_assert!(size_of::<ServiceHandle>() == 8);

/// A property entry for C ABI transport.
#[repr(C)]
pub struct PropertyEntry {
    pub key: *const u8,
    pub key_len: u32,
    pub value_type: PropertyType,
    pub value_data: *const u8,
    pub value_len: u32,
    pub _pad: u32,
}
// PropertyEntry: key(ptr) + key_len(u32) + value_type(u32) + value_data(ptr) +
//   value_len(u32) + _pad(u32).
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<PropertyEntry>() == 32);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<PropertyEntry>() == 24);

#[repr(u32)]
pub enum PropertyType {
    U64         = 0,
    I64         = 1,
    String      = 2,
    Bytes       = 3,
    Bool        = 4,
    StringArray = 5,
}

/// C-ABI-safe property value output buffer.
#[repr(C)]
pub struct PropertyValueC {
    pub value_type: PropertyType,
    pub _pad: u32,
    pub data: [u8; 256],
}
// PropertyValueC: value_type(u32=4) + _pad(u32=4) + data([u8;256]=256) = 264 bytes.
const_assert!(size_of::<PropertyValueC>() == 264);

// `KabiVersion` — defined in Section 12.1.9.3 (11-kabi.md).
// Layout: { major: u16, minor: u16, patch: u16, _pad: u16 } — repr(C), 8 bytes.
// Key methods: new(major,minor,patch), is_compatible_with(kernel), as_u64(), from_u64(v).
// Constant: KABI_CURRENT = 1.0.0.
// The vtable wire format stores KabiVersion::as_u64() in the first 8 bytes of each vtable.

11.6.2.3 DeviceDescriptor Extension

The existing DeviceDescriptor gains new fields (appended):

// Appended to DeviceDescriptor
pub device_handle: DeviceHandle,    // Registry handle for this device
pub numa_node: i32,                 // NUMA node (-1 = unknown)
pub _pad: u32,

The DeviceDescriptor passed to driver_entry.init() is now populated from the registry node's properties, ensuring consistency between what the registry knows and what the driver sees.

11.6.2.4 Memory Management KABI (memory_v1)

The memory_v1 KABI table provides driver-callable memory management functions appended to KernelServicesVTable starting at KABI version 2 (the initial KernelServicesVTable layout is version 1). Per Section 12.2 versioning rules, these four Option<fn> fields are tail-appended; drivers compiled against KABI v1 see a shorter vtable_size and never access these offsets. Drivers compiled against v2+ check vtable_size >= offset_of!(memory_v1 fields) before calling, and fall back to non-NUMA allocation if the kernel does not expose memory_v1.

These extend the existing DMA allocation functions (Section 11.3, Tier 2 syscall table) with NUMA-aware operations for Tier 1 drivers.

// === Memory Management (appended to KernelServicesVTable, memory_v1) ===

/// Request explicit NUMA page migration for driver-private pages.
///
/// Moves the specified physical pages to the target NUMA node. Only callable
/// on pages within the calling driver's isolation domain (Tier 1 protection
/// key match required). The kernel validates ownership before migration.
///
/// Migration is **synchronous**: this function blocks until all pages have
/// been physically moved to the target node (or an error occurs). The
/// driver's virtual mappings are updated transparently — existing virtual
/// addresses remain valid after migration, only the underlying physical
/// frames change.
///
/// # Arguments
/// - `pages`: Pointer to a caller-allocated array of physical page addresses
///   (page-aligned, 4 KiB granularity). Each address must be within the
///   caller's isolation domain.
/// - `page_count`: Number of entries in the `pages` array. Maximum 512 pages
///   per call (2 MiB). For larger migrations, issue multiple calls.
/// - `target_node`: NUMA node ID to migrate to. Must be a valid node with
///   available memory. Use `numa_node_count()` to discover topology.
///
/// # Returns
/// - `IO_SUCCESS` (0): All pages successfully migrated.
/// - `IO_ERR_INVALID_ADDR` (-EFAULT, -14): One or more page addresses are
///   outside the caller's isolation domain or not page-aligned. No pages
///   are migrated (atomic failure).
/// - `IO_ERR_INVALID_NODE` (-EINVAL, -22): `target_node` does not exist in
///   the NUMA topology or has no allocatable memory.
/// - `IO_ERR_DMA_PINNED` (-EBUSY, -16): One or more pages have an active
///   DMA mapping (`PG_dma_pinned` flag set, [Section 11.4](#device-registry-and-bus-management--deviceresources)). DMA-pinned pages cannot
///   be migrated because a device holds their physical address. The driver
///   must unpin DMA buffers (`free_dma_buffer`) before migrating. No pages
///   are migrated (atomic failure).
/// - `IO_ERR_NOMEM` (-ENOMEM, -12): Target node has insufficient free memory
///   to accept the migrated pages. No pages are migrated (atomic failure).
/// - `IO_ERR_PERM` (-EPERM, -1): Caller does not hold `CAP_NUMA_MIGRATE`
///   capability (required for explicit NUMA migration).
///
/// # Atomicity
/// Migration is all-or-nothing: either all pages in the request are migrated,
/// or none are. The kernel pre-validates all pages and pre-allocates target
/// frames before beginning the move. If any page fails validation, the entire
/// request is rejected before any migration occurs.
///
/// # Concurrency
/// The kernel holds the per-page migration lock during the move, serializing
/// with concurrent NUMA balancer scans and other migration requests for the
/// same pages. Other pages in the driver's domain remain accessible during
/// migration. The migrating pages are briefly unmapped (~1-5 µs per page);
/// concurrent access from the driver's other threads will fault and block
/// until migration completes.
///
/// # Safety
/// - `pages` must point to a valid array of at least `page_count` elements.
/// - All addresses in the array must be page-aligned (4096-byte boundary).
/// - Caller must ensure no device DMA is in flight to the specified pages
///   (the `PG_dma_pinned` check catches registered DMA buffers, but the
///   driver is responsible for not issuing new DMA to these addresses
///   concurrently with migration).
pub driver_request_numa_migration: Option<unsafe extern "C" fn(
    pages: *const u64,
    page_count: u32,
    target_node: i32,
) -> IoResultCode>,

/// Query the NUMA node for a set of physical pages.
///
/// Returns the NUMA node ID for each page in the input array.
/// Useful for drivers that want to check data locality before deciding
/// whether to migrate.
///
/// # Arguments
/// - `pages`: Pointer to array of physical page addresses (page-aligned).
/// - `page_count`: Number of entries in `pages`.
/// - `out_nodes`: Pointer to caller-allocated array of `page_count` `i32`
///   values. On success, `out_nodes[i]` contains the NUMA node ID for
///   `pages[i]`.
///
/// # Returns
/// - `IO_SUCCESS`: All node IDs written to `out_nodes`.
/// - `IO_ERR_INVALID_ADDR` (-EFAULT): One or more pages outside caller's domain.
pub driver_query_numa_node: Option<unsafe extern "C" fn(
    pages: *const u64,
    page_count: u32,
    out_nodes: *mut i32,
) -> IoResultCode>,

/// Query NUMA topology: number of NUMA nodes in the system.
///
/// # Returns
/// - Positive value: number of NUMA nodes (1 on non-NUMA systems).
/// - Negative value: error (should not occur; returns 1 as fallback).
pub numa_node_count: Option<unsafe extern "C" fn() -> i32>,

/// Query available memory on a NUMA node.
///
/// # Arguments
/// - `node`: NUMA node ID.
/// - `out_total_bytes`: Total physical memory on this node.
/// - `out_free_bytes`: Currently free memory on this node.
///
/// # Returns
/// - `IO_SUCCESS`: Values written to output pointers.
/// - `IO_ERR_INVALID_NODE` (-EINVAL): Node does not exist.
pub numa_node_memory: Option<unsafe extern "C" fn(
    node: i32,
    out_total_bytes: *mut u64,
    out_free_bytes: *mut u64,
) -> IoResultCode>,

Usage pattern — A NUMA-aware NIC driver migrates receive buffer pages to the NUMA node closest to the NIC's PCIe attachment point:

1. Driver probes device, reads `DeviceDescriptor.numa_node` ([Section 11.6](#device-services-and-boot--devicedescriptor-extension)).
2. Driver allocates receive ring buffers (via `alloc_dma_buffer`).
3. On each received packet, driver calls `driver_query_numa_node` to check
   if the destination process's pages are local.
4. If remote, driver calls `driver_request_numa_migration` to pull hot pages
   to the NIC's node, reducing memory access latency for subsequent packets.
5. Migration frequency is rate-limited by the driver to avoid migration storms
   (recommended: at most once per page per 100ms).

11.6.3 Crash Recovery Integration

The registry participates in the crash recovery sequence defined in Section 11.9.

11.6.3.1 When a Driver Crashes

  1. Detection: UmkaOS Core detects the fault (hardware exception in isolation domain, watchdog timeout, Tier 2 process crash).

  2. Registry notification: UmkaOS Core identifies the faulting driver's device node. Registry transitions it to Recovering.

  3. Service invalidation: All service handles pointing to the crashed driver are invalidated. Client drivers receive -ENODEV on subsequent service calls.

  4. Child cascade: If the crashed driver is a bus driver with children, the registry processes children bottom-up:

  5. For each child: stop driver, release capabilities, transition to Stopping.
  6. Children are re-probed after the bus driver recovers.

  7. I/O drain + DMA fence: All pending I/O completed with -EIO. Critically, before freeing any driver memory, UmkaOS must ensure no in-flight DMA operations can write to those pages. The sequence:

  8. IOMMU mapping for the driver's DMA regions is revoked (set to fault-on-access) immediately at step 2 (ISOLATE). Any in-flight DMA that completes after this point will hit an IOMMU fault (harmless — the write is dropped by the IOMMU).

DMA teardown sequence (before IOTLB unmap):

  1. Assert device-class DMA stop:

    • PCIe devices with FLR (Function Level Reset) support: issue FLR via the PCIe Device Control register (capability offset + 0x08, bit 15). FLR resets the device state and stops all outstanding DMA.
    • NVMe: issue Admin Command ABORT for all outstanding I/Os, then CC.EN=0 (Controller Enable clear) to halt the NVMe controller.
    • AHCI: issue PORT_CMD_FIS_RX clear + PORT_CMD_ST clear per port.
    • USB devices: send USB port reset to the host controller.
    • Devices without a DMA-stop mechanism: skip to step 2 (fallback only).
  2. Wait for DMA quiescence:

    • Poll the device's DMA-active indicator (device-class specific) until it reports no outstanding DMA, OR until 100ms has elapsed.
    • For FLR: the PCIe spec requires FLR completion within 100ms. After FLR, DMA is guaranteed stopped by hardware.
  3. If step 2 does not complete within 100ms:

    • Increment driver.dma_timeout_count (exposed via /sys/devices/.../dma_timeouts)
    • The FMA subsystem (Section 20.1) receives a FaultEvent::DmaTimeout event.
    • Issue PCIe Function Level Reset (FLR) via the device's FLR capability register (Device Control register bit 15). FLR is a hard device reset that stops all outstanding DMA by definition.
    • Wait up to 500ms for FLR completion (poll config space; device returns 0xFFFF during reset; the PCIe Base Spec requires FLR to complete within 100ms, so 500ms provides a conservative margin).
    • If FLR is unsupported by the device, or if FLR also times out:
      • Do not free memory. Mark the IOMMU group as quarantined: the existing IOMMU mappings are left in place (fault-on-access) but no new mappings are granted. Memory backing those mappings is pinned and excluded from the allocator until the quarantine is lifted.
      • Return Err(DmaQuiescenceTimeout) to the crash recovery path.
      • The quarantined IOMMU group is reset on the next system suspend/resume cycle (which performs a full bus reset), at which point the pinned memory is released.
      • Log: "DMA quiescence failed on [bus:dev.fn] after FLR — IOMMU group quarantined; memory pinned until suspend/resume reset"
  4. IOTLB invalidate: Only after confirmed device quiescence (step 1 DMA stop + step 2 poll, or step 3 FLR), invalidate IOMMU TLB entries for the unmapped region. On Intel VT-d, this uses the Invalidation Wait Descriptor with IWD=1 to wait for invalidation completion. On AMD, the COMPLETION_WAIT command provides equivalent functionality. Only after IOTLB invalidation completes is it safe to free physical pages.

Design note: Linux's default driver teardown does not always issue FLR, relying on IOMMU timeouts and trusting drivers to drain their own DMA. UmkaOS enforces the explicit stop sequence — it is the kernel's responsibility to ensure hardware is quiesced, not the driver's.

  • Driver private memory is freed only after confirmed device quiescence and completed IOTLB invalidation. If quiescence cannot be confirmed (FLR also fails), the memory is quarantined rather than freed — no use-after-free path is permitted.
  • Why this matters: without confirmed DMA quiescence, a device still mid-DMA could write to pages that have been freed and reallocated to another driver or to userspace — a use-after-free via hardware. Proceeding past a timeout without hardware confirmation of DMA stop is not an acceptable fallback for a production kernel; quarantine is the safe alternative when quiescence cannot be established.

  • Device reset: FLR for PCIe, port reset for USB, etc.

  • Driver reload: Fresh binary loaded, new vtable exchange. The DeviceDescriptor retains the same DeviceHandle — the device's identity in the registry is preserved across crashes.

  • Service re-publication: Reloaded driver publishes its services again. Registry notifies clients via service_recovered callback.

  • Child re-probe: If this was a bus driver, the registry re-enumerates and re-probes child devices.

11.6.3.2 Failure Counter Integration

/// Sliding-window failure tracker. Records timestamps of recent failures
/// in a circular buffer. Used by the auto-demotion policy to count failures
/// within a configurable time window.
pub struct FailureWindow {
    /// Circular buffer of failure timestamps (monotonic nanoseconds).
    timestamps: [u64; 16],
    /// Index of the next write position (wraps at 16 via `head % 16`).
    head: u8,
    /// Total number of failures recorded (may exceed 16; only the last 16
    /// timestamps are retained).
    /// u64 per project policy for monotonically increasing counters.
    total_count: u64,
}

impl FailureWindow {
    /// Count failures within the last `window_ns` nanoseconds.
    pub fn count_within(&self, window_ns: u64) -> u32 { /* ... */ }
    /// Record a failure at the current time.
    pub fn record(&mut self, now_ns: u64) { /* ... */ }
}

The registry's per-node failure_window (a FailureWindow sliding-window counter) feeds into the existing auto-demotion policy. The counter records timestamps in a 16-entry circular buffer; the policy query asks "how many entries fall within the last N seconds?" (default window: 60 seconds):

failure_window.count_within(60 seconds):
  0-2: Reload at same tier
  3+:  Demote to next lower tier (if minimum_tier allows)
  5+:  Transition to Quarantined state (driver permanently disabled, device
       unbound); requires manual administrator re-enable via umkafs. Log critical alert.

This is the same policy described in Section 11.9, now with the registry as the tracking mechanism.

How auto-demotion works without recompilation — A driver that can run in both Tier 1 (isolation domain, Ring 0) and Tier 2 (process, Ring 3) does not need two separate binaries. The KABI vtable abstraction (Section 12.1) provides identical function signatures regardless of tier. The difference is in the hosting environment: Tier 1 drivers are loaded as shared objects into a kernel isolation domain; Tier 2 drivers are loaded as processes. The same .umka binary is valid in both contexts because KABI syscalls (ring buffer operations, capability invocations) are designed to work from either Ring 0 or Ring 3 — both Tier 1 and Tier 2 use ring buffer dispatch (the driver's IDL-generated consumer loop dequeues requests and dispatches through the vtable). The difference is the isolation boundary: Tier 1 rings cross a hardware memory domain (MPK/POE/DACR), Tier 2 rings cross a process boundary (Ring 3 + IOMMU). Auto-demotion simply means "restart this driver binary in a Tier 2 process instead of a Tier 1 isolation domain." The driver code is unaware of the change — the consumer loop, vtable, and kabi_call! abstraction work identically in both contexts.


11.6.4 Boot Sequence Integration

The registry integrates into the canonical boot sequence (Section 2.3). Key registry-related steps:

  • Phase 4.2 dev_registry_init(): Initialize the device registry, NUMA topology map, and device match database.
  • Phase 4.3 kabi_runtime_init(): Initialize the KABI runtime (service registry, version negotiation tables). The mod-loader workqueue is created later at Phase 4.4a (see Section 3.11) — module loading requires bus enumeration first.
  • Phase 4.4a bus_enumerate(): PCI/platform bus enumeration populates the registry.
  • Phase 5.1 tier0_drivers_init(): Register Tier 0 devices (APIC, timer, serial) retroactively — they were initialized before the registry existed.
  • Phase 5.3 tier1_driver_load(): Registry match engine loads Tier 1 storage/NIC drivers via the module loader queue. During ELF validation, the loader enforces the panic = abort requirement (see below).
  • Phase 5.4 storage_probe(): NVMe/SCSI/AHCI device probe.

Tier 1 panic = abort enforcement: During Tier 1 driver ELF validation, the loader MUST verify that the binary contains no unwind tables. If .eh_frame or .gcc_except_table ELF sections are present, the driver is rejected with ENOEXEC and an FMA warning is emitted: "driver <name> compiled with panic=unwind; Tier 1 requires panic=abort". A panic inside a Tier 1 driver must trigger abort, which invokes the crash recovery NMI mechanism (Section 11.9). Unwinding across extern "C" isolation domain boundaries is undefined behavior per the Rust specification. This check is Phase 2 (Tier 1 driver loading is Phase 2).

Do not define an alternative boot sequence here — the canonical table in §2.3 (Section 2.3) is the single source of truth for initialization ordering.

11.6.4.1 Tier 0 Devices

Tier 0 drivers (APIC, timer, serial) are statically linked and initialized before the registry exists. After registry init, they are registered retroactively:

registry.register_tier0_device("apic", ...);
registry.register_tier0_device("timer", ...);
registry.register_tier0_device("serial0", ...);

These nodes are created directly in Active state with no match/load cycle.

11.6.4.2 Console Handoff

The display and input stack transitions through multiple phases during boot. The handoff protocol ensures zero message loss and graceful degradation.

Phase 1 — Tier 0 (early boot): - Serial console (COM1/PL011/16550) is active from the first instruction. - VGA text mode (80×25) initialized by BIOS/UEFI firmware on x86-64. - All kernel output goes to the ring buffer (klog), serial, and VGA text mode simultaneously. The ring buffer captures every message from the first printk.

Phase 2 — Tier 1 loaded (DRM/KMS driver): - The DRM/KMS display driver initializes, performs modeset, and allocates a framebuffer. - A framebuffer console renderer (fbcon) is initialized with the target resolution.

Handoff protocol:

1. DRM driver completes modeset, signals "console ready" via KABI callback:
     driver_event(CONSOLE_READY, framebuffer_info)

2. Kernel console subsystem:
   a. Locks the console output path (brief pause, <1ms)
   b. Replays the full ring buffer contents onto the framebuffer console
      — no boot messages are lost, the user sees the complete boot log
   c. Registers fbcon as the primary console output
   d. Unlocks the console output path

3. Serial console remains active — never disabled. All output goes to BOTH
   serial and framebuffer. This ensures remote management always works.

4. VGA text mode driver is deregistered as the *primary* console backend.
   The VGA text mode memory region (0xB8000) is NOT released to the physical
   memory allocator — it is reserved as a panic-only fallback (see below).
   The region is small (4000 bytes) and the cost of keeping it reserved is
   negligible compared to the benefit of having a guaranteed crash output path.

Keyboard handoff: - Early boot: PS/2 scan code handler (Tier 0) captures keystrokes into a buffer. This allows emergency interaction (e.g., boot parameter editing) before USB is up. - Tier 1 loaded: USB HID driver initializes, registers as input device. The input subsystem drains the PS/2 keystroke buffer — no keystrokes are lost. - PS/2 handler remains active for keyboards physically connected via PS/2.

Virtual terminals: - VT switching (Ctrl+Alt+F1–F6) is implemented in umka-core's input multiplexer, NOT in the display driver. The display driver is a passive renderer. - On VT switch, the input multiplexer sends a SWITCH_VT(n) command to the display driver via KABI. The driver switches which virtual framebuffer is scanned out. - This design means a crashing display driver doesn't break VT switching logic — on driver recovery, the multiplexer re-sends the current VT state.

Crash fallback: - If the DRM driver faults, the core reverts to VGA text mode (x86-64) or serial-only (AArch64/RISC-V/PPC) for panic output. Tier 0 console backends are always available. - The panic handler bypasses the normal console locking path and writes directly to the Tier 0 backends (serial + VGA text if available).

11.6.4.3 PCI Enumeration

PCI enumeration is part of UmkaOS Core (Tier 0 functionality in early boot). It walks PCI configuration space and creates device nodes:

For each PCI bus (starting from bus 0):
  For each device 0-31, function 0-7:
    If device present:
      1. Create DeviceNode with PCI bus identity
      2. Populate properties: vendor-id, device-id, class-code, BARs, IRQs
      3. If this is a bridge: create a bus node, recurse into secondary bus
      4. Set numa_node from ACPI SRAT proximity domain
      5. Registry runs match engine for this node

11.6.4.4 NUMA Awareness

ACPI SRAT (System Resource Affinity Table) provides NUMA topology. The registry uses this to set numa_node on each device node based on the device's proximity domain (PCI devices inherit from their root port's NUMA node).

This information is available for: - Driver memory allocation: Prefer the device's NUMA node. - DMA buffer allocation: Prefer the device's NUMA node. - IRQ affinity: Suggest CPU affinity matching the device's NUMA node. - Tier 1 domain assignment: Prefer grouping NUMA-local devices when isolation domains are shared.

Fallback when SRAT is absent (ARM/DT platforms, minimal VMs): If no SRAT/PPTT table provides NUMA information, numa_node remains -1 (unknown). All NUMA-aware allocation functions treat -1 as "use local node" — the allocator uses the requesting CPU's NUMA node as a best-effort default. This is correct: on UMA systems there is only one NUMA node, and on DT-based NUMA systems the DT numa-node-id property is parsed during device tree enumeration (same as Linux's of_node_to_nid()).

11.6.4.5 ACPI Enumerator

The ACPI enumerator is Tier 0 kernel-internal code that walks the ACPI namespace and creates platform device nodes in the registry. It handles the tables that define hardware topology:

ACPI Table Registry Impact
MCFG (PCI Express Memory Mapped Config) Defines PCI segment groups and ECAM base addresses. The PCI enumerator uses these to access PCI config space.
SRAT (System Resource Affinity) Maps PCI bus ranges and memory ranges to NUMA proximity domains. Sets numa_node on device nodes.
DMAR / IVRS (DMA Remapping) Defines IOMMU hardware. Creates IOMMU group assignments (Section 11.5). Intel DMAR for VT-d, AMD IVRS for AMD-Vi.
DSDT / SSDT (Differentiated System Description) Defines platform devices (embedded controllers, power buttons, battery, thermal zones). Each ACPI device object becomes a platform device node.
HPET / MADT Timer and interrupt controller topology. Creates Tier 0 device nodes for APIC, I/O APIC, HPET.

AML evaluation: The ACPI enumerator includes an AML (ACPI Machine Language) interpreter for evaluating _STA (device status), _CRS (current resources), and _HID (hardware ID) methods. This is a significant subsystem but is required for correct hardware enumeration on any x86 system. The AML interpreter runs in Tier 0 with full kernel privileges because it accesses hardware registers directly.

Device Tree enumerator (AArch64/RISC-V/PPC/LoongArch64): Parses the flattened device tree (FDT) passed by the bootloader. Each DT node with a compatible property becomes a platform device node. The reg property populates DeviceResources.bars (as MMIO regions), and the interrupts property populates DeviceResources.irqs. DT phandle references become provider-client service links.

Channel I/O enumerator (s390x only): Scans the channel subsystem by iterating all possible subchannel addresses via STSCH (Store Subchannel). Each valid subchannel becomes a device node with BusIdentity::ChannelIo. The PMCW (Path Management Control Word) from the SCHIB provides device type identification; SenseID CCW commands provide CU type/model and device type/model. QDIO-capable devices (OSA-Express, FCP, zFCP) get ChannelIoResources populated with queue descriptors. Hotplug events arrive as channel path (CHP) vary-on/vary-off machine check interrupts, which trigger re-enumeration of affected subchannels. See Section 11.10 for the full Channel I/O spec.

11.6.4.6 Firmware Quirk Framework

ACPI tables and Device Trees are authored by firmware engineers and are notoriously buggy. Linux has accumulated thousands of firmware workarounds scattered across subsystem-specific code (drivers/acpi/, arch/x86/kernel/, DMI match tables, ACPI override tables). UmkaOS centralizes firmware workarounds into a structured quirk framework, similar to the CPU errata framework (Section 2.18).

The problem is real — common firmware bugs observed in the wild: - ACPI _CRS (Current Resources) reports incorrect MMIO ranges for PCI bridges, causing resource conflicts - SRAT (NUMA affinity) tables claim all memory belongs to NUMA node 0 on multi-socket systems (broken BIOS update) - DMAR (IOMMU) tables omit devices or report wrong scope, causing IOMMU group misassignment - Device Tree interrupt-map entries with wrong parent phandle references (ARM SoC vendor bugs) - DSDT/SSDT AML code with infinite loops, incorrect register addresses, or methods that return wrong types - MADT reports non-existent APIC IDs (causes boot failure if kernel trusts them) - ECAM (PCI config space) base address wrong in MCFG table

UmkaOS's firmware quirk table:

/// Firmware quirk entry — matches a system to its required workarounds.
struct FirmwareQuirk {
    /// System identification (DMI vendor + product + BIOS version).
    match_id: DmiMatch,
    /// ACPI table match (optional — match specific table revision).
    table_match: Option<AcpiTableMatch>,
    /// Human-readable quirk identifier.
    quirk_id: &'static str,
    /// Workaround: override, ignore, or patch firmware data.
    action: QuirkAction,
}

enum QuirkAction {
    /// Override a specific ACPI table with a corrected version (ACPI override).
    OverrideTable { table_signature: [u8; 4], replacement: &'static [u8] },
    /// Ignore a specific device entry in DMAR/IVRS (broken IOMMU scope).
    IgnoreIommuDevice { segment: u16, bus: u8, device: u8, function: u8 },
    /// Override NUMA affinity for a memory range (broken SRAT).
    OverrideNumaAffinity { phys_start: u64, phys_end: u64, node: u32 },
    /// Ignore an APIC ID in MADT (non-existent CPU).
    IgnoreApicId { apic_id: u32 },
    /// Patch a specific AML method (replace bytecode).
    PatchAml { path: &'static str, replacement: &'static [u8] },
    /// Skip enumeration for a device matching this HID (broken _CRS).
    SkipDevice { hid: &'static str },
    /// Custom workaround function.
    Custom(fn() -> Result<()>),
}

Quirk database population — the initial quirk database is seeded from: 1. Linux's existing DMI quirk tables (drivers/acpi/, arch/x86/pci/) — these document decades of firmware workarounds with specific DMI match strings 2. Community-reported firmware bugs (same mechanism as Linux's bugzilla) 3. Vendor-provided errata sheets (when available)

ACPI table override — Linux supports loading replacement ACPI tables from initramfs (CONFIG_ACPI_TABLE_UPGRADE). UmkaOS supports the same mechanism: if a corrected DSDT is placed in the initramfs at /lib/firmware/acpi/, it replaces the firmware-provided table at boot. This allows users to fix firmware bugs without waiting for a BIOS update.

Boot-time quirk logging — all applied quirks are logged at boot:

umka: Firmware quirk applied: DELL-POWEREDGE-R740-BIOS-2.12 — DMAR ignore device 0000:00:14.0 (broken IOMMU scope)
umka: Firmware quirk applied: LENOVO-T14S-BIOS-1.38 — SRAT override node 0→1 for range 0x100000000-0x200000000

Why UmkaOS is more sensitive to firmware bugs than Linux — UmkaOS's topology-aware device registry derives NUMA affinity, IOMMU groups, power management ordering, and driver isolation domains from firmware-reported topology. A firmware bug that reports wrong NUMA affinity causes UmkaOS to place a driver on the wrong NUMA node (performance degradation). In Linux, the same bug might cause a suboptimal numactl suggestion but doesn't affect driver placement (Linux doesn't have topology-aware driver isolation).

This means UmkaOS must invest more heavily in firmware workarounds than Linux for the same set of hardware. The structured quirk framework makes this manageable — adding a new workaround is a single table entry, not scattered if (dmi_match(...)) checks across the codebase.

Defensive parsing — beyond per-system quirks, all firmware table parsers are defensively coded: - ACPI table lengths are validated against the RSDP/XSDT-reported size - AML interpreter has an instruction count limit (prevents infinite loops in AML code) - Device Tree parser validates all phandle references before dereferencing - PCI config space reads are bounds-checked against MCFG-reported ECAM regions - Any parse failure is logged as an FMA event (Section 20.1) and the offending entry is skipped rather than causing a boot failure

11.6.4.7 Resource Assignment

During PCI enumeration, the registry assigns hardware resources to each device:

For each PCI device:
  1. Read BAR registers to determine resource requirements (size, type).
  2. Assign physical address ranges from the PCI memory/IO space allocator.
     - MMIO BARs: allocate from PCI MMIO window (defined by ACPI `_CRS`
       method on the PCI host bridge device; MCFG defines only the ECAM base
       address for PCIe configuration space access).
     - I/O BARs: allocate from PCI I/O window (legacy x86, rare).
  3. Write assigned addresses back to BAR registers.
  4. Populate DeviceResources.bars with the assigned mappings.
  5. Allocate MSI/MSI-X vectors:
     - If device supports MSI-X: allocate up to min(device_max, driver_requested) vectors.
     - If MSI only: allocate power-of-2 vectors up to device limit.
     - Fallback: assign legacy INTx pin.
  6. Populate DeviceResources.irqs.

Resource conflicts (overlapping BAR assignments, IRQ vector exhaustion) are detected during enumeration and logged as FMA events (Section 20.1). Conflicting devices remain in Discovered state with no driver bound.


11.6.5 Sysfs Compatibility

The registry is the single source of truth for the /sys filesystem required by Linux compatibility (Section 19.1).

11.6.5.1 Mapping

Sysfs Path Registry Source
/sys/devices/ Device tree traversal (parent-child edges)
/sys/bus/pci/devices/ All nodes with bus_type == Pci
/sys/bus/usb/devices/ All nodes with bus_type == Usb
/sys/class/block/ Nodes publishing "block" service
/sys/class/net/ Nodes publishing "net" service
/sys/devices/.../driver driver_binding.driver_name
/sys/devices/.../power/ Power state and runtime PM policy
/sys/devices/.../uevent Generated from node properties

11.6.5.2 Attribute Files

Each standard property maps to the expected sysfs attribute format: - vendor → property "vendor-id" formatted as 0x%04x - device → property "device-id" formatted as 0x%04x - class → property "class-code" formatted as 0x%06x

Custom driver-set properties appear under a properties/ subdirectory.

11.6.5.3 Device Class via Service Names

Linux's /sys/class/ directories are derived from service publication: - A driver that publishes a "net" service → device appears under /sys/class/net/ - A driver that publishes a "block" service → device appears under /sys/class/block/ - A driver that publishes a "input" service → device appears under /sys/class/input/

This is more principled than Linux's explicit class_create() calls because the classification falls naturally out of what the driver actually does.


11.6.6 Firmware Management

Devices need firmware updates. The kernel provides infrastructure for loading and updating device firmware without requiring device-specific userspace tools.

11.6.6.1 Firmware Loading

Firmware loading flow (boot and runtime):
  1. Driver calls kabi_request_firmware(name, device_id).
  2. Kernel searches firmware paths in order:
     a. /lib/firmware/updates/<name>  (admin overrides)
     b. /lib/firmware/<name>          (distro-provided)
     c. Initramfs embedded firmware   (for boot-critical devices)
  3. If found: kernel maps the firmware blob read-only into the
     driver's isolation domain. Driver receives a FirmwareBlob handle
     with .data() and .size() accessors.
  4. Driver loads firmware to device via its own mechanism
     (MMIO, DMA upload, vendor mailbox).
  5. Driver releases the handle; kernel unmaps the blob.

  Same semantics as Linux request_firmware() / request_firmware_nowait().
  The async variant (kabi_request_firmware_async) does not block the
  driver's probe path — useful for large firmware blobs (>10MB).

11.6.6.2 Firmware Update (Runtime)

Runtime firmware update (fwupd / vendor tools):
  1. Userspace writes firmware capsule to /sys/class/firmware/<device>/loading.
  2. Kernel validates:
     a. Signature (mandatory: Ed25519 or PQC if enabled).
        The signing key must match the device's firmware trust anchor
        (embedded in device or provided by vendor via UEFI db).
     b. Version (must be >= current version, prevents downgrade attacks
        unless admin explicitly overrides via firmware.allow_downgrade=1).
  3. Kernel notifies driver via KABI callback:
     update_firmware(blob, blob_size) -> FirmwareUpdateResult.
  4. Driver performs the device-specific update procedure:
     - NVMe: Firmware Download + Firmware Commit (NVMe admin commands).
     - GPU: vendor-specific update mechanism.
     - NIC: flash update via vendor mailbox.
  5. Driver returns result: Success, NeedsReset, Failed(error_code).
  6. If NeedsReset: kernel marks device for reset. Reset can be
     triggered immediately (if no active I/O) or deferred to next
     maintenance window (admin-configurable).

UEFI capsule updates (system firmware):
  Kernel writes capsule to EFI System Resource Table (ESRT) via
  efi_capsule_update(). Actual update happens on next reboot.
  Same mechanism as Linux (CONFIG_EFI_CAPSULE_LOADER).
  Exposes /dev/efi_capsule_loader for userspace tools (fwupd).

11.6.6.3 Linux Compatibility

/sys/class/firmware/<device>/loading    — firmware loading trigger
/sys/class/firmware/<device>/data       — firmware blob upload
/sys/class/firmware/<device>/status     — update status
/sys/bus/*/devices/*/firmware_node/     — ACPI firmware node link
/dev/efi_capsule_loader                — UEFI capsule interface

fwupd works unmodified — it uses the standard sysfs firmware update interface and UEFI capsule loader, both of which are provided.


11.6.7 Appendix: Comparison with Prior Art

Aspect Linux IOKit Windows PnP Fuchsia DF UmkaOS
Tree owner Kernel (kobject) Kernel (IORegistry) Kernel (devnode) Userspace (devmgr) Kernel (DeviceRegistry)
Matching Per-bus (module_alias) Property dict match INF file rules Bind rules MatchRule in ELF .kabi_match
PM ordering Heuristic (dpm_list) IOPMPowerState tree IRP tree walk Component PM Topological sort of device tree
Service discovery Per-subsystem APIs IOService matching WDF target objects Protocol/service Unified registry_publish/lookup
Hot-plug Per-bus callbacks IOService terminate PnP IRP dispatch devmgr events Registry-mediated events
Crash recovery Kernel panic IOService terminate Bugcheck Component restart Registry-orchestrated reload
ABI coupling Tight (kobject in driver) Tight (C++ inheritance) Tight (WDM/WDF) Protocol-only None (KABI vtable only)
Isolation None None None Process boundary Domain isolation + process + capability

11.7 Zero-Copy I/O Path

The entire I/O path from user space to device and back avoids all data copies. This is essential for matching Linux performance.

11.7.1 NVMe Read Example (io_uring SQPOLL + Registered Buffers)

Step 1: User writes SQE to io_uring submission ring
        [User space, shared memory, 0 transitions]

Step 2: SQPOLL kernel thread reads SQE from ring
        [UmkaOS Core, shared memory read, 0 copies]

Step 3: Domain switch to NVMe driver domain (~23 cycles on x86 MPK)
        [Single WRPKRU on x86; MSR POR_EL0+ISB on AArch64 POE; MCR DACR on ARMv7]

Step 4: NVMe driver writes command to hardware submission queue
        [Pre-computed DMA address from registered buffer]

Step 5: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
        [Submit path complete, return to core domain]

Step 6: NVMe device DMAs data directly to user buffer
        [IOMMU-validated, zero-copy, device -> user memory]

Step 7: NVMe device writes completion to hardware CQ, raises interrupt

Step 8: Interrupt routes to NVMe driver (domain switch, ~23 cycles on x86 MPK)
        Driver reads hardware CQE

Step 9: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)

Step 10: UmkaOS Core writes CQE to io_uring completion ring
         [Shared memory write, 0 copies]

Step 11: User reads CQE from completion ring
         [User space, shared memory, 0 transitions]

Summary: - Total data copies: 0 - Total domain switches: 4 (steps 3+5 on submit path, steps 8+9 on completion path) - Total domain switch overhead: ~92 cycles on x86 MPK (4 x ~23 cycles per Section 11.2; see Section 11.2 table for other architectures) - Device latency: ~3-10 us - Overhead percentage: < 1%

11.7.1.1 NVMe Doorbell Coalescing (Mandatory)

NVMe hardware uses doorbell registers (MMIO writes) to notify the controller that new commands are available in the submission queue. Each doorbell write is an uncacheable MMIO store — ~100-200 cycles on x86-64 (PCIe posted write), ~150-300 cycles on ARM (device memory type). In the naive case, every submit_io() call writes the doorbell immediately, which means one MMIO write per I/O command.

UmkaOS coalesces doorbell writes as a core design decision. When multiple I/O commands are submitted in a batch (common with io_uring SQPOLL, which drains multiple SQEs per poll cycle), the NVMe driver writes all commands to the submission queue first, then issues a single doorbell write for the entire batch. The NVMe specification explicitly supports this: the doorbell value is the new SQ tail index, and the controller processes all entries between the previous tail and the new tail.

/// NVMe submission batch context. Accumulates commands and defers the
/// doorbell write until `flush()` is called. Created by the KABI dispatch
/// trampoline when it detects multiple pending SQEs in the domain ring buffer.
///
/// # Invariants
///
/// - `pending_count` tracks commands written to the hardware SQ since the
///   last doorbell write.
/// - `flush()` must be called before returning from the KABI dispatch
///   to ensure all commands are visible to the controller. The KABI
///   trampoline enforces this via Drop (flush on drop as safety net).
pub struct NvmeSubmitBatch<'sq> {
    /// Reference to the submission queue (hardware memory).
    sq: &'sq mut NvmeSubmissionQueue,
    /// Number of commands written since last doorbell.
    // Bounded by max_batch, which is clamped to sq.depth (max 65536 per
    // NVMe spec CAP.MQES).  u32 cannot overflow.
    pending_count: u32,
    /// Maximum batch size before auto-flush (tunable, default: 32).
    /// Prevents unbounded batching that could increase per-command latency.
    /// **Clamped**: `max_batch` must not exceed the SQ's queue depth
    /// (`sq.depth`); the constructor enforces `max_batch = min(requested, sq.depth)`.
    /// Exceeding queue depth would overwrite un-consumed SQ entries.
    max_batch: u32,
}

impl<'sq> NvmeSubmitBatch<'sq> {
    /// Write a command to the SQ without ringing the doorbell.
    /// If `pending_count` reaches `max_batch`, auto-flushes.
    pub fn submit(&mut self, cmd: &NvmeCommand) {
        self.sq.write_entry(cmd);
        self.pending_count += 1;
        if self.pending_count >= self.max_batch {
            self.flush();
        }
    }

    /// Ring the doorbell once for all pending commands.
    /// Cost: one MMIO write (~100-200 cycles) regardless of batch size.
    pub fn flush(&mut self) {
        if self.pending_count > 0 {
            // SAFETY: doorbell is an MMIO register in the driver's private
            // domain. Writes the new SQ tail index.
            unsafe { self.sq.ring_doorbell() };
            self.pending_count = 0;
        }
    }
}

impl Drop for NvmeSubmitBatch<'_> {
    fn drop(&mut self) {
        // Safety net: ensure all commands are submitted even if the caller
        // forgets to call flush(). This is a correctness guarantee, not a
        // performance path — callers should flush() explicitly.
        //
        // Device liveness check: skip the doorbell write if the device is
        // no longer alive (e.g., hot-removed, or controller disabled during
        // error recovery). NVMe spec section 3.1.5 guarantees doorbell
        // writes are ignored when CC.EN=0, but other device types with
        // doorbell-style notification (e.g., legacy PCI devices with
        // interrupt assertion on doorbell write) may not tolerate writes
        // in error state. The liveness check makes the coalescing pattern
        // safe for all device types using this abstraction.
        if self.sq.device_alive() {
            self.flush();
        }
    }
}

Batch size selection: The default max_batch of 32 balances throughput and latency. With io_uring SQPOLL draining at ~32-64 SQEs per poll cycle, this typically results in 1-2 doorbell writes per poll cycle instead of 32-64. The value is tunable per-device to accommodate different workload patterns.

Cost savings:

Scenario Without coalescing With coalescing Savings
io_uring SQPOLL, 32 SQEs/batch 32 × ~150 cycles = ~4800 cycles 1 × ~150 cycles = ~150 cycles ~4650 cycles (~97%)
io_uring SQPOLL, 1 SQE (fsync) 1 × ~150 cycles 1 × ~150 cycles 0 (no batching opportunity)
Direct submit (non-SQPOLL) 1 × ~150 cycles 1 × ~150 cycles 0 (single command)

Per-I/O amortized doorbell cost with batch-32: ~150/32 = ~5 cycles/command, down from ~150 cycles/command. On a 10μs NVMe read (~25,000 cycles), this reduces doorbell overhead from ~0.6% to ~0.02%.

Applicability beyond NVMe: The same coalescing pattern applies to any device with doorbell-style notification: virtio (virtqueue kick), network TX (NIC doorbell/tail pointer write), and accelerator command queues. The KABI dispatch trampoline detects batch opportunities for any device type and uses the same flush-on-last-command pattern.

11.7.2 NVMe Write Example (Buffered write() → Page Cache → Writeback → NVMe)

The write path is longer than the read path because it traverses the page cache and writeback machinery before reaching the NVMe driver. This example shows a standard buffered write() — the most common write path for applications that do not use O_DIRECT.

Step 1: User calls write(fd, buf, len)
        [User space → syscall → UmkaOS Core]

Step 2: VFS write path copies user data into page cache pages
        [Domain switch to VFS Tier 1 (~23 cycles on x86 MPK)]
        VFS calls address_space_ops.write_begin() to allocate/find page cache
        pages for the file offset range, then copy_from_user() into those pages.
        Pages are marked PG_DIRTY. write() returns to userspace.
        [Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)]

Step 3: Writeback thread wakes (periodic or balance_dirty_pages() threshold)
        [UmkaOS Core, writeback kthread context]
        The per-BDI writeback thread selects dirty inodes and calls
        address_space_ops.writepages() to flush dirty page cache pages.

Step 4: Filesystem (ext4/XFS/Btrfs) maps dirty pages to disk blocks
        [VFS Tier 1 domain — already entered by writeback dispatch]
        The filesystem calls iomap_begin() to resolve file offsets to block
        device LBAs. For delayed-allocation filesystems, this is where
        physical blocks are allocated.

Step 5: Bio construction from dirty pages
        [VFS Tier 1 domain]
        The filesystem builds bio structs: each bio references page cache
        pages directly (no copy — bio.bi_io_vec points to the dirty pages).
        The bio target is the NVMe block device. submit_bio() hands the bio
        to the block layer.

Step 6: Block layer dispatches bio to NVMe driver
        [Domain switch from VFS Tier 1 to NVMe Tier 1 (~23 cycles on x86 MPK)]
        The block layer's make_request_fn routes the bio to the NVMe driver's
        submission queue. The NVMe driver translates the bio into an NVMe Write
        command, referencing the DMA addresses of the page cache pages.

Step 7: NVMe driver writes command to hardware submission queue
        [NVMe Tier 1 domain]
        Pre-computed DMA mapping: page cache pages were DMA-mapped when the bio
        was constructed (streaming DMA map, direction = TO_DEVICE). The NVMe
        command PRP/SGL entries point to these IOMMU-validated DMA addresses.
        Doorbell coalescing applies if multiple bios are batched.
        [Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)]

Step 8: NVMe device DMAs data FROM page cache pages to device storage
        [IOMMU-validated, zero-copy from page cache — no intermediate buffer]

Step 9: NVMe device writes completion to hardware CQ, raises interrupt

Step 10: Interrupt routes to NVMe driver (domain switch, ~23 cycles on x86 MPK)
         Driver reads hardware CQE, calls bio completion callback.

Step 11: Bio completion callback marks pages PG_CLEAN
         [Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)]
         DMA mappings are unmapped (streaming DMA unmap). Page cache pages
         transition from dirty to clean. Writeback accounting is updated.

Summary: - Total data copies: 1 (user buffer → page cache in step 2; this copy is inherent to buffered I/O and identical to Linux) - Page cache → device: 0 copies (NVMe DMAs directly from page cache pages) - Total domain switches: 7 (2 for VFS write, 3 for writeback dispatch Core→VFS→NVMe→Core, 2 for completion interrupt). Row 3 (Core→VFS for writeback) may be elided by shadow if the writeback kthread runs persistently in the VFS domain, reducing the count to 6. - Total domain switch overhead: ~161 cycles on x86 MPK (7 × ~23 cycles per Section 11.2) - Device latency: ~10-30 μs (NVMe write, varies with device and write size) - Overhead percentage: < 1% (138 cycles on a ~40,000-120,000 cycle operation)

Domain crossings:

Crossing Direction Trigger
Core → VFS Tier 1 Forward write() syscall dispatch
VFS Tier 1 → Core Return write() completion (user returns)
Core → VFS Tier 1 Forward Writeback thread dispatches writepages()
VFS Tier 1 → NVMe Tier 1 Forward submit_bio() to NVMe block device
NVMe Tier 1 → Core Return Doorbell write complete, return to writeback
Core → NVMe Tier 1 Forward NVMe completion interrupt
NVMe Tier 1 → Core Return Bio completion callback returns

Note: The VFS Tier 1 → NVMe Tier 1 crossing (row 4) is a cross-domain ring dispatch (VFS "filesystem" domain → NVMe "block" domain). These are different hardware isolation domains, so communication uses a ring buffer — same as any cross-domain crossing. The ring is set up at bind time when the VFS module resolves its block device dependency. The table lists 7 crossings total. If the writeback kthread runs persistently in the VFS domain (shadow elision), row 3 (Core→VFS for writeback) is elided, reducing the effective count to 6. The overhead claim (<1%) is valid either way.

DMA mapping lifecycle: The streaming DMA mapping for page cache pages is created at bio construction time (step 5) with dma_map_page(page, offset, len, DMA_TO_DEVICE) and unmapped at bio completion (step 11) with dma_unmap_page(). The IOMMU ensures the NVMe device can only access the mapped pages for the duration of the I/O. No bounce buffers are needed when the page cache pages are within the device's DMA-addressable range (which is always true for 64-bit DMA capable NVMe devices).

11.7.3 TCP Receive Path

Step 1: NIC DMAs packet to pre-posted receive buffer
        [IOMMU-validated, zero-copy]

Step 2: NIC raises interrupt -> domain switch to NIC driver (~23 cycles on x86 MPK)

Step 3: NIC driver processes descriptor, identifies packet
        Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)

Step 4: UmkaOS Core dispatches to umka-net -> domain switch to umka-net (~23 cycles on x86 MPK)

Step 5: umka-net processes TCP headers, copies payload to socket buffer
        (This is the one "copy" -- same as Linux. Technically a move
         of ownership, not a memcpy, when using page-flipping.)

Step 6: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
        UmkaOS Core signals epoll/io_uring waiters

Step 7: User reads from socket via read()/recvmsg()/io_uring
        Data delivered from socket buffer (zero-copy with MSG_ZEROCOPY)

Total domain switches: 4 (2 domain entries x 2 switches each: enter NIC driver + exit, enter umka-net + exit) Total domain switch overhead: ~92 cycles on x86 MPK (~20 ns at 4.6 GHz) on ~5 us path = ~0.4% (see Section 11.2 for other architectures). The ~92 cycles (~0.4%) is the worst-case cost without PKRU shadow elision. With elision (Section 11.2), 1-2 redundant writes are skipped, reducing overhead to ~46-69 cycles (~0.2-0.3%).


11.8 IPC Architecture and Message Passing

Section 11.7 describes the data plane -- how bytes flow from user space through Tier 1 drivers to devices and back with zero copies. This section describes the control plane that Section 11.7's data plane relies on: the IPC primitives that carry commands, completions, capability transfers, and event notifications between isolation domains.

11.8.1 IPC Primitives

UmkaOS's IPC model has three distinct layers, each serving a different boundary:

1. Intra-kernel IPC (between isolation domains within Ring 0): domain ring buffers. Shared memory regions with per-domain access controlled by the isolation domain register (WRPKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, etc.). Zero-copy, zero-syscall. This is the transport for all umka-core to Tier 1 driver communication — the command/completion flow shown in Section 11.7's NVMe and TCP examples. The "domain switch" at each step in those diagrams crosses a domain ring buffer boundary.

2. Kernel-user IPC (between kernel and user space): io_uring submission/completion rings. Standard Linux ABI (Section 19.3). Applications submit SQEs to the io_uring submission ring and receive CQEs from the completion ring. This is the only I/O interface that user space sees. UmkaOS's io_uring implementation is fully compatible with Linux 6.x semantics -- unmodified applications work without changes.

3. Inter-process IPC (between user processes): POSIX and System V IPC. Pipes, Unix domain sockets, POSIX message queues, and POSIX shared memory — implemented via the syscall interface (Section 19.1). These are performance-critical hot paths for many production workloads: Unix domain sockets are the backbone of high-throughput service meshes (nginx↔php-fpm, gRPC over UDS, container networking via socket activation), pipes sustain multi-Gbps throughput in data processing pipelines, and POSIX shared memory backs zero-copy IPC in databases (PostgreSQL), browsers (Chrome), and multimedia frameworks. UmkaOS optimizes all three: Unix sockets use user-space-mapped SPSC ring buffers with 2-copy transfer (vs Linux's 3-copy), pipes use lock-free algorithms with vmsplice zero-copy page gifting, and shared memory maps physical frames directly into multiple address spaces via capability grants. See Section 17.3 for the full specifications.

System V IPC (shmget, msgget, semget) is fully supported and performance-optimized. Major production databases depend on it: PostgreSQL uses SysV shared memory for the shared buffer pool and SysV semaphores for lightweight locking; Oracle Database uses SysV shared memory segments for the SGA (System Global Area) and SysV semaphores extensively for process coordination. UmkaOS pre-allocates SysV semaphore arrays at semget() time with AtomicU16 per semaphore — no heap allocation during semop(). SysV shared memory segments are direct page-table mappings (same physical frames in multiple address spaces), identical in performance to POSIX shm_open. See Section 17.1 for the full SysV IPC namespace implementation.

11.8.1.1 IPC Replaceability and Live Evolution

IPC spans all four layers of the live evolution model (Section 13.18):

IPC Layer Replaceability Rationale
Domain ring buffers (Layer 1, intra-kernel) Data: No. Dispatch: Yes. Entry format: Evolvable. The DomainRingBuffer header layout is the wire protocol between isolation domains — non-replaceable data. But the dispatch logic is in the replaceable KABI service layer, and entry formats evolve via versioned headers (RingEntryHeader) with append-only fields and version negotiation (see Section 11.8 below and Section 13.18).
io_uring (Layer 2, kernel-user) Protocol: No. Implementation: Yes. The SQ/CQ ring layout is Linux ABI — non-replaceable. The io_uring implementation (SQE processing, CQE posting, registered buffers) lives in the replaceable SysAPI layer (Layer 2) or KABI services (Layer 3).
Pipes, Unix sockets, POSIX IPC (Layer 3, inter-process) Yes (via KABI service replacement). Pipe buffers live inside VFS; Unix sockets live inside the networking stack. Both VFS and networking implement ServiceEvolvable (Section 13.18) with incremental state export. The pipe buffer contents, socket state, and IPC namespace data are serialized during service replacement.
SysV IPC (Layer 3, inter-process) Yes (via cgroup/namespace service replacement). SysV shared memory segments are page-table mappings (physical frames persist across service replacement). SysV semaphore and message queue state is serialized as part of the IPC namespace.

Key insight: Unlike the memory allocator (where the hot path is magazine pop/push and policy runs on warm path), IPC hot paths ARE the data structure algorithms themselves — the lock-free ring protocol, the pipe write CAS sequence, the AF_UNIX SPSC ring. There is no separate "IPC policy" to factor out. Instead, IPC replaceability is achieved at the service layer: the entire VFS or networking stack can be live-replaced, carrying all IPC state with it. Individual IPC data structures (PipeBuffer, UserSpscRing, SysvSemaphore) are migrated during the service replacement's state serialization phase.

4. Hardware peer IPC (between the host kernel and a device running UmkaOS firmware): domain ring buffers over PCIe P2P. A device that participates as a first-class cluster member (Section 5.2) communicates with the host kernel via the same domain ring buffer protocol used for intra-kernel IPC (Layer 1), transported over PCIe peer-to-peer MMIO and MSI-X interrupts instead of in-process memory. From the host kernel's perspective, the device firmware endpoint is just another ring buffer pair — the same DomainRingBuffer structure, the same ClusterMessageHeader wire format, the same message-passing discipline. The transport medium changes (PCIe instead of cache-coherent RAM); the abstraction does not. This is not a compatibility shim. It is the intended model for first-class hardware participation: a SmartNIC, DPU, computational storage device, or RISC-V accelerator running UmkaOS presents an IPC endpoint identical in structure to an in-kernel Tier 1 driver, while owning its own scheduler, memory manager, and capability space. See Section 5.2 for the wire protocol, implementation paths (A/B/C), and near-term hardware targets.

The terms are not interchangeable. When this document says "io_uring", it means the userspace-facing async I/O interface. When it says "domain ring buffer", it means the internal kernel transport between isolation domains. An io_uring SQE from userspace triggers an isolation domain switch to a Tier 1 driver via a domain ring buffer — the two mechanisms are connected but architecturally distinct.

User space                        Kernel (Ring 0)
+-----------+                     +------------------------------------------+
| App       |                     |  umka-core         Tier 1 driver         |
|           |   io_uring SQE      |                                          |
|  SQ ring -|-------------------->|-> dispatch -----> domain cmd ring --------->|
|           |                     |                   (WRPKRU)               |
|           |   io_uring CQE      |                                          |
|  CQ ring <|--------------------<|<- collect  <----- domain cpl ring <---------|
|           |                     |                   (WRPKRU)               |
+-----------+                     +------------------------------------------+
     Layer 2                           Layer 1 (internal)
  (Linux ABI)                      (domain ring buffers)

11.8.2 Domain Ring Buffer Design

Each Tier 1 driver has a pair of ring buffers shared with umka-core: a command ring (umka-core produces, driver consumes) and a completion ring (driver produces, umka-core consumes). Both use the same underlying structure:

Weak-isolation fast path (isolation=performance or no fast isolation mechanism): When drivers run as Tier 0 (no CPU-side isolation), domain ring buffers remain the IPC mechanism — the data structure and lock-free protocol are unchanged — but the domain register switches are elided. On architectures with hardware domains (MPK, POE, DACR), each ring buffer access requires toggling the domain register to grant access to the shared region (~23-80 cycles per switch, 4 switches per I/O round-trip = ~92-320 cycles). Without hardware domains, the ring buffer memory is mapped with normal kernel permissions and no domain switch is needed: the producer writes directly, the consumer reads directly, and the only synchronization is the existing atomic head/published/tail protocol. This eliminates the dominant per-I/O isolation overhead on RISC-V (~800-2000 cycles saved per I/O) and on any platform running isolation=performance. The ring buffer structure itself is unchanged — only the access-control wrapper is bypassed.

/// A lock-free single-producer single-consumer ring buffer that lives in
/// a shared memory region accessible to exactly two isolation domains.
///
/// The header occupies two cache lines (one producer-owned, one
/// consumer-owned). Ring data follows immediately after the header,
/// aligned to `entry_size`.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct DomainRingBuffer {
    /// Write claim position. Producers CAS this to claim slots (MPSC mode).
    /// In SPSC mode, only the single producer increments this.
    ///
    /// `AtomicU64`: u32 would wrap in ~29 seconds at 148 Mpps (100 Gbps with
    /// 64-byte packets); u64 wraps after ~4 billion years at the same rate.
    /// u64 counters eliminate the need for modular wrap-around logic in the hot path.
    pub head: AtomicU64,
    /// Published position. In MPSC mode, a producer increments this (in order)
    /// AFTER writing data to the claimed slot. The consumer reads `published`
    /// (not `head`) to determine how many entries are ready. In SPSC mode,
    /// `published` always equals `head` (the single producer updates both).
    /// In broadcast mode, this field is NOT the source of truth —
    /// `last_enqueued_seq` (u64) is the authoritative write position. The
    /// `published` field is derived (`write_seq / 2`) for diagnostic
    /// compatibility only. Implementations MUST NOT increment `published`
    /// independently in broadcast mode.
    pub published: AtomicU64,
    /// Number of entries. Must be a power of two.
    pub size: u32,
    /// Bytes per entry. Fixed at ring creation time.
    pub entry_size: u32,
    /// Number of entries dropped due to ring-full condition.
    /// Monotonically increasing. Exposed via umkafs diagnostics (unified-object-namespace).
    pub dropped_count: AtomicU64,
    /// Sequence number of the last successfully enqueued entry.
    /// Consumers use this to detect gaps: if the consumer's last-seen
    /// sequence is less than `last_enqueued_seq - ring_size`, entries
    /// were lost.
    /// In broadcast mode, this field serves as `write_seq` for torn-read
    /// prevention (incremented by 2 per entry; odd = write-in-progress,
    /// even = stable). See "Broadcast channels" below.
    pub last_enqueued_seq: AtomicU64,
    /// Ring lifecycle state. Written by crash recovery or graceful shutdown;
    /// read by producers in spin loops to detect partner death.
    ///   0 = Active (normal operation)
    ///   1 = Disconnected (producer died or ring being torn down)
    /// Producers check this in every spin iteration and bail with
    /// `Err(Disconnected)` if set. The crash recovery path (Section 11.7)
    /// sets this AFTER publishing poison markers for any in-flight
    /// slots (see "Producer death recovery" below).
    pub state: AtomicU8,
    /// Padding to fill the producer cache line to exactly 64 bytes.
    /// Layout: head(8) + published(8) + size(4) + entry_size(4)
    ///       + dropped_count(8) + last_enqueued_seq(8) + state(1)
    ///       + _pad(23) = 64.
    _pad_producer: [u8; 23],
    /// Read position. Only the consumer increments this.
    /// On a separate cache line from head/published to avoid false sharing.
    ///
    /// `AtomicU64`: same rationale as `head` — no wrap-around at any realistic rate.
    pub tail: AtomicU64,
    /// Optional: duplicated ring parameters for consumer-only cache line access.
    /// Implementations SHOULD initialize these from the producer's `size` and
    /// `entry_size` at ring creation time. See the SHOULD note above.
    pub consumer_size: u32,
    pub consumer_entry_size: u32,
    /// Padding to fill the consumer cache line to exactly 64 bytes.
    /// tail(8) + consumer_size(4) + consumer_entry_size(4) + pad(48) = 64.
    _pad_consumer: [u8; 48],
    // Ring data follows: `size * entry_size` bytes.
}
/// Header size is load-bearing: ring data starts at `self as *const u8 + 128`.
const_assert!(size_of::<DomainRingBuffer>() == 128);
/// Errors returned by ring buffer produce operations.
pub enum RingError {
    /// Ring is full — no free slots available.
    Full,
    /// Ring partner has died (crash recovery set `state = Disconnected`).
    /// Caller must not retry; propagate the error.
    Disconnected,
    /// System severely overloaded — entry was discarded (poison marker written).
    /// The entry was lost but the ring remains operational.
    Overloaded,
    /// Completion did not arrive within the caller's timeout. Used by
    /// `CrossDomainRing::wait_completion()` ([Section 12.8](12-kabi.md#kabi-domain-runtime)) when
    /// the ring partner is alive but slow. The `kabi_call!` macro maps this
    /// to `KabiError::Timeout`. The caller may retry or escalate.
    Timeout,
}

Note on false sharing: size and entry_size are read-only after initialization and are read by both producer and consumer. They are placed on the producer's cache line for layout simplicity, but implementations SHOULD duplicate these values on the consumer's cache line (as consumer_size and consumer_entry_size) to avoid false sharing. The consumer reads only from its own cache line.

Lock-free SPSC protocol. The producer writes an entry at data[head % size], then increments head and published together (in SPSC mode they are always equal). The consumer reads the entry at data[tail % size] when published > tail, then increments tail. If the first byte of an entry is 0xFF (poison marker), the consumer skips the entry and increments tail without processing — this occurs only when a producer hit the Err(Overloaded) path and had to force-publish a discarded slot. No locks, no CAS, no contention. The head/published fields are on one cache line (producer-owned); tail is on a separate cache line (consumer-owned). This eliminates false sharing on hot paths.

Memory ordering. The producer uses Release ordering on the published store. The consumer uses Acquire ordering on the published load. This pair ensures that the entry data written by the producer is visible to the consumer before the consumer sees the updated published counter. On x86-64 this compiles to plain MOV instructions (TSO provides the required ordering for free). On AArch64, RISC-V, PowerPC, and LoongArch64, the compiler emits the appropriate barriers (stlr/ldar on ARM, fence-qualified atomics on RISC-V, lwsync/isync on PPC, DBAR on LoongArch). On s390x, the TSO (total store ordering) memory model provides Release/Acquire semantics for free — plain loads and stores suffice.

Architecture Producer (Release store) Consumer (Acquire load) Notes
x86-64 MOV (TSO) MOV (TSO) No explicit barriers needed
AArch64 STLR LDAR ARM's acquire/release instructions
RISC-V 64 amoswap.w.rl or fence rw,w + sw lw + fence r,rw RVWMO requires explicit fencing
PPC32 sync + stw lwz + isync Weak ordering; e500 lacks lwsync — must use sync/msync
PPC64LE lwsync + std ld + isync Same model as PPC32; lwsync preferred over sync
s390x ST (plain store) L (plain load) Sequentially consistent — Release/Acquire free. CSG for atomic CAS. BCR 14,0 for full fence
LoongArch64 DBAR 0x12 + ST.D LD.D + DBAR 0x14 Weakly ordered; DBAR 0x12 = store-release, DBAR 0x14 = load-acquire. DBAR 0 for full fence

Backpressure. When the ring is full (head - tail == size), the producer cannot write. For SPSC rings (command and completion channels), umka-core handles this in two stages: (1) spin for up to 64 iterations checking whether the consumer has advanced tail — this covers the common case where the driver is actively draining; (2) if the ring is still full after spinning, yield to the scheduler via sched_yield_current() and retry on the next scheduling quantum. Both stages check state on each iteration — if the ring is Disconnected (partner driver died), the producer returns Err(Disconnected) immediately rather than waiting for a dead consumer to drain. This avoids wasting CPU on a stalled driver while keeping the fast path lock-free. For MPSC rings (event channels), backpressure behavior depends on the calling context — see the MPSC producer API contract in Section 11.8 for the distinction between blocking (mpsc_produce_blocking(), thread context only) and non-blocking (mpsc_try_produce(), safe in any context) variants.

11.8.3 Channel Types and Capability Passing

The ring buffer primitive from Section 11.8 is instantiated in four channel configurations:

Command channels (SPSC): umka-core -> driver. One per driver instance. Carries I/O requests (read, write, discard), configuration commands (set queue depth, enable feature), and health queries (heartbeat, statistics request). Umka-core is the sole producer; the driver is the sole consumer.

Completion channels (SPSC): driver -> umka-core. One per driver instance. Carries I/O completions (success, error, partial), interrupt notifications (forwarded from the hardware interrupt handler), and error reports (device errors, internal driver faults). The driver is the sole producer; umka-core is the sole consumer.

Event channels (MPSC): multiple drivers -> umka-core event loop. Used for asynchronous events that do not belong to a specific I/O flow: device hotplug notifications, link state changes (NIC up/down), thermal throttle alerts, error notifications requiring global coordination. Multiple drivers may need to signal the same event loop, so the MPSC variant uses a compare-and-swap on head to coordinate multiple producers:

MPSC scaling limits: For event channels with >10 concurrent producers (unusual but possible in systems with many independent drivers signaling a single event loop), CAS contention on the ring head can degrade performance. In this regime, hierarchical fanout is recommended: drivers signal per-device intermediate rings, and an aggregator thread (or softirq batch) forwards events to the central ring. This reduces contention from O(producers) to O(1) at the cost of one additional indirection. The default single-ring design is optimized for the common case of 2-5 active producers per channel.

Per-CPU deferred publish buffer — When Phase 2 publication would require spinning for too long (>64 iterations, meaning an earlier producer is slow), the producer defers its publication by storing the ring pointer and slot into a small per-CPU buffer, then re-enables interrupts. This ensures interrupt-disabled windows remain bounded to ~1-2μs.

/// Per-CPU buffer for deferred MPSC ring publications.
///
/// When Phase 2 cannot complete within 64 spin iterations (because an earlier
/// producer has not yet written its data), the producer stores its pending
/// publication here and re-enables interrupts immediately. The drain function
/// is called at the start of every subsequent `send()` and at idle entry,
/// so deferred publications are completed within bounded time.
///
/// Capacity 16: supports up to 16 simultaneously stalled producers across
/// different rings. Under normal load, 0-2 entries are pending; 16 is
/// reached only under extreme contention or scheduling stalls.
pub struct DeferredPublishBuf {
    /// Ring of (published_counter_ptr, slot_index) pairs awaiting Phase 2.
    /// `published_ptr` is a pointer into the ring's AtomicU64 `published` field.
    /// `slot` is the index this producer claimed in Phase 1.
    /// Uses null `published_ptr` as empty sentinel (avoids the 8-byte
    /// `Option` discriminant overhead per entry). Empty = `published_ptr.is_null()`.
    pub entries: [DeferredEntry; 16],
    /// Head index (next slot to fill).
    pub head: u8,
    /// Tail index (next slot to drain).
    pub tail: u8,
}

pub struct DeferredEntry {
    /// Pointer to the ring's `published` counter (the one this entry must advance).
    ///
    /// **Dangling pointer safety**: This raw pointer references the ring's
    /// `published` field, which is valid only as long as the ring exists.
    /// If a ring is torn down while deferred entries reference it, these
    /// pointers would dangle. To prevent this, each entry also stores
    /// `ring_generation` — the ring's monotonic generation counter at entry
    /// creation time. The drain function compares `ring_generation` against
    /// the ring's current generation before dereferencing `published_ptr`;
    /// mismatched entries are discarded without dereferencing the pointer.
    /// Ring teardown increments the ring's generation counter, invalidating
    /// all outstanding deferred entries for that ring.
    // *mut because Phase 2 drain writes through this pointer via
    // AtomicU64::store(&self).  Sound with *const (Atomic uses interior
    // mutability), but *mut documents the write intent.
    pub published_ptr: *mut AtomicU64,
    /// Generation of the ring at the time this entry was created.
    /// Stale entries (ring_generation != ring.generation.load(Acquire))
    /// are discarded without dereferencing `published_ptr`.
    pub ring_generation: u64,
    /// Slot index claimed by Phase 1 CAS.
    pub slot: u64,
}

DeferredPublishBuf is stored in the per-CPU data structure alongside CpuLocal fields. deferred_publish_drain() iterates tail..head, and for each entry first validates ring_generation against the ring's current generation counter (an AtomicU64 incremented on ring teardown). If the generations match, it attempts Phase 2 publication: if published == slot - 1, advance published to slot (success, remove from buffer); otherwise leave in place for the next drain pass. If the generations do not match, the entry is silently discarded (the ring was torn down and the pointer is stale).

Overflow behavior: When DeferredPublishBuf reaches capacity (16 entries), the producer performs an eager flush: all 16 pending entries are written to the domain ring buffer before adding the new entry. If the ring buffer is full (consumer is behind), the flush blocks until sufficient space is available — this provides natural backpressure. A stalled Tier 1 consumer will stall its producer, preventing unbounded deferred entry accumulation. The 16-entry buffer is a coalescing optimization, not a queue; it is never intended to hold more than a few entries in steady state.

impl DomainRingBuffer {
    /// MPSC non-blocking produce: multiple producers coordinate via CAS on head.
    /// Returns Err(RingError::Full) immediately if the ring is full, or
    /// Err(RingError::Disconnected) if the ring partner has died.
    /// Safe to call from any context (thread, IRQ, softirq).
    /// See "MPSC producer API contract" below for the blocking variant.
    ///
    /// Two-phase commit protocol:
    ///   Phase 1 (claim): CAS on `head` to reserve a slot. After CAS success,
    ///     the slot is exclusively ours but NOT yet visible to the consumer.
    ///   Phase 2 (publish): After writing data, wait until `published` catches
    ///     up to our slot (ensuring in-order publication), then advance `published`.
    ///
    /// The consumer reads `published` (not `head`) to determine ready entries.
    /// This eliminates the data race where a consumer sees an incremented `head`
    /// but reads a slot whose data has not yet been written.
    pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingError> {
        // NMI handlers MUST NOT call mpsc_try_produce() — see NMI/MCE safety
        // section above. NMIs cannot be masked by local_irq_save(), so an NMI
        // between Phase 1 CAS and Phase 2 publication causes a deadlock.
        debug_assert!(
            !arch::current::cpu::in_nmi(),
            "mpsc_try_produce() called from NMI context — use per-CPU NMI buffer instead"
        );
        // --- BEGIN interrupt-disabled section ---
        // Disable interrupts BEFORE the Phase 1 CAS to prevent a deadlock:
        // if an interrupt fires between a successful CAS (slot claimed) and
        // Phase 2 (published advanced), an interrupt handler calling
        // mpsc_try_produce on the same ring would spin forever in Phase 2
        // waiting for the interrupted thread's slot to be published. Moving
        // local_irq_save() here eliminates that race window entirely.
        // The CAS loop is bounded (succeeds or returns RingError::Full), so the
        // additional interrupt-disabled time is minimal.
        let irq_state = arch::current::interrupts::local_irq_save();

        // Phase 1: Claim a slot by advancing head (interrupts already disabled).
        let my_slot;
        loop {
            let current_head = self.head.load(Ordering::Relaxed);
            let current_tail = self.tail.load(Ordering::Acquire);

            // Ring disconnected?
            if self.state.load(Ordering::Acquire) != 0 {
                arch::current::interrupts::local_irq_restore(irq_state);
                return Err(RingError::Disconnected);
            }
            // Ring full?
            if current_head.wrapping_sub(current_tail) >= self.size as u64 {
                arch::current::interrupts::local_irq_restore(irq_state);
                return Err(RingError::Full);
            }

            // Strong CAS required: on AArch64 LL/SC architectures, compare_exchange_weak
            // permits spurious failures. In an interrupt-disabled window, spurious failures
            // cause unbounded spinning — use compare_exchange (strong) to prevent this.
            // Attempt to claim the slot.
            if self
                .head
                .compare_exchange(
                    current_head,
                    current_head.wrapping_add(1),
                    Ordering::AcqRel,
                    Ordering::Relaxed,
                )
                .is_ok()
            {
                my_slot = current_head;
                break;
            }
            core::hint::spin_loop();
        }

        // Write entry data to the claimed slot.
        let offset = (my_slot % self.size as u64) as usize * self.entry_size as usize;
        // SAFETY: offset is within bounds (power-of-two size, fixed entry_size).
        // The slot is exclusively ours because we won the CAS race.
        unsafe {
            core::ptr::copy_nonoverlapping(
                entry.as_ptr(),
                self.data_ptr().add(offset),
                self.entry_size as usize,
            );
        }

        // Phase 2: Publish. Wait until all prior slots are published, then
        // advance `published` to make our slot visible to the consumer.
        // This spin is brief: it only waits for producers that claimed earlier
        // slots to finish their writes. Under normal operation, this completes
        // in 1-2 iterations.
        //
        // Drain deferred publications from previous calls. Before attempting
        // our own Phase 2, drain ALL entries from the per-CPU deferred publish
        // ring buffer. This ensures that deferrals from prior send() calls are
        // re-attempted (and completed) before new entries are published, preventing
        // silent loss if multiple producers defer in succession.
        //
        // The drain takes no arguments — each deferred entry stores a pointer to
        // the ring's `published` counter alongside the slot index, so the drain
        // correctly targets the ring that each slot belongs to (a producer may
        // have deferred on ring A and now be calling send() on ring B).
        arch::current::cpu::deferred_publish_drain();

        // **IRQ-disabled window**: Interrupts are disabled only during Phase 1
        // CAS + Phase 2 publication attempt (bounded at 64 iterations). If Phase 2
        // exceeds 64 spins, the entry is deferred and interrupts are restored
        // immediately. The 256-iteration fallback spin (if the defer buffer is full)
        // runs with interrupts **re-enabled**. Worst-case IRQ-disabled duration:
        // ~64 CAS operations ≈ 1-2μs.
        //
        // **Phase 2 uses compare_exchange (strong), not compare_exchange_weak.**
        // On AArch64 LL/SC architectures, compare_exchange_weak can fail spuriously
        // (no actual contention — just LL/SC interference from an unrelated store).
        // In Phase 2, spurious failures increment spin_count, potentially exhausting
        // the 64-iteration budget and triggering unnecessary deferred-publish overhead.
        // Strong CAS ensures the spin count only advances on genuine contention (another
        // producer with an earlier slot has not yet published), keeping the common-case
        // IRQ-disabled window at the expected 1-3 iterations.
        //
        // Bounded publish wait: To prevent unbounded interrupt-disabled spinning,
        // Phase 2 uses a bounded spin of 64 iterations. If `self.published` has
        // not advanced to `my_slot` within 64 iterations, the producer stores the
        // ring's `published` pointer and its slot index as a pair into a per-CPU
        // deferred publish ring buffer, then re-enables interrupts. The drain path
        // (at the start of the next `send()` call and on the consumer side)
        // re-attempts publication on behalf of the stalled producer, using the
        // stored ring pointer to target the correct ring. The per-CPU deferred
        // buffer is a ring (`[Option<(*const AtomicU64, u64)>; 16]` with
        // head/tail indices) rather than a single `Option<u64>`, so multiple
        // consecutive deferrals (potentially targeting different rings) can queue
        // without silently losing earlier deferred values. The buffer holds 16
        // entries (increased from an earlier 4-entry design to ensure bounded-time
        // behavior under heavy contention). If the deferred buffer itself is full
        // (16 outstanding deferrals — an extreme edge case indicating severe system
        // overload), the producer re-enables interrupts before falling back to a
        // bounded spin (up to 256 iterations with `core::hint::spin_loop()`). If the
        // bounded spin also fails, the producer returns `Err(Overloaded)` to the
        // caller, which applies backpressure (increment `dropped_count` for IRQ
        // producers, or yield and retry for thread-context producers). This ensures
        // the interrupt-disabled window is always bounded. The common-case bound is:
        // Phase 1 CAS (~5ns, usually 1 attempt) + data write +
        // drain (up to 16 entries * CAS each = ~80ns) + Phase 2 spin
        // (up to 64 * ~5ns = ~320ns) = ~410ns in the common case.
        let mut spin_count = 0u32;
        loop {
            if self
                .published
                .compare_exchange(
                    my_slot,
                    my_slot.wrapping_add(1),
                    Ordering::Release,
                    Ordering::Relaxed,
                )
                .is_ok()
            {
                break;
            }
            spin_count += 1;
            if spin_count >= 64 {
                // Exceeded bounded spin — defer completion to the consumer drain
                // path and re-enable interrupts to avoid unbounded IRQ latency.
                // The deferred buffer holds up to 16 entries; if it is full,
                // re-enable IRQs and fall through to bounded spin (system overloaded).
                // Fence ensures entry data written at the slot is visible to
                // all CPUs before the slot can be published by a deferred drain
                // on any CPU. Without this, on weakly-ordered architectures
                // (AArch64, RISC-V, PPC), a different CPU draining and publishing
                // via CAS(Release) would only order its own stores, not the
                // original writer's stores.
                core::sync::atomic::fence(Ordering::Release);
                if arch::current::cpu::deferred_publish_enqueue(&self.published, my_slot) {
                    arch::current::interrupts::local_irq_restore(irq_state);
                    return Ok(());
                }
                // Deferred buffer full — re-enable IRQs to preserve RT
                // guarantees, then bounded spin outside the IRQ-disabled window.
                arch::current::interrupts::local_irq_restore(irq_state);
                let mut fallback_spin = 0u32;
                loop {
                    if self.state.load(Ordering::Acquire) != 0 {
                        return Err(RingError::Disconnected);
                    }
                    if self.published.compare_exchange(
                        my_slot, my_slot.wrapping_add(1),
                        Ordering::Release, Ordering::Relaxed,
                    ).is_ok() {
                        return Ok(());
                    }
                    fallback_spin += 1;
                    if fallback_spin >= 256 {
                        // System severely overloaded. We must still advance `published`
                        // past our slot to prevent permanently wedging the ring.
                        // Write a poison marker (entry_type = 0xFF) into the slot so
                        // the consumer knows to skip it, then spin until we can
                        // advance `published`. This spin waits for earlier producers
                        // to publish. If an earlier producer has died (Tier 1/2 crash
                        // between Phase 1 and Phase 2), the crash recovery path will
                        // have set `state = Disconnected` and published poison markers
                        // for the dead producer's slots, unblocking this spin. We
                        // check `state` on every iteration to detect this case.
                        let offset = (my_slot % self.size as u64) as usize * self.entry_size as usize;
                        // SAFETY: slot is ours (won the Phase 1 CAS); offset in bounds.
                        unsafe { *self.data_ptr().add(offset) = 0xFF; } // poison marker
                        let mut publish_spin = 0u32;
                        while self.published.compare_exchange(
                            my_slot, my_slot.wrapping_add(1),
                            Ordering::Release, Ordering::Relaxed,
                        ).is_err() {
                            if self.state.load(Ordering::Acquire) != 0 {
                                return Err(RingError::Disconnected);
                            }
                            publish_spin += 1;
                            if publish_spin >= 4096 {
                                // Earlier producer is alive but severely delayed.
                                // Yield the CPU to allow it to make progress.
                                // This prevents livelock on the same core.
                                arch::current::cpu::yield_cpu();
                                publish_spin = 0;
                            }
                            core::hint::spin_loop();
                        }
                        return Err(RingError::Overloaded);
                    }
                    core::hint::spin_loop();
                }
            }
            core::hint::spin_loop();
        }

        // --- END interrupt-disabled section ---
        arch::current::interrupts::local_irq_restore(irq_state);

        Ok(())
    }
}

To prevent data loss when no future send() occurs, the per-CPU idle entry hook (cpu_idle_enter(), Section 8.4) drains the deferred publish buffer for all MPSC rings registered on that CPU. Additionally, when a thread that performed a deferred publish is migrated to a different CPU, the migration path drains the source CPU's deferred buffer. These hooks ensure deferred entries are published within a bounded window (at most one scheduler tick, ~4ms).

MPSC Phase 2 preemption hazard and mitigation. The Phase 2 publish spin in mpsc_try_produce() can stall if a producer is preempted (by an interrupt or scheduler) between Phase 1 (CAS on head) and Phase 2 (advancing published). While preempted, the published counter is stuck at the preempted producer's slot, blocking all subsequent producers from making their entries visible to the consumer -- even though their data is already written. This is not a deadlock (the preempted producer will eventually resume and complete Phase 2), but it can cause unbounded latency spikes on the consumer side.

Mitigation: UmkaOS addresses this in three ways:

  1. Interrupts disabled from before Phase 1 through Phase 2. The MPSC produce path disables interrupts (not just preemption) BEFORE the Phase 1 CAS, keeping them disabled through the Phase 2 published counter advancement. This prevents the following deadlock scenario: on a uniprocessor (or any CPU), thread T1 claims slot N via CAS, then an IRQ fires and the IRQ handler claims slot N+1 via CAS. The IRQ handler's Phase 2 spin waits for published to reach N, but T1 cannot advance published because it is interrupted — deadlock. Disabling interrupts before Phase 1 eliminates this window entirely (there is no gap between CAS success and interrupt disabling). The interrupt-disabled region covers: Phase 1 CAS (bounded — succeeds or returns Full/Disconnected, typically 1 attempt = ~5ns), data write, deferred drain (up to 16 entries × ~5ns CAS = ~80ns), and Phase 2 publish CAS (up to 64 iterations = ~320ns), totaling ~410ns in the common case. On multiprocessor systems, disabling preemption alone would suffice (another CPU could run the interrupted producer), but disabling interrupts is correct on all configurations and the cost is negligible.

  2. Consumer-side stuck detection and recovery (defense in depth). The consumer (umka-core event loop) maintains a watchdog: if head > published for more than 1000 consecutive poll iterations (~10μs), the consumer treats the gap as a stalled producer. If the gap persists beyond 10ms (configurable), the consumer initiates forced slot recovery: for each unpublished slot from published to head, write a poison marker (0xFF) and advance published. This unblocks any live producers spinning on Phase 2 while discarding the dead producer's incomplete entries. The consumer logs a diagnostic event with the number of force-published slots and the ring identity.

This consumer-side recovery is a safety net for the case where the crash recovery path (Section 11.9, step 5a) has not yet run — e.g., the driver faulted but the FMA detection latency exceeds 10ms, or the fault was a silent hang rather than a trap. Under normal operation, the crash recovery path (step 5a below) handles slot recovery before the consumer watchdog fires.

  1. Interrupt handlers use bounded produce. Interrupt handlers that produce to MPSC rings use mpsc_try_produce(), which fails with Err(Full) if the ring is full rather than spinning. This prevents interrupt handlers from spinning on a full ring while the consumer (which runs in thread context) cannot drain it — if the consumer needs to be scheduled to make progress, a spinning IRQ handler creates an unbounded spin or deadlock.

MPSC producer API contract. The MPSC ring exposes two producer entry points with distinct calling context requirements:

impl DomainRingBuffer {
    /// Non-blocking produce. Returns immediately if the ring is full or
    /// disconnected. Safe to call from ANY context (thread, IRQ, softirq).
    ///
    /// On success: entry is enqueued and will be visible to the consumer
    /// after Phase 2 publish completes.
    /// On Err(Full): the ring has no free slots. The caller is responsible
    /// for handling the overflow (see overflow accounting below).
    /// On Err(Disconnected): the ring partner has died. Caller must not retry.
    pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingError>;

    /// Blocking produce. Spins (with bounded spin + yield) until a slot
    /// becomes available, then enqueues the entry.
    ///
    /// MUST NOT be called with interrupts disabled. If the ring is full,
    /// this function spins waiting for the consumer to drain entries. If
    /// interrupts are disabled, the consumer (which runs in thread context)
    /// may never be scheduled, causing an unbounded spin.
    ///
    /// In debug builds: panics immediately if called with interrupts
    /// disabled (detected via arch::current::interrupts::are_enabled()).
    /// In release builds: falls back to mpsc_try_produce() with overflow
    /// accounting if interrupts are disabled (defense in depth — the debug
    /// panic should catch all such call sites during development).
    pub fn mpsc_produce_blocking(&self, entry: &[u8]);

    /// Return a raw pointer to the start of the ring data region.
    ///
    /// The data region follows immediately after the 128-byte header.
    /// Each entry occupies `entry_size` bytes. The total data region
    /// is `size * entry_size` bytes.
    ///
    /// # Safety
    ///
    /// The returned pointer is valid for `size * entry_size` bytes.
    /// The caller must ensure exclusive access to the slot before writing.
    #[inline]
    pub unsafe fn data_ptr(&self) -> *mut u8 {
        (self as *const Self as *mut u8).add(core::mem::size_of::<Self>())
    }

    /// Block until at least one new entry is available to consume.
    ///
    /// Compares `published` against `tail`. If no new entries are available,
    /// sleeps on the doorbell mechanism (architecture-specific: monitor/mwait
    /// on x86, WFE/SEV on AArch64, or futex-like wait on the `published`
    /// counter). Woken by the producer calling `doorbell_signal()` on the
    /// owning `CrossDomainRing` after advancing `published`.
    ///
    /// Returns when `published > tail` (at least one entry ready). Does not
    /// consume entries — the caller reads them via `read_entry_as()` and
    /// advances `tail` after processing.
    ///
    /// Used by `kabi_consumer_loop()` Phase 1 and `irq_consumer_loop()`.
    pub fn wait_for_entries(&self) {
        loop {
            let published = self.published.load(Ordering::Acquire);
            let tail = self.tail.load(Ordering::Relaxed);
            if published > tail {
                return;
            }
            // No entries available — sleep until the producer signals.
            // Architecture-specific wait: x86 monitor/mwait on `published`,
            // AArch64 WFE, or generic futex-like wait. The doorbell signal
            // from the producer breaks this wait.
            arch::current::cpu::wait_on_address(
                &self.published as *const AtomicU64 as *const u64,
                tail,
            );
        }
    }

    /// Read a ring entry at the given slot index, casting to `&T`.
    ///
    /// Computes the byte offset as `sizeof(DomainRingBuffer) + idx * entry_size`
    /// and returns a reference to the entry at that offset.
    ///
    /// # Safety
    ///
    /// - `idx` must be less than `self.size` (the ring capacity).
    /// - The entry at `idx` must have been fully written by the producer
    ///   (i.e., `tail <= slot_seq < published` for consumer reads).
    /// - `T` must match the actual entry type and have size <= `entry_size`.
    /// - The caller must not hold a mutable reference to the same slot.
    #[inline]
    pub unsafe fn read_entry_as<T>(&self, idx: usize) -> &T {
        debug_assert!(idx < self.size as usize, "ring index out of bounds");
        debug_assert!(
            core::mem::size_of::<T>() <= self.entry_size as usize,
            "entry type exceeds ring entry_size"
        );
        let offset = core::mem::size_of::<Self>() + idx * self.entry_size as usize;
        let ptr = (self as *const Self as *const u8).add(offset) as *const T;
        &*ptr
    }

    /// Write an entry to the ring at the given slot index.
    ///
    /// Copies `core::mem::size_of::<T>()` bytes from `entry` into the slot
    /// at offset `sizeof(DomainRingBuffer) + idx * entry_size`.
    ///
    /// # Safety
    ///
    /// - `idx` must be less than `self.size` (the ring capacity).
    /// - The caller must have exclusive ownership of the slot (guaranteed
    ///   by the CAS in `submit()` for MPSC, or by the single-producer
    ///   invariant for SPSC).
    /// - `T` must have size <= `entry_size`.
    #[inline]
    pub unsafe fn write_entry_as<T>(&self, idx: usize, entry: &T) {
        debug_assert!(idx < self.size as usize, "ring index out of bounds");
        debug_assert!(
            core::mem::size_of::<T>() <= self.entry_size as usize,
            "entry type exceeds ring entry_size"
        );
        let offset = core::mem::size_of::<Self>() + idx * self.entry_size as usize;
        let dst = (self as *const Self as *mut u8).add(offset) as *mut T;
        core::ptr::write(dst, core::ptr::read(entry));
    }
}

Calling mpsc_produce_blocking() with interrupts disabled is a BUG. Debug builds panic to catch the error during development; release builds fall back to mpsc_try_produce() with overflow accounting to avoid a hard hang in production. The release fallback is a safety net, not a license to call the blocking variant from IRQ context — all such call sites must be fixed.

Overflow accounting. When mpsc_try_produce() returns Err(Full) or Err(Overloaded) (whether called directly from IRQ context or as the release-build fallback), the caller increments a per-ring atomic overflow counter. Err(Disconnected) is not counted as overflow — it indicates the ring is being torn down and the caller should propagate the error to its own caller rather than retrying. The overflow statistics are stored directly in the DomainRingBuffer producer cache line as dropped_count and last_enqueued_seq (see struct definition in Section 11.8). Inlining these fields into the ring header avoids an extra pointer dereference on the drop path and keeps both fields on the same cache line as head and published, which are already hot during produce operations.

Each MPSC entry includes a monotonic sequence number in its header. The consumer detects dropped entries by checking for gaps in the sequence: if the sequence jumps from N to N+K (where K > 1), then K-1 entries were dropped due to overflow. The consumer logs a diagnostic event on gap detection, including the ring identity and gap size, so operators can identify rings that need larger depth configuration (Section 11.8 channel depths).

Summary of context rules:

Producer context Permitted API On ring full Notes
Thread context (IRQs enabled) mpsc_produce_blocking() or mpsc_try_produce() Blocking: spin + yield until space (or Err(Disconnected) on partner death). Try: return Err(Full) or Err(Disconnected). Blocking variant is the normal path for thread-context producers.
IRQ handler / softirq mpsc_try_produce() ONLY Return Err(Full) or Err(Disconnected), increment dropped_count, drop message. Calling the blocking variant is a BUG (debug panic / release fallback).
NMI / MCE handler NEITHER — use per-CPU buffer N/A See NMI/MCE safety below.

NMI/MCE safety: NMI handlers and Machine Check Exception (MCE) handlers MUST NOT produce to MPSC rings. Mitigation 1 (disabling interrupts) does NOT protect against NMIs or MCEs — both are non-maskable architectural exceptions that fire regardless of the interrupt flag state. If an NMI or MCE handler needs to log data, it must use a dedicated per-CPU single-producer buffer (not shared with normal interrupt context) that is drained by the main kernel after the exception returns. On x86, MCE handlers additionally run on a dedicated IST (Interrupt Stack Table) stack, so they must not access per-CPU data structures that assume the normal kernel stack.

Producer death recovery. If a producer (any tier) dies between MPSC Phase 1 (CAS claim) and Phase 2 (advancing published), the published counter is stuck at the dead producer's slot, blocking all subsequent producers. Three mechanisms ensure recovery:

  1. Crash recovery step 5a (Tier 1, Section 11.9) / step 4 (Tier 2, Section 11.9): The crash handler identifies all MPSC rings where the dead driver was a producer. For each ring with head > published, it writes poison markers (0xFF) into all slots from published to head, then advances published to head. Finally, it sets state = Disconnected. Any live producer currently spinning in Phase 2 observes the state change and returns Err(Disconnected). This is the primary recovery mechanism and handles the vast majority of cases.

  2. Consumer-side watchdog (mitigation 2 above): If head > published persists beyond 10ms (the crash handler hasn't run yet — e.g., silent hang, FMA detection latency), the consumer force-publishes poison markers and advances published. Safety net only.

  3. Spin loop state checks: Every spin loop in mpsc_try_produce() (the 256- iteration fallback and the final unbounded spin) checks state on each iteration. On Disconnected, the spin exits immediately with Err(Disconnected) rather than waiting for published to advance.

These mechanisms are tier-independent for Tier 1 and Tier 2: the ring protocol handles producer death the same way regardless of whether the producer was Tier 1 (MPK fault) or Tier 2 (process death). The tier determines detection latency (Tier 1: <1ms via fault handler; Tier 2: immediate via process exit), but the ring recovery sequence is identical.

Tier 0 (in-kernel) drivers: The recovery mechanisms above do not apply to Tier 0. A Tier 0 driver runs without isolation — if it crashes between Phase 1 and Phase 2 (or anywhere), the kernel is already in a panic state. Corrupted kernel memory makes ring recovery meaningless; the system is going down. The MPSC produce path mitigates the window by disabling interrupts before Phase 1 (preventing preemption between CAS and publication), but no software mechanism can recover from a Tier 0 fault — only hardware isolation provides that.

This is the explicit trade-off of Tier 0 promotion: zero isolation overhead in exchange for accepting that any driver bug is a kernel panic. On platforms that lack hardware domain isolation (e.g., RISC-V without a fast isolation mechanism, or when isolation=performance is set), all Tier 1 drivers are effectively promoted to Tier 0. Operators choosing this configuration accept the reduced fault containment. The ring buffer's state/poison-marker recovery remains compiled in (zero cost when not triggered) but cannot fire because no crash recovery path exists to set state = Disconnected — the kernel has already panicked.

Broadcast channels (SPMC): umka-core -> all drivers. Used for system-wide notifications (suspend imminent, memory pressure, clock change). Umka-core writes once; each driver reads independently. The broadcast channel uses a sequence-numbered ring with a single sequencing mechanism: the last_enqueued_seq field (hereafter write_seq in broadcast mode), a u64 in the ring header. write_seq increments by 2 for each published entry (odd values indicate a write in progress; even values indicate a stable, readable entry — see torn-read prevention below). The logical entry count is write_seq / 2. The DomainRingBuffer's published field is not used independently in broadcast mode; if read, it is derived as write_seq / 2 for compatibility with diagnostic code that inspects published. Implementations must not increment published separately from write_seq in broadcast mode — write_seq is the sole source of truth.

Each consumer tracks its own read position (a u64 sequence number stored in the consumer's private memory, not in the shared ring header). To read, a consumer scans from its last-seen sequence to the ring's current write_seq (even values only). The ring's tail field is unused in broadcast mode — the producer never needs to know individual consumer positions. Instead, the producer overwrites the oldest entry when the ring is full (broadcast semantics: slow consumers miss entries rather than blocking the producer). Consumers detect missed entries by checking for sequence gaps.

Torn-read prevention: Each broadcast ring entry is bracketed by a u64 sequence stamp. Layout: [seq_start: u64 | payload: [u8; entry_size - 16] | seq_end: u64]. The producer writes seq_start = write_seq | 1 (odd = write in progress), then the payload, then seq_end = write_seq (even = complete), then advances write_seq by 2. The consumer reads seq_start, copies the payload, reads seq_end. If seq_end != (seq_start ^ 1), the read is torn — seq_start and seq_end are not a matched pair from the same write (a concurrent write changes seq_start to a different odd value, causing this check to fail). Additionally, if seq_start < consumer.last_seq, the entry is stale. In either case, the consumer detects the gap, increments gap_count, and advances to the next entry. All sequence accesses use Ordering::Acquire (reads) and Ordering::Release (writes).

/// Per-consumer broadcast state (stored in consumer's private memory).
pub struct BroadcastConsumer {
    /// Last sequence number consumed by this consumer.
    pub last_seq: u64,
}

Capability passing. Capabilities (Section 9.1) can be transferred over any IPC channel. The sending domain writes a CapabilityHandle (an opaque 64-bit token) into a ring buffer entry. Umka-core intercepts the transfer at the domain boundary and validates the capability: does the sender actually hold this capability? Is the capability transferable? Is the receiver permitted to hold capabilities of this type? If validation passes, umka-core translates the handle into the receiving domain's capability space -- the receiver gets a new handle that maps to the same underlying resource but exists in its own namespace. Raw capability data (kernel pointers, permission bitmasks) never crosses domain boundaries; only validated, translated handles do.

11.8.4 Flow Control and Ordering

Ordering within a channel. Ring buffer entries are processed in strict FIFO order within a single channel. If umka-core submits commands A, B, C to a driver's command ring, the driver sees them in A, B, C order. Completions flow back in the order the driver produces them (which may differ from submission order -- a driver may complete a fast read before a slow write).

No ordering across channels. There is no ordering guarantee between different channels. Driver A's completion may arrive at umka-core before driver B's completion, regardless of which command was submitted first. Applications that need cross-device ordering must enforce it at the io_uring level (using IOSQE_IO_LINK or IOSQE_IO_DRAIN), which umka-core translates into sequencing constraints on the domain command rings.

Channel depths. Each channel has a configurable entry count, set at ring creation time via the device registry (Section 11.4):

Channel type Default depth Typical entry size Notes
Command (SPSC) 256 64 bytes Matches NVMe SQ depth default
Completion (SPSC) 1024 16 bytes 4x command depth for batched completions
Event (MPSC) 512 32 bytes Shared across all drivers on this event loop
Broadcast (SPMC) 64 32 bytes Low-frequency system events

The minimum useful broadcast entry size is 24 bytes (8 bytes payload with 16 bytes of sequence stamps for torn-read prevention). The default of 32 bytes provides 16 bytes of payload, suitable for most event notifications. Umka-core rejects broadcast ring creation requests with entry_size < 24.

Depths are tunable per-driver via the device registry's ring_config property. Drivers that handle high-throughput workloads (NVMe, high-speed NIC) typically increase command depth to 1024 or 4096 to match hardware queue depths.

Priority channels. Real-time I/O (Section 8.4) uses a separate high-priority command ring per driver. The driver polls the priority ring before the normal ring on every iteration. This ensures RT I/O is not head-of-line blocked behind bulk I/O. Priority rings use the same SPSC structure but are typically shallow (32-64 entries) since RT workloads are low-volume, latency-sensitive flows.

umka-core dispatch logic (per driver, per poll iteration):

  1. Check priority command ring  -> process all pending entries
  2. Check normal command ring    -> process up to batch_limit entries
  3. Check event ring (MPSC)      -> process system events

Comparison with Linux. Linux has no equivalent to the intra-kernel domain ring buffer. Subsystem communication within the Linux kernel uses direct function calls with no isolation boundary. The closest analogy is Linux's io_uring internal implementation (the SQ/CQ ring structure), but that serves a different purpose (kernel-to-userspace communication). UmkaOS effectively uses an io_uring-inspired ring structure inside the kernel to connect isolated subsystems that Linux connects via unprotected function calls.

11.8.5 Versioned Ring Entry Format

Over a 50-year kernel lifetime, the entry formats used in DomainRingBuffer channels will evolve: new command fields, extended completion status, additional event metadata. Producer and consumer may temporarily run at different versions during rolling KABI service replacement or host↔DPU version skew. The ring entry versioning protocol handles this without breaking existing consumers or requiring synchronized upgrades.

This implements Pattern 3 (Versioned Wire Protocol) from the Data Format Evolution Framework (Section 13.18).

11.8.5.1 Ring Entry Header

Every entry in a DomainRingBuffer begins with an 8-byte header:

/// Versioned message envelope for DomainRingBuffer entries.
///
/// Part of the entry's `entry_size` allocation — does NOT add overhead
/// beyond the existing per-entry budget. For a ring with `entry_size = 64`,
/// 56 bytes remain for payload after the header.
///
/// **Version negotiation**: When a ring is created, producer and consumer
/// exchange their maximum supported `format_version` via the ring setup
/// handshake (KABI vtable `create_ring()` call). The ring operates at
/// `min(producer_version, consumer_version)`. If a live evolution upgrades
/// the producer to a newer version, the ring continues at the old version
/// until the consumer is also upgraded. The producer checks the negotiated
/// version before each produce and formats the entry accordingly.
///
/// **Backward compatibility rule**: Version N+1 messages are a strict
/// superset of version N. Fields are append-only — existing field offsets
/// never change. A version N consumer safely reads a version N+1 message
/// by ignoring bytes beyond `payload_len`. A version N+1 consumer reading
/// a version N message uses documented defaults for missing trailing fields.
#[repr(C)]
pub struct RingEntryHeader {
    /// Format version of this entry's payload layout. Starts at 1.
    /// Incremented when new payload fields are appended.
    pub format_version: u16,
    /// Total bytes of payload following this header. Must satisfy:
    /// `payload_len + size_of::<RingEntryHeader>() <= ring.entry_size`.
    /// Consumers use this to determine how many bytes to read,
    /// preventing out-of-bounds access on shorter entries.
    pub payload_len: u16,
    /// Entry type discriminant (command, completion, event, etc.).
    /// Values are ring-specific, defined by the KABI service that
    /// owns the ring. 0x0000 = NOP (ignored by consumer).
    /// 0xFFFF = poison marker (existing ring protocol).
    pub entry_type: u16,
    /// Reserved. Must be zero. Available for future header extensions
    /// without incrementing the per-ring format_version (the header
    /// format itself is stable).
    pub _reserved: u16,
    // Payload bytes follow immediately at offset 8.
}
const_assert!(core::mem::size_of::<RingEntryHeader>() == 8);

// Static assertion: header is exactly 8 bytes.
const _RING_ENTRY_HEADER_SIZE: () = assert!(
    core::mem::size_of::<RingEntryHeader>() == 8,
    "RingEntryHeader must be exactly 8 bytes"
);

11.8.5.2 Version Negotiation Protocol

Ring version negotiation occurs during create_ring(), before any entries are produced:

Producer                              Consumer
   │                                     │
   │  create_ring(max_version=3, ...)    │
   │ ──────────────────────────────────> │
   │                                     │
   │  ring_accept(max_version=2)         │
   │ <────────────────────────────────── │
   │                                     │
   │  negotiated_version = min(3, 2) = 2 │
   │  All entries formatted as v2.       │
   │                                     │

The negotiated version is stored in a new field on the ring's metadata page (not in the DomainRingBuffer header — that is frozen as non-replaceable data). The metadata page is a separate 4 KB page allocated per ring pair, accessible to both producer and consumer:

/// Per-ring metadata page (4 KB, shared between producer and consumer).
/// Allocated during create_ring() and freed during ring teardown.
/// Contains negotiation state and diagnostic counters.
/// This struct is shared between separately-compiled isolation domains
/// (producer and consumer may be different KABI components), so it MUST
/// have `#[repr(C)]` and explicit padding.
#[repr(C)]
pub struct RingMetadata {
    /// Negotiated entry format version. Set once during create_ring()
    /// handshake; updated atomically during live evolution when both
    /// sides have been upgraded.
    pub negotiated_version: AtomicU16,
    /// Producer's maximum supported version. Read-only after creation.
    pub producer_max_version: u16,
    /// Consumer's maximum supported version. Read-only after creation.
    pub consumer_max_version: u16,
    /// Explicit padding to align `total_produced` to 8-byte boundary.
    /// `#[repr(C)]` would insert 2 bytes of implicit padding here;
    /// making it explicit prevents information disclosure.
    pub _pad0: [u8; 2],
    /// Total entries produced since ring creation (diagnostic).
    pub total_produced: AtomicU64,
    /// Total entries consumed (diagnostic).
    pub total_consumed: AtomicU64,
    /// Trailing padding to fill a cache line (64 bytes total).
    /// 2 + 2 + 2 + 2 + 8 + 8 = 24 bytes of fields; 64 - 24 = 40 bytes pad.
    pub _pad1: [u8; 40],
}
/// offset 0: negotiated_version (2B)
/// offset 2: producer_max_version (2B)
/// offset 4: consumer_max_version (2B)
/// offset 6: _pad0 (2B)
/// offset 8: total_produced (8B)
/// offset 16: total_consumed (8B)
/// offset 24: _pad1 (40B)
/// Total: 64 bytes (one cache line)
const_assert!(size_of::<RingMetadata>() == 64);

11.8.5.3 Version Upgrade During Live Evolution

When a KABI service is live-replaced (Section 13.18) and the new version supports a higher entry format:

  1. New producer loads: Phase A. producer_max_version updated in RingMetadata. Ring still operates at old negotiated_version.
  2. Consumer upgraded (same or later evolution): Consumer writes new consumer_max_version to RingMetadata.
  3. Version bump: When both producer_max_version and consumer_max_version are ≥ target version, the evolution framework atomically updates negotiated_version during Phase B. Subsequent entries use the new format.

No entries are lost or reformatted during the upgrade. The transition is seamless: entries before the version bump are at the old version; entries after are at the new version. Consumers handle both versions in the same processing loop (one match format_version branch).

11.8.5.4 Compatibility Matrix

Producer version Consumer version Negotiated Behavior
N N N Normal. All fields understood.
N+1 N N Producer formats at v(N). New fields not sent.
N N+1 N Consumer reads v(N). Missing fields use defaults.
N+1 N+1 N+1 Both upgraded. New fields available.
N+2 N N Producer formats at v(N). Skips two versions of new fields.

11.8.5.5 Performance Impact

Steady state (versions match): The 8-byte header is part of the existing entry allocation. Consumers already read the first bytes to determine command type — entry_type replaces the ad-hoc discriminant byte at offset 0. The format_version check is a single u16 comparison, branch-predicted as "taken" (versions match >99.99% of the time). Total overhead: zero additional cache misses, ~1 cycle for the version branch.

During version skew: The producer formats entries at the negotiated (lower) version, omitting new fields. No per-entry cost beyond the steady-state path. The only overhead is the version check in the producer's format path (one branch, predicted).

Cross-references: - DomainRingBuffer design: Section 11.8 (above) - Data format evolution framework: Section 13.18 - KABI service live replacement: Section 13.18 - Distributed IPC RDMA header: Section 5.5

11.8.6 Terminology Reference

The following terms are used precisely throughout this document. This reference resolves ambiguity that arises from the word "ring" appearing in multiple contexts:

Term Meaning Where used
io_uring Linux-compatible userspace async I/O interface. SQ/CQ rings mapped into user space. Section 19.3, user-facing I/O API
domain ring buffer Internal kernel IPC mechanism between isolation domains. SPSC or MPSC lock-free rings in shared memory. Section 11.8, driver architecture
MPSC ring A domain ring buffer variant with CAS-based multi-producer support. Used for event aggregation. Section 11.8, event channels
Hardware queue Device-specific command/completion queues (e.g., NVMe SQ/CQ, virtio virtqueue). Mapped via MMIO. Section 11.7, device I/O paths
SPSC Single-Producer Single-Consumer. The default domain ring buffer mode. Section 11.8
SPMC Single-Producer Multi-Consumer. Used for broadcast channels (umka-core -> all drivers). Section 11.8

Any unqualified reference to "ring buffer" in the driver architecture sections (Sections 5-9) means a domain ring buffer. Any reference to "io_uring" means the userspace interface. Hardware queues are always qualified by device type (e.g., "NVMe submission queue", "virtio virtqueue").


11.9 Crash Recovery and State Preservation

This is UmkaOS's killer feature -- the primary reason to choose it over Linux.

Scope: This section covers Tier 1 and Tier 2 driver crash recovery where the host kernel acts as supervisor. For peer kernel crash recovery (devices running UmkaOS as a first-class multikernel peer), see Section 5.3, which uses a different isolation model (IOMMU hard boundary + PCIe unilateral controls rather than software domain supervision).

Tier 0 driver crash behavior (no Tier 1 available): On architectures where Tier 1 is unavailable and drivers run as Tier 0 (RISC-V, s390x, LoongArch64 — see Section 11.2), a driver crash is equivalent to a kernel panic. There is no isolation boundary to contain the fault. The crash recovery protocol (steps 1-9 below) does NOT apply to Tier 0-promoted drivers. Instead, the panic handler runs: fma_panic_event() is emitted, pstore captures the crash log, and the system reboots.

11.9.1 The Linux Problem

In Linux, all drivers run in the same address space with no isolation. A single bug in any driver -- null pointer dereference, buffer overflow, use-after-free -- triggers a kernel panic. Recovery requires a full system reboot: 30-60 seconds of downtime, loss of all in-flight state, and potential filesystem corruption if writes were in progress.

11.9.1.1 Shared-Domain Crash Detection Latency

When multiple drivers share an isolation domain (Section 24.5), corruption within the shared domain does not trigger a hardware exception. A buffer overrun in driver A that corrupts driver B's data produces no immediate fault -- driver B may return wrong results (silent data corruption) until the corruption eventually causes a detectable crash (null pointer dereference, invalid opcode, ring buffer integrity failure, or watchdog timeout). In contrast, a solo-domain driver's first errant cross-domain memory access triggers an immediate hardware fault.

This delayed detection is an inherent consequence of domain grouping and is mitigated by:

  1. Rust memory safety -- prevents buffer overruns, use-after-free, and data races at compile time, eliminating the dominant bug classes that would exploit co-tenancy.
  2. Software integrity checks -- ring buffer validation (step 1 fault detection below), watchdog timers, and KABI return value validation provide software-level fault detection independent of hardware domain boundaries.
  3. Operator control -- administrators can promote high-value drivers to solo domains or Tier 2 (Section 11.2).

11.9.2 UmkaOS Tier 1 Recovery Sequence

When a Tier 1 (domain-isolated) driver faults:

1. FAULT DETECTED
   - Hardware exception (page fault, GPF) within a Tier 1 isolation domain
   - OR watchdog timer expires (driver stalled for >Nms)
   - OR driver returns invalid result / corrupts its ring buffer

   DOMAIN IDENTIFICATION — determine which domain faulted:
   - Extract the domain ID from the faulting CPU's per-CPU state:
     `let domain_id = arch::current::cpu::cpulocal()
         .active_domain.load(Ordering::Relaxed);`
     CpuLocalBlock.active_domain is set by the consumer loop on domain
     entry and cleared on exit ([Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path)).
     If nonzero, the fault occurred within that isolation domain.
   - If `domain_id == CORE_DOMAIN_ID` (value 0):
       The fault is in Core domain — no isolation boundary exists.
       This is an unrecoverable kernel fault. Branch to panic handler:
       `panic!("Fault in Core domain: no isolation boundary");`
   - Otherwise: `domain_id` identifies the crashed domain.
     Proceed to step 1a with the identified domain.

1a. TIER CHECK — determine if recovery is possible:
   - Look up the domain descriptor from the domain registry:
     `let domain = domain_registry.get(domain_id).expect("valid domain");`
     where `domain_registry` is a global `XArray<DomainId, DomainDescriptor>`
     ([Section 11.3](#driver-isolation-tiers)).
   - Read the faulting driver's effective isolation tier:
     `let tier = domain.effective_tier();`
     where `effective_tier()` is defined on `DomainDescriptor`:
     ```rust
     impl DomainDescriptor {
         /// Returns the effective isolation tier for this domain, accounting
         /// for architecture-level fallback (e.g., RISC-V Tier 1 -> Tier 0),
         /// operator override, and KABI manifest constraints.
         pub fn effective_tier(&self) -> IsolationTier {
             self.tier  // IsolationTier field set at domain creation / promotion
         }
     }
     ```
   - If `tier == Tier::Zero`:
       The fault is in Tier 0 — no isolation boundary exists.
       Recovery is NOT possible. Branch to kernel panic handler:
       `fma_panic_event("Tier 0 driver fault: no isolation boundary");`
       `panic!("Tier 0 driver fault: no isolation boundary");`
       A driver runs at Tier 0 for any reason: operator decision
       (`echo 0 > /ukfs/kernel/drivers/<name>/tier`), KABI manifest
       constraint (`minimum_tier = 0`), architecture-level promotion
       on platforms without Tier 1 hardware, or any future reason.
       The crash handler does not care WHY the driver is Tier 0 —
       only that `effective_tier() == Tier::Zero` means no boundary.
   - If `tier == Tier::One`:
       Hardware memory domain isolation (MPK/POE/DACR/segments) contains
       the fault. Proceed to step 2 (ISOLATE). Recovery via reload.
   - If `tier == Tier::Two`:
       Full Ring 3 + IOMMU process isolation. Available on ALL
       architectures — Tier 2 does not depend on Tier 1 hardware.
       Proceed to step 2 (ISOLATE). Recovery via process restart.

2. ISOLATE
   - UmkaOS Core revokes the faulting driver's isolation domain by calling
     `arch::current::isolation::revoke_domain_permissions(domain.isolation_key)`.
     This function permanently sets the domain's permissions to deny-all in
     the hardware isolation register, preventing any CPU from accessing the
     domain's memory. Unlike `switch_domain()` (which changes the ACTIVE
     domain), `revoke_domain_permissions()` permanently locks out a crashed
     domain so that no future `switch_domain()` call can re-enable access.

     Per-architecture implementation of `revoke_domain_permissions(key)`:

     | Architecture | Operation | Effect |
     |---|---|---|
     | x86-64 | Set both AD (access-disable) and WD (write-disable) bits for the domain's PKEY in the domain allocation table. Update PKRU on the faulting CPU via `WRPKRU`. | Any CPU that calls `switch_domain(domain_id)` will load PKRU with AD+WD set for this key — deny-all. |
     | AArch64 POE | Clear the Permission Overlay Index entry for the domain's POI slot. Write `0b00` (no access) to the corresponding `POR_EL0` field. | Any CPU entering this domain gets no-access overlay. |
     | AArch64 mainstream (page-table + ASID) | Mark the domain's ASID as invalid in the ASID allocator. Set a `revoked` flag on the domain's page table root. Issue `TLBI ASIDE1IS` to flush the ASID across the inner-shareable domain. Page table backing memory is NOT freed here — deferred until NMI ejection completes and all CPUs acknowledge they are no longer using the ASID (tracked by `ejected_count`). | Any CPU switching to this domain's ASID triggers a translation fault (ASID invalid). The trampoline checks `domain_valid` before `switch_domain` to avoid writing a freed page table address to TTBR0_EL1. Page tables are freed after all ejections complete. |
     | ARMv7 | Set the domain field in the global DACR template to `0b00` (No Access) for the domain's DACR slot. | Subsequent DACR writes include no-access for this domain. |
     | PPC32 | Invalidate the segment register mapping for the domain. Mark the segment as invalid in the domain allocation table. | Segment translation faults on any access to the domain's pages. |
     | PPC64LE | Unmap the Radix PID's pages for the domain. Clear the PID entry in the partition table. | ERAT miss → walk finds no valid mapping. |
     | RISC-V | Unmap driver pages from all page tables (Tier 0 fallback — no fast isolation). | Page fault on any access to driver pages. |
     | s390x | Revoke storage key assignments for the domain's pages (Tier 0 fallback). | Storage key protection fault on access. |
     | LoongArch64 | Unmap driver pages from all page tables (Tier 0 fallback — no fast isolation). | Page fault on any access to driver pages. |

     The function is idempotent: calling it multiple times for the same key
     is safe (the NMI handler on ejected CPUs may call it redundantly).
     The domain allocation table update is protected by the per-domain
     `crash_lock` (held by the faulting CPU during Steps 1-2a only —
     exception context, non-sleeping). The `crash_lock` is released before
     the `CrashRecoveryRequest` is pushed to `CRASH_RECOVERY_RING`. Steps
     3-9 (process context, sleeping) are serialized by `recovery_mutex`
     instead. See [Section 12.8](12-kabi.md#kabi-domain-runtime) for the full lock protocol.

   - Driver can no longer access any memory in its domain
   - Interrupt lines for this driver are masked
   - Transition domain state to `Crashed`:
     ```rust
     domain.state.store(DomainState::Crashed as u8, Ordering::Release);
     ```
     This MUST happen BEFORE Step 2' (SET RING STATE) so that
     `domain_crashed()` returns `true` for any workqueue check that
     races with the recovery. The `Release` ordering ensures the state
     is visible to any subsequent `Acquire` load in `domain_crashed()`
     on other CPUs.
   - Clear `domain_valid` on the faulting CPU:
     `cpulocal.domain_valid.store(0, Ordering::Relaxed)`.
     This invalidates trampoline TOCTOU checks immediately — any concurrent
     trampoline on this CPU that passed the generation check but has not yet
     entered the domain will observe `domain_valid == 0` and return
     `Error::ProviderDead` instead of entering a revoked domain. Relaxed
     ordering suffices because the faulting CPU is the one executing this
     code and the hardware domain revocation (above) already prevents access.

2'. SET RING STATE
   - Set `state = Disconnected` (AtomicU8::store, Release, value 1) on all
     rings owned by the dead driver. Any producer currently spinning in a
     Phase 2 loop will observe this on its next `state.load(Acquire)` and
     return `Err(Disconnected)`. This field is reset to `Active` (0) when
     the replacement driver re-initializes the ring.
   - Step 2' must precede step 2a to ensure all ring accessors observe
     `Disconnected` before any CPU is ejected from the driver domain. The
     `Release` store to ring `state` guarantees ordering with respect to
     subsequent `Acquire` loads of the same field by other CPUs. However,
     the NMI delivery mechanism (IPI) is a separate store to a device
     register, not an `Acquire` load of ring state. To ensure the ring
     `state` store is visible on remote CPUs before their NMI handlers
     execute, `send_nmi_ipi()` must include an architecture-appropriate
     barrier before the IPI trigger write:

     | Architecture | Required barrier before IPI send |
     |---|---|---|
     | x86-64 | None (APIC MMIO write is serializing) |
     | AArch64 | `DSB ISH` before `MSR ICC_SGI1R_EL1` |
     | ARMv7 | `DSB` before GIC GICD_SGIR MMIO write |
     | RISC-V | `fence rw, w` before ACLINT MMIO write |
     | PPC32 | `sync` before MPIC IPI write |
     | PPC64LE | `sync` before XIVE IPI trigger |
     | s390x | `BCR 15,0` (serialization) before SIGP |
     | LoongArch64 | `dbar 0` before EIOINTC IPI write |

     Each architecture's `send_nmi_ipi()` and `send_ipi()` implementations
     include the appropriate barrier, ensuring all prior stores to normal
     memory (including the ring `state` Release stores) are globally visible
     before the IPI is received by remote CPUs.

2a. NMI PREEMPTION OF IN-DOMAIN CPUs
   - After revoking the domain's memory permissions, send an NMI IPI to
     all CPUs (excluding the faulting CPU, which is already in the fault
     handler). The NMI handler checks whether the interrupted context
     was executing within the revoked domain:

     NMI handler pseudocode (reads from global `NMI_CRASH_CTX`,
     defined in [Section 12.8](12-kabi.md#kabi-domain-runtime)):

       let revoked_id = NMI_CRASH_CTX.revoked_domain_id.load(Acquire);
       let domain_id = current_cpu.active_domain.load(Relaxed);
       if domain_id == revoked_id {
           // CPU was executing driver code in the crashed domain.
           // Clear domain_valid BEFORE redirect — this prevents the
           // TOCTOU window where a trampoline on this CPU could still
           // see domain_valid == 1 for the crashed domain.
           current_cpu.domain_valid.store(0, Ordering::Relaxed);
           // Modify the NMI return frame to redirect execution to
           // domain_crash_trampoline. The driver's in-progress
           // operation is abandoned (ring buffer cancellation
           // handles cleanup in Step 3).
           redirect_to_crash_recovery(
               &mut nmi_frame,
               &NMI_CRASH_CTX.ejected_count,
           );
       }
       // Otherwise: CPU was in kernel core or a different domain.
       // No action — return from NMI normally.

   - This ensures CPUs still executing inside the crashed driver's
     domain are immediately and safely ejected, preventing use of
     revoked memory mappings. Without this step, a CPU could continue
     executing stale driver code for up to one scheduler tick (~1-4ms)
     after domain revocation, potentially triggering secondary faults
     on other CPUs (GPF from accessing revoked domain pages).

   - The NMI is synchronous with respect to the crash recovery
     sequence: the faulting CPU waits (spin-polls an atomic counter)
     until all targeted CPUs have acknowledged the NMI (either by
     executing the trampoline redirect or by confirming they were not
     in the revoked domain). Maximum wait: configurable timeout, after
     which the recovery proceeds anyway (a CPU that has not responded
     within the timeout is assumed to be in a nested NMI or hardware
     stall and will fault on its next domain access attempt).

     **Timeout tiers** (configured via `umka.nmi_ack_timeout_us=N`):

     | Environment | Default timeout | Rationale |
     |---|---|---|
     | Bare metal (optimized firmware) | 10 us | NMI IPI delivery + handler entry is ~50-100 ns on modern x86-64; 10 us provides 100x headroom for cache misses and interrupt coalescing. |
     | Bare metal (SMI-heavy firmware) | 100 us | SMI handlers can run for 50-200 us and are invisible to the OS. Boot parameter `umka.nmi_ack_timeout_us=100` avoids spurious warnings. |
     | Virtualized (VM guest) | 10000 us (10 ms) | vCPU scheduling delays on the host can prevent the target vCPU from running for milliseconds. The host scheduler may not run the target vCPU immediately after the NMI IPI — the vCPU may be descheduled, preempted by a higher-priority host process, or waiting for a VMCS/VMCB reload. 10 ms accommodates worst-case host scheduling latency under moderate load. |

     The kernel auto-detects virtualization at boot (CPUID hypervisor
     bit on x86, device tree `hypervisor` node on ARM/RISC-V,
     `/sys/hypervisor` on s390x) and selects the VM timeout tier
     automatically unless overridden by the boot parameter. If
     acknowledgment still fails after the extended timeout, the
     recovery path assumes the CPU is stuck and proceeds — the stuck
     CPU will fault when it eventually runs (accessing the revoked
     domain triggers a hardware exception), triggering a secondary
     recovery or panic if in Core domain.

   - Per-architecture NMI delivery mechanism:
     - x86-64: APIC NMI IPI (ICR delivery mode = 0b100 NMI,
       destination shorthand = all-excluding-self)
     - AArch64: GICv3 pseudo-NMI (priority-based masking via
       `ICC_PMR_EL1`; send SGI with highest priority in group 0
       via `ICC_SGI1R_EL1` with IRM=1 for all-excluding-self)
     - ARMv7: FIQ via GICv2 group 0 (highest priority interrupt;
       target all CPUs via `GICD_ITARGETSR` broadcast)
     - RISC-V: No hardware NMI; use highest-priority IPI via
       ACLINT MSWI / PLIC with supervisor software interrupt.
       The IPI handler runs at highest interrupt priority with
       preemption disabled, approximating NMI semantics. Latency
       is implementation-dependent and not bounded like a true NMI —
       RISC-V software IPIs can be delayed if the target hart is
       servicing a higher-priority interrupt or has interrupts
       masked. Typical observed latency: ~200ns-2us depending on
       platform and interrupt load (vs ~50-100ns for x86-64 NMI)
     - PPC32: Inter-processor doorbell (msgsnd) with critical
       interrupt priority (IVOR1)
     - PPC64LE: System Reset Interrupt (SRI, vector 0x100)
       delivered via OPAL `opal_signal_system_reset()` on powernv,
       or H_PROD + external interrupt on pseries. SRI is
       non-maskable and preempts all other interrupt classes
     - s390x: External interrupt (SIGP signal-processor order
       `SIGP_EXTERNAL_CALL` with emergency signal subclass).
       Non-maskable via PSW bit manipulation
     - LoongArch64: IPI mailbox interrupt via IOCSR with highest
       priority. Configure via `LOONGARCH_IOCSR_IPI_SEND` with
       target CPU mask

   - **Lock safety**: The NMI handler must not acquire any locks.
     It reads `current_cpu.active_domain` (an AtomicU64 in the
     per-CPU block) and either redirects execution or returns.
     The trampoline target is a fixed function pointer stored in
     the per-CPU block at boot time (not dynamically dispatched).

   - **Ejected CPU tracking**: The `NMI_CRASH_CTX.ejected_count`
     field (an `AtomicU64` in the global `NmiCrashContext` struct,
     defined in [Section 12.8](12-kabi.md#kabi-domain-runtime)) is incremented by each NMI
     handler that performs a redirect. This counter is used by
     Step 6 for diagnostic logging ("N CPUs ejected from crashed
     domain") and by the NMI synchronization spin-poll to determine
     when all targeted CPUs have acknowledged. The faulting CPU resets
     `ejected_count` to 0 (Release) before sending the NMI IPI, and
     writes `revoked_domain_id` (Release) so that NMI handlers on
     remote CPUs can read it (Acquire).

     Serialization across concurrent domain crashes accessing the
     global `NMI_CRASH_CTX` is provided by the `NMI_CRASH_CTX.active`
     field (AtomicU8, CAS 0→1 before writing fields, store 0 after
     NMI cycle completes). The per-domain `crash_lock` serializes
     same-domain re-crashes but does NOT serialize cross-domain access
     to the global NMI context — `active` does. See
     [Section 12.8](12-kabi.md#kabi-domain-runtime) for the full CAS protocol.

3. RECOVER RING BUFFER IN-FLIGHT SLOTS
   - Ring integrity must be restored before draining, because the drain
     step reads from rings that may have corrupted pointers or unpublished
     slots from the crashed driver.
   - For each MPSC ring where the dead driver was a producer: if
     `head > published` (indicating the driver may have claimed a slot
     via Phase 1 CAS but died before Phase 2 publication), write poison
     markers (0xFF) into all unpublished slots from `published` to `head`
     and advance `published` to `head`. This unblocks any live producers
     spinning in Phase 2 waiting for the dead driver's slot to be published.
   - For SPSC completion rings (driver -> core): the ring is drained of all
     valid entries up to `published`, then the ring is reset (`head = tail
     = published = 0`) for the replacement driver instance.

4. DRAIN PENDING I/O
   - All pending requests from user space are completed with -EIO
   - Applications receive error codes, not crashes
   - io_uring CQEs are posted with error status

4a. INCREMENT CRASH COUNT AND EMIT FMA EVENT
   - Increment the device's crash count BEFORE emitting the FMA event, so that
     the event payload reflects the current (not previous) crash count. This
     increment happens on every crash, not only on reload failure. The reload
     failure handler (Reload Failure Handling, below) does NOT separately
     increment crash_count — it is already counted here.

     `crash_count` is a per-device field on `DeviceNode` (not per-domain),
     because multiple devices in a domain-group crash all get the same crash
     but each has its own crash history for FMA escalation:
     ```rust
     /// Per-device crash counter for FMA escalation. Incremented by the
     /// crash recovery manager at Step 4a. Per-device (not per-domain)
     /// because devices may have different crash histories — a device
     /// may have crashed solo before being grouped into a shared domain.
     /// AtomicU64 for 50-year longevity (even at 1 crash/second, u64
     /// does not wrap within 584 billion years).
     // Field on DeviceNode ([Section 11.6](#device-services-and-boot)):
     pub crash_count: AtomicU64,
     ```
     For each device in the crashed domain:
     `dev.crash_count.fetch_add(1, Ordering::Relaxed);`
   - The crash recovery manager emits an FMA fault event so the FMA subsystem
     ([Section 20.1](20-observability.md#fault-management-architecture)) can track crash frequency and
     escalate if the driver is repeatedly crashing:
     fma_emit(FaultEvent::DriverCrash {
         device:      dev.device_node_id(),
         driver_id:   dev.driver_id(),
         tier:        domain.effective_tier() as u8,
         crash_count: dev.crash_count.load(Ordering::Relaxed),
     });
   - This emission is non-blocking and lock-free (the FMA ring buffer is a
     BoundedMpmcRing with atomic tail increment, safe from IRQ context).
   - The FMA diagnosis engine processes the DriverCrash event asynchronously
     and may trigger escalation actions (demote tier, disable device) based
     on crash frequency rules. See "FMA-Driven Crash Escalation" below.

   CQE DELIVERY DURING DOMAIN-GROUP CRASH RECOVERY

   When a domain-group crash affects multiple drivers sharing an isolation
   domain, the crash recovery thread iterates their rings and posts CQEs
   sequentially (one ring at a time), not simultaneously across rings. The
   crash recovery thread holds no lock across ring iterations -- each ring's
   CQE posting is independent.

   If the io_uring CQ ring overflows during crash-recovery CQE posting
   (application is not draining completions), overflow entries are queued in
   the io_uring overflow list (`ctx.cq_overflow_list`), identical to Linux
   behavior. The overflow list is a linked list of `CqEntry` structs
   allocated from the per-ring pre-reserved overflow pool:

   ```rust
   /// CQ overflow pool size per io_uring instance.
   /// Sized for worst-case domain-group crash: all in-flight requests
   /// for all drivers in a single isolation domain completing with -EIO
   /// simultaneously while the application is not draining CQ.
   ///
   /// Default: max(CQ ring size, 1024). Configurable via
   /// IORING_REGISTER_CQ_OVERFLOW_SIZE.
   pub const CQ_OVERFLOW_POOL_DEFAULT: u32 = 1024;
   ```

   Ordering guarantee: FMA fault events (`FmaEvent::DriverCrash`) are
   emitted after domain revocation (step 2) but before CQE posting
   (step 4). This gives monitoring agents (e.g., Kubernetes liveness
   probes watching FMA eventfd) a head start before the application
   observes -EIO errors.

5. DEVICE RESET VERIFICATION (post-FLR checkpoint)
   - Confirm device is in known-good state after the FLR/reset initiated in
     DMA-1 and waited on in DMA-3. This is a VERIFICATION step, not a
     reset-initiation step — the actual FLR was initiated in DMA-1 (see
     unified interleaving above).
   - Read device status register to verify: no pending DMA, configuration
     space returns valid Vendor ID, no error bits set in PCIe AER status.
     For devices reporting lingering errors after FLR (broken firmware),
     log an FMA warning and proceed — the driver reload (Step 8) will
     re-initialize the device.
   - For non-FLR devices: DMA-1 already issued the vendor-specific reset.
     This step simply verifies the reset completed successfully.
   - Device is now in known-good state, ready for driver reload.

6. RELEASE KABI LOCKS
   - The KABI lock registry tracks all Core kernel locks currently held on
     behalf of this driver. Every lock-acquiring KABI call (e.g., mutex_lock,
     rw_lock_read) pushes a (lock_ptr, lock_type) entry onto a per-driver,
     per-CPU lock stack (max depth 8, statically allocated in the driver
     descriptor). On normal unlock, the entry is popped.

     Lock stack entry types (diagnostic-only with ring dispatch):

     ```rust
     /// Per-driver, per-CPU lock stack entry for crash recovery diagnostics.
     /// With ring-based Tier 1 dispatch, Core locks are rarely held on behalf
     /// of a driver (only during Tier 0 direct calls). This struct is used
     /// primarily for diagnostic logging ("driver held N locks at crash time")
     /// and for the Tier 0 direct-call path where Core lock cleanup is needed.
     pub struct LockStackEntry {
         /// Address of the lock object (for identification and force-release).
         pub lock_ptr: usize,
         /// Type of lock held (determines the force-release protocol).
         pub lock_type: LockType,
     }

     /// Lock type discriminant for crash recovery force-release.
     #[repr(u8)]
     pub enum LockType {
         /// Mutex: force-release sets owner to NONE, wakes waiters.
         Mutex = 0,
         /// SpinLock: force-release clears the lock word.
         SpinLock = 1,
         /// RwLock held for read: force-release decrements reader count.
         RwLockRead = 2,
         /// RwLock held for write: force-release clears writer flag, wakes waiters.
         RwLockWrite = 3,
     }

     /// Maximum lock stack depth per driver per CPU. Bounded by the KABI
     /// non-reentrancy invariant (at most one Core lock per KABI call).
     pub const LOCK_STACK_MAX_DEPTH: usize = 8;
     ```
   - On crash recovery, the registry is walked in reverse order (LIFO):
     each held lock is force-released (mutex: set owner to NONE and wake
     waiters; rwlock: decrement reader count or clear writer; spinlock:
     release). This prevents deadlock when a driver panics mid-critical-section.
   - After lock release, per-CPU borrow states held by the driver are reset
     to 0 (free), matching the PerCpu borrow-state tracking in Section 3.1.1.
   - **Invariant**: KABI calls that acquire Core locks MUST be non-reentrant
     and hold at most one Core lock at a time (enforced by the KABI vtable
     wrappers). This bounds the lock stack depth and ensures reverse-order
     release is always safe.
   - **Cross-CPU scope**: The registry walks the per-driver lock stack for the
     CURRENT CPU only (the CPU where the fault occurred). Cross-CPU locks are
     not held — KABI calls are non-reentrant and execute on one CPU at a time.
     The per-CPU lock stack (max depth 8) captures all locks acquired by KABI
     calls on behalf of the crashed driver.
   - **Scope with ring dispatch**: With ring-based Tier 1 dispatch
     ([Section 12.8](12-kabi.md#kabi-domain-runtime)), the kernel never enters a driver domain
     for KABI calls — all cross-domain calls go through rings. Consumer
     threads run entirely within their domain and hold only intra-domain
     locks (driver-private mutexes, per-device spinlocks). When a domain
     is torn down (Step 7), all intra-domain state is discarded along
     with the consumer threads. The KABI lock stack therefore tracks
     only the locks acquired by the FAULTING CPU's current KABI call
     (applicable when the fault occurred during a Tier 0 direct call
     within the domain), not cross-CPU locks.
   - **NMI-ejected CPUs and Core locks**: CPUs ejected by Step 2a were
     executing consumer loop code in the crashed domain, using only
     intra-domain locks. They do not hold Core locks — the consumer
     loop sets `active_domain != 0` before entering the domain and
     only acquires intra-domain locks thereafter. These CPUs are
     redirected to `domain_crash_trampoline` (see Step 2a definition),
     which runs in Core domain. No Core lock cleanup is needed for
     ejected CPUs. Step 6 logs the ejected CPU count (from the
     `ejected_cpus` counter maintained in Step 2a) for diagnostics.
   - **Exception — Tier 0 code in the same domain**: If a Tier 0
     module colocated in the same domain as the crashed driver was
     mid-call to a Core service (holding a Core lock) when the NMI
     hit, that lock WOULD be on the Tier 0 module's lock stack.
     However, Tier 0 code runs with `active_domain == 0` (Core
     domain), so it is NOT ejected by the Step 2a NMI handler (the
     NMI handler only ejects CPUs where `active_domain ==
     revoked_domain_id`). The Tier 0 module's call completes normally
     and releases its locks through the standard unlock path.

7. UNLOAD DRIVER
   - **Workqueue cancellation**: All pending workqueue items owned by the
     crashed domain are dropped (not executed). The workqueue subsystem
     ([Section 3.11](03-concurrency.md#workqueue-deferred-work)) checks the domain state between
     work items: `if domain_crashed(work.owner_domain_id) { drop(work); continue; }`.
     `domain_crashed()` queries the domain registry for the domain's crash state:
     ```rust
     /// Returns true if the specified domain has crashed and is undergoing
     /// recovery or has been permanently disabled.
     /// `DomainState` is defined in [Section 12.8](12-kabi.md#kabi-domain-runtime).
     fn domain_crashed(id: DomainId) -> bool {
         match domain_registry.get(id) {
             Some(desc) => {
                 // Single load to avoid TOCTOU between the two comparisons.
                 let s: DomainState = desc.state.load(Acquire);
                 matches!(s, DomainState::Crashed | DomainState::Recovering | DomainState::Stopped)
             }
             None => true, // Domain was already fully torn down
         }
     }
     ```
     Work items already in-progress on a worker thread complete normally
     (they run in Core domain context, not in the crashed domain). Work
     items queued but not yet started are discarded. This prevents stale
     driver callbacks from executing after domain teardown.
   - Clear `DomainDescriptor.consumer_threads` (the old consumer threads
     were terminated during Step 2a NMI ejection or exited upon observing
     `RING_STATE_DISCONNECTED`). This must happen BEFORE Step 8 (RELOAD)
     so that the Hello protocol can repopulate it with new thread TaskIds.
   - Free all driver-private memory
   - Release all driver capabilities
   - Unmap driver MMIO regions

   **Domain-group crash: multi-driver reload ordering**

   When a domain contains multiple drivers and crashes, all drivers in the
   domain must be unloaded (step 7) and reloaded (step 8). The reload order
   is determined by the driver dependency graph registered in the KABI
   dependency registry ([Section 12.7](12-kabi.md#kabi-service-dependency-resolution)):

   - **Unload order**: reverse dependency order (dependents first, then
     the drivers they depend on). This ensures dependent drivers release
     their references to provider drivers before provider state is freed.
   - **Reload order**: dependency order (leaf/provider drivers first, then
     their dependents). This ensures each driver's dependencies are
     available when it calls `init()` and binds to services.
   - If no dependency relationship exists between two drivers in the same
     domain, their relative reload order is arbitrary.
   - **Partial reload failure**: If a driver's reload fails (step 8 timeout
     or init crash), the failure does NOT block other drivers' reloads.
     The failed driver is marked `DeviceState::Error` and its dependents
     receive `Err(ServiceUnavailable)` when they attempt to bind. Dependents
     that can operate in degraded mode (e.g., a filesystem without its
     crypto provider) proceed with reduced functionality. Dependents that
     require the failed driver abort their own init and are also marked
     `DeviceState::Error`.

8. RELOAD DRIVER

   **Crash Recovery Memory Pool**: Module decompression, ELF parsing, and page
   table setup during reload require memory allocation. Under OOM conditions,
   normal allocation may stall indefinitely. The kernel reserves a dedicated
   memory pool at boot for crash recovery allocations:

   ```rust
   /// Pre-allocated memory pool for crash recovery. Reserved at boot
   /// Phase 1.1 (buddy allocator init). Only the crash recovery path
   /// (step 8 RELOAD) may allocate from this pool.
   pub struct CrashRecoveryPool {
       /// Contiguous physical pages, pinned and excluded from normal allocation.
       /// The pool lives within the kernel's direct/linear map, so callers convert
       /// the returned `PhysAddr` to a writable virtual address via
       /// `phys_to_virt(phys_addr)` before use (ELF loading, zstd decompression,
       /// page table construction all operate on virtual addresses).
       base: PhysAddr,
       /// Pool size in bytes. Configurable via kernel command line
       /// `umka.recovery_pool_mb=N` (default: 4 MiB, range: 0-64).
       ///
       /// When N=0: no pool reserved. Step 8 allocates from the normal
       /// allocator with GFP_KERNEL semantics (may sleep, waits for OOM
       /// killer to free pages). Recovery stalls until memory is available.
       /// Device is unavailable during the wait. This is a valid deployment
       /// choice for memory-constrained systems: trades recovery speed for
       /// memory savings.
       size: usize,
       /// Bump allocator offset. Reset to 0 after recovery completes
       /// (pool is reusable for the next crash recovery).
       offset: AtomicUsize,
   }
   ```

   Allocation during step 8:
   1. If `recovery_pool.size > 0`: allocate from the pool (instant, guaranteed).
      If the pool is exhausted (module binary larger than pool): fall through
      to the normal allocator.
   2. If `recovery_pool.size == 0` or pool exhausted: allocate from the normal
      allocator with `GFP_KERNEL` (sleepable). May wait for OOM killer to free
      pages. No panic — recovery is delayed, not failed.
   3. After step 9 (RESUME): reset `offset` to 0. Pool is reusable.

   **Concurrent crash safety**: If multiple domains crash simultaneously,
   each domain's recovery thread bump-allocates from the pool via a CAS loop
   with bounds checking:
   ```rust
   // SAFETY: Relaxed ordering on the CAS is sufficient because:
   //   1. CAS atomicity prevents overlapping allocations — each successful
   //      CAS claims a disjoint [current, current+alloc_size) region.
   //   2. Each recovery thread writes only to its own allocated region;
   //      no inter-thread data sharing occurs through the pool itself.
   //   3. The pool reset path uses fetch_sub(1, AcqRel) on active_recoveries
   //      which provides the necessary release/acquire ordering to ensure
   //      all writes to allocated regions are visible before the pool is
   //      reused. The reset (offset → 0) happens only when the last
   //      recovery completes (previous active_recoveries value was 1).
   loop {
       let current = self.offset.load(Relaxed);
       let new_offset = current + alloc_size;
       if new_offset > self.size {
           return None; // Pool exhausted — fall through to normal allocator
       }
       if self.offset.compare_exchange(
           current, new_offset, Relaxed, Relaxed
       ).is_ok() {
           // Return the virtual address via the kernel's direct/linear map.
           // The pool is allocated from the buddy allocator at Phase 1.1,
           // so the physical pages are always present in the kernel's direct
           // map (phys_to_virt). Module loading (zstd decompression, ELF
           // parsing, page table construction) operates on virtual addresses.
           return Some(phys_to_virt(self.base + current));
       }
   }
   ```
   The CAS loop prevents out-of-bounds allocation: unlike a bare `fetch_add`,
   the new offset is validated against `self.size` before committing.
   The pool reset (`offset` → 0) only occurs when ALL concurrent recoveries
   have completed (tracked by `active_recoveries: AtomicU32`, incremented at
   step 8 entry, decremented at step 9 completion; reset happens when
   `active_recoveries.fetch_sub(1, AcqRel) == 1` — i.e., the previous
   value was 1, meaning this decrement brought the count to zero and this
   thread is the last active recovery). This prevents a race where one
   recovery thread resets the pool while another is still using its allocation.

   **Memory ordering**: `active_recoveries.fetch_sub(1, AcqRel)` ensures
   visibility of all writes made during crash recovery. An `atomic::fence(Acquire)`
   before resetting `offset` to 0 ensures no allocations are in-flight when
   the pool is recycled.  The `AcqRel` on `fetch_sub` + `Acquire` fence
   before reset provides a full barrier chain: the resetting thread observes
   all writes from all concurrent recovery threads.  The `Relaxed` load at
   the top of the CAS loop is safe because no new allocation attempt occurs
   until a subsequent crash triggers a new `active_recoveries.fetch_add(1,
   AcqRel)`, which pairs with this reset.

   - Load driver binary from the **Module Binary Store** (MBS):

     The MBS is a kernel-resident cache of compressed (.zstd) module ELF
     binaries, populated at module load time and retained for crash recovery.
     It eliminates the circular dependency where the crashed driver IS the
     filesystem driver that holds the module binary on disk.

     ```rust
     /// Kernel-resident store of module binaries for crash recovery.
     /// Populated at module load time; consulted at step 8 of crash
     /// recovery instead of reading from the filesystem.
     ///
     /// Placement: Tier 0 (Core). The MBS itself must not crash — it is
     /// a simple XArray of compressed byte buffers.
     /// Module identity for MBS keying. Derived from the KABI manifest's
     /// `driver_name` (64 bytes) and `driver_version` (u32) via SipHash-2-4.
     /// Stable across boots (same binary = same ModuleId). Two modules with
     /// different names or versions produce different ModuleIds.
     /// SipHash key: fixed kernel constant (not secret — collision resistance
     /// is sufficient, not pre-image resistance).
     pub type ModuleId = u64;

     pub fn module_id(manifest: &KabiDriverManifest) -> ModuleId {
         siphash_2_4(&KERNEL_HASH_KEY, &manifest.driver_name, manifest.driver_version)
     }

     pub struct ModuleBinaryStore {
         /// Compressed .uko ELF binaries, keyed by ModuleId (u64).
         /// Pinned kernel memory — not pageable, not evictable.
         binaries: XArray<CompressedModuleBinary>,
         /// Total compressed bytes currently stored.
         total_bytes: AtomicU64,
         /// Maximum bytes allowed. Default: 16 MiB. Configurable
         /// via kernel command line `mbs_max_mb=N`.
         max_bytes: u64,
     }

     pub struct CompressedModuleBinary {
         /// zstd-compressed raw .uko ELF bytes.
         data: &'static [u8],
         /// Original (uncompressed) size for pre-allocation at reload.
         original_size: u32,
         /// Compressed size for memory accounting.
         compressed_size: u32,
         module_id: ModuleId,
     }
     ```

     **Population**: when the module loader loads any Tier 1 or Tier 2 module,
     it checks `KabiDriverManifest.mbs_exclude`
     ([Section 12.6](12-kabi.md#kabi-transport-classes--kabidrivermanifest-transport-capability-advertisement)).
     If `mbs_exclude == false` (default), the raw .uko ELF bytes are zstd-
     compressed (~3-4x reduction) and stored in the MBS. Tier 0 modules are
     NOT cached (Tier 0 crash = kernel panic, no recovery).

     **Exclusion**: drivers that set `mbs_exclude = true` in their KABI
     manifest are not cached. This is appropriate for media/display/bluetooth
     drivers where a brief service interruption on crash is acceptable and
     memory savings on constrained devices matter. Filesystem, block,
     network, and crypto drivers should NEVER set this flag.

     **Root filesystem protection**: The module loader enforces a hard
     constraint: if a driver serves the root filesystem (its `DeviceNode`
     is in the root mount's device path), `mbs_exclude` MUST be `false`.
     If a driver binary has `mbs_exclude = true` in its manifest but is
     assigned to the root filesystem's device, the module loader overrides
     the manifest and caches the binary anyway, logging a warning:
     `"mbs_exclude overridden for root filesystem driver <name>"`.
     This prevents an unrecoverable deadlock: if the root filesystem
     driver crashes and its binary is not in the MBS, the recovery
     path cannot load the replacement binary (it must read from the
     filesystem served by the crashed driver). The override is applied
     at module load time, not at crash time — by the time a crash
     occurs, the binary is already cached.

     **Reload path**: crash recovery step 8 calls `mbs.load(module_id)` —
     an O(1) XArray lookup, no filesystem I/O. The compressed binary is
     decompressed (~0.5-1ms for a 200KB module), parsed, mapped, and
     initialized. If the module is not in MBS (mbs_exclude was set),
     fallback to filesystem read (best-effort).

     **Memory budget**: typical server ~1-2 MB compressed; RPi-class device
     ~1 MB compressed (excluding audio/video/bluetooth/GPU modules).
     Configurable cap prevents unbounded growth.

     **Overflow policy**: When `total_bytes + new_compressed_size > max_bytes`,
     the MBS rejects the store with `Err(KernelError::ResourceExhausted)`.
     The module loader logs an FMA warning (`"MBS capacity exceeded for
     driver {name}, crash recovery will use filesystem fallback"`) and
     proceeds with the driver load. The driver operates normally but falls
     back to filesystem-based reload on crash recovery. No existing MBS
     entries are evicted — FIFO eviction would remove critical filesystem/
     network drivers to make room for low-priority media drivers.

     **Runtime tier promotion/demotion interaction**: Administrators can
     change a driver's tier at runtime via
     `echo 0|1|2 > /ukfs/kernel/drivers/<name>/tier`
     ([Section 11.3](#driver-isolation-tiers)). The kernel reloads the driver at
     the new tier. The MBS must track tier changes:

     | Transition | MBS Action |
     |---|---|
     | Tier 0 → Tier 1 (promotion to isolated) | Binary was NOT in MBS (Tier 0 skips caching). The reload at Tier 1 goes through the normal module loader which reads the .uko from disk and caches it in MBS. After the tier change, the driver is crash-recoverable. |
     | Tier 0 → Tier 2 (promotion to process) | Same as above — binary loaded from disk and cached in MBS. |
     | Tier 1 → Tier 0 (demotion to in-kernel) | Binary IS in MBS. Retained (not evicted) — admin may promote back to Tier 1 later. Eviction would force a disk read on re-promotion, recreating the circular dependency if the driver serves the root filesystem. MBS entry is kept until the module is fully unloaded (`rmmod`). |
     | Tier 1 → Tier 2 (auto-demotion after crashes) | Binary stays in MBS. Both Tier 1 and Tier 2 use MBS for crash recovery. No MBS action needed. |
     | Tier 2 → Tier 1 (promotion after stability) | Binary already in MBS. No action needed. |
     | Tier 2 → Tier 0 (promotion to in-kernel) | Binary retained in MBS (same rationale as Tier 1 → Tier 0). |

     **Rule**: once a binary enters MBS, it stays until `rmmod`. Tier
     changes never evict MBS entries. The cost of keeping a ~50-200 KB
     compressed entry is trivial compared to the risk of needing it
     after a future re-promotion.

     **Deallocation**: when the module loader processes `rmmod`, it calls
     `mbs.remove(module_id)`. This removes the XArray entry, decrements
     `total_bytes` by `compressed_size`, and frees the compressed data
     backing memory (allocated from `MBS_DATA_SLAB`, a dedicated slab
     cache for MBS entries).  The slab cache uses `PAGE_KERNEL` pages
     (not pinned to a specific node) and is created at MBS init.

   - New bilateral vtable exchange
   - Device re-initialization
   - Re-register interrupt handlers

9. RESUME
   - New driver begins accepting I/O requests
   - Applications retry failed operations (standard I/O error handling)

   **Subsystem-specific recovery extensions**: Subsystems interleave additional
   steps between the general steps above. See
   [Section 14.3](14-vfs.md#vfs-per-cpu-ring-extension--unified-vfs-driver-crash-recovery-sequence)
   for the canonical VFS crash recovery sequence (Steps U1-U18) that merges
   VFS-specific steps (ring quiescence, orphaned page unlock, page cache
   integrity check, dirty page detection) with the general steps.

   NIC RX RING OVERFLOW DURING GRACEFUL QUIESCENCE:
   Between steps 2 (ISOLATE) and 8 (RELOAD), the NIC's hardware RX ring
   continues receiving packets from the network (the NIC hardware is still
   active and connected to the link). If the driver is down for ~50-150ms
   during recovery, the hardware RX ring may overflow and drop packets.
   Mitigation:
   a. NAPI poll continues in Tier 0 context during recovery. The crash
      recovery manager calls `napi_schedule()` on the NIC's NAPI instance
      to drain hardware RX descriptors into a temporary ring buffer
      (pre-allocated per-NIC at driver probe time, capacity
      `max(256, 2 * nic.rx_ring_size)`). This single ring serves both
      crash recovery and graceful evolution quiescence
      ([Section 13.18](13-device-classes.md#live-kernel-evolution)). These packets are held until
      the replacement driver is loaded and can process them.
   b. If the temporary ring overflows (sustained high packet rate during
      the ~50-150ms recovery window), excess packets are dropped with
      a per-NIC `rx_recovery_drops` counter incremented. This is
      equivalent to tail-drop under link congestion — TCP retransmits
      handle recovery; UDP applications must tolerate loss.
   c. After step 8 (RELOAD), the replacement driver's `init()` calls
      `napi_recovery_drain()` to process all buffered packets from the
      temporary ring before accepting new hardware interrupts.
   d. Userspace tasks blocked in `recv()`/`recvmsg()` during NIC driver
      **crash** reload: the socket's wait queue is woken with `POLLERR`
      when the NIC's NAPI instance is disabled (step 2, ISOLATE). The
      woken task sees an empty receive buffer and the socket error flag
      set to `EIO`. The task returns `-EIO` from `recv()`. After driver
      reload completes and NAPI resumes, the socket error flag is cleared.
      Subsequent `recv()` calls succeed normally.
      **Note**: This `-EIO` behavior applies only to NIC driver **crash**
      recovery, where the NAPI instance is abruptly disabled. During
      **graceful** NIC driver evolution ([Section 13.18](13-device-classes.md#live-kernel-evolution)),
      socket-layer operations are NOT disrupted — see "Socket-layer impact
      during NIC reload" below for the distinction.

TOTAL RECOVERY TIME: ~50ms typical (soft-reset path) to ~150ms (FLR path)
  (design target; validation requires hardware prototype — actual timing depends
   on driver state snapshot complexity and memory domain reset cost)
  Note: PCIe FLR spec allows up to 100ms for function-level reset completion.
  NVMe controller reset (CC.EN=0→1 + CSTS.RDY transition) may add 100-500ms
  depending on controller firmware and queue depth at crash time.

redirect_to_crash_recovery() Definition

The NMI handler at Step 2a calls redirect_to_crash_recovery() to eject a CPU from a crashed domain. This function modifies the NMI exception return frame so that when the NMI returns, the CPU resumes at domain_crash_trampoline instead of the original faulting code within the crashed domain.

/// Architecture-neutral NMI exception frame. Each architecture defines
/// its own concrete type in `arch::current::interrupts::NmiFrame`;
/// this trait provides the common interface used by crash recovery.
///
/// The NMI frame is the saved register state pushed by the CPU (or
/// software exception handler) when the NMI/pseudo-NMI/FIQ is taken.
/// Modifying the return address in this frame redirects execution when
/// the NMI handler returns.
///
/// | Architecture | Frame location                  | Return address field |
/// |---|---|---|
/// | x86-64       | NMI IRET frame on IST stack     | RIP in IRET frame   |
/// | AArch64      | ELR_EL1 (exception link register)| ELR_EL1             |
/// | ARMv7        | LR_fiq on exception stack       | LR in banked FIQ mode|
/// | RISC-V       | SEPC (supervisor exception PC)  | SEPC CSR            |
/// | PPC32        | CSRR0 (critical interrupt return addr) | CSRR0         |
/// | PPC64LE      | SRR0                            | SRR0                |
/// | s390x        | PSW in lowcore prefix area      | PSW instruction addr |
/// | LoongArch64  | ERA (exception return address)  | ERA CSR             |
pub trait NmiFrameOps {
    /// Overwrite the return address so the NMI returns to `addr`
    /// instead of the interrupted instruction.
    fn set_return_address(&mut self, addr: usize);

    /// Sanitize the exception return frame for safe trampoline entry.
    /// Architecture-specific adjustments:
    /// - x86-64: set CS to KERNEL_CS, clear IF in RFLAGS (prevent
    ///   nested interrupts during trampoline), set SS to KERNEL_DS.
    /// - AArch64: set SPSR_EL1 to EL1h with DAIF masked.
    /// - ARMv7: set CPSR to SVC mode with IRQ/FIQ disabled.
    /// - RISC-V: set SSTATUS.SPP to supervisor, clear SIE.
    /// - PPC32: set CSRR1 MSR to supervisor with EE/CE cleared; return via `rfci`.
    /// - PPC64LE: set SRR1 MSR to supervisor with EE cleared; return via `rfid`.
    /// - s390x: set PSW to supervisor state with I/O and external
    ///   interrupts disabled.
    /// - LoongArch64: set PRMD to kernel privilege, clear PIE.
    fn sanitize_for_trampoline(&mut self);
}

/// Redirect an NMI-interrupted CPU out of a crashed domain.
///
/// Called from the NMI handler when `CpuLocal.active_domain` matches
/// the revoked domain ID. Modifies the NMI return frame so the CPU
/// resumes at `domain_crash_trampoline` in Core domain (Domain 0).
///
/// # Safety
///
/// Safe to call from NMI context because:
/// - The NMI handler has exclusive access to the interrupted frame
///   (it is on the IST/exception stack, not shared).
/// - `domain_crash_trampoline` is a Core domain function (always
///   mapped, never revoked).
/// - The interrupted context is in a crashed domain (permanently
///   revoked) — its execution cannot meaningfully continue.
/// - No locks are acquired. Only the NMI frame and an atomic counter
///   are modified.
fn redirect_to_crash_recovery(
    nmi_frame: &mut arch::current::interrupts::NmiFrame,
    ejected_count: &AtomicU64,
) {
    nmi_frame.set_return_address(domain_crash_trampoline as usize);
    nmi_frame.sanitize_for_trampoline();
    ejected_count.fetch_add(1, Ordering::Relaxed);
}

/// Trampoline function that runs after an NMI-ejected CPU returns
/// from the NMI handler. Executes in Core domain (Domain 0).
///
/// This function:
/// 1. Resets `CpuLocal.active_domain` to 0 (Core domain), reflecting
///    that the CPU is no longer executing in any isolation domain.
/// 2. Saves the ejected CPU's register state snapshot (from the NMI
///    frame) to the driver's `DriverCrashState` for diagnostics.
/// 3. Calls `schedule()` to yield the CPU to the scheduler, allowing
///    it to run other work while the crashed domain is being recovered.
///
/// The trampoline never returns to the crashed domain's code. The
/// consumer loop that was running on this CPU is effectively aborted;
/// any in-flight ring requests are handled by Step 3 (RECOVER RING
/// BUFFER IN-FLIGHT SLOTS) of the crash recovery sequence.
fn domain_crash_trampoline() -> ! {
    let cpu_local = arch::current::cpu::cpulocal();
    // Follow the same Phase 4 exit protocol as the consumer loop:
    // 1. Clear domain_valid (may already be 0 if NMI handler cleared it,
    //    but store is idempotent).
    cpu_local.domain_valid.store(0, Ordering::Release);
    // 2. Clear active_domain to Core.
    cpu_local.active_domain.store(CORE_DOMAIN_ID, Ordering::Relaxed);
    // 3. Switch the HARDWARE isolation register back to Core domain.
    //    Without this, the CPU's PKRU/POR_EL0/DACR still has the crashed
    //    domain's permissions enabled — an isolation violation.
    arch::current::isolation::switch_domain(CORE_DOMAIN_ID);

    // Register state was already captured in the NMI handler; the
    // crash recovery thread (running on the faulting CPU) will read
    // it from the DriverCrashState buffer.
    //
    // Loop into schedule() forever. This function was entered via NMI
    // frame rewrite (not a `call` instruction), so there is no valid
    // return address on the stack. The consumer loop it replaces is
    // also `-> !` (never returns). The kthread that ran the consumer
    // loop becomes an idle-like entity that yields the CPU to the
    // scheduler until the domain service terminates or rebinds it
    // after crash recovery completes.
    loop {
        schedule();
    }
}

11.9.2.1 napi_recovery_drain() Definition

/// Drain hardware RX ring into temporary buffer during driver recovery.
/// Called by the crash recovery manager after Step 2' (SET RING STATE)
/// and BEFORE DMA-1 (FLR initiation). This timing is critical:
/// `napi_recovery_drain()` reads hardware RX ring descriptors via MMIO,
/// which requires the NIC to be functional. After DMA-1 (FLR), the NIC
/// is being reset and MMIO reads return undefined data. After DMA-2
/// (IOTLB invalidation), the NIC's DMA is blocked.
///
/// **Step placement in interleaved ordering**:
/// ```text
/// Step 2:   ISOLATE (domain revocation)
/// Step 2':  SET RING STATE (Disconnected)
/// >>> napi_recovery_drain() runs HERE <<<
/// DMA-1:    initiate_flr()
/// Step 2a:  NMI EJECTION
/// DMA-2:    invalidate_iotlb()
/// ...
/// ```
///
/// Runs in Tier 0 context (the crashed driver's domain is revoked).
/// Reads raw DMA descriptors from the NIC's RX ring (MMIO-mapped),
/// copies packet data to pre-allocated recovery pages, and enqueues
/// descriptors into the temporary ring buffer.
///
/// **MMIO access pattern**: After Step 2 (ISOLATE), the driver's MMIO
/// pages are tagged with the driver's domain key (PKEY on x86, POE
/// domain on AArch64, etc.). Tier 0 code (running with Core domain
/// permissions) cannot access these pages without explicitly granting
/// temporary access. This function uses the architecture-neutral
/// `with_domain_access()` wrapper to temporarily enable access to the
/// crashed driver's domain for the duration of the MMIO reads:
///
/// ```rust
/// arch::current::isolation::with_domain_access(dev.domain_id, || {
///     // MMIO reads are safe here — the driver's domain key is
///     // temporarily enabled in the isolation register.
///     // ... read RX descriptors, copy packet data ...
/// });
/// ```
///
/// Per-architecture implementation of `with_domain_access()`:
/// - x86-64: Save PKRU, enable the driver's PKEY via WRPKRU, execute
///   closure, restore original PKRU. (~40 cycles round-trip.)
/// - AArch64 (POE): Save POR_EL0, grant overlay permission for the
///   driver's POE domain, execute closure, restore POR_EL0.
/// - AArch64 (ASID fallback): Switch ASID to a recovery ASID that
///   has the driver's MMIO pages mapped, execute closure, switch back.
/// - ARMv7: Save DACR, set driver's domain to Client, ISB, execute
///   closure, restore DACR + ISB.
/// - PPC32: Load driver's segment register, execute closure, restore.
/// - PPC64LE: Switch Radix PID to recovery PID, execute closure, restore.
/// - RISC-V / s390x / LoongArch64: These architectures run Tier 1
///   drivers as Tier 0 (no fast isolation), so MMIO pages are in the
///   Core domain and accessible without special handling.
///
/// This pattern is consistent with the ptrace MMIO access pattern
/// documented in [Section 11.3](#driver-isolation-tiers).
///
/// Returns the number of packets drained. Stops when the RX ring is
/// empty or the temporary buffer is full.
pub fn napi_recovery_drain(
    dev: &NetDevice,
    temp_ring: &mut RecoveryRxRing,
) -> u32;

11.9.2.2 RecoveryRxRing Definition

/// Temporary RX ring buffer used during NIC driver crash recovery.
/// Pre-allocated per NIC at driver probe time to ensure availability
/// during crash recovery (no allocation on the crash path).
///
/// Capacity: `max(256, 2 * nic.rx_ring_size)`. For a typical NIC with
/// `rx_ring_size = 4096`, this is 8192 descriptors. The 2x factor
/// accounts for packets that arrive during the recovery window
/// (driver reload typically takes 50-150ms).
///
/// Entry type: `RecoveryRxEntry` contains a DMA-mapped buffer handle
/// and packet metadata (length, RSS hash, VLAN tag) extracted from
/// the hardware RX descriptor. The actual packet data is copied into
/// pre-allocated recovery pages (not the original NIC DMA buffers,
/// which are freed during driver teardown).
///
/// Allocated from the buddy allocator at Phase 1.1 (driver probe),
/// pinned in physical memory (no swap, no reclaim). The ring is
/// reusable across multiple crash/reload cycles.
// kernel-internal, not KABI
#[repr(C)]
pub struct RecoveryRxRing {
    /// Base virtual address of the ring entry array.
    pub entries: *mut RecoveryRxEntry,
    /// Number of entries in the ring (power of 2 for mask-based indexing).
    pub capacity: u32,
    /// Producer index (written by napi_recovery_drain).
    pub head: u32,
    /// Consumer index (read by the replacement driver's init path).
    pub tail: u32,
    /// Explicit alignment padding for `pages` (PhysAddr = u64, alignment 8).
    /// On 64-bit: entries(8)+capacity(4)+head(4)+tail(4) = offset 20; this 4-byte
    /// field brings offset to 24 (aligned for u64). On 32-bit: entries(4)+
    /// capacity(4)+head(4)+tail(4) = offset 16; this 8-byte field brings offset
    /// to 24 (aligned for u64). Per CLAUDE.md rule 11 — all padding explicit.
    #[cfg(target_pointer_width = "64")]
    pub _pad_align: [u8; 4],
    #[cfg(target_pointer_width = "32")]
    pub _pad_align: [u8; 8],
    /// Physical pages backing the ring entries. Freed when the NIC
    /// device is permanently removed (not on crash recovery — the ring
    /// persists across reload cycles).
    pub pages: PhysAddr,
    /// Number of physical pages allocated.
    pub page_count: u32,
    pub _pad: u32,
}
// RecoveryRxRing (64-bit): entries(ptr=8,off=0) + capacity(4,off=8) + head(4,off=12) +
//   tail(4,off=16) + _pad_align(4,off=20) + pages(u64=8,off=24) + page_count(4,off=32) +
//   _pad(4,off=36) = 40 bytes.
// RecoveryRxRing (32-bit): entries(ptr=4,off=0) + capacity(4,off=4) + head(4,off=8) +
//   tail(4,off=12) + _pad_align(8,off=16) + pages(u64=8,off=24) + page_count(4,off=32) +
//   _pad(4,off=36) = 40 bytes.
const_assert!(core::mem::size_of::<RecoveryRxRing>() == 40);

/// Single entry in the RecoveryRxRing.
// kernel-internal, not KABI
#[repr(C)]
pub struct RecoveryRxEntry {
    /// Virtual address of the packet data buffer (allocated from
    /// recovery page pool, NOT the original NIC DMA buffer).
    pub data: *mut u8,
    /// Packet length in bytes.
    pub len: u32,
    /// RSS hash from hardware (0 if not available).
    pub rss_hash: u32,
    /// VLAN TCI (0 if no VLAN tag).
    pub vlan_tci: u16,
    /// Flags: bit 0 = VLAN present, bit 1 = checksum valid.
    pub flags: u16,
    pub _pad: u32,
}
// RecoveryRxEntry (64-bit): data(ptr=8,off=0) + len(u32=4,off=8) + rss_hash(u32=4,off=12) +
//   vlan_tci(u16=2,off=16) + flags(u16=2,off=18) + _pad(u32=4,off=20) = 24 bytes.
// RecoveryRxEntry (32-bit): data(ptr=4,off=0) + len(u32=4,off=4) + rss_hash(u32=4,off=8) +
//   vlan_tci(u16=2,off=12) + flags(u16=2,off=14) + _pad(u32=4,off=16) = 20 bytes.
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<RecoveryRxEntry>() == 24);
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<RecoveryRxEntry>() == 20);

11.9.2.3 DMA Quiescence During Crash Recovery

When a Tier 1 driver faults, the device it controls may have in-flight DMA operations — reads or writes that were initiated before the crash and have not yet completed. If these DMA operations target memory that is freed or reassigned during crash cleanup, the result is silent data corruption. This subsection specifies the DMA quiescence protocol that runs between fault detection (step 1) and driver state cleanup (step 6).

DMA quiescence sits between steps 2 and 3 of the main recovery sequence. The complete interleaved ordering (resolving the DMA/NMI step numbering) is:

Step 1:   FAULT DETECTED
Step 1a:  TIER CHECK
Step 2:   ISOLATE (revoke domain hardware permissions)
Step 2':  SET RING STATE (Disconnected on all rings)
DMA-1:    initiate_flr() — non-blocking FLR (or vendor-specific reset)
Step 2a:  NMI EJECTION (eject all CPUs from the revoked domain)
DMA-2:    invalidate_iotlb() — revoke IOMMU mappings
DMA-3:    wait_dma_quiesce() — wait for all in-flight DMA to complete
DMA-4:    quiesce_domain() — assert DMA-safe state
Step 3:   RECOVER RING BUFFERS
Step 4:   RESUME I/O FOR SURVIVING DOMAINS (CQE delivery)
Step 4a:  INCREMENT CRASH COUNT AND EMIT FMA EVENT
Step 5:   DEVICE RESET VERIFICATION — confirm FLR completed
          (this is a post-FLR verification step, not a reset initiation;
          the actual reset was initiated in DMA-1. For non-FLR devices,
          DMA-1 already issued the vendor-specific reset.)
Step 6:   LOCK CLEANUP
Step 7:   UNLOAD DRIVER
Step 8:   RELOAD
Step 9:   RESUME

FLR initiation (DMA-1) is non-blocking and is placed BEFORE NMI ejection (Step 2a) to maximize the overlap time: while CPUs are being ejected from the domain, the device is already processing the FLR. NMI ejection must complete BEFORE ring recovery (Step 3) to ensure no CPU is still writing to ring memory.

After the faulting driver's isolation domain is revoked (step 2, no driver code can execute), but before pending I/O is drained (step 3), the kernel must ensure the device's DMA engine is stopped and no new DMA translations can occur.

IOMMU domain access path during DMA crash recovery: The crash recovery worker obtains IOMMU domain references by iterating domain_desc.devices (the DomainDescriptor.devices field, populated at driver probe time). For each DeviceHandle, the DmaDeviceHandle.iommu_domain field (an RcuCell<Option<Arc<IommuDomain>>>) is read under rcu_read_lock(). The Arc is cloned to extend the lifetime beyond the RCU critical section. This cloned Arc<IommuDomain> is passed to invalidate_iotlb(), wait_dma_quiesce(), and quiesce_domain(). For domain-group crashes, each device may belong to a different IOMMU group — the orchestrator deduplicates IOMMU domains (using DomainDescriptor.iommu_domains, populated at domain creation time) before issuing invalidations.

/// DMA device quiescence for crash recovery. Implemented by the PCIe
/// subsystem (Tier 0) for all bus-mastering devices. Called by the kernel's
/// crash recovery manager after revoking the faulted driver's MPK domain.
///
/// # Safety
/// Called with the faulted driver's isolation domain revoked — no driver code
/// is executing. The device may still have in-flight DMA operations targeting
/// memory within the (now-revoked) driver's address range.
pub trait DmaCrashRecovery {
    /// Step 2a: Issue Function-Level Reset (FLR) to stop the device's DMA engine.
    ///
    /// For PCIe devices: write bit 15 of the PCI Express Device Control
    /// register (`PCI_EXP_DEVCTL_BCR_FLR`). FLR guarantees that all in-flight
    /// TLPs (Transaction Layer Packets) complete or are discarded. The PCIe
    /// Base Specification requires FLR to complete within 100 ms.
    ///
    /// For non-PCIe devices or devices without FLR capability: issue a
    /// vendor-specific soft reset (register write to the device's reset
    /// control register). The device-class driver (if Tier 0) or the bus
    /// framework provides the reset sequence.
    ///
    /// Returns `Ok(())` when the device acknowledges the reset (config space
    /// returns valid data). Returns `Err(DeviceError::FlrTimeout)` if the
    /// device does not respond within 100 ms — escalation proceeds via the
    /// FLR Timeout Recovery sequence ([Section 11.9](#crash-recovery-and-state-preservation--flr-timeout-recovery)).
    fn initiate_flr(&self, dev: DeviceHandle) -> Result<(), DeviceError>;

    /// Step 2b: Invalidate the device's IOTLB entries in the IOMMU.
    ///
    /// Issues an IOTLB invalidation command targeting all entries in the
    /// device's IOMMU domain. After this call returns, any new DMA initiated
    /// by the device (if FLR has not yet completed) will fault at the IOMMU
    /// rather than reaching physical memory.
    ///
    /// On Intel VT-d: `QI_IOTLB_IIDX` invalidation via the invalidation
    /// queue, targeting the device's domain ID.
    /// On ARM SMMU: `TLBI_NH_ASID` or `TLBI_S12_VMALL` for the stream ID.
    /// On AMD-Vi: `INVALIDATE_IOTLB_PAGES` command for the domain.
    fn invalidate_iotlb(&self, domain: &IommuDomain);

    /// Step 2c: Wait for in-flight DMA completion.
    ///
    /// Polls for FLR completion (config space returns valid Vendor ID, not
    /// 0xFFFF) with a bounded timeout. After this call returns `Ok(())`, no
    /// device-initiated DMA is active — the device has completed its reset
    /// and is in a quiescent state.
    ///
    /// `timeout_ms`: Maximum wait time. Default: 100 ms for PCIe FLR (per spec).
    /// Returns `Err(DeviceError::FlrTimeout)` if the device does not quiesce
    /// within the deadline.
    fn wait_dma_quiesce(&self, timeout_ms: u32) -> Result<(), DeviceError>;

    /// Step 2d: Transition IOMMU domain to "quiesced" state.
    ///
    /// The domain's page table mappings are preserved but their permissions
    /// are downgraded to **read-only**. This ensures:
    /// 1. No device can write to the old driver's DMA buffers (even if a
    ///    stale IOTLB entry somehow survives the invalidation).
    /// 2. The new driver instance can inspect the buffer contents during
    ///    `dma_reclaim_buffers()` without risking corruption from the device.
    ///
    /// The domain remains in "quiesced" state until the replacement driver
    /// calls `dma_reclaim_buffers()`, which transitions it back to "active"
    /// with full read-write permissions under the new driver's MPK domain.
    fn quiesce_domain(&self, domain: &mut IommuDomain);
}

Complete DMA crash recovery sequence (interleaved with the main recovery steps):

Main step 1: FAULT DETECTED
Main step 2: ISOLATE (revoke driver's MPK domain)

  DMA step 2a: initiate_flr(dev)
      → Issue FLR to stop the device's DMA engine.
      → If device lacks FLR: vendor-specific soft reset.

  DMA step 2b: invalidate_iotlb(domain)
      → Set IOMMU domain to read-only (blocks new DMA writes).
        Memory ordering for the IOMMU page table write (RW → RO
        permission downgrade): the page table entry store uses Release
        ordering. The subsequent IOTLB invalidation acts as a full
        memory barrier — after invalidation completes, all CPUs and
        devices observe the read-only permission.
      → Issue IOTLB invalidation for the driver's IOMMU domain
        (flushes cached IOMMU translations).
      → Wait for IOTLB invalidation completion (architecture-specific:
        Intel VT-d: QI descriptor completion — write the invalidation
        descriptor to the Invalidation Queue, poll the IWC (Invalidation
        Wait Completion) status bit in IQA_REG;
        ARM SMMU: CMD_SYNC — issue TLBI_NH_ASID followed by CMD_SYNC,
        poll the SMMU_GERROR or sync completion signal;
        AMD-Vi: INVALIDATE_IOTLB_PAGES command followed by
        COMPLETION_WAIT, poll the completion semaphore).
        No device DMA write can succeed after this point — the IOMMU
        will fault any write transaction targeting the downgraded domain.
      → IOTLB invalidation MUST complete before FLR completes (step 2c).
        This prevents stale IOTLB entries from allowing the device
        (post-FLR) to DMA to memory it no longer owns.
      → New DMA from the device now faults at the IOMMU.

  DMA step 2c: wait_dma_quiesce(100)
      → Poll for FLR completion (max 100 ms).
      → After this, no device DMA is active.

  DMA step 2d: quiesce_domain(domain)
      → Confirm IOMMU mappings are read-only (set in 2b).
      → DMA buffers are preserved, not freed.

Main step 3: RECOVER RING BUFFER IN-FLIGHT SLOTS
Main step 4: DRAIN PENDING I/O (complete with -EIO)
Main step 4a: EMIT FMA EVENT
Main step 5: DEVICE RESET (already done by FLR in step 2a)
Main step 6: RELEASE KABI LOCKS
Main step 7: UNLOAD DRIVER
    → Free driver-private memory, release capabilities, unmap MMIO.
    → DMA buffers are NOT freed — ownership transfers to the new instance.

Main step 8: RELOAD DRIVER
    → Load fresh driver binary, new bilateral vtable exchange.
    → The Hello protocol creates new consumer threads for the replacement
      driver. Each new consumer thread's TaskId is pushed to
      `DomainDescriptor.consumer_threads` (cleared in Step 7). This
      ensures the NMI handler targets correct CPUs if a second crash occurs.
    → Ring state (head/tail pointers) is reset to zero AFTER the new driver
      instance is loaded but BEFORE I/O is resumed, preventing premature
      submission to an uninitialized driver.

  DMA step 8a: dma_reclaim_buffers(dev, domain)
      → New driver instance inherits the preserved DMA mappings.
      → IOMMU domain transitions from "quiesced" back to "active"
        with full read-write permissions under the new driver's MPK domain.
      → Returns descriptors of all preserved DMA buffers so the new
        driver can inspect their contents and decide whether to reuse
        them (warm restart) or release them.

Main step 9: RESUME
/// Called by the NEW driver instance after reload to reclaim DMA buffers
/// from the previous (crashed) instance.
///
/// The IOMMU domain transitions from "quiesced" (read-only, no driver
/// association) back to "active" (read-write, associated with the new
/// driver's MPK domain).
///
/// Returns a list of DMA buffer descriptors that the previous instance
/// had mapped. Each descriptor contains:
/// - `iova`: I/O virtual address (device-visible address).
/// - `phys`: Physical address of the backing pages.
/// - `size`: Buffer size in bytes.
/// - `direction`: DMA direction (ToDevice, FromDevice, Bidirectional).
///
/// The new driver inspects these buffers and either:
/// - **Reuses** them (warm restart): re-programs the device's DMA rings
///   to point at the preserved buffers. No data is lost.
/// - **Releases** them: calls `dma_unmap()` + `free_pages()` for each
///   buffer it does not need. This is the cold-restart fallback.
///
/// If no DMA buffers were preserved (e.g., the device had no active DMA
/// at crash time), an empty array is returned.
/// Errors that can occur during DMA buffer reclamation after a driver crash.
pub enum DmaReclaimError {
    /// IOMMU reported a fault while walking the device's DMA mappings.
    /// The IOMMU page table may be corrupted by the crashed driver.
    IommuFault,
    /// The device did not respond to config-space reads within the timeout.
    /// The device may be in a wedged state requiring a bus-level reset.
    DeviceNotResponding,
    /// The preserved DMA metadata (shadow ring, buffer descriptors) failed
    /// integrity checks (CRC mismatch or out-of-range IOVA/size values).
    MetadataCorrupted,
}

/// # Precondition
///
/// Called after step 2a of crash recovery -- all consumers are disconnected
/// and no CPU is executing driver code. If called outside the recovery
/// sequence, the caller must ensure no live consumer holds DMA mappings.
pub fn dma_reclaim_buffers(
    dev: DeviceHandle,
    domain: &mut IommuDomain,
) -> Result<ArrayVec<DmaBufferDescriptor, MAX_PHYS_RECLAIM>, DmaReclaimError>;

/// Maximum number of DMA buffers preserved across a crash recovery.
///
/// Runtime-discovered: `min(device.queue_depth * device.num_queues, MAX_PHYS_RECLAIM)`.
/// The per-device queue depth and queue count are read from the device's
/// capability registers at probe time (e.g., NVMe CAP.MQES, NIC descriptor
/// ring size). `MAX_PHYS_RECLAIM` is a system-wide ceiling to bound memory
/// reserved for crash recovery buffers.
///
/// Default `MAX_PHYS_RECLAIM`: 8192 (sufficient for high-end NVMe with 128
/// queues x 64 entries, or NICs with 32 queues x 256 descriptors). Tunable
/// via `/sys/kernel/fma/max_phys_reclaim`.
///
/// **System-wide memory footprint**: Each `DmaBufferDescriptor` is 32 bytes.
/// 8192 descriptors × 32 B = 256 KB per device. For a server with 64 PCIe
/// devices, worst-case system-wide reservation is ~16 MB. Descriptor arrays
/// are slab-allocated lazily at driver probe time (not at boot), so devices
/// that never use crash recovery do not consume descriptor memory.
pub const MAX_PHYS_RECLAIM: usize = 8192;

/// Per-device reclaim buffer count, computed at driver probe time.
/// The `ArrayVec` capacity for `dma_reclaim_buffers()` is bounded by this value.
pub fn max_reclaim_buffers(dev: &DeviceNode) -> usize {
    let queue_depth = dev.resources.queue_depth as usize;
    let num_queues = dev.resources.num_queues as usize;
    core::cmp::min(queue_depth * num_queues, MAX_PHYS_RECLAIM)
}

/// Descriptor for a DMA buffer preserved across crash recovery.
///
/// Passed between the crash recovery manager and the replacement driver
/// via `dma_reclaim_buffers()`.  Layout must be stable across subsystem
/// boundaries.
///
/// Layout (64-bit, `#[repr(C)]`):
///   iova: u64 (8 B, offset 0), phys: PhysAddr/u64 (8 B, offset 8),
///   size: u64 (8 B, offset 16), direction: DmaDirection/u8 (1 B, offset 24),
///   _pad: [u8; 7] (7 B, offset 25).  Total: 32 bytes.
#[repr(C)]
pub struct DmaBufferDescriptor {
    /// I/O virtual address (the address the device uses for DMA).
    pub iova: u64,
    /// Physical address of the backing page(s).
    pub phys: PhysAddr,
    /// Size of the buffer in bytes.
    pub size: u64,
    /// DMA transfer direction.
    pub direction: DmaDirection,
    /// Explicit padding to 8-byte struct alignment (prevents info leak).
    pub _pad: [u8; 7],
}
const_assert!(core::mem::size_of::<DmaBufferDescriptor>() == 32);

// DmaDirection: uses the canonical definition from the DMA subsystem
// ([Section 4.14](04-memory.md#dma-subsystem--dmadevice-trait-and-typed-dma-api)).
// Variants: ToDevice = 0, FromDevice = 1, Bidirectional = 2, None = 3.

Key invariant: DMA buffers are never freed during crash cleanup. Their ownership transfers from the crashed driver instance to the kernel's crash recovery manager, and then to the new driver instance via dma_reclaim_buffers(). This prevents use-after-free of buffers that hardware may still reference during the FLR completion window (up to 100 ms). The IOMMU read-only downgrade (step 2d) provides an additional safety net: even if a stale IOTLB entry allows a device DMA after the invalidation, the write will fault at the IOMMU rather than corrupting the preserved buffer contents.

Cross-references: - DMA subsystem (DmaDevice, CoherentDmaBuf, StreamingDmaMap): Section 4.14 - IOMMU and IOMMUFD framework: Section 18.5 - FLR timeout escalation sequence: Section 11.9 (this chapter)

11.9.2.4 PPC64LE EEH (Enhanced Error Handling)

PPC64LE POWER systems have a hardware-level PCI error recovery mechanism called EEH (Enhanced Error Handling) that operates below the OS crash recovery layer. EEH is unique to IBM POWER platforms and has no direct equivalent on x86, ARM, or RISC-V (the closest analogue is PCIe AER, but EEH operates at a fundamentally different level).

Mechanism: When the PHB (PCI Host Bridge) detects a PCI error on a PE (Partitionable Endpoint), it "freezes" the PE:

  • MMIO reads from the frozen PE return 0xFFFFFFFF (all-ones)
  • MMIO writes to the frozen PE are silently discarded
  • DMA transactions from the frozen PE are silently discarded
  • The freeze prevents error propagation to the rest of the system

Detection: Drivers detect a frozen PE when MMIO reads return 0xFFFFFFFF. Because this is a valid value for some registers, the driver must check a second source: the eeh_dev_check_failure() function reads the PE freeze state from firmware (OPAL on PowerNV, RTAS on pseries) to confirm whether the PE is actually frozen or the register genuinely contains 0xFFFFFFFF.

/// EEH frozen-PE check. Called by driver code when an MMIO read returns all-ones.
///
/// Returns `true` if the PE is frozen (genuine hardware error), `false` if the
/// register value is legitimate. On non-PPC platforms, always returns `false`.
///
/// # Architecture
/// - PowerNV: OPAL call `opal_pci_eeh_freeze_status(phb_id, pe_number)`
/// - pseries: RTAS call `ibm,read-slot-reset-state2`
pub fn eeh_dev_check_failure(dev: DeviceHandle) -> bool;

Recovery flow (integrated with UmkaOS crash recovery):

1. Driver MMIO read returns 0xFFFFFFFF
2. eeh_dev_check_failure() → true (PE is frozen)
3. Kernel triggers the standard crash recovery sequence (steps 1-8 above)
   with the following PPC-specific modifications:

   Step 2 (ISOLATE): The PE is already hardware-frozen. The kernel's
   MPK domain revocation is redundant on PPC but is still performed for
   consistency with the cross-platform recovery protocol.

   Step 4 (DEVICE RESET): Instead of PCIe FLR, use EEH-specific reset:
   a. PE reset via OPAL: opal_pci_reset(phb_id, pe_number, OPAL_RESET_PCI_FUNDAMENTAL)
   b. Wait for PE unfreezing (firmware clears the freeze state)
   c. Re-enable MMIO and DMA for the PE

   Step 7 (RELOAD DRIVER): Standard driver reload. The PE is now unfrozen
   and the device is in a known-good state.

4. If recovery fails after 5 attempts: permanently disable the PE.
   OPAL: opal_pci_reset(phb_id, pe_number, OPAL_RESET_PHB_COMPLETE)
   for a full PHB reset as last resort.

EEH and UmkaOS Tier model interaction: EEH provides hardware-level isolation that supplements UmkaOS's software isolation tiers. For Tier 1 drivers on PPC64LE, an EEH freeze is an additional fault signal alongside MPK domain faults and watchdog timeouts. For Tier 2 drivers, EEH freeze is detected by the kernel's IOMMU fault handler (the frozen PE's discarded DMA triggers an IOMMU fault).

POWER9 EEH errata: "Intermittent failures for reset of a Virtual Function (VF) for SR-IOV adapters during EEH recovery" if a data packet is received during recovery. Fixed in firmware VH950. UmkaOS documents minimum firmware version requirement for PPC64LE platforms. - State preservation and warm restart: Section 11.9 (this chapter) - Tier 2 deferred DMA page lifetime: Section 11.9 (this chapter)

11.9.2.5 Network Buffer Handle Reclamation

When a NIC driver's isolation domain crashes:

  1. TX ring cleanup: The kernel walks the driver's TX completion ring. Each NetBufHandle still held by the driver is reclaimed:
  2. Generation counter in the NetBuf pool slot is incremented (invalidates any stale handles).
  3. DMA mappings are torn down via IOMMU unmap.
  4. Pages are returned to the NetBuf pool.

  5. RX ring cleanup: Pre-posted RX buffers are reclaimed similarly. DMA unmap + return to pool.

  6. In-flight NetBuf reclamation: NetBufs that were extracted from the KABI ring by the Tier 0 NAPI poll handler but not yet delivered to umka-net are tracked via NapiContext.in_flight_count: AtomicU32. This counter is incremented when a NetBuf is dequeued from the ring and decremented when it is handed to the protocol stack. On crash, the domain teardown handler sweeps all NapiContext instances for the crashed driver: if in_flight_count > 0, the corresponding NetBuf pool slots are reclaimed by scanning the pool's allocated bitmap for entries belonging to the crashed domain. This prevents NetBuf leaks from crash mid-NAPI-poll.

  7. NAPI cleanup: napi_disable() followed by napi_unregister() for all NAPI instances associated with the crashed driver. GRO state in umka-net for affected queues is flushed.

  8. Timing: Handle reclamation happens in Step 3 of the crash recovery sequence (after domain fault handled, before driver reload). Total cleanup target: <10ms for typical NIC with 1024 TX + 1024 RX descriptors.

  9. Double-free protection: The generation counter ensures that if the crashed driver's remnant code somehow calls netbuf_free() on a reclaimed handle, the generation mismatch causes the free to be silently ignored (stale handle).

Socket-layer impact during NIC graceful evolution: Socket-layer operations (recv(), send(), accept()) are NOT disrupted by NIC driver graceful evolution (Section 13.18). The protocol stack (umka-net, Tier 1) is independent of the NIC driver — it continues running throughout the evolution window. During a crash reload, recv() callers receive -EIO (see NIC RX RING OVERFLOW bullet d in the recovery sequence above); during graceful evolution, packets are buffered and no error is returned. Established TCP connections survive: recv() waiters remain blocked on the socket's receive queue; send() callers enqueue data in the TCP send buffer. During the reload window, no new packets arrive from or depart to the NIC, so TCP retransmission timers handle data in flight (typical RTO is 200ms-1s, well above the reload window). After the new NIC driver is operational and carrier is re-established, normal TCP flow resumes — ACKs drain the retransmit queue, and pending recv() waiters are woken when retransmitted data arrives. UDP packets in flight during the reload window are lost (no retransmission) — this is equivalent to transient packet loss on the wire.

11.9.3 Reload Failure Handling

If the new driver instance fails to initialize after a crash, UmkaOS handles the failure as follows:

  1. Detection: reload failure is defined as the new driver instance crashing during initialization, OR initialization not completing within 500 ms (hard timeout).
  2. FMA event emission: a crash during Step 8 (driver init) is a new crash event. The crash_count was already incremented at Step 4a of the recovery sequence — this reload failure triggers a SECOND crash recovery pass which starts at Step 1 again. The second pass's Step 4a will increment crash_count again (correctly reflecting two crashes). If the reload failure is a timeout (not a hardware exception), emit the FMA event directly: fma_emit(FaultEvent::DriverCrash { device: dev.device_node_id(), driver_id: driver_instance.id(), tier: driver_instance.isolation_tier() as u8, crash_count: driver_instance.crash_count() }); FMA escalation rules apply: if crash_count >= 3, escalate to DEMOTE or QUARANTINE (Section 20.1). This ensures that a driver which repeatedly crashes during init (e.g., firmware bug triggered by probe(), incompatible hardware revision, corrupted module binary) is correctly escalated rather than retrying indefinitely.
  3. Device offline: the device is marked DeviceState::Error; no new I/O is accepted.
  4. Client notification: all processes with open file descriptors to this device receive SIGHUP; any pending I/O syscalls return EIO.
  5. Kernel continues: a Tier 1 reload failure does not panic the kernel — the device is simply unavailable. All other drivers and subsystems continue operating normally.
  6. Audit: a kernel warning is logged with the device canonical name, failure reason (crash vs timeout), and driver version.
  7. Manual recovery: an operator can trigger a fresh reload attempt via the umkafs control interface at /ukfs/kernel/drivers/<name>/reload after investigating the cause; the failure counter (Section 11.6) may also trigger automatic demotion to Tier 2 on repeated failures.

Recovery timing breakdown — The ~50ms figure applies to the soft-reset path where the driver performs a vendor-specific device reset (register write + status poll) without a full PCIe Function Level Reset. Many devices (Intel NICs, AHCI controllers) support fast software reset in 1-10ms. The full PCIe FLR path takes longer: the PCIe spec requires the function to complete FLR within 100ms (the device must not be accessed until FLR completes; software polls the device's configuration space to detect completion). With driver reload overhead, the FLR path totals ~150ms. UmkaOS prefers the soft-reset path when the driver crash was a software bug (the device hardware is fine); FLR is used when the device itself appears hung (no response to MMIO reads, completion timeout). In either case, the recovery is 100-1000x faster than a full Linux reboot (30-60s).

11.9.4 FLR Timeout Recovery

The PCIe Base Specification requires that a function complete FLR within 100 ms. UmkaOS enforces this deadline and defines an escalating recovery sequence for the case where FLR does not complete in time.

FLR with timeout enforcement:

/// Poll interval while waiting for FLR completion.
const FLR_POLL_INTERVAL_US: u64 = 1_000; // 1 ms
/// Maximum wait for FLR per PCIe Base Spec (Section 7.6.2).
const FLR_TIMEOUT_MS: u64 = 100;

/// Initiate FLR on a PCIe function and poll for completion.
/// Returns Ok(()) when the function's config space is accessible again.
/// Returns Err(PcieError::FlrTimeout) if the deadline elapses.
fn pcie_flr_with_timeout(dev: &mut PcieDevice) -> Result<(), PcieError> {
    // Initiate FLR: write bit 15 of the Device Control register.
    // Cap offset is discovered via the PCIe Capability structure pointer.
    let devctl_offset = dev.pcie_cap_offset + PCI_EXP_DEVCTL;
    dev.config_write_u16(devctl_offset, PCI_EXP_DEVCTL_BCR_FLR);

    let deadline_ns = monotonic_ns() + FLR_TIMEOUT_MS * 1_000_000;
    loop {
        delay_us(FLR_POLL_INTERVAL_US);
        // FLR completion is indicated by config space returning valid data.
        // A device undergoing FLR returns 0xFFFF for any config read.
        if dev.config_read_u16(PCI_VENDOR_ID) != 0xFFFF {
            return Ok(());
        }
        if monotonic_ns() >= deadline_ns {
            break;
        }
    }
    Err(PcieError::FlrTimeout)
}

Escalation sequence on PcieError::FlrTimeout:

When FLR does not complete within 100 ms, UmkaOS escalates through the following steps in order, stopping at the first step that succeeds:

  1. IOMMU quarantine (immediate, before attempting any escalation): the device's IOMMU domain is placed in fault mode — all further DMA from the device is blocked by the IOMMU. This prevents the hung device from corrupting memory during the escalation sequence, regardless of how long escalation takes.

  2. Secondary bus reset: if the device is behind a PCIe bridge (not directly attached to the root complex), assert the bridge's secondary bus reset bit (PCI_BRIDGE_CTL_BUS_RESET, bit 6 of the Bridge Control register at config offset 0x3E). Hold for 1 ms, then deassert and wait up to 100 ms for the device's Vendor ID to become valid. A secondary bus reset resets all functions on the secondary bus, so sibling functions receive DeviceEvent::SiblingReset.

  3. Hot-plug slot power cycle: if the slot exposes Hot-Plug capability and the HPC_POWER_CTRL bit is set in the Slot Capabilities register, toggle slot power off and on. Wait up to 1 s for the slot's Presence Detect State to return to present and the device's config space to become accessible.

  4. Bus master disable (last resort): if both SBR and hot-plug fail (or are unavailable), disable bus mastering via the PCIe Command register: clear the BME bit (PCI_COMMAND_MASTER, bit 2). This is a host-side register write that the device cannot block — it prevents the PCIe root complex from forwarding the device's DMA requests. Combined with IOMMU quarantine (step 1), this provides defense-in-depth: even if the device somehow issues DMA, the IOMMU blocks it. The device is marked DeviceState::FaultedUnrecoverable and requires physical intervention (power cycle, slot reseat).

  5. Permanent fault: if no escalation step recovers the device:

  6. Transition the device to DeviceState::FaultedUnrecoverable.
  7. Remove the device from the active device registry (it is retained as a tombstone entry for diagnostic purposes, accessible via umkafs).
  8. Invoke the Tier 1 driver's teardown path (unload the driver, release its memory domain and capabilities) as if a crash occurred, but without attempting reload.
  9. Log: pcie: FLR timeout on [bus:dev.fn] (vid={vid} did={did}), secondary bus reset {"succeeded"|"failed"}, slot power cycle {"succeeded"|"failed"|"unavailable"}, device faulted permanently.
  10. The FMA subsystem (Section 20.1) receives a FaultEvent::PcieFlrTimeout event carrying the BDF, the vendor/device ID, and the escalation result. FMA may trigger a predictive replacement recommendation.

  11. User notification: after the fault is recorded, send a uevent to userspace (ACTION=change, SUBSYSTEM=pci, PCIE_EVENT=FLR_TIMEOUT, PCI_SLOT_NAME=<bdf>). Device manager daemons (udev, systemd-udevd) can trigger operator alerts or automated replacement workflows.

Invariants: - IOMMU quarantine (step 1) is unconditional and runs before any escalation attempt. The device must not be able to DMA during escalation. - Steps 2 and 3 each have their own 100 ms and 1 s timeouts respectively. Total worst-case escalation time before permanent fault: ~1.2 s. - No driver code runs after FlrTimeout is returned. The escalation sequence is entirely in the kernel's PCIe subsystem (Tier 0), not in the Tier 1 driver. - If a secondary bus reset is performed, the sibling functions' drivers are notified via DeviceEvent::SiblingReset before the reset is asserted, giving them 5 ms to quiesce outstanding I/O.

11.9.5 Crash State Buffer Wire Format

When a Tier 1 driver panics, a pre-allocated crash state buffer is filled before the driver's isolation domain is destroyed. This buffer is stored in umka-core memory and remains accessible after teardown. It is used for post-mortem diagnostics, FMA fault reporting, and optionally for warm-restart state recovery.

/// Wire format of the crash state buffer saved when a Tier 1 driver panics.
/// Saved to a pre-allocated crash buffer in umka-core memory so it remains
/// accessible after the driver's memory domain is destroyed.
///
/// Total size: 512 bytes. Aligned to 64 bytes (cache-line boundary).
///
/// **Endianness**: All integer fields use **native endianness** — this buffer
/// is node-local (stored in umka-core memory, read by the same CPU that wrote
/// it). It is never transmitted cross-node. If FMA forwarding sends crash
/// diagnostics to a remote monitoring node, the FMA transport serializes
/// the fields into a wire-format message with explicit `Le32`/`Le64` types
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--fma-event-wire-format)).
/// `CrashReason` (`#[repr(u32)]`) is stored as a native-endian u32; the
/// `magic` field serves as a byte-order marker for crash dump readers on
/// heterogeneous-endian clusters: `0x554D4B4352415348` in native order.
#[repr(C, align(64))]
pub struct DriverCrashState {
    /// Magic number for validation: 0x554D4B4352415348 ("UMKCRASH" in ASCII).
    pub magic: u64,                    // offset  0, size 8
    /// TSC value at the time of crash (monotonic, CPU-local).
    pub crash_tsc: u64,                // offset  8, size 8
    /// Program counter (instruction pointer) at crash.
    pub crash_pc: u64,                 // offset 16, size 8
    /// Stack pointer at crash.
    pub crash_sp: u64,                 // offset 24, size 8
    /// Frame pointer at crash (for stack unwinding).
    pub crash_fp: u64,                 // offset 32, size 8
    /// Driver ID (same as in the driver registry).
    pub driver_id: u32,                // offset 40, size 4
    /// Crash reason code.
    pub crash_reason: CrashReason,     // offset 44, size 4 (repr(u32))
    /// Ring buffer head index at crash time (u64 to match KABI ring buffer indices).
    pub ring_head: u64,                // offset 48, size 8
    /// Ring buffer tail index at crash time (u64 to match KABI ring buffer indices).
    pub ring_tail: u64,                // offset 56, size 8
    /// Format version. Current: 2 (v1 had u32 ring indices).
    pub version: u16,                  // offset 64, size 2
    _pad0: [u8; 6],                    // offset 66, size 6
    /// First 248 bytes of the request being processed when the crash occurred
    /// (zero-padded if the request is shorter or unavailable).
    pub partial_request: [u8; 248],    // offset 72, size 248
    /// Crash backtrace: first 192 bytes (24 frames at 8 bytes each on 64-bit).
    /// Symbolicated if DWARF debug info is available at the time of crash;
    /// raw 8-byte addresses otherwise. 24 frames captures most driver call
    /// stacks including the KABI dispatch path (~4-6 frames) + driver-internal
    /// frames (~10-18 frames).
    pub backtrace: [u8; 192],          // offset 320, size 192
    // Field layout (no implicit repr(C) padding — all fields naturally aligned):
    // magic(8) + crash_tsc(8) + crash_pc(8) + crash_sp(8) + crash_fp(8)
    // + driver_id(4) + crash_reason(4) + ring_head(8) + ring_tail(8)
    // + version(2) + _pad0(6) + partial_request(248) + backtrace(192)
    // = 512 bytes total.
}
/// Crash buffer is pre-allocated to exactly size_of::<DriverCrashState>().
/// If a future change increases this past 512, the fault handler writes OOB.
const_assert!(size_of::<DriverCrashState>() == 512);

/// Crash reason codes stored in DriverCrashState.
#[repr(u32)]
pub enum CrashReason {
    /// Driver code invoked panic!() or hit an assertion failure.
    Panic         = 0,
    /// Page fault (null dereference, stack overflow, bad pointer).
    PageFault     = 1,
    /// Invalid opcode (#UD fault — executed an undefined instruction).
    InvalidOpcode = 2,
    /// Divide-by-zero (#DE fault).
    DivByZero     = 3,
    /// Capability access violation (attempted to cross an isolation boundary
    /// without a valid capability token).
    CapViolation  = 4,
    /// Watchdog timer expired (driver did not make forward progress).
    Timeout       = 5,
    /// Stack overflow detected (guard page fault at the bottom of the driver stack).
    StackOverflow = 6,
}

The crash buffer is pre-allocated per driver at load time (no allocation during the crash path). The domain fault handler fills it with whatever register state is available at fault entry, then proceeds with the normal recovery sequence.

11.9.6 UmkaOS Tier 2 Recovery Sequence

Tier 2 (user-space process) driver recovery is even simpler:

1. Driver process crashes (SIGSEGV, SIGABRT, etc.)
2. UmkaOS Core's driver supervisor detects process exit
2a. EMIT FMA EVENT
    // Emit FMA health event for observability parity with Tier 1
    fma_emit(FaultEvent::DriverCrash {
        device:      dev.device_node_id(),
        driver_id:   driver_instance.id(),
        tier:        2,
        crash_count: driver_instance.crash_count(),
    });
3. REVOKE DEVICE ACCESS
   - Mark the device as "in recovery" in the device registry, preventing
     any new MMIO mappings or device access grants for this device.
   - Revoke the driver's IOMMU entries (tear down the device's IOMMU
     domain mappings). Any in-flight DMA that completes after this point
     hits an IOMMU fault and is dropped.
   - If the dying process's teardown has not yet completed MMIO unmapping
     (page table entry removal + TLB shootdown), force-invalidate the
     relevant page table entries. In practice, the process is already
     exiting at step 2, so MMIO unmapping is a cleanup operation — the
     device registry marking and IOMMU revocation are what actually
     prevent further device access.
4. RECOVER RING BUFFER IN-FLIGHT SLOTS
   - Same as Tier 1 step 3 (RECOVER RING BUFFER IN-FLIGHT SLOTS): publish poison markers for any MPSC slots
     claimed but unpublished by the dead driver, set ring `state =
     Disconnected`. Unblocks live producers spinning on Phase 2.
5. Pending I/O completed with -EIO
6. Supervisor restarts driver process
7. New process re-initializes device, resumes service

TOTAL RECOVERY TIME: ~10ms

Why Tier 2 is faster than Tier 1 -- Counter-intuitively, the "weaker" isolation tier recovers faster. The reason is that Tier 2 recovery skips the most expensive step in the Tier 1 sequence: no device FLR in the normal case. Tier 2 drivers have direct MMIO access to their device's BAR regions (for performance), but MMIO revocation (step 3 above) cuts off device access immediately. The IOMMU prevents any DMA initiated through those MMIO registers from reaching non-driver memory, so there is no DMA safety hazard even if the device has in-flight operations.

IOTLB coherence and DMA page lifetime -- A lightweight IOMMU invalidation (not a full drain fence) suffices at step 3 because Tier 2 recovery defers freeing the crashed driver's DMA pages rather than draining all in-flight DMA. After IOMMU entry revocation, stale IOTLB entries may still allow in-flight DMA to complete to the old physical addresses. If those pages were freed immediately, this would be a use-after-free via hardware. Instead, the old DMA pages remain allocated (owned by the kernel, not the dead process) until the replacement driver instance calls init() and either reuses them (warm restart via the state buffer) or explicitly releases them back to the allocator. By the time pages are actually freed, the IOTLB has long since been flushed — either by the invalidation at step 3, by natural IOTLB eviction, or by the new driver's own IOMMU setup. This makes the IOTLB coherence window moot without requiring a synchronous drain fence.

DMA deferred-free lifetime bound -- The deferred-free strategy described above has a resource exhaustion risk: if the replacement driver never loads (or loads but never calls init()), the old DMA pages remain allocated indefinitely. An attacker could repeatedly crash Tier 2 drivers to exhaust DMA-capable memory (typically ZONE_DMA / ZONE_DMA32 on x86-64, or CMA regions on ARM). To bound this exposure, every deferred DMA page set carries a reclaim deadline:

  1. When a Tier 2 driver crashes and its DMA pages are moved to deferred-free status, each page set is tagged with deferred_deadline = now + 30_seconds.
  2. A kernel background task (dma_reclaim_worker, period = 10 seconds) scans all deferred-free DMA page sets. Any page set whose deadline has passed is reclaimed immediately — the "wait for replacement driver" check is bypassed. The reclaim frees the physical pages back to the allocator and logs a warning identifying the driver and number of pages force-reclaimed.
  3. Rationale: 30 seconds is ample time for the driver supervisor to restart the replacement process and for the new driver to call init() and either reuse or release the preserved pages. If no replacement has loaded after 30 seconds, the driver is presumed permanently crashed (or its supervisor has given up), and the pages are safe to reclaim. By the 30-second mark, any stale IOTLB entries have long since been flushed (IOTLB eviction typically occurs within microseconds to milliseconds), so reclaiming the pages carries no DMA safety hazard.
/// Maximum DMA pages preserved across a Tier 2 driver crash.
/// 512 pages × 4 KB = 2 MB maximum preserved DMA state per driver.
/// Drivers requiring more than 2 MB of preserved DMA state should use
/// persistent memory (DAX) or external state servers.
pub const MAX_DEFERRED_DMA_PAGES: usize = 512;

/// DMA pages held in deferred-free state after a Tier 2 driver crash.
///
/// These pages are preserved so the replacement driver can reuse them
/// (warm restart via the state buffer). If no replacement loads before
/// `deadline`, the `dma_reclaim_worker` force-reclaims them.
pub struct DeferredDmaPages {
    /// Physical pages held for replacement driver use after crash recovery.
    /// Fixed-size array: crash handlers MUST NOT allocate from heap.
    /// Pre-allocated at driver initialization time.
    pub pages:       ArrayVec<PhysPage, MAX_DEFERRED_DMA_PAGES>,
    // Actual count is pages.len() — no separate counter needed.
    // ArrayVec tracks its own length.
    /// Deadline after which pages are reclaimed regardless of driver state.
    pub deadline:    Instant,
    /// Which driver's state these pages belong to (for logging on forced reclaim).
    pub driver_name: DriverName,
}

The DriverRegistry maintains a counter for observability:

/// Number of times the 30-second deadline triggered forced DMA page
/// reclaim. Exposed via umkafs at `/Devices/<device>/dma_forced_reclaims`.
/// A sustained non-zero rate indicates drivers that crash without timely
/// replacement — investigate the driver supervisor and restart policy.
pub dma_forced_reclaims: AtomicU64,

If the device appears hung after the Tier 2 crash (the replacement driver's init() detects an unresponsive device), the registry escalates to FLR, but this fallback is rare. Tier 2 recovery is typically "revoke mappings, restart the process, reconnect to the ring buffer" -- a ~10ms operation dominated by process creation and driver init().

11.9.7 State Preservation and Checkpointing

Driver recovery (Section 11.9 steps 1–6) restarts a new driver instance, but without state preservation the new instance starts cold — losing in-flight I/O, device configuration, and connection state. UmkaOS uses a Theseus-inspired state spill design to enable warm restarts.

State buffer — Each Tier 1 driver has an associated kernel-managed "state buffer" that resides outside the driver's isolation domain. The buffer is allocated by umka-core and mapped read-write into the driver's address space. On crash, the isolation domain is destroyed but the state buffer survives (it belongs to umka-core).

Driver Isolation Domain (destroyed on crash)    umka-core (survives)
┌─────────────────────────┐                ┌──────────────────────┐
│  Driver code + heap     │  checkpoint →  │  State Buffer        │
│  Internal caches        │  ──────────→   │  ┌────────────────┐  │
│  (NOT preserved)        │                │  │ Version: 3     │  │
│                         │                │  │ DevCmdQueue[]   │  │
│                         │                │  │ RingBufPos      │  │
│                         │                │  │ ConnState[]     │  │
│                         │                │  │ HMAC Tag        │  │
└─────────────────────────┘                │  └────────────────┘  │
                                           └──────────────────────┘

State buffer format: - Driver-defined structure (the driver author decides what to checkpoint). - Versioned via KABI version field — the state buffer header includes a format version number so a newer driver binary can detect and handle (or reject) state from an older version. - HMAC-SHA256 integrity tag — computed by umka-core using a per-driver key, verified before handing to the new driver instance. Corrupt or tampered buffers are discarded. The HMAC key is generated by umka-core on the first load of a driver for a given DeviceHandle. The key is stored in the DeviceNode (Section 11.7 Device Registry) and persists across driver crash/reload cycles. The key is only discarded when the DeviceHandle is removed from the registry (device unplugged or explicitly deregistered). On reload, umka-core verifies the existing state buffer using the persisted key, then continues using the same key for the new driver instance. The driver writes state data, but only umka-core can produce valid integrity tags, preventing a buggy driver from poisoning the state buffer with corrupted data. Note: Tier 1 drivers run in Ring 0, so a deliberately compromised driver (with arbitrary code execution) could read the HMAC key from umka-core memory by bypassing MPK via WRPKRU (Section 11.2, WRPKRU threat model). This is within the documented Tier 1 threat model — MPK provides crash containment, not exploitation prevention. The HMAC protects state buffer integrity against bugs (the common case), not against active exploitation (which requires Tier 2 for defense).

Checkpoint frequency: - Configurable per-driver. Default: checkpoint after every I/O batch completion, or every 1ms, whichever comes first. - Checkpoint is a memcpy from driver-local structures to the inactive state buffer slot (~1–4 KB typical) plus an atomic doorbell write. At 1ms intervals, the overhead is negligible.

Torn checkpoint protection (double buffering):

The driver cannot compute the HMAC (only umka-core can), so a driver crash mid-write would leave a torn (partially written) state buffer. To prevent this, the state buffer uses a double-buffering protocol:

  • The state buffer contains two slots (A and B). At any time, one slot is active (the last successfully checkpointed state) and the other is inactive (the write target for the next checkpoint).
  • The driver writes its checkpoint data to the inactive slot. When the write is complete, the driver signals umka-core by writing a completion flag to a shared doorbell — a single atomic write visible to umka-core.
  • Umka-core, on observing the doorbell (polled during periodic work or on driver crash), computes HMAC-SHA256 over the completed slot and atomically swaps the active slot pointer.
  • On crash recovery, umka-core verifies the active slot's HMAC. If valid, that state is used for the new driver instance. If invalid (corruption or incomplete swap), umka-core falls back to the previous active slot, which still holds the last known-good checkpoint.
  • The double-buffer swap is an atomic pointer update. There is no race with driver writes because the driver only ever writes to the inactive slot.
  • After ringing the doorbell, the inactive slot is considered "pending" -- the driver must not begin a new checkpoint until umka-core completes the swap and clears the doorbell flag. If the next 1 ms checkpoint interval arrives while a swap is still pending, the driver skips that checkpoint cycle. In practice, umka-core processes the doorbell within a few microseconds (HMAC-SHA256 on 4 KB takes ~2–5 µs with hardware SHA acceleration, ~15–30 µs without — see HMAC-SHA256 performance note below), so skipped checkpoints are rare.

TOCTOU mitigation (verify-then-use atomicity):

The state buffer is mapped read-write into the driver's address space, which creates a potential Time-Of-Check-Time-Of-Use (TOCTOU) vulnerability: a compromised driver could modify the active slot after umka-core verifies the HMAC but before the new driver instance reads it. UmkaOS prevents this attack through the following mechanisms:

  1. Slot revocation on crash: When a driver crashes, umka-core immediately revokes the crashed driver's write access to both state buffer slots by unmapping the entire state buffer from the old isolation domain. This is step 2 of the recovery sequence (Section 11.9) — it happens before HMAC verification (step 4). After revocation, the crashed driver's code cannot execute and its page tables are destroyed, so there is no entity that can modify the buffer between verification and use.

  2. Copy-on-verify to kernel-private storage: After HMAC verification succeeds, umka-core copies the verified slot contents to a kernel-private buffer (not mapped into any driver's address space). The new driver instance receives a read-only snapshot of this copy, not a pointer to the original state buffer. This ensures that even if an attacker could somehow gain write access to the original buffer (which they cannot, per point 1), the verified data cannot be altered.

  3. New driver isolation: The new driver instance is created with a fresh isolation domain. The state buffer is not mapped into this new domain until after the new driver calls init() and signals that it has finished consuming the checkpoint data. During initialization, the driver reads from the kernel-private copy (provided via a read-only mapping or explicit copy to the driver's local heap). Only after init() returns successfully does umka-core map the state buffer (both slots) read-write into the new driver's address space for future checkpoints.

  4. Atomicity guarantee: The sequence — unmap from old domain, verify HMAC, copy to kernel-private storage, create new domain — is performed with preemption disabled on the recovery CPU. There is no window during which any user-space code (driver or otherwise) can execute while holding write access to the verified buffer.

This design ensures that HMAC verification and data consumption are effectively atomic: once verified, the data cannot be modified by any entity before the new driver reads it. The cost is one additional memcpy (~4 KB) per recovery, which is negligible compared to the overall recovery latency (~50-150 ms).

HMAC-SHA256 performance:

HMAC-SHA256 for a 4 KB message: - With hardware SHA acceleration (SHA-NI on x86-64 Skylake+, SHA1/SHA256 extensions on AArch64/ARMv7, Zknh on RISC-V): ~2.1 cycles/byte → ~8,600 cycles → ~2–5 µs at 3 GHz - Without hardware acceleration (software implementation, SSSE3, or generic): ~13 cycles/byte → ~53,000 cycles → ~15–30 µs at 3 GHz

UmkaOS selects the optimal implementation at boot via algorithm priority: hardware-SHA > SSSE3 > generic. The crypto_shash_alloc() API transparently selects the fastest available implementation for the running CPU.

HMAC-SHA256 computation is performed by umka-core asynchronously — not on the driver's hot path. The driver's checkpoint cost is limited to the memcpy plus an atomic doorbell write.

What is preserved vs. rebuilt:

Preserved (in state buffer) NOT preserved (rebuilt from scratch)
Device command queue positions Driver-internal caches
Hardware register snapshots Deferred work queues
In-flight I/O descriptors Timers and timeout state
Ring buffer head/tail pointers Debug/logging state
Connection/session state Statistics counters (reset to zero)
Device configuration (MTU, features, etc.)

NVMe example: - Checkpointed: submission queue tail doorbell position, completion queue head position, in-flight command IDs with their scatter-gather lists, namespace configuration. - On reload: new driver reads state buffer, re-maps device BARs, verifies queue state against hardware registers, and resumes submission. In-flight commands that were submitted but not completed are re-issued.

NIC example: - Checkpointed: active flow table entries, RSS (Receive Side Scaling) indirection table and hash key, interrupt coalescing settings, VLAN filter table, MAC address list. - On reload: new driver re-programs the NIC with the checkpointed configuration. Active TCP connections see a brief pause (~50-150ms) but do not reset — the connection state lives in umka-net (Tier 1), not in the NIC driver.

Fallback: - If HMAC verification of the state buffer fails, or the version is incompatible, the new driver instance performs a cold restart (current behavior: full device reset, all in-flight I/O returned as -EIO). - Cold restart is always safe — state preservation is an optimization, not a requirement.

11.9.8 Crash Dump Infrastructure

When umka-core itself faults (not a driver — the core kernel), the system needs to capture diagnostic state for post-mortem analysis. Unlike driver crashes (which are recoverable), a core panic is fatal.

Reserved memory region: - At boot, UmkaOS reserves a contiguous physical memory region for crash dumps, configured via boot parameter: umka.crashkernel=256M (similar to Linux crashkernel=). - This region is excluded from the normal physical memory allocator — it survives a warm reboot if the firmware doesn't clear RAM.

Panic sequence:

1. Core panic triggered (null deref, assertion failure, double fault, etc.)
2. Disable interrupts on all CPUs (IPI NMI broadcast)
3. Panic handler (Tier 0 code, always resident, minimal dependencies):
   a. Save register state for the faulting CPU:
      - x86-64: GPRs, CR3, IDTR, RSP, RFLAGS, RIP, segment selectors
      - AArch64: GPRs (x0-x30), SP_EL1, ELR_EL1, SPSR_EL1, ESR_EL1, FAR_EL1
      - ARMv7: GPRs (r0-r15), CPSR, DFAR, DFSR, IFAR, IFSR
      - RISC-V: GPRs (x0-x31), sepc, scause, stval, sstatus, satp
   b. Walk the stack, generate backtrace (using .eh_frame / DWARF unwind info)
   c. Snapshot key data structures:
      - Active process list + their states
      - Capability table summary
      - Driver registry state
      - IRQ routing table
      - Recent ring buffer entries (last 64KB of klog)
   d. Write all of the above into the reserved crash region as an ELF core dump
4. Flush panic message to serial console (polled serial output, no interrupts)
5. If a pre-registered NVMe region exists (configured at boot):
   a. Use the NVMe driver's Tier 0 "panic write" path (polled mode, no interrupts)
   b. Write the crash dump from reserved memory to the NVMe region
6. Halt or reboot (configurable: `umka.panic=halt|reboot`, default: halt)

Crash stub: - The panic handler is Tier 0 code: statically linked, no dynamic dispatch, no allocation, no locks (or only try-lock with immediate fallback). It must work even if the heap, scheduler, or interrupt subsystem is corrupted. - Serial output always works (Tier 0 serial driver, polled mode). - NVMe panic write uses polled I/O (no interrupts, no completion queues) — a simplified write path that can function with a partially-corrupted kernel.

Next boot recovery:

1. Bootloader loads UmkaOS kernel
2. Early init checks the reserved crash region for a valid dump header
3. If found:
   a. Copy dump to a temporary in-memory buffer
   b. After filesystem mount, write to /var/crash/umka-dump-<timestamp>.elf
   c. Log "Previous crash dump saved to /var/crash/umka-dump-<timestamp>.elf"
   d. Clear the reserved crash region
4. The dump can be analyzed with standard tools:
   - `crash` utility (same as Linux kdump analysis)
   - GDB with the UmkaOS kernel debug symbols
   - `umka-crashdump` tool (UmkaOS-specific, extracts structured summaries)

Dump format: - ELF core dump format, compatible with the crash utility and GDB. - Contains: register state, memory regions (kernel text, data, stack pages for active threads, page tables), and a note section with UmkaOS-specific metadata (kernel version, boot parameters, uptime, driver state).

No kexec on day one: - Linux uses kexec to boot a second "crash kernel" that writes the dump. This is reliable but complex. - UmkaOS uses a simpler "in-place dump" to reserved memory: the panic handler writes directly to the reserved region without booting a second kernel. - kexec-based crash dump is a future enhancement for systems where the in-place approach is insufficient (e.g., very large memory dumps requiring a full kernel to compress and transmit).

11.9.9 Recovery Comparison

Scenario Linux UmkaOS
NVMe driver null deref Kernel panic, full reboot Reload driver, ~50-150ms (design target)
NIC driver infinite loop System freeze Watchdog kill, reload, ~50-150ms (design target)
USB driver buffer overflow Kernel panic Restart process, ~10ms
FS driver corruption Kernel panic + fsck Reload driver, fsck on mount
Audio driver crash Kernel panic Restart process, ~10ms

Architecture caveat: On RISC-V, s390x, and LoongArch64, Tier 1 drivers are promoted to Tier 0 — a promoted driver crash causes kernel panic (same as Linux). The recovery times above apply to architectures with Tier 1 hardware isolation (x86-64, AArch64, ARMv7, PPC). Tier 2 drivers on all architectures (including RISC-V, s390x, LoongArch64) retain full crash recovery via Ring 3 + IOMMU process isolation.

11.9.10 Crash History and Auto-Demotion

The kernel tracks per-driver crash statistics within a sliding window (default: 60 seconds). The following table is the canonical crash escalation policy — the FMA rules below implement exactly these thresholds:

Crashes in Window Action
1-2 Restart driver at same isolation tier
3-4 in 60s Demote to next lower tier (Tier 1 -> Tier 2); log warning, notify admin
5+ in 300s Quarantine driver (permanently disabled); manual re-enable via sysfs; log critical alert

A Tier 1 driver that crashes 3 times is demoted to Tier 2 (full process isolation), accepting the performance penalty for increased safety. An administrator can override this policy.

11.9.10.1 FMA-Driven Crash Escalation

The crash history and auto-demotion logic above is the crash recovery subsystem's local policy. The FMA subsystem (Section 20.1) provides a global, rule-based escalation layer on top of it. The FaultEvent::DriverCrash event emitted in step 4a (EMIT FMA EVENT) of the recovery sequence feeds into the FMA diagnosis engine, which can apply cross-device correlation rules and persistent fault tracking that the local crash counter cannot.

Default FMA rules for DriverCrash events:

RULE driver_crash_reload:
  WHEN EVENT(DriverCrash) AND crash_count <= 2
  THEN RELOAD(SELF), ALERT(WARNING, "Driver crashed, reloading")
  PRIORITY 20
  COOLDOWN 1000 ms

RULE driver_crash_demote:
  // COUNT(DriverCrash) within a sliding window,
  // matching the canonical table's "3-4 in 60s" time-windowed semantics.
  WHEN EVENT(DriverCrash) AND COUNT(DriverCrash) >= 3 WITHIN 60000 AND COUNT(DriverCrash) < 5 WITHIN 60000
  THEN RELOAD(SELF), ALERT(CRITICAL, "Driver repeatedly crashing, demoting tier")
  PRIORITY 15
  COOLDOWN 5000 ms

RULE driver_crash_quarantine:
  // 300-second window per the canonical escalation table (5+ in 300s).
  WHEN EVENT(DriverCrash) AND COUNT(DriverCrash) >= 5 WITHIN 300000
  THEN ISOLATE(SELF), FMA_TICKET(CRITICAL, "Driver quarantined after 5+ crashes")
  PRIORITY 10
  COOLDOWN 60000 ms

FMA RELOAD action implementation:

The RELOAD(dev) FMA action invokes the crash recovery subsystem's driver reload mechanism. This creates a bidirectional link: crash recovery emits DriverCrash events to FMA, and FMA's RELOAD action calls back into crash recovery to perform the reload.

/// FMA response executor implementation for the RELOAD action.
/// Called by the FMA kworker thread when a diagnosis rule fires RELOAD(dev).
///
/// This function bridges FMA → crash recovery: it invokes the same driver
/// reload path used by the crash recovery sequence (steps 6-8), but
/// initiated by FMA rule evaluation rather than by a fault handler.
pub fn fma_action_reload(dev: &DeviceNode) -> Result<(), FmaActionError> {
    // 1. Check that the device is in a reloadable state.
    //    Devices in Quarantined or PermanentFailure state cannot be reloaded
    //    by FMA — they require manual administrator intervention.
    let state = dev.driver_state();
    if matches!(state, DriverState::Quarantined | DriverState::PermanentFailure) {
        return Err(FmaActionError::DeviceNotReloadable);
    }

    // 2. Determine the target tier. If the FMA rule that fired is
    //    driver_crash_demote, the crash recovery subsystem's auto-demotion
    //    logic has already selected the next lower tier based on crash_count.
    //    FMA does not override the tier decision — it only triggers the reload.
    let target_tier = dev.crash_recovery_target_tier();

    // 3. Invoke the crash recovery reload sequence (steps 6-8):
    //    a. Unload the current driver instance (free memory, release caps).
    //    b. Load a fresh driver binary at the target tier.
    //    c. Perform bilateral vtable exchange.
    //    d. Call driver probe() for device re-initialization.
    //    e. Re-register interrupt handlers.
    //    f. Broadcast DriverRecoveryEvent::Reloaded to subscribers.
    crash_recovery::reload_driver(dev, target_tier)?;

    // 4. Emit a recovery completion event to FMA (closes the fault case).
    // FMA_DRIVER_RELOAD_COMPLETE = 0x1003 (HealthEventClass::Generic),
    // emitted at step 9 (RESUME) after successful driver reload.
    fma_emit(FaultEvent::Generic {
        device_id: dev.device_node_id(),
        event_code: FMA_DRIVER_RELOAD_COMPLETE, // 0x1003
        payload: {
            let mut p = [0u8; 16];
            p[0] = target_tier as u8;
            p
        },
    });

    Ok(())
}

Sequencing guarantee: The FMA diagnosis engine processes DriverCrash events in a dedicated kworker thread. The RELOAD action runs in that thread's context (process context, may sleep). The crash recovery sequence steps 6-8 are serialized by the per-device recovery_mutex — if a second crash occurs while a reload is in progress, the second DriverCrash event is queued in the FMA ring and processed after the first reload completes. The crash_count field in the second event reflects the updated count, so the correct escalation rule fires.

Cross-device correlation: FMA rules can correlate DriverCrash events across multiple devices. For example, if three different drivers on the same PCIe root complex crash within 10 seconds, an FMA rule can detect this pattern and escalate to a PCIe link-level investigation (AER check, link retrain) rather than individually reloading each driver. This cross-device awareness is something the local per-driver crash counter cannot provide.

11.9.10.2 Post-Reload Subsystem Notification

After a Tier 1 or Tier 2 driver successfully reloads (step 8 of the recovery sequence), the kernel broadcasts a recovery event through the device registry event bus. Subsystems that depend on the recovered device subscribe to these events and perform device-specific re-initialization without kernel-wide coordination.

/// Recovery event broadcast after successful driver reload.
/// Delivered via the device registry event bus to all registered subscribers.
pub enum DriverRecoveryEvent {
    /// Driver has reloaded and is ready to accept I/O.
    Reloaded {
        /// Canonical device identifier from the device registry.
        device_id: DeviceNodeId,
        /// New device handle for I/O submission. The old handle is invalid
        /// after crash — any subsystem caching the old handle must replace it.
        new_handle: DeviceHandle,
        /// Tier the driver loaded at. May differ from the previous tier if
        /// the crash counter triggered demotion (e.g., Tier 1 → Tier 2).
        tier: IsolationTier,
    },
    /// Driver reload failed after max retries. Device is permanently offline
    /// until manual intervention (see Reload Failure Handling above).
    PermanentFailure {
        device_id: DeviceNodeId,
    },
}

Subsystems register for recovery events at init time via a per-device subscription. The event bus is a simple callback list protected by a spinlock — recovery events are infrequent (crash recovery is an exceptional path), so the callback dispatch overhead is negligible.

/// Register a callback for driver recovery events on a specific device.
/// The callback is invoked synchronously on the recovery CPU after the
/// new driver instance has completed initialization. Callbacks must not
/// block — they should enqueue deferred work if complex re-initialization
/// is needed.
///
/// Returns a handle that must be held for the lifetime of the subscription.
/// Dropping the handle unregisters the callback.
pub fn device_recovery_subscribe(
    device_id: DeviceNodeId,
    callback: fn(DriverRecoveryEvent),
) -> SubscriptionHandle

Subsystem responses to DriverRecoveryEvent::Reloaded:

Block I/O (NVMe, VirtIO-blk, AHCI):

  1. Replace the cached DeviceHandle in the superblock's block device reference with new_handle.
  2. Process in-flight bios based on their BioFlags::PERSISTENT flag (Section 15.2):
  3. PERSISTENT bios (journal commits, superblock writes): re-queue from the per-device bio retry list (allocated in umka-core memory, outside the driver's isolation domain). Before re-submitting PERSISTENT bios, the recovery path must re-map their DMA addresses through the new driver's IOMMU domain (the old IOVAs reference the crashed driver's now-invalid IOMMU mappings). For each PERSISTENT bio: (1) Unmap old IOVAs from the crashed domain (if the domain is still accessible; otherwise mark as leaked for IOMMU garbage collection). (2) Map physical pages through new_domain.map_pages() to obtain fresh IOVAs. (3) Update the bio's DMA address fields with the new IOVAs. (4) Clear BIO_ERROR flags before re-submission.
  4. Non-PERSISTENT bios (data reads/writes): drain with bio.status = -EIO. Fire completion callbacks so waiters unblock. Applications retry via standard I/O error handling.
  5. Resume I/O submission to the new driver instance. The block layer's I/O scheduler (mq-deadline or BFQ) re-dispatches queued requests in priority order — no special re-ordering is needed.
  6. If the new driver loaded at a different tier (e.g., demoted from Tier 1 to Tier 2), update the block device's I/O path to use the appropriate IPC mechanism (shared-memory ring for Tier 1, message-passing for Tier 2).

Networking (NIC drivers):

  1. ifindex preservation: The NetDevice.ifindex is preserved across driver reload. The DeviceNode in the device registry retains its identity — the new driver instance inherits the same ifindex as the old one. This ensures that userspace tools (ip link show), socket bindings (SO_BINDTODEVICE), routing table entries, and netlink monitors continue to reference the correct interface without re-resolution. The ifindex is owned by the DeviceNode (allocated in umka-core memory), not by the driver. On reload, the new driver receives the existing DeviceNode reference with the original ifindex.
  2. Re-register NAPI instance via NicDriverVTable::napi_register(). The new driver allocates fresh RX/TX descriptor rings; the old rings were cleaned up during crash recovery step 3 (Network Buffer Handle Reclamation above).
  3. Restore NIC configuration from the preserved state buffer: MAC address, MTU, RSS indirection table, hardware offload settings (checksum, TSO, GRO), VLAN filter table, and multicast filter list.
  4. Re-apply packet filters: eBPF programs attached to the device are re-loaded from the kernel's BPF program registry (BPF programs are stored in umka-core memory, not in the driver's domain). Netfilter rules are stateless with respect to the NIC driver and require no re-application. 3a. Qdisc configuration re-application: Traffic control (qdisc) configuration is persisted in the DeviceNode, not in the driver's isolation domain. The qdisc tree (root qdisc, child qdiscs, filter chains, class hierarchies) and all associated shaping parameters (rate limits, burst sizes, priorities) are stored in umka-core memory and survive driver crashes. After the new driver instance is initialized, the networking stack re-applies the saved qdisc configuration by calling qdisc_attach() for each qdisc in the tree, starting from the root. This restores the exact traffic control policy that was active before the crash — no tc commands need to be re-issued by userspace. eBPF programs attached to qdisc filters (cls_bpf) are similarly re-attached from the kernel's BPF program registry.
  5. NetDevice state transition: DOWN → carrier detection → UP. The link state machine waits for physical link detection (PHY polling or link-change interrupt from the new driver instance) before transitioning to UP.
  6. netif_carrier_on(): Once the new driver confirms physical link is up (via PHY polling result or link-change interrupt), the kernel calls netif_carrier_on() to signal to the network stack that the link layer is operational. This is the final step before normal packet flow resumes:
  7. TX queues are unblocked (netif_wake_queue() is called internally).
  8. Pending TX packets queued during the reload window are transmitted.
  9. Upper-layer protocols (bonding, bridge, VLAN) are notified via NETDEV_CHANGE event and update their port states accordingly. Without this explicit netif_carrier_on() call, the network stack would not resume packet transmission even though the driver is functional — the carrier state is the authoritative signal for TX queue enablement.
  10. Gratuitous ARP/NDP: once the link is UP, the networking stack sends a gratuitous ARP (IPv4) or unsolicited Neighbor Advertisement (IPv6) to announce the MAC address to the local network segment. This ensures switches and peers update their forwarding tables, preventing temporary black-holing of traffic destined for this host.
  11. TCP connections: in-flight packets are handled by TCP retransmission. The brief link-down period (~50-150ms) is well within TCP's retransmission timeout (default RTO: 200ms minimum). No TCP connections are dropped unless the recovery exceeds the application's keepalive timeout.

VFS (filesystem drivers — ext4, XFS, Btrfs on block devices):

  1. Re-open the device handle for each superblock that was using the recovered device. Replace sb.s_bdev with the new DeviceHandle.
  2. Replay the dirty inode writeback queue: all inodes with I_DIRTY flags that were pending writeback at crash time are re-submitted to the writeback work queue. The inode dirty list is maintained by umka-core (outside the driver's domain) and survives the crash.
  3. If the crash occurred during a journaled transaction (ext4 JBD2, XFS log): the filesystem's journal replay runs on the new driver instance. This is the same journal replay that would run after a power failure — the filesystem treats the driver crash as equivalent to a sudden device disappearance and re-appearance.
  4. Re-validate open file descriptors: file descriptors referencing inodes on the recovered device remain valid (the inode and dentry caches survive in umka-core). However, if the device changed identity (different serial number or capacity — indicating the wrong device was hot-plugged), all open fds are marked with EIO and the superblock is forced read-only.

Subsystem responses to DriverRecoveryEvent::PermanentFailure:

All subsystems treat permanent failure as a device removal event: - Block I/O: fail all pending bios with EIO, mark the block device offline. - Networking: transition NetDevice to DOWN, drop carrier. TCP connections time out normally. - VFS: force the superblock to read-only, return EIO for new I/O operations. Open file descriptors remain valid for metadata reads (stat, readdir) but all data I/O returns EIO.

11.9.11 Compound Recovery Scenarios

The recovery protocol above describes individual steps in isolation. Real workloads combine multiple subsystems under concurrent load. This section traces three representative compound scenarios end-to-end, documenting observable application behavior, bounded latency analysis, and cascade containment conditions.

11.9.11.1 Scenario 1: ext4 Crash Under Database Workload

Initial state: PostgreSQL with 500 active connections, ~120 MB dirty page cache, ext4 journaled filesystem on NVMe via block layer. WAL (write-ahead log) fsync rate: ~2000 fsyncs/sec. 64 in-flight bios in the block layer submission ring. ext4 JBD2 journal occupancy: ~40% (3 open transactions with ~800 dirty metadata buffers).

Trigger: ext4 filesystem driver faults (e.g., corrupted extent tree triggers out-of-bounds access within its Tier 1 domain).

Recovery protocol step-by-step:

  1. DETECT (t=0 ms): Hardware exception (MPK/POE violation or GPF) in the ext4 driver's isolation domain. The fault handler identifies the faulting domain and reads effective_tier() == Tier::One.

  2. ISOLATE (t=0.1 ms): ext4's domain permissions are revoked. The ext4 driver can no longer execute. Interrupt lines for the ext4 driver's KABI calls are masked. The block layer and NVMe driver remain fully operational in their own domains (on x86-64; on AArch64 POE, see Per-Architecture Crash Isolation Comparison below).

  3. DRAIN RING (t=0.5 ms): The 64 in-flight bios are drained. PERSISTENT bios (journal commits) are preserved on the per-device bio retry list in umka-core memory. Non-PERSISTENT bios (data writes from PostgreSQL) are completed with -EIO. io_uring CQEs (Section 19.3) for affected submissions are posted with -EIO status.

  4. FMA EVENT (t=0.6 ms): FaultEvent::DriverCrash emitted to the FMA ring.

  5. DEVICE RESET (t=0.6 ms): Skipped -- the NVMe device itself is healthy. Only the ext4 driver faulted; the block layer continues operating. No FLR is needed.

  6. RELEASE KABI LOCKS (t=0.7 ms): Any Core locks held by ext4 KABI calls are force-released in LIFO order (typically 0-1 locks: the superblock lock or an inode lock).

  7. UNLOAD (t=1 ms): ext4 driver memory freed, capabilities released.

  8. RELOAD (t=80-120 ms): Fresh ext4 driver loaded. Bilateral vtable exchange. ext4 init() discovers the existing superblock (preserved in umka-core memory) and triggers journal replay. JBD2 replays uncommitted transactions from the journal (same as power-failure recovery). The 3 open transactions are either committed (if the journal commit bio was PERSISTENT and succeeds on replay) or rolled back (if the transaction was not yet committed to the journal).

  9. RESUME (t=120-150 ms): ext4 re-opens the superblock, processes the dirty inode writeback queue, and resumes accepting VFS operations.

Observable PostgreSQL behavior:

Time Window PostgreSQL Observes Application Impact
t=0 to t=0.5 ms Nothing (in-kernel, sub-millisecond) None
t=0.5 to t=120 ms write()/fsync() calls return -EIO PostgreSQL's WAL writer receives EIO on fsync
t=120 ms fsync() succeeds again WAL writer retries and succeeds
t=120 to t=500 ms Slight latency spike (journal replay) P99 latency spike of ~50-100 ms for write transactions

PostgreSQL's WAL writer retries fsync() on EIO (standard PostgreSQL behavior since PG 12's data-checksum-aware fsync retry). The ~120 ms window of EIO responses triggers WAL writer retries but does NOT cause connection drops. Client transactions that were mid-commit during the crash window receive a transaction abort (PostgreSQL's normal behavior when fsync fails) and retry at the application level.

Bounded latency analysis: Total recovery time is bounded by T_unload + T_journal_replay + T_reload. Journal replay time is bounded by journal size (default 128 MB, sequential read at NVMe speeds: ~1-5 ms). Maximum: 200 ms (including FLR path for worst-case controller reset). The block layer remains operational throughout, so NVMe latency for non-ext4 I/O (e.g., other filesystems on the same device) is unaffected.

Cascade containment: The cascade is bounded because: (a) the block layer and NVMe driver are in separate domains (x86-64) or share a domain but are structurally independent (AArch64 POE), (b) dirty page cache is in umka-core memory and survives the crash, and (c) journal replay is idempotent. The only unbounded scenario is if the ext4 driver crashes repeatedly (>3 times in 60 seconds), triggering FMA escalation to permanent failure and forced read-only.

11.9.11.2 Scenario 2: NIC Driver Crash Under High-Connection Web Server

Initial state: nginx with 10,000 active HTTP/2 connections (mix of keep-alive and active transfers). 25 Gbps NIC (e.g., Mellanox ConnectX-6). RX ring: 4096 descriptors. TX ring: 2048 descriptors. ~200 packets/ms inbound, ~150 packets/ms outbound. TCP connection table: 10,000 entries with active timers. 3 RSS queues active.

Trigger: NIC driver faults (e.g., corrupted TX descriptor causes hardware exception within the NIC driver's Tier 1 domain).

Recovery protocol step-by-step:

  1. DETECT (t=0 ms): Hardware exception in the NIC driver's domain.

  2. ISOLATE (t=0.1 ms): NIC driver domain revoked. Interrupt lines masked. The crash recovery manager calls napi_schedule() to begin draining hardware RX descriptors into the pre-allocated RecoveryRxRing (capacity: max(256, 2 * 4096) = 8192 descriptors).

  3. DRAIN RING (t=0.5 ms): TX ring drained. In-flight TX packets that were DMA-mapped are reclaimed (network buffer handles returned to the NetBuf pool in umka-core memory). RX ring: napi_recovery_drain() copies pending RX descriptors to the temporary ring.

  4. FMA EVENT (t=0.6 ms): FaultEvent::DriverCrash emitted.

  5. DEVICE RESET (t=1-100 ms): FLR issued to the NIC. ConnectX-6 FLR typically completes in ~10-50 ms. During this window, napi_recovery_drain() continues processing hardware RX ring entries that were DMA'd before the reset took effect.

  6. RELEASE KABI LOCKS (t=0.7 ms, concurrent with FLR): KABI lock cleanup.

  7. UNLOAD (t=50-100 ms, after FLR): NIC driver memory freed. MMIO regions unmapped.

  8. RELOAD (t=100-150 ms): Fresh NIC driver loaded. New RX/TX descriptor rings allocated. Configuration restored from preserved state: MAC address, MTU (9000 for jumbo frames), RSS indirection table (3 queues), hardware offloads (TSO, GRO, CSUM), VLAN filters, multicast list. eBPF XDP programs re-attached from the BPF program registry. Qdisc tree re-applied from DeviceNode state.

  9. RESUME (t=150-180 ms): netif_carrier_on() called after PHY link detection (~10-30 ms for copper, near-instant for fiber/DAC). Gratuitous ARP/NDP sent. TX queues unblocked. napi_recovery_drain() flushes buffered packets to the new driver.

Observable nginx behavior:

Time Window nginx Observes Client Impact
t=0 to t=0.5 ms Nothing None
t=0.5 ms Sockets with pending recv() return -EIO (POLLERR) Active HTTP/2 streams see read error
t=0.5 to t=180 ms send()/recv() return -EAGAIN or -EIO HTTP/2 streams stall; nginx marks connections as temporarily errored
t=180 ms onward TCP retransmits recover in-flight data Connections resume within 1 RTT

TCP connection survival: The ~150-180 ms outage is within TCP's minimum RTO of 200 ms (RFC 6298). TCP retransmission timers for the 10,000 connections fire at staggered intervals (not simultaneously), spreading retransmit load over ~200-400 ms after recovery. Connection survival rate: >99.9% for connections with keepalive timeout >1 second (nginx default: 75 seconds). Only connections whose application-level timeout is shorter than the recovery window (~180 ms) may be dropped by the application.

Bounded latency analysis: Total recovery bounded by T_flr + T_reload + T_carrier. FLR: max 100 ms (PCIe spec). Reload: ~50-80 ms. Carrier detection: ~10-30 ms. Maximum: 210 ms. The RecoveryRxRing capacity of 8192 descriptors at 200 packets/ms provides ~40 ms of buffering; at sustained 200 packets/ms for 180 ms, ~36,000 packets arrive, of which ~8,192 are buffered and ~27,800 are tail-dropped with rx_recovery_drops incremented. TCP retransmission recovers dropped packets.

Cascade containment: The cascade is bounded because: (a) TCP state (connection table, sequence numbers, window sizes) is maintained by the TCP stack in umka-core memory, (b) socket buffers survive in umka-core, (c) the NIC driver's domain is isolated from the networking stack's control plane, and (d) TCP retransmission is self-healing. The unbounded scenario is sustained NIC hardware failure (FLR does not restore function), which escalates to PermanentFailure → ifdown → TCP connections RST after keepalive timeout.

11.9.11.3 Scenario 3: NVMe Driver Crash During fsync-Heavy Write Workload

Initial state: Database workload (mixed PostgreSQL + application logging) performing ~5,000 fsyncs/sec across 4 NVMe namespaces. NVMe submission queue depth: 1024 per namespace (4,096 total in-flight commands). NVMe completion queue: 256 entries per namespace. Write bandwidth: ~3 GB/s sustained. Dirty page cache: ~500 MB. Block layer bio retry list: ~200 PERSISTENT bios (journal commits from ext4 and XFS on different namespaces).

Trigger: NVMe driver faults (e.g., completion queue processing bug causes out-of-bounds read within the NVMe driver's Tier 1 domain).

Recovery protocol step-by-step:

  1. DETECT (t=0 ms): Hardware exception in the NVMe driver's domain.

  2. ISOLATE (t=0.1 ms): NVMe driver domain revoked. All 4 NVMe submission queues are frozen (no new commands can be submitted by the block layer).

  3. DRAIN RING (t=0.5-2 ms): 4,096 in-flight NVMe commands are categorized:

  4. PERSISTENT bios (~200): preserved on bio retry list for re-submission after reload. DMA mappings are invalidated (old IOVAs from crashed domain). Physical page references are retained (pages are pinned in umka-core memory).
  5. Non-PERSISTENT bios (~3,896): completed with -EIO. Completion callbacks fire, unblocking filesystem writeback workers and userspace fsync() callers.

  6. FMA EVENT (t=2.1 ms): FaultEvent::DriverCrash emitted.

  7. DEVICE RESET (t=2-102 ms): NVMe controller reset: set CC.EN=0, wait for CSTS.RDY=0 (NVMe spec: up to 10 seconds, typical: 5-50 ms), then set CC.EN=1, wait for CSTS.RDY=1. This resets all submission and completion queues. In-flight NVMe commands that completed on the device side but whose completions were not yet processed by the (crashed) driver are lost — the block layer treats them as failed and either retries (PERSISTENT) or returns -EIO (non-PERSISTENT).

  8. RELEASE KABI LOCKS (t=2.2 ms): KABI lock cleanup (typically 0 locks for NVMe driver, which uses lock-free submission rings).

  9. UNLOAD (t=102 ms): NVMe driver memory freed. Admin queue and I/O queue doorbell MMIO regions unmapped.

  10. RELOAD (t=150-350 ms): Fresh NVMe driver loaded. Admin queue re-created. I/O queue pairs re-created for all 4 namespaces. PERSISTENT bios are re-mapped through the new IOMMU domain (new_domain.map_pages()) and re-submitted. NVMe controller state restored: namespace identification, queue parameters, interrupt coalescing settings.

  11. RESUME (t=350-500 ms): All 4 namespaces operational. Block layer I/O scheduler re-dispatches queued requests. Filesystems resume writeback.

Observable application behavior:

Time Window Application Observes Data Integrity Impact
t=0 to t=2 ms Nothing None
t=2 to t=350 ms fsync() returns -EIO; write() may block (page cache full) Uncommitted data in page cache is safe (umka-core memory)
t=350 ms fsync() succeeds again PERSISTENT bios (journal commits) are retried and succeed
t=350 to t=1000 ms Elevated fsync latency (~2-5 ms vs normal ~0.1 ms) Journal replay completes; filesystem consistent

Filesystem data integrity: The journal (ext4 JBD2, XFS log) protects metadata consistency. Data pages that were dirty but not yet written are retained in the page cache (umka-core memory) and re-submitted after reload. The only data loss scenario is for O_DIRECT writes that were in non-PERSISTENT bios and had already been DMA'd to the device but whose completions were lost — these writes may or may not have reached persistent media. Applications using O_DIRECT + fsync() will retry the fsync and the filesystem journal ensures metadata consistency regardless. See Section 15.1 for the full durability analysis.

Bounded latency analysis: Total recovery bounded by T_nvme_reset + T_reload + T_persistent_bio_resubmit. NVMe reset: 5-500 ms (controller-dependent; spec allows 10 s but well-designed controllers complete in <100 ms). Reload: ~50-80 ms. PERSISTENT bio re-submission: ~1-5 ms (200 bios at NVMe submission rate). Maximum: 600 ms for a slow controller, 200 ms typical. The NVMe controller reset is the dominant factor and is hardware-dependent.

Cascade containment: The cascade is bounded because: (a) the block layer persists PERSISTENT bios in umka-core memory, (b) filesystem journals provide idempotent recovery, (c) page cache contents survive in umka-core, and (d) NVMe controller reset is specified to complete in bounded time (NVMe spec CAP.TO field, max 255 * 500 ms = 127.5 s, but FMA escalates to PermanentFailure after the configured timeout, default 10 s). The unbounded scenario is NVMe controller firmware hang during reset (CSTS.RDY never transitions), which triggers FMA timeout escalation → mark device offline → all filesystems on the device forced read-only.

11.9.12 Power Failure During Recovery

A power failure (or hardware watchdog reset) can occur at any point during the crash recovery sequence. This section documents why the recovery protocol is safe under power loss at every step — the key property is that power-cycle resets are strictly stronger than software recovery, making every step idempotent with respect to power loss.

11.9.12.1 Idempotency by Power-Cycle Reset

PCIe power-on reset (Fundamental Reset) overrides any in-progress Function-Level Reset. Per PCIe Base Specification 5.0, Section 5.4: a Fundamental Reset places all Functions of a device into the initial state, regardless of any other reset mechanism that may be active. After power is restored, the device is in a fully defined initial state — the same state as first power-on. The kernel's next boot sequence performs normal device discovery and driver probe, with no dependency on whether a crash recovery was in progress when power was lost.

This means the crash recovery protocol does NOT need checkpoint/restart semantics. Power loss at any step reduces to "normal cold boot."

11.9.12.2 pstore Recovery-in-Progress Marker

The FMA event emitted at step 4a (FaultEvent::DriverCrash) is written to pstore/NVRAM (Section 20.7). On next boot, the kernel checks pstore for incomplete recovery events and logs diagnostic information:

umka: pstore: previous boot had in-progress driver recovery for nvme0 (crash_count=2)
umka: pstore: recovery was at step DEVICE_RESET when power was lost

This is informational only — it aids post-mortem diagnosis but does not affect boot behavior. The power cycle already reset everything; the kernel does not attempt to resume the interrupted recovery. The pstore marker is written as part of the existing FMA event emission (step 4a) and adds no new I/O to the recovery path. The recovery step identifier is encoded in the FaultEvent::DriverCrash payload's recovery_phase field (a u8 enum matching the step numbers 1-9).

11.9.12.3 Filesystem Data Integrity Under Power Loss During Recovery

Dirty page cache flush during recovery writes through the block layer using the normal I/O path. The filesystem's journal (ext4 JBD2, XFS log) protects metadata consistency on power failure — this is the same guarantee provided by Linux. The crash recovery protocol does NOT bypass the filesystem's durability mechanisms: all writes go through the block layer's write path, which respects FUA (Force Unit Access) and barrier semantics (Section 15.1).

Specific scenarios:

  • Power loss during DRAIN (step 3-4): In-flight bios that were completed with -EIO have already returned errors to the filesystem. The filesystem treats this as equivalent to a device error during normal operation. On next boot, journal replay recovers metadata to the last committed transaction.

  • Power loss during DEVICE RESET (step 5): The device may be mid-FLR. Power-cycle Fundamental Reset overrides FLR and places the device in initial state. Next boot performs normal probe.

  • Power loss during RELOAD (step 8): The new driver instance was partially initialized. Power cycle discards all volatile state. Next boot loads the driver fresh. If PERSISTENT bios were being re-submitted, they are lost in volatile memory — but the filesystem journal on the NVMe persistent media is intact, so journal replay on next boot recovers to a consistent state.

11.9.12.4 Per-Step Idempotency Analysis

Recovery Step State if Power Lost Next Boot Behavior
1. DETECT Fault detected, pre-recovery Normal boot; device operational; pstore may have partial fault log
1a. TIER CHECK Tier determination in progress Normal boot; tier re-evaluated at probe time
2. ISOLATE Domain permissions cleared, interrupts masked Power cycle resets all CPU domain registers and interrupt controllers; normal boot
3. DRAIN RING Partial ring drain; some bios completed with -EIO Rings re-initialized on boot; filesystem journal replay recovers metadata consistency
4. FMA EVENT pstore entry may be partial or missing pstore reports prior crash (if entry was fully written) or is empty (if not)
4a. FMA EMIT FMA ring entry partially written FMA ring is volatile; reinitialized on boot; pstore captures what was flushed
5. DEVICE RESET (FLR) FLR may be in-progress Power-cycle Fundamental Reset overrides FLR (PCIe 5.0 §5.4); device in initial state
6. RELEASE LOCKS Some locks force-released, others still held All locks are volatile (in-memory); reinitialized on boot
7. UNLOAD Driver partially unloaded All driver memory is volatile; freed on boot
8. RELOAD New driver partially loaded Partial load discarded; normal driver probe on boot
9. RESUME Recovery complete or nearly complete Normal boot; if recovery had completed, next boot is clean

Key invariant: Every row reduces to "normal boot" because all recovery state is volatile (in-memory). The only persistent state is: (a) the device's on-media data (protected by filesystem journal), (b) the pstore NVRAM region (informational), and (c) the device's PCIe configuration (reset by Fundamental Reset). No recovery step writes persistent state that would leave the system in an inconsistent state on power loss.

11.9.13 Per-Architecture Crash Isolation Comparison

The crash recovery protocol described above applies uniformly across architectures, but the scope of a single crash — how many drivers are affected — varies significantly depending on the number of available hardware isolation domains. This section compares crash isolation granularity and recovery characteristics across the supported architectures.

See Section 11.2 for the full per-architecture isolation mechanism specification and Section 11.1 for tier definitions.

11.9.13.1 x86-64: MPK (12 Driver Domains)

x86-64 Memory Protection Keys provide 16 domains total, of which 12 are available for Tier 1 drivers (domains 1-12; domain 0 is reserved for Core, domains 13-15 for kernel infrastructure). Each driver can occupy its own domain, enabling per-driver crash isolation: a fault in the ext4 driver does NOT corrupt the block layer, VFS dentry cache, NIC driver, or any other driver.

  • Crash blast radius: Single driver only. No co-tenancy damage.
  • Recovery scope: Only the faulted driver is unloaded and reloaded.
  • Recovery time: ~100-200 ms (dominated by FLR + driver reload).
  • Typical deployment: ext4, XFS, NVMe, NIC, GPU each in separate domains. 12 domains comfortably cover a typical server's driver set.

11.9.13.2 AArch64: POE (3 Driver Domains)

AArch64 Permission Overlay Extension provides 4 overlay indices, of which 3 are available for Tier 1 drivers (index 0 is Core). Drivers are grouped into domains by subsystem: typically Domain 1 = network stack, Domain 2 = storage stack (VFS + block + filesystem drivers), Domain 3 = miscellaneous (GPU, input, etc.).

  • Crash blast radius: All drivers sharing the domain. A bug in ext4 that corrupts memory within Domain 2 may affect the block layer and other filesystem drivers in the same domain. Rust memory safety mitigates cross-driver corruption within a shared domain (buffer overruns and use-after-free are prevented at compile time), but logic bugs that produce incorrect data through safe Rust code are not contained by domain boundaries.
  • Recovery scope: The entire domain must be recovered — all drivers in the faulted domain are unloaded and reloaded. For the storage domain, this means recovering VFS, block layer, and filesystem drivers together.
  • Recovery time: ~200-500 ms (higher than x86-64 due to multi-driver reload and coordinated re-initialization of the storage stack).
  • Value proposition: "Storage-stack-level isolation" — a storage crash does not affect networking, and vice versa. This is still vastly better than Linux (where any driver crash kills the entire kernel).

See Section 24.5 for the security analysis of domain grouping and the conditions under which shared-domain co-tenancy is acceptable.

11.9.13.3 RISC-V, s390x, LoongArch64: Tier 1 Unavailable

These architectures lack hardware memory domain isolation mechanisms suitable for Tier 1 (no MPK/POE equivalent). Tier 1 is unavailable — drivers choose Tier 0 or Tier 2 depending on licensing, driver preference, and sysadmin decision (Section 11.2).

  • Tier 0 (in-kernel, no isolation): Driver runs with full kernel privileges. A crash is a kernel panic — the full crash recovery protocol does not apply. Mitigation is Rust memory safety (eliminates buffer overruns, use-after-free, data races at compile time) plus software integrity checks (ring buffer validation, watchdog timers). Recovery: full system reboot (~30-60 seconds).
  • Tier 2 (Ring 3, full process + IOMMU isolation): Driver runs as a userspace process with IOMMU-enforced DMA isolation. Available on ALL architectures that have an IOMMU. A crash is fully contained — the driver process is restarted. Recovery: ~10-50 ms (process restart, no FLR needed if IOMMU domain is intact). Performance cost: message- passing IPC instead of shared-memory rings (~2-5 us per I/O operation vs ~0.1-0.5 us for Tier 1).
  • Typical deployment: Upstream open-source drivers (high trust) run as Tier 0 for maximum performance. Proprietary or out-of-tree drivers run as Tier 2 for isolation.

11.9.13.4 ARMv7: DACR (4 Driver Domains)

ARMv7 Domain Access Control Register provides 16 domains, of which practical usability is limited by TLB pressure (each domain assignment requires page table domain field updates). 4 domains are typically allocated for Tier 1 drivers.

  • Crash blast radius: Grouped drivers within a domain (similar to AArch64 POE).
  • Recovery scope: All drivers in the faulted domain.
  • Recovery time: ~150-400 ms.

11.9.13.5 PPC32 / PPC64LE: Segment-Based Isolation

PPC32 uses segment registers (16 segments) and PPC64LE uses Radix Tree Translation with PID-based isolation. Both provide limited Tier 1 domain counts (4-6 usable domains).

  • Crash blast radius: Grouped drivers within a segment/PID domain.
  • Recovery scope: All drivers in the faulted domain.
  • Recovery time: ~150-400 ms (PPC64LE) to ~200-500 ms (PPC32).

11.9.13.6 Survivability Comparison

Failure Mode x86-64 (12 domains) AArch64 POE (3 domains) ARMv7 DACR (4 domains) PPC32/PPC64LE (4-6 domains) Tier 1 unavailable (RISC-V, s390x, LoongArch64)
Single FS driver bug Contained to FS domain Shared-domain crash (VFS+block+FS) Shared-domain crash Shared-domain crash Tier 0: kernel panic / Tier 2: process restart
Block layer bug Contained to block domain Shared-domain crash (VFS+block+FS) Shared-domain crash Shared-domain crash Tier 0: kernel panic / Tier 2: process restart
NIC driver bug Contained to NIC domain Contained to network domain Contained to network domain Contained to network domain Tier 0: kernel panic / Tier 2: process restart
GPU driver bug Contained to GPU domain Contained to misc domain Contained to misc domain Contained to misc domain Tier 0: kernel panic / Tier 2: process restart
Recovery time (single driver) ~100-200 ms ~200-500 ms (domain-wide) ~150-400 ms (domain-wide) ~150-500 ms (domain-wide) Tier 0: reboot ~30-60 s / Tier 2: ~10-50 ms
Max concurrent isolated crashes 12 (one per domain) 3 4 4-6 Tier 0: 0 / Tier 2: unlimited (one per process)

Key observation: x86-64 provides the finest-grained crash isolation. AArch64 POE, ARMv7 DACR, and PPC provide subsystem-level isolation (still far better than Linux's all-or-nothing model). On architectures where Tier 1 is unavailable, Tier 2 provides the strongest isolation (full process boundary + IOMMU) at the cost of higher per-I/O latency, while Tier 0 provides maximum performance with Rust safety as the primary mitigation.

See also: Section 13.18 (Live Kernel Evolution) extends crash recovery to proactively replace core kernel components at runtime, reusing the same state-export/reload mechanism. For planned (non-crash) driver replacement — deliberate upgrades where the old driver is healthy — see Section 13.18. The graceful path drains in-flight I/O to completion (no -EIO errors), skips device reset, and performs a cooperative state handoff. It completes in ~7-35 ms for a typical NIC (vs ~50-150 ms for crash recovery). Section 20.1 (Fault Management) adds predictive telemetry and diagnosis before crashes occur.

11.9.14 Swap Device Crash Interaction

When a block device driver crashes and the device serves a swap area, the swap subsystem and OOM killer must be notified immediately. Without notification, the reclaim path attempts page-out to a device that is being recovered, stalling the entire memory management subsystem.

Notification protocol (integrated into step 1 of the recovery sequence):

Block driver crash detected (step 1, FAULT DETECTED):
  For each swap area backed by the crashed device:
    1. swap_device_suspend(swap_area_id)
       → Sets swap_area.state = SWAP_SUSPENDED (new state, distinct from
         SWAP_ACTIVE and SWAP_WRITEOUT).
       → Reclaim path (kswapd, direct reclaim) skips SWAP_SUSPENDED areas:
         if swap_area.state == SWAP_SUSPENDED { continue; }
       → No page-out attempts to the crashed device during recovery.

    2. OOM killer adjustment:
       → oom_available_memory() subtracts SWAP_SUSPENDED swap space from
         the available memory calculation.
       → This may trigger OOM kill earlier — correct: memory IS less
         available. The OOM killer makes decisions based on actual usable
         resources, not nominal capacity.
       → Pages already swapped out to the crashed device remain on-disk
         (the NVMe flash retains data across FLR). Swap-in of those pages
         stalls until recovery completes (step 9).

Block driver recovery complete (step 9, RESUME):
  For each swap area backed by the recovered device:
    1. swap_device_resume(swap_area_id)
       → Sets swap_area.state = SWAP_ACTIVE.
       → Normal reclaim and swap-in resume.
       → Stalled swap-in requests (tasks waiting for pages on the
         recovered device) are woken and retried.

Memory cost: Zero — one flag (state: AtomicU8) per swap area, already part of the swap area descriptor. Two function calls in the crash/resume paths.

Edge case — all swap devices crash: If all swap areas are SWAP_SUSPENDED (e.g., single NVMe serving both rootfs and swap), the system operates with zero swap until recovery. This is equivalent to running with swapoff -a. Memory pressure increases; OOM killer may activate. After recovery, swap resumes automatically.



11.10 Channel I/O Subsystem (s390x)

Summary: This section specifies the s390x Channel I/O bus subsystem — the only device I/O model in UmkaOS that does not use PCI, MMIO, or memory-mapped registers. s390x devices communicate through channel programs: linked lists of Channel Command Words (CCWs) executed autonomously by the channel subsystem hardware. This is fundamentally different from the PCI/platform/DT bus model used by all other seven supported architectures. The generic bus enumeration, device matching, and lifecycle management described in Section 11.4 apply to s390x devices — this section specifies the s390x-specific transport, addressing, enumeration, and I/O submission mechanisms that plug into that generic framework.

11.10.1 Architecture Overview

On every other UmkaOS-supported architecture, devices expose MMIO registers that the CPU reads and writes directly. On s390x, the CPU never touches device registers. Instead:

  1. The CPU constructs a channel program — an array of CCW instructions in main memory.
  2. The CPU issues a privileged instruction (SSCH — Start SubChannel) that hands the channel program to the channel subsystem, a hardware co-processor integrated into the I/O fabric.
  3. The channel subsystem executes the CCW chain autonomously, transferring data between device and memory without further CPU involvement (analogous to DMA, but driven by the channel hardware rather than the device).
  4. On completion (or error), the channel subsystem delivers an I/O interrupt via the PSW-swap mechanism at lowcore offset 0x1F0. See Section 3.8 for the s390x interrupt model.

Key s390x I/O concepts:

  • Subchannel addressing: every device is identified by a triple (css_id: u8, ssid: u8, devno: u16) — Channel Subsystem Image ID, Subchannel Set ID, and Device Number. This replaces PCI BDF (bus/device/function) addressing.
  • SCHIB (SubChannel Information Block): per-device configuration and status block, read via STSCH (Store SubChannel) and modified via MSCH (Modify SubChannel).
  • I/O instructions: SSCH (Start SubChannel), TSCH (Test SubChannel — read completion status), HSCH (Halt SubChannel), CSCH (Clear SubChannel), RSCH (Resume SubChannel). All are privileged (supervisor-state only).
  • I/O completion: delivered as I/O interrupts. The lowcore I/O interruption code contains the subchannel ID that completed, allowing the handler to dispatch to the correct driver.

11.10.2 Subchannel Enumeration

At boot time, the kernel discovers all attached devices by scanning the subchannel address space. There is no equivalent to PCI bus enumeration or device-tree parsing — the channel subsystem is the sole device discovery mechanism on s390x.

11.10.2.1 Core Data Structures

/// Subchannel identifier — uniquely addresses a device on the s390x channel subsystem.
///
/// Matches the hardware layout defined in the z/Architecture Principles of Operation
/// and Linux `arch/s390/include/uapi/asm/schid.h`:
///
/// ```text
/// Bits [31:24] = css_id  (8 bits)  — Channel Subsystem Image ID (0-255)
/// Bits [23:20] = reserved (4 bits) — Must be zero
/// Bit  [19]    = m       (1 bit)   — Multipath mode
/// Bits [18:17] = ssid    (2 bits)  — Subchannel Set ID (0-3)
/// Bit  [16]    = one     (1 bit)   — Must be 1 for valid subchannel IDs
/// Bits [15:0]  = sch_no  (16 bits) — Subchannel number (used in STSCH/SSCH)
/// ```
///
/// Stored as a packed `u32`; accessor methods extract individual fields.
/// Note: `sch_no` is the subchannel number used in I/O instructions (STSCH/SSCH),
/// NOT the device number (`devno`), which is a separate field inside the PMCW.
#[repr(C)]
pub struct SubchannelId {
    /// Raw 32-bit hardware representation.  Accessor methods below extract
    /// individual bitfields.
    pub raw: u32,
}

const_assert!(core::mem::size_of::<SubchannelId>() == 4);

impl SubchannelId {
    /// Construct from components. Sets the architecturally-required `one` bit.
    pub fn new(css_id: u8, ssid: u8, sch_no: u16, multipath: bool) -> Self {
        let raw = ((css_id as u32) << 24)
            | ((multipath as u32) << 19)
            | (((ssid & 0x3) as u32) << 17)
            | (1u32 << 16) // `one` bit — must be 1
            | (sch_no as u32);
        Self { raw }
    }
    /// Channel Subsystem Image ID (bits [31:24]).
    pub fn css_id(&self) -> u8  { (self.raw >> 24) as u8 }
    /// Multipath mode (bit [19]).
    pub fn m(&self) -> bool     { (self.raw >> 19) & 1 != 0 }
    /// Subchannel Set ID (bits [18:17], 0-3).
    pub fn ssid(&self) -> u8    { ((self.raw >> 17) & 0x3) as u8 }
    /// Must-be-one bit (bit [16]).  Hardware requires this set for valid IDs.
    pub fn one(&self) -> bool   { (self.raw >> 16) & 1 != 0 }
    /// Subchannel number (bits [15:0]).  Used in STSCH/SSCH instructions.
    pub fn sch_no(&self) -> u16 { self.raw as u16 }
}

/// SubChannel Information Block — the per-device state block read via the `STSCH`
/// instruction and modified via `MSCH`.
///
/// All multi-byte integer fields in s390x Channel I/O structs use native types
/// (not `Be32`/`Be64`) because s390x is big-endian and these structs are
/// architecture-specific hardware register blocks, never used cross-node or
/// cross-endianness. CLAUDE.md rule 12 (endian wrapper types) applies to wire
/// structs transmitted between nodes or stored on disk; it does not apply to
/// arch-specific hardware register interfaces.
///
/// Every subchannel has exactly one SCHIB. The kernel reads it to determine device
/// configuration (PMCW), current I/O status (SCSW), and path availability. The SCHIB
/// is 52 bytes, aligned to a 4-byte boundary (architectural requirement for `STSCH`).
///
/// Layout (matches Linux `struct schib` in `drivers/s390/cio/cio.h`):
/// - Bytes  0-27: PMCW (Path Management Control Word) — device configuration (28 bytes).
/// - Bytes 28-39: SCSW (SubChannel Status Word) — last I/O completion status (12 bytes).
/// - Bytes 40-47: MBA (Measurement Block Address) — u64.
/// - Bytes 48-51: MDA (Model-Dependent Area) — 4 bytes.
/// Note: `mba` is split into two `u32` halves to keep struct alignment at 4
/// (matching the s390x Principles of Operation which defines SCHIB as 52 bytes
/// at a word boundary). Using a native `u64` would force alignment 8 and pad
/// the struct to 56 bytes, breaking the hardware layout.
#[repr(C, align(4))]
pub struct Schib {
    /// Path Management Control Word — device configuration and path state.
    pub pmcw: Pmcw,
    /// SubChannel Status Word — I/O completion status from the last operation.
    pub scsw: Scsw,
    /// Measurement block address (high 32 bits). Reconstruct full address:
    /// `((mba_hi as u64) << 32) | mba_lo as u64`.
    pub mba_hi: u32,
    /// Measurement block address (low 32 bits).
    pub mba_lo: u32,
    /// Model-dependent area. Contents vary by machine generation; reserved.
    pub mda: [u8; 4],
}
/// 28(pmcw) + 12(scsw) + 4(mba_hi) + 4(mba_lo) + 4(mda) = 52 bytes
const_assert!(size_of::<Schib>() == 52);

/// Path Management Control Word — the configuration portion of the SCHIB.
///
/// The PMCW describes the device's channel paths, enablement state, and interrupt
/// routing. The kernel reads the PMCW via `STSCH` during enumeration and modifies
/// it via `MSCH` to enable/disable subchannels and change ISC (Interrupt Sub-Class)
/// routing.
///
/// Key fields for device discovery:
/// - `devno`: the device number visible to the guest.
/// - `pim` (Path Installed Mask): which of the 8 possible channel paths are wired.
/// - `pam` (Path Available Mask): which installed paths are currently operational.
/// - `flags`: PMCW word 1 upper 16 bits — subchannel control flags.
#[repr(C)]
pub struct Pmcw {
    /// Interrupt parameter — passed back in the I/O interruption code on completion.
    /// The kernel sets this to a value that identifies the device (e.g., a pointer
    /// to the device descriptor or an index into the subchannel table).
    pub intparm: u32,
    /// Flags word (upper 16 bits of PMCW word 1).
    /// s390x big-endian bit numbering within the u16:
    /// - Bit 0: QF (QDIO Facility) — set if the subchannel supports QDIO mode.
    /// - Bit 1: W (reserved for WSCH — Write Subchannel).
    /// - Bits 2-4: ISC (Interrupt Sub-Class, 0-7). Controls which CPUs receive
    ///   I/O interrupts for this subchannel (via CR6 ISC mask).
    /// - Bits 5-7: Reserved (must be zero).
    /// - Bit 8: E (Enabled). Subchannel is enabled for I/O operations.
    /// - Bits 9-10: LM (Limit Mode). Controls channel program address range.
    /// - Bits 11-12: MME (Measurement-Mode Enable). Controls measurement collection.
    /// - Bit 13: MP (Multipath Mode). If set, I/O can use multiple channel paths.
    /// - Bit 14: TF (Timing Facility). If set, timing measurements are available.
    /// - Bit 15: DNV (Device Number Valid). If clear, `devno` is meaningless.
    pub flags: u16,
    /// Device number. This is the `devno` as assigned by firmware, distinct from
    /// `SubchannelId::sch_no()` (the subchannel number used in I/O instructions).
    /// Included here for consistency with the hardware PMCW block layout.
    pub devno: u16,
    /// Logical Path Mask — identifies which channel paths may be used for I/O.
    /// Each bit corresponds to one of 8 possible channel paths (CHPID 0-7).
    pub lpm: u8,
    /// Path Not Operational Mask — channel paths detected as non-functional.
    pub pnom: u8,
    /// Last Path Used Mask — the channel path used for the most recent I/O operation.
    pub lpum: u8,
    /// Path Installed Mask — which channel paths are physically wired to this device.
    /// A bit is set if the corresponding CHPID is installed.
    pub pim: u8,
    /// Measurement Block Index — index into the measurement block area (if MM=1).
    pub mbi: u16,
    /// Path Operational Mask — channel paths that are installed AND operational.
    pub pom: u8,
    /// Path Available Mask — channel paths available for I/O (operational and not
    /// reserved by another partition).
    pub pam: u8,
    /// Channel Path IDs — the 8 possible CHPIDs connecting this subchannel to
    /// the channel subsystem. `chpid[i]` is meaningful only if `pim & (0x80 >> i)` is set.
    pub chpid: [u8; 8],
    /// Last 4 bytes of PMCW — subchannel type and measurement controls.
    /// Hardware layout (Linux `drivers/s390/cio/cio.h` `struct pmcw`):
    /// ```text
    /// Bits [31:24] = unused1  (8 bits)  — reserved zeros
    /// Bits [23:21] = st       (3 bits)  — subchannel type (I/O=0, CHSC=1, MSG=2)
    /// Bits [20:3]  = unused2  (18 bits) — reserved zeros
    /// Bit  [2]     = mbfc     (1 bit)   — measurement block format control
    /// Bit  [1]     = xmwme    (1 bit)   — extended measurement word mode enable
    /// Bit  [0]     = csense   (1 bit)   — concurrent sense
    /// ```
    /// Stored as raw `u32`; accessor methods extract fields.
    /// On MSCH, the kernel writes back the full `flags2` word read from STSCH,
    /// modifying only the fields it intends to change (principally `csense`).
    pub flags2: u32,
}

impl Pmcw {
    /// Subchannel type (bits [23:21]): 0=I/O, 1=CHSC, 2=MSG.
    pub fn st(&self) -> u8     { ((self.flags2 >> 21) & 0x7) as u8 }
    /// Measurement block format control (bit [2]).
    pub fn mbfc(&self) -> bool { (self.flags2 >> 2) & 1 != 0 }
    /// Extended measurement word mode enable (bit [1]).
    pub fn xmwme(&self) -> bool { (self.flags2 >> 1) & 1 != 0 }
    /// Concurrent sense (bit [0]).  May need to be set on MSCH for proper
    /// sense data retrieval.
    pub fn csense(&self) -> bool { self.flags2 & 1 != 0 }
}

/// 4(intparm) + 2(flags) + 2(devno) + 1(lpm) + 1(pnom) + 1(lpum) + 1(pim) +
/// 2(mbi) + 1(pom) + 1(pam) + 8(chpid) + 4(flags2) = 28 bytes
const_assert!(size_of::<Pmcw>() == 28);

/// SubChannel Status Word — the I/O completion status portion of the SCHIB.
///
/// After an I/O operation completes (or fails), the channel subsystem stores the
/// final status in the SCSW. The kernel reads it via `TSCH` (Test SubChannel) to
/// determine whether the operation succeeded and how many bytes were transferred.
///
/// Status bits are checked in a defined priority order:
/// 1. Primary status (channel-end, device-end).
/// 2. Secondary status (unit check, unit exception, attention).
/// 3. If unit check is set, the driver must issue a SENSE CCW to read detailed
///    sense data from the device.
#[repr(C)]
pub struct Scsw {
    /// Flags and control bits. Includes function control (start/halt/clear),
    /// activity control (resume/subchannel-active/device-active/suspended),
    /// and status control (alert/intermediate/primary/secondary/status-pending).
    pub flags: u32,
    /// CCW address — points to the CCW that was executing when the status was stored.
    /// This allows the driver to determine how far the channel program progressed.
    pub ccw_addr: u32,
    /// Device status byte. Bit layout:
    /// - Bit 0: Attention — device requesting service (unsolicited).
    /// - Bit 1: Status Modifier — modifies the meaning of other status bits.
    /// - Bit 2: Control Unit End — control unit completed its part.
    /// - Bit 3: Busy — device or control unit is busy.
    /// - Bit 4: Channel End (CE) — channel program execution completed.
    /// - Bit 5: Device End (DE) — device completed the operation.
    /// - Bit 6: Unit Check (UC) — device error; sense data available.
    /// - Bit 7: Unit Exception (UE) — end-of-medium, end-of-file, or similar.
    pub dev_status: u8,
    /// Subchannel status byte. Bit layout:
    /// - Bit 0: Program-controlled interrupt (PCI flag in CCW triggered this).
    /// - Bit 1: Incorrect length — transfer length did not match CCW count.
    /// - Bit 2: Program check — invalid CCW or address error in channel program.
    /// - Bit 3: Protection check — storage protection violation during transfer.
    /// - Bit 4: Channel data check — data error on the channel path.
    /// - Bit 5: Channel control check — protocol error on the channel path.
    /// - Bit 6: Interface control check — hardware failure on the channel path.
    /// - Bit 7: Chaining check — error during CCW chain fetch.
    pub sch_status: u8,
    /// Residual byte count — number of bytes NOT transferred from the last CCW.
    /// If the CCW requested 4096 bytes and 3000 were transferred, `residual` = 1096.
    pub residual: u16,
}
/// 4(flags) + 4(ccw_addr) + 1(dev_status) + 1(sch_status) + 2(residual) = 12 bytes
const_assert!(size_of::<Scsw>() == 12);

11.10.2.2 Enumeration Procedure

The boot-time subchannel scan runs after Phase 2 (SCLP console init) in the s390x boot sequence (Section 2.12):

  1. Iterate the subchannel address space: css_id 0..MAX_CSS (typically 0 only on QEMU and most LPARs), ssid 0..3, sch_no 0..0xFFFF. The total address space is 4 * 65536 = 262,144 possible subchannels per CSS image. Note: the iteration variable is sch_no (subchannel number), NOT devno (device number). They are distinct hardware concepts — sch_no addresses the subchannel for I/O instructions (STSCH/SSCH/TSCH), while devno is a device-level identifier read from the PMCW after STSCH succeeds. Linux's for_each_subchannel in drivers/s390/cio/css.c iterates schid.sch_no.
  2. Issue STSCH for each (css_id, ssid, sch_no) triple, constructing a SubchannelId via SubchannelId::new(css_id, ssid, sch_no, false). The instruction returns a condition code (CC):
  3. CC=0: subchannel exists. The SCHIB is stored at the provided address.
  4. CC=3: subchannel does not exist. Skip.
  5. CC=1 or CC=2: subchannel is busy or status pending — retry after draining status.
  6. Check enablement: read PMCW.flags bit 7 (E). If not set, the subchannel exists but is not configured for this guest — skip it. Extract devno from schib.pmcw.dev — this is the device number, distinct from sch_no.
  7. Issue SENSE ID CCW to each enabled subchannel (using the SubchannelId from step 2, NOT devno). The response contains the Control Unit type/model and Device type/model — the s390x equivalents of PCI vendor_id/device_id. This identifies the device class (DASD, tape, network, virtio, etc.).
  8. Create DeviceNode with BusIdentity::ChannelIo { css_id, ssid, sch_no, devno } and populate ChannelIoResources with the SubchannelId, SCHIB snapshot, CHPID list (from PMCW.pam), and SENSE ID results.
  9. Register in the device registry (Section 11.4). Driver matching proceeds by (cu_type, cu_model, dev_type, dev_model) — analogous to PCI (vendor_id, device_id, subsystem_vendor, subsystem_device).

11.10.2.3 ChannelIoResources

Resources discovered during enumeration and needed by drivers for I/O submission. Stored alongside BusIdentity::ChannelIo in the device node. Drivers access these through the device-services accessor API.

/// Per-device Channel I/O resources populated during subchannel enumeration (step 5).
/// The `SubchannelId` is the key resource: drivers need it to issue SSCH/TSCH
/// instructions for I/O submission. In Linux, the `ccw_device` reaches its
/// subchannel via `to_subchannel(cdev->dev.parent)` → `sch->schid`.
// kernel-internal, not KABI — accessed via device-services accessors
pub struct ChannelIoResources {
    /// Full SubchannelId for I/O instruction operands (SSCH, TSCH, MSCH, HSCH, CSCH).
    /// Contains css_id, ssid, sch_no, and the architecturally-required `one` bit.
    pub subchannel_id: SubchannelId,
    /// Snapshot of the SCHIB read at enumeration time. Updated on path events
    /// (CHP vary-on/off) and after MSCH (Modify SubChannel). Drivers should
    /// re-read PMCW fields via the accessor API, not cache this snapshot.
    pub schib_snapshot: Schib,
    /// Channel Path ID mask from PMCW.pam — identifies which physical channel
    /// paths are available for this subchannel. Each bit corresponds to one of
    /// 8 possible CHPIDs. Used for multipath I/O decision-making.
    pub chpid_mask: u8,
    /// Control Unit type from SENSE ID response.
    pub cu_type: u16,
    /// Control Unit model from SENSE ID response.
    pub cu_model: u8,
    /// Device type from SENSE ID response.
    pub dev_type: u16,
    /// Device model from SENSE ID response.
    pub dev_model: u8,
    /// Device number read from PMCW.dev. Distinct from sch_no (see enumeration
    /// step 3). Used for sysfs display and user-facing device identification,
    /// NOT for I/O instructions.
    pub devno: u16,
}

11.10.2.4 Hotplug

Channel path (CHP) vary-on and vary-off events are delivered as Machine Check interrupts (lowcore offset 0x1E0). When a CHP state change is detected:

  1. The Machine Check handler reads the Channel Report Word (CRW) via STCRW (Store Channel Report Word) instruction. The CRW identifies the affected CHPID and the nature of the change (available, not available, error).
  2. For vary-on (path becomes available): rescan subchannels on the affected CHPID. New devices are enumerated and registered. Existing devices may gain additional paths (update PMCW.pam).
  3. For vary-off (path removed): update PMCW.pam for affected subchannels. If a device loses all paths, notify the bound driver via the standard device-removal callback. If alternate paths remain, the device continues operating on surviving paths (multipath failover).
  4. CRW processing is serialized — the kernel drains all pending CRWs in a loop until STCRW returns CC=1 (no more CRWs pending).

11.10.3 CCW Program Model

Channel programs are the fundamental I/O primitive on s390x. A channel program is a contiguous array of CCW instructions in main memory. The channel subsystem reads and executes them autonomously — the CPU is free to perform other work while the channel program runs.

11.10.3.1 CCW Format

/// Channel Command Word — a single I/O operation within a channel program.
///
/// CCWs are the s390x equivalent of scatter-gather DMA descriptors, but more powerful:
/// they support chaining (sequential execution), branching (TIC — Transfer In Channel),
/// mid-chain interrupts (PCI flag), and indirect data addressing (IDA/MIDA).
///
/// UmkaOS uses **Format 1 CCWs** exclusively (31-bit data addresses in a 32-bit
/// field). Format 0 CCWs (24-bit addresses) are a legacy artifact from ESA/390
/// and are not supported. For buffers above 2 GB, set the IDA flag and point
/// `data_addr` to an IDAL (Indirect Data Address List) of 64-bit entries.
///
/// Alignment: CCWs must be 8-byte aligned (architectural requirement). The channel
/// subsystem will program-check if it fetches a misaligned CCW.
#[repr(C, align(8))]
pub struct Ccw {
    /// Command code — identifies the I/O operation to perform.
    /// See the CCW command code table below. Device-specific command codes
    /// exist beyond the base set (e.g., DASD-specific seek, search, multi-track).
    pub cmd: u8,
    /// CCW flags — control chaining, data handling, and interrupt behavior.
    /// See the CCW flags table below.
    pub flags: u8,
    /// Byte count — number of bytes to transfer for this CCW.
    /// The channel subsystem decrements this as data is transferred. The residual
    /// count in the SCSW indicates how many bytes were NOT transferred.
    pub count: u16,
    /// Data address — 31-bit address in a 32-bit field. Bit 0 (MSB) must be
    /// zero; bits 29-31 must be zero (doubleword alignment for some commands).
    /// For read commands: destination where device data is stored.
    /// For write commands: source from which data is sent to the device.
    /// If the IDA flag is set, this points to an IDAL (Indirect Data Address
    /// List) of 64-bit entries instead of the data buffer directly.
    pub data_addr: u32,
}
/// 1(cmd) + 1(flags) + 2(count) + 4(data_addr) = 8 bytes
const_assert!(size_of::<Ccw>() == 8);

impl Ccw {
    /// Construct a CCW, validating the 31-bit data_addr constraint.
    ///
    /// Format 1 CCWs require `data_addr` bit 31 (the highest bit of the u32)
    /// to be zero — the address space is 31-bit (0-2 GB).  Buffers above
    /// 2 GB must use IDA (set `CcwFlags::IDA` and point `data_addr` to
    /// an IDAL of 64-bit entries instead).
    ///
    /// Returns `Err(KernelError::InvalidArgument)` if bit 31 is set and
    /// `CcwFlags::IDA` is not set.
    pub fn new(
        cmd: u8, flags: CcwFlags, count: u16, data_addr: u32,
    ) -> Result<Self, KernelError> {
        if data_addr & 0x8000_0000 != 0 && !flags.contains(CcwFlags::IDA) {
            return Err(KernelError::InvalidArgument);
        }
        Ok(Self { cmd, flags, count, data_addr })
    }
}

11.10.3.2 CCW Command Codes

The base command codes are shared by all s390x device types. Device-specific commands (DASD seek, tape rewind, network-specific control) use the same CCW format with device-class-specific command code values.

Code Command Direction Description
0x02 READ Device to Memory Transfer data from the device into the buffer at data_addr.
0x01 WRITE Memory to Device Transfer data from the buffer at data_addr to the device.
0x03 CONTROL To Device Send a device-specific control command. data_addr points to the control block.
0x04 SENSE Device to Memory Read device status and sense data after a Unit Check condition.
0x08 TIC (Branch) Transfer In Channel: the channel subsystem jumps to the CCW at data_addr. No data transfer. count must be zero. The target CCW must not itself be a TIC (no TIC-to-TIC chaining).
0xE4 SENSE ID Device to Memory Read the device identification block (CU type/model, device type/model). Used during enumeration. Returns an 8-byte identification block.

11.10.3.3 CCW Flags

Flag Bit Mask Description
CC (Chain Command) 0x80 Continue execution with the next CCW in memory after this one completes. If not set, the channel program ends after this CCW.
SLI (Suppress Length Indication) 0x40 Do not report incorrect-length condition if the device transfers fewer bytes than count. Used when the exact transfer size is not known in advance.
SKIP 0x20 Suppress data transfer — the device sends/receives data but the channel subsystem does not store/fetch it in memory. Used for positioning (e.g., skipping tape records).
PCI (Program Controlled Interrupt) 0x10 Request an I/O interrupt when this CCW starts execution, before it completes. Enables mid-chain progress notification.
IDA (Indirect Data Addressing) 0x04 data_addr points to an IDAL (list of 4 KB-aligned page addresses) instead of a contiguous buffer. Required when the data buffer crosses page boundaries. Each IDAL entry is an 8-byte address pointing to one 4 KB page.
SUSPEND 0x02 Suspend channel program execution after this CCW. Resume with RSCH (Resume SubChannel). Used for flow control when the driver needs to inspect intermediate results before continuing.

11.10.3.4 CCW Chain Execution

The lifecycle of a channel program:

  1. Submission: The kernel constructs a CCW chain in physically contiguous memory (or uses IDA for discontiguous buffers). The first CCW address is stored in an Operation Request Block (ORB) and submitted via SSCH (Start SubChannel).
  2. Autonomous execution: The channel subsystem fetches CCWs from memory and executes them sequentially. Data transfers occur between device and main memory without CPU involvement. If the CC (Chain Command) flag is set, the next contiguous CCW is fetched; if TIC is encountered, execution jumps to the target CCW address.
  3. Completion: When the chain ends (no CC flag on the last CCW, or an error occurs), the channel subsystem stores final status in the SCSW and signals an I/O interrupt.
  4. Status retrieval: The kernel's I/O interrupt handler calls TSCH (Test SubChannel) to read the IRB (Interruption Response Block), which contains the SCSW with completion status and the residual byte count.

11.10.3.5 Error Handling

Channel program errors are reported through the SCSW status bits and fall into three categories:

  • Channel End + Device End (CE+DE): Normal completion. Both the channel path and the device have finished processing. The residual count indicates whether all bytes were transferred.
  • Unit Check (UC): The device detected an error. The driver must issue a SENSE CCW (command code 0x04) to read the device-specific sense data, which contains the error reason code. Sense data interpretation is device-class-specific (DASD sense differs from network sense differs from tape sense).
  • Channel/program errors: The channel subsystem itself detected an error (invalid CCW address, protection violation, data parity error). These are reported in the subchannel status byte of the SCSW. No SENSE CCW is needed — the error is in the channel program or the path, not the device.
  • Unit Exception (UE): End-of-medium (tape), end-of-file, or similar boundary condition. Not necessarily an error — the driver determines the appropriate response based on the device class.
  • Attention: Unsolicited status — the device is requesting service without a prior I/O command (e.g., a network device has incoming data, or a device was hot-plugged). The driver's interrupt handler processes attention conditions as asynchronous events.

11.10.4 Indirect Data Address Lists (IDAL)

When a data buffer is not physically contiguous (spans page boundaries), the IDA flag in the CCW directs the channel subsystem to use an IDAL — a list of physical page addresses — instead of a single contiguous buffer address.

/// Indirect Data Address List — enables scatter-gather for CCW data transfers.
///
/// Each entry is an 8-byte physical address pointing to a 4 KB page. The channel
/// subsystem processes entries sequentially, transferring data to/from each page
/// until the CCW byte count is exhausted.
///
/// The IDAL itself must be located in physically contiguous memory and must be
/// 8-byte aligned. The maximum number of entries is bounded by the CCW byte count
/// divided by 4096 (rounded up).
///
/// This serves the same purpose as scatter-gather lists on PCI architectures, but
/// is executed by the channel subsystem hardware rather than a device DMA engine.
#[repr(C, align(8))]
pub struct Idal {
    /// Page-aligned physical addresses. Each entry points to one 4 KB page.
    /// The number of valid entries is `ceil(ccw.count / 4096)`.
    /// Entries beyond that count are not accessed by the channel subsystem.
    pub entries: [u64; Self::MAX_ENTRIES],
}
// Idal: [u64; 16] = 128 bytes, align(8) satisfied.
const_assert!(core::mem::size_of::<Idal>() == 128);

impl Idal {
    /// Maximum IDAL entries. Supports transfers up to 64 KB per CCW
    /// (the architectural maximum for a single CCW byte count is 65535).
    pub const MAX_ENTRIES: usize = 16;
}

11.10.5 QDIO (Queued Direct I/O)

QDIO is the high-performance I/O path for modern s390x devices — primarily OSA-Express network adapters and ECKD/FCP DASD (storage). Instead of submitting individual CCW chains for each I/O operation (which incurs per-operation overhead from SSCH/TSCH instruction pairs), QDIO uses shared-memory queues with lightweight doorbell signaling.

QDIO is to s390x what virtqueue rings are to virtio: a shared-memory data plane with minimal CPU-to-device synchronization. The key difference is that QDIO is a hardware-level protocol implemented in the channel subsystem, not a software convention.

Tier M peer integration: For s390x Tier M peers (Section 11.1), QDIO queues serve as the ring pair backing — ClusterMessageHeader + payload are carried in SBAL entries instead of PCIe DomainRingBuffer entries. The peer protocol message format is identical; only the queue mechanics differ. SIGA instruction serves as the doorbell. Tier M detection on s390x uses SENSE ID CU type 0x554D (reserved for UmkaOS peer devices). Standard non-Tier-M s390x devices (DASD, OSA) use the normal KABI driver path via CCW/QDIO.

11.10.5.1 QDIO Data Structures

/// Storage Block Address List — a buffer descriptor containing up to 16
/// scatter-gather entries for a single I/O operation.
///
/// QDIO queues are arrays of SBALs. Each SBAL describes one I/O operation
/// (one network packet, one disk block, etc.) with scatter-gather capability.
/// The 16-entry limit per SBAL is an architectural constant.
#[repr(C)]
pub struct Sbal {
    /// Scatter-gather entries for this buffer. Unused entries (beyond the
    /// last `SBAL_FLAG_LAST`-flagged entry) must be zeroed.
    pub entries: [SbalEntry; Self::ENTRIES_PER_SBAL],
}
// Sbal: [SbalEntry; 16]. SbalEntry alignment = 8 (u64 field),
// so SbalEntry = 24 bytes. Sbal = 24 × 16 = 384 bytes.
const_assert!(core::mem::size_of::<Sbal>() == 384);

impl Sbal {
    /// Number of scatter-gather entries per SBAL (architectural constant).
    pub const ENTRIES_PER_SBAL: usize = 16;
}

/// Single buffer entry within an SBAL — describes one contiguous memory region.
///
/// Analogous to a single entry in a PCI scatter-gather list. The `eflags` field
/// indicates whether this entry is the first, middle, or last in a multi-entry
/// buffer descriptor.
#[repr(C)]
pub struct SbalEntry {
    /// Entry flags (matches Linux `u8 eflags` in `struct qdio_buffer_element`):
    /// - Bits 6-7: Position indicator (00=only, 01=first, 10=middle, 11=last).
    /// - Bit 5: SBAL entry continuation — more entries follow for this buffer.
    /// Remaining bits are reserved (must be zero).
    pub eflags: u8,
    /// Reserved padding (3 bytes, matches Linux hardware struct layout).
    pub _reserved: [u8; 3],
    /// Explicit alignment padding: _reserved ends at offset 4; addr (u64)
    /// requires alignment 8. DMA'd to channel subsystem hardware ��� must be
    /// zeroed to avoid sending uninitialized bytes.
    pub _pad0: [u8; 4],
    /// Physical address of the data buffer for this entry.
    pub addr: u64,
    /// Length of the data buffer in bytes.
    pub length: u32,
    /// Explicit trailing padding: length ends at offset 20; struct alignment
    /// is 8 (from addr: u64), so size rounds to 24. Must be zeroed.
    pub _pad1: [u8; 4],
}
// SbalEntry: eflags(u8=1,off=0) + _reserved([u8;3]=3,off=1) + _pad0([u8;4]=4,off=4) +
//   addr(u64=8,off=8) + length(u32=4,off=16) + _pad1([u8;4]=4,off=20) = 24 bytes.
// All padding explicit. No implicit holes.
const_assert!(core::mem::size_of::<SbalEntry>() == 24);

/// QDIO queue — one direction of an I/O queue (input or output).
///
/// Each QDIO queue contains exactly 128 SBALs (architectural constant for z/Architecture
/// QDIO). The kernel and the device share the SBAL array; per-SBAL state bytes
/// indicate ownership (empty = kernel owns, primed = device may process,
/// active = device is processing, error = device encountered an error).
///
/// The queue operates as a circular buffer. The kernel tracks `first_to_check` as its
/// consumer index; the device tracks its own producer/consumer indices internally.
pub struct QdioQueue {
    /// The 128 SBALs that make up this queue. Shared between kernel and device.
    /// Must be page-aligned (allocated from physically contiguous memory).
    pub sbals: [Sbal; Self::QUEUE_DEPTH],
    /// Per-SBAL state indicators. Each byte represents the state of the
    /// corresponding SBAL:
    /// - `0x00` (EMPTY): owned by the kernel. May be filled with new data.
    /// - `0x01` (PRIMED): filled by the kernel, ready for device processing.
    /// - `0x02` (ACTIVE): being processed by the device.
    /// - `0x03` (ERROR): device encountered an error processing this SBAL.
    /// State transitions are atomic — the device and kernel never write the
    /// same state byte simultaneously (ownership protocol prevents races).
    pub sbal_state: [AtomicU8; Self::QUEUE_DEPTH],
    /// Consumer index — the first SBAL the kernel should check for completion
    /// on input queues, or the first SBAL the kernel should reclaim on output
    /// queues. Advanced by the kernel after processing completed SBALs.
    ///
    /// **Advance formula**: `first_to_check.store((old + count) % QUEUE_DEPTH, Release)`
    /// where `count` is the number of SBALs consumed in this batch.
    /// NOT a monotonic counter — it is a modular index in `[0, QUEUE_DEPTH)`.
    /// AtomicU32 is correct (modular arithmetic, not a generation counter).
    pub first_to_check: AtomicU32,
}

impl QdioQueue {
    /// Number of SBALs per queue (architectural constant on z/Architecture).
    pub const QUEUE_DEPTH: usize = 128;
}

11.10.5.2 QDIO Operation

  1. Establishment: The driver sets up QDIO queues via a CCW program that configures the device for QDIO mode. This involves a sequence of control CCWs specific to the device type (e.g., ESTABLISH QUEUES for OSA-Express, ENABLE QDIO for FCP). The CCW program provides the physical addresses of the SBAL arrays and state byte arrays to the device.
  2. Submission (output): The kernel fills one or more SBALs with outbound data (network packets, disk write blocks), sets their state to PRIMED (0x01), and issues SIGA-w (Signal Adapter — write) to notify the device that new SBALs are ready. SIGA is a privileged instruction that serves as a lightweight doorbell — it does not transfer data, only signals the device.
  3. Completion (input): The device fills SBALs with inbound data and sets their state to indicate completion. The device signals the kernel via an I/O interrupt (ISC-routed) or the kernel polls via SIGA-r (Signal Adapter — read). The kernel reads completed SBALs starting from first_to_check and advances the index.
  4. Three queue types: QDIO supports three queue directions — Input (device to host), Output (host to device), and Data (device-managed bidirectional). Most device types use one input and one output queue; FCP (Fibre Channel Protocol) uses the data queue type for direct I/O to storage.

11.10.5.3 QDIO Performance Properties

  • Batching: Multiple SBALs can be primed before a single SIGA-w, amortizing the instruction cost across multiple I/O operations. This is architecturally identical to the batched notification model used by virtio (VIRTQ_USED_F_NO_NOTIFY).
  • Polling mode: Under high I/O load, the kernel can suppress I/O interrupts and poll sbal_state directly, eliminating interrupt overhead. This is the s390x equivalent of NAPI polling on network devices.
  • No per-I/O privileged instructions: Once queues are established, the data plane requires only SIGA instructions (one per batch), not SSCH/TSCH pairs per operation. This reduces per-I/O overhead from ~1000 cycles (classical CCW) to ~100 cycles (QDIO with batching).

11.10.6 virtio-ccw Transport

VirtIO devices on s390x use the CCW transport (virtio-ccw) instead of the PCI transport used on all other architectures. The virtqueue ring format (split ring or packed ring) is identical to PCI and MMIO transports — only the transport layer differs.

11.10.6.1 Device Identification

/// Control Unit type identifying a virtio-ccw device.
///
/// During subchannel enumeration, the `SENSE ID` CCW returns a device identification
/// block. A CU type of `0x3832` indicates a virtio-ccw device. The CU model field
/// encodes the VirtIO device type (net=0x01, blk=0x02, console=0x05, etc.),
/// matching the VirtIO device type IDs from the VirtIO specification.
pub const VIRTIO_CCW_CU_TYPE: u16 = 0x3832;

11.10.6.2 Discovery and Initialization

virtio-ccw discovery is part of the standard subchannel enumeration:

  1. During the subchannel scan, SENSE ID returns CU type 0x3832 for virtio-ccw devices.
  2. The CU model field identifies the VirtIO device type (network, block, console, etc.).
  3. The virtio-ccw transport driver claims all devices with CU type 0x3832.

Queue setup is performed via CCW programs (not PCI configuration space access):

Step CCW Command Purpose
1 SET_VIRTIO_REV (cmd 0x83) Negotiate the virtio-ccw protocol revision (currently rev 1).
2 READ_FEAT (cmd 0x82) Read device feature bits from the device.
3 WRITE_FEAT (cmd 0x81) Write accepted feature bits back to the device.
4 SET_VQ (cmd 0x80) Configure a virtqueue: queue index, size, and physical addresses of descriptor/available/used ring areas. One SET_VQ CCW per virtqueue.
5 WRITE_STATUS (cmd 0x84) Set the VirtIO device status to DRIVER_OK, completing initialization.

11.10.6.3 Notification Mechanism

Queue notifications (the equivalent of PCI MSI-X for virtio-pci) use the SIGA instruction:

  • Host-to-device notification (SIGA-w): the kernel issues SIGA on the output virtqueue's subchannel to signal that new descriptors are available.
  • Device-to-host notification: the device delivers an I/O interrupt on the subchannel's ISC, which the kernel routes to the virtio interrupt handler.

11.10.6.4 Integration with Generic VirtIO Layer

The virtio-ccw transport maps directly to the generic VirtIO abstraction. The transport difference is encapsulated behind a VirtioTransport::Ccw variant — all virtio device drivers (virtio-net, virtio-blk, virtio-console, etc.) are transport-agnostic and work identically over CCW, PCI, and MMIO transports. See Section 11.3 for the VirtIO hosting model.

/// VirtIO transport variant for s390x channel I/O.
///
/// Implements the same `VirtioTransport` trait as `VirtioTransport::Pci` and
/// `VirtioTransport::Mmio`. The transport-specific operations (queue setup,
/// feature negotiation, notification) are implemented via CCW programs and
/// `SIGA` instructions instead of PCI config space and MSI-X.
pub struct VirtioTransportCcw {
    /// The subchannel hosting this virtio device.
    pub schid: SubchannelId,
    /// Cached SCHIB for this device.
    pub schib: Schib,
    /// Negotiated virtio-ccw protocol revision.
    pub revision: u8,
}

11.10.7 Integration with UmkaOS Device Registry

The channel I/O subsystem plugs into the generic device registry (Section 11.4) through s390x-specific bus identity and resource descriptors.

11.10.7.1 Bus Identity

/// Bus identity variant for s390x channel I/O devices.
///
/// Added to the `BusIdentity` enum alongside `Pci`, `Platform`, and `Dt` variants.
/// The triple `(css_id, ssid, devno)` uniquely identifies a device within the s390x
/// channel subsystem, serving the same role as PCI BDF on other architectures.
pub struct ChannelIoBus {
    /// Channel Subsystem Image identifier.
    pub css_id: u8,
    /// Subchannel Set identifier.
    pub ssid: u8,
    /// Device number.
    pub devno: u16,
}

11.10.7.2 Device Resources

/// Channel I/O resources — s390x-specific resource descriptor.
///
/// Added to `DeviceResources` alongside `PciResources` and `PlatformResources`.
/// Contains all information needed by a driver to operate the device: the cached
/// SCHIB, active channel paths, device identification from `SENSE ID`, and QDIO
/// queue configuration (if the device supports QDIO).
pub struct ChannelIoResources {
    /// Cached SCHIB from the most recent `STSCH`. Re-read via `STSCH` when:
    /// (a) a Channel Report Word (CRW) indicates a path state change,
    /// (b) the driver issues `MSCH` (Modify Subchannel) to change path masks,
    /// (c) I/O interrupt status indicates path-not-operational (PNO).
    /// Used for diagnostics and path selection; the device always consults
    /// the hardware SCHIB for I/O routing decisions.
    pub schib: Schib,
    /// Active channel path IDs. Populated from `PMCW.pim & PMCW.pam` — paths that
    /// are both installed and available. Maximum 8 paths per subchannel.
    pub chpids: ArrayVec<u8, 8>,
    /// Control Unit type from `SENSE ID`. Identifies the device class
    /// (e.g., 0x3832 = virtio-ccw, 0x3990 = ECKD DASD, 0x1731 = OSA-Express).
    pub cu_type: u16,
    /// Control Unit model from `SENSE ID`. Further identifies the device variant
    /// within a CU type (e.g., for virtio-ccw, the model encodes the VirtIO device type).
    pub cu_model: u16,
    /// Device type from `SENSE ID`. Some devices report a separate device type
    /// distinct from the CU type (e.g., DASD 3390, tape 3590).
    pub dev_type: u16,
    /// Device model from `SENSE ID`. Further narrows the device variant.
    pub dev_model: u16,
    /// QDIO queue configuration, if the device supports QDIO.
    /// `None` for classical CCW-only devices (e.g., tape drives, unit-record devices).
    /// `Some(...)` for QDIO-capable devices (OSA-Express, FCP, virtio with QDIO).
    pub qdio_setup: Option<QdioSetup>,
}

/// QDIO queue setup parameters — describes the queue topology for a QDIO-capable device.
///
/// Populated during device initialization when the driver establishes QDIO queues.
/// The kernel uses this to allocate SBAL arrays and state byte arrays in physically
/// contiguous memory.
pub struct QdioSetup {
    /// Number of input queues (device-to-host). Typically 1 for network devices,
    /// 0 for output-only devices.
    pub nr_input_queues: u8,
    /// Number of output queues (host-to-device). Typically 1 for network devices,
    /// 1 for FCP (block) devices.
    pub nr_output_queues: u8,
    /// Number of data queues (device-managed bidirectional). Used by FCP for
    /// direct I/O. Zero for most device types.
    pub nr_data_queues: u8,
}

11.10.7.3 Bus Enumerator

The s390x bus enumerator implements the generic BusEnumerator trait:

  1. Boot scan: The subchannel enumeration procedure (described above) creates a DeviceNode for each enabled subchannel.
  2. Bus identity: DeviceNode.bus = BusIdentity::ChannelIo { css_id, ssid, devno }.
  3. Resources: DeviceNode.resources includes ChannelIoResources with the SCHIB, CHPID list, and SENSE ID results.
  4. Driver matching: Drivers declare their supported (cu_type, cu_model, dev_type, dev_model) tuples in their manifest. The registry matches discovered devices to drivers by these tuples — analogous to PCI (vendor_id, device_id) matching on other architectures.
  5. Hotplug: CHP vary-on/vary-off events trigger a rescan of affected subchannels. New devices are registered; removed devices are unregistered through the standard device lifecycle (Section 11.4).

11.10.8 Protection and Isolation

Channel I/O provides intrinsic device isolation through the channel subsystem hardware — a property that no other UmkaOS-supported bus architecture offers natively.

11.10.8.1 Subchannel Isolation

Each subchannel is an architecturally independent I/O path:

  • Memory protection: A channel program submitted to subchannel A cannot access memory allocated to subchannel B. The channel subsystem enforces memory boundaries per subchannel — the CCW data_addr and byte count define the accessible range, and the channel subsystem validates these addresses against the subchannel's configured limits (PMCW Limit Mode).
  • Subchannel enablement: Only the kernel (supervisor state) can enable or disable subchannels via MSCH. A malfunctioning or compromised driver in Tier 2 cannot enable additional subchannels to access other devices.
  • I/O authorization keys: The PMCW contains a key field that restricts which storage protection keys may be used for CCW data transfers. This prevents a driver from using channel programs to read or write memory outside its allocated protection domain.

11.10.8.2 Tier Integration

The channel subsystem's intrinsic isolation maps to UmkaOS driver tiers as follows:

Tier Channel I/O Behavior
Tier 0 (in-kernel) Direct access to SSCH/TSCH/MSCH instructions. The driver constructs CCW programs and submits them directly. No mediation overhead.
Tier 1 (domain-isolated) Tier 1 is not available on s390x (Storage Keys are too coarse for domain isolation). Drivers choose Tier 0 or Tier 2 based on licensing and admin policy. See Section 11.3.
Tier 2 (Ring 3, process-isolated) The kernel mediates all I/O operations. The Tier 2 driver submits I/O requests via IPC (Section 11.8); the kernel validates the CCW program (checks data_addr ranges, command codes, and flags) before issuing SSCH on the driver's behalf. The subchannel's I/O authorization key is set to match the kernel's protection domain, preventing the driver from bypassing mediation.

11.10.8.3 IOMMU Relationship

s390x does not use an IOMMU for channel I/O devices. The channel subsystem is its own address translation and protection mechanism — CCW data addresses are physical addresses validated by the channel subsystem hardware. For virtio-ccw devices, the virtio-ccw protocol layer handles address translation between the guest and the virtual device. The DmaDevice trait from Section 4.14 applies to virtio-ccw devices via the standard virtio DMA mapping path; native channel I/O devices use the CCW program model directly and do not participate in the DMA subsystem.

11.10.9 Cross-References

11.11 D-Bus Bridge Service

11.11.1 Motivation

11.11.1.1 The Problem: Linux Has No Unified Driver Management Interface

Every Linux subsystem invents its own way for userspace to manage hardware:

  • Block devices: sysfs text files (/sys/block/sda/queue/scheduler), ioctl (BLKRRPART), udev properties, smartmontools passthrough. A desktop disk manager must combine all four to show one drive's status.
  • Networking: three separate IPC mechanisms — netlink (rtnetlink for routes, nl80211 for WiFi, genetlink for ethtool), sysfs (/sys/class/net/*/), and legacy ioctl (SIOCGIFFLAGS). NetworkManager implements parsers for all three.
  • Storage health: S.M.A.R.T. data requires a root-privilege passthrough ioctl (SG_IO or HDIO_DRIVE_CMD) plus a polling daemon (smartd) that logs to syslog. No structured event when a drive starts failing.
  • iSCSI: a three-way split across iscsiadm (CLI tool), iscsid (userspace daemon), and kernel modules (iscsi_tcp, libiscsi). Session management requires coordinating all three.
  • Bluetooth, WiFi, firmware updates: each runs its own userspace daemon (BlueZ, wpa_supplicant/iwd, fwupd) with its own D-Bus API, its own crash behavior, its own capability model, and its own restart logic.

The result: desktop environments (GNOME, KDE) and system management tools contain thousands of lines of per-subsystem glue code — sysfs parsers, netlink decoders, ioctl wrappers, daemon health monitors — to present a unified view of hardware that the kernel itself does not provide.

11.11.1.2 Two Problems, One Solution

UmkaOS's three-tier driver model creates two related needs:

1. Daemon replacement compatibility. Several high-value Tier 2 driver candidates — Bluetooth (BlueZ), WiFi control plane (wpa_supplicant/iwd), firmware update (fwupd), disk management (udisks2) — currently run as userspace daemons that expose D-Bus interfaces. Moving their hardware-facing logic into Tier 2 drivers gains KABI ring zero-copy I/O, IOMMU-fenced DMA, kernel-managed crash recovery (~10ms restart), and capability-based access control. But D-Bus is a userspace IPC protocol — applications (GNOME Settings, NetworkManager, GNOME Software) talk to these services via D-Bus method calls, property reads, and signal subscriptions. If the service moves into a Tier 2 driver, the D-Bus interface must be preserved. Applications cannot be modified.

2. Management plane for all drivers. Drivers that never had D-Bus interfaces — NVMe, SATA, RAID, network NICs, thermal zones, power supplies — currently expose management state through fragmented sysfs/procfs/ioctl interfaces. UmkaOS can do better: give every driver a structured, typed, event-driven management interface that desktop apps can consume without per-subsystem parsers. D-Bus is the natural protocol for this — it already has introspection, signal subscription, property change notification, and widespread desktop toolkit support (GDBus, sd-bus, QtDBus).

11.11.1.3 Why a Generic Bridge, Not Per-Driver Shims

A per-driver D-Bus shim approach would require writing N custom translation processes, each understanding the specific D-Bus interface and mapping it to KABI ring commands. This is fragile, duplicates boilerplate, and creates N additional failure points.

The D-Bus Bridge is a single, generic Tier 2 component that provides D-Bus interface exposure for any driver (Tier 0, Tier 1, or Tier 2), driven entirely by declarative schema in the driver's KABI manifest. No per-driver code is needed in the bridge. A driver declares its D-Bus interfaces, and the bridge handles marshaling, capability checks, crash recovery, object path management, and signal delivery — all from the schema.

11.11.2 Architecture

                    System D-Bus / Session D-Bus
                    ┌─────────┴──────────┐
                    │   D-Bus Bridge     │  ← Tier 2 driver (Ring 3)
                    │   (generic, one    │     Exposes D-Bus interfaces
                    │    instance per    │     declared by any driver
                    │    bus type)       │     (Tier 0, 1, or 2)
                    └──┬──────┬─────┬───┘
                       │      │     │
                  KABI ring  ring  ring     ← Inter-driver ring channels
                       │      │     │         (via kernel ring broker)
                    ┌──┴──┐ ┌─┴──┐ ┌┴────┐
                    │ BT  │ │WiFi│ │ GPU │  ← Any tier; bridge is
                    │ T2  │ │ T0* │ │ T1  │     tier-agnostic
                    └─────┘ └────┘ └─────┘
                    * T0 on RISC-V/s390x/LoongArch (no Tier 1 HW)

The D-Bus Bridge is itself a Tier 2 driver. It does not access hardware. Its role is protocol translation: D-Bus wire format ↔ KABI ring typed messages.

Tier-agnostic: The bridge communicates with connected drivers via KABI ring channels, which work across any tier boundary — Tier 0 ↔ Tier 2, Tier 1 ↔ Tier 2, and Tier 2 ↔ Tier 2. The bridge does not know or care what tier a connected driver runs at — it only sees the ring and the DbusSchemaTable.

This means drivers at any tier can expose D-Bus interfaces:

  • Tier 2 drivers (BlueZ, fwupd, iSCSI) — the primary use case. Ring pair between two Ring 3 processes, brokered by the kernel.
  • Tier 1 drivers (GPU display, NIC management) — data path runs in Ring 0 at Tier 1 speed; management plane exposed via D-Bus through a Tier 1 ↔ Tier 2 ring channel. The driver declares both the hardware vtable and the dbus_interface blocks in a single .kabi manifest.
  • Tier 0 drivers — on architectures without Tier 1 hardware isolation (RISC-V, s390x, LoongArch64), drivers that would request Tier 1 elsewhere run as Tier 0 (if permitted by licensing and admin policy) or Tier 2. Tier 0 drivers can still declare dbus_interface blocks. The kernel creates a Tier 0 ↔ Tier 2 ring channel — the same mechanism used for any Tier 2 driver calling kernel services via KernelServicesVTable, just in the reverse direction (kernel-side driver pushes D-Bus events to the Ring 3 bridge). This ensures that D-Bus management interfaces work identically on all eight architectures, regardless of which tier the backing driver actually runs at. A WiFi control plane driver exposes net.connman.iwd.* via D-Bus whether it runs as Tier 1 on x86-64 (MPK-isolated) or Tier 0 on RISC-V (in-kernel). Desktop applications see the same D-Bus interface on both platforms.

Runtime tier changes: If an administrator changes a driver's tier assignment at runtime — e.g., promoting a Tier 1 driver to Tier 0 for performance via operator policy in /etc/umka/driver-policy.d/ — the kernel reloads the driver at the new tier and creates a new ring pair. The bridge detects the ring disconnect, reconnects to the new ring endpoint, and resumes operation. The DbusSchemaTable is unchanged (it is a property of the driver binary, not its tier). D-Bus clients see a brief latency blip identical to a driver restart — no service name change, no interface change, no application-visible difference. The tier is an operational deployment decision; the D-Bus management interface is a driver identity that survives tier changes.

Inter-driver ring channels: When a driver declares D-Bus interfaces in its manifest, the kernel's driver loader creates a dedicated ring pair between the bridge and that driver, regardless of the driver's tier. The ring broker ensures capability-gated access — the bridge can only reach drivers that explicitly declared D-Bus exposure. See Section 11.8 for the ring buffer infrastructure.

11.11.3 D-Bus Interface Schema in KABI Manifest

Each driver that wants to expose D-Bus interfaces declares them in its KABI manifest (.kabi IDL file). The kabi-gen tool validates the schema at compile time and generates the ring message layout for the bridge.

11.11.3.1 Schema Declaration Syntax

// In bluetooth_hci.kabi:

// D-Bus interface declarations — parsed by kabi-gen, embedded in
// .kabi_manifest ELF section as a DbusSchemaTable.
// The D-Bus bridge reads this table at driver registration time.

dbus_interface "org.bluez.Adapter1" {
    object_pattern = "/org/bluez/hci{N}"

    @dbus_method("StartDiscovery")
    @perm(WRITE)
    fn adapter_start_discovery(adapter_index: u32) -> Result<(), ErrCode>;

    @dbus_method("StopDiscovery")
    @perm(WRITE)
    fn adapter_stop_discovery(adapter_index: u32) -> Result<(), ErrCode>;

    @dbus_method("RemoveDevice")
    @perm(ADMIN)
    fn adapter_remove_device(
        adapter_index: u32,
        device_path: [u8; 64],       // Object path, NUL-terminated
    ) -> Result<(), ErrCode>;

    @dbus_property("Address", read)
    fn adapter_get_address(adapter_index: u32) -> [u8; 18];  // "XX:XX:XX:XX:XX:XX\0"

    @dbus_property("Powered", readwrite)
    fn adapter_get_powered(adapter_index: u32) -> bool;
    fn adapter_set_powered(adapter_index: u32, value: bool) -> Result<(), ErrCode>;

    @dbus_property("Discovering", read)
    fn adapter_get_discovering(adapter_index: u32) -> bool;

    @dbus_signal("DeviceFound")
    struct DeviceFoundSignal {
        adapter_index: u32,
        device_path: [u8; 64],
        address: [u8; 18],
        rssi: i16,
        _pad: [u8; 2],
    }
}

dbus_interface "org.bluez.Device1" {
    object_pattern = "/org/bluez/hci{N}/dev_{ADDR}"

    @dbus_method("Connect")
    @perm(WRITE)
    fn device_connect(
        adapter_index: u32,
        device_addr: [u8; 18],
    ) -> Result<(), ErrCode>;

    @dbus_method("Disconnect")
    @perm(WRITE)
    fn device_disconnect(
        adapter_index: u32,
        device_addr: [u8; 18],
    ) -> Result<(), ErrCode>;

    @dbus_property("Connected", read)
    fn device_get_connected(adapter_index: u32, device_addr: [u8; 18]) -> bool;

    @dbus_property("Name", read)
    fn device_get_name(adapter_index: u32, device_addr: [u8; 18]) -> [u8; 248];
}

11.11.3.2 Schema Compilation

kabi-gen processes dbus_interface blocks and produces:

  1. Ring message types — one #[repr(C)] struct per method/property/signal, with a discriminant tag for dispatch. These are the payloads exchanged on the bridge ↔ driver ring channel.

  2. DbusSchemaTable — a compiled binary table embedded in the .kabi_manifest ELF section alongside the KabiDriverManifest. Contains interface names, object path patterns, method/property/signal descriptors with D-Bus type signatures and ring message ordinals.

  3. Driver-side dispatch stub — generated code that receives ring messages from the bridge and calls the driver's implementation functions. The driver author implements the fn adapter_start_discovery(...) etc. functions; the generated stub handles ring deserialization and response.

/// Compiled D-Bus schema table. Embedded in .kabi_manifest.
/// Read by the D-Bus bridge at driver registration time.
#[repr(C)]
pub struct DbusSchemaTable {
    /// Magic: 0x44425553 ("DBUS") — identifies a valid D-Bus schema.
    pub magic: u32,
    /// Schema version (currently 1).
    pub version: u32,
    /// Number of D-Bus interfaces declared by this driver.
    pub interface_count: u16,
    /// Number of total methods + properties + signals across all interfaces.
    pub entry_count: u16,
    /// Offset (bytes from table start) to the interface descriptor array.
    pub interfaces_offset: u32,
    /// Offset to the entry descriptor array (methods, properties, signals).
    pub entries_offset: u32,
    /// Offset to the string table (interface names, method names, type sigs).
    pub strings_offset: u32,
    /// Total table size in bytes.
    pub total_size: u32,
}
// DbusSchemaTable: magic(4) + version(4) + interface_count(2) + entry_count(2) +
//   interfaces_offset(4) + entries_offset(4) + strings_offset(4) + total_size(4) = 28 bytes.
const_assert!(size_of::<DbusSchemaTable>() == 28);

/// One D-Bus interface declared by the driver.
#[repr(C)]
pub struct DbusInterfaceDesc {
    /// Offset into string table: interface name (e.g., "org.bluez.Adapter1").
    pub name_offset: u32,
    /// Offset into string table: object path pattern (e.g., "/org/bluez/hci{N}").
    pub object_pattern_offset: u32,
    /// Index of first entry (method/property/signal) for this interface.
    pub first_entry_index: u16,
    /// Number of entries belonging to this interface.
    pub entry_count: u16,
    /// D-Bus bus type: 0 = system bus, 1 = session bus.
    pub bus_type: u8,
    pub _pad: [u8; 3],
}
// DbusInterfaceDesc: name_offset(4) + object_pattern_offset(4) + first_entry_index(2) +
//   entry_count(2) + bus_type(1) + _pad(3) = 16 bytes.
const_assert!(size_of::<DbusInterfaceDesc>() == 16);

/// One method, property, or signal within a D-Bus interface.
#[repr(C)]
pub struct DbusEntryDesc {
    /// Offset into string table: D-Bus member name (e.g., "StartDiscovery").
    pub name_offset: u32,
    /// Offset into string table: D-Bus type signature for arguments
    /// (e.g., "" for no args, "s" for one string, "a{sv}" for dict).
    pub in_sig_offset: u32,
    /// Offset into string table: D-Bus type signature for return value.
    pub out_sig_offset: u32,
    /// Ring message ordinal — the discriminant tag used on the bridge ↔ driver
    /// ring channel. Assigned sequentially by kabi-gen starting from 1.
    pub ring_ordinal: u16,
    /// Entry kind: 0 = method, 1 = read-only property, 2 = readwrite property,
    /// 3 = signal (driver → bridge direction).
    pub kind: u8,
    /// Required PermissionBits (from @perm annotation). The bridge checks
    /// the caller's capability before forwarding.
    pub required_perm: u8,
}
// DbusEntryDesc: name_offset(4) + in_sig_offset(4) + out_sig_offset(4) +
//   ring_ordinal(2) + kind(1) + required_perm(1) = 16 bytes.
const_assert!(size_of::<DbusEntryDesc>() == 16);

11.11.4 D-Bus Type Mapping

The bridge translates between D-Bus wire types and KABI ring field types. All conversions are lossless and reversible. The bridge performs no semantic interpretation — it is a pure type transducer.

D-Bus type D-Bus sig KABI ring type Notes
BYTE y u8
BOOLEAN b u8 (0/1) D-Bus uses 4-byte boolean; ring uses 1-byte
INT16 n i16
UINT16 q u16
INT32 i i32
UINT32 u u32
INT64 x i64
UINT64 t u64
DOUBLE d f64
STRING s [u8; N] NUL-terminated N from schema; max 256 bytes
OBJECT_PATH o [u8; N] NUL-terminated N from schema; max 256 bytes
SIGNATURE g [u8; N] NUL-terminated N from schema; max 256 bytes
UNIX_FD h u32 (fd number) Bridge transfers fd via SCM_RIGHTS internally
ARRAY a... [T; MAX] + count: u16 Fixed max from schema; actual count in header
DICT_ENTRY {kv} Struct with key + value fields kabi-gen expands known dict shapes
VARIANT v Tagged union Schema must enumerate allowed variants

Design constraint: D-Bus VARIANT (v) and unbounded ARRAYs are inherently dynamic-sized. KABI ring messages are fixed-layout (#[repr(C)]). The schema must constrain these:

  • VARIANTs: Each property/method that uses a{sv} must declare the known keys and their types in the schema. The bridge rejects unknown keys with org.freedesktop.DBus.Error.InvalidArgs. This is acceptable because the set of properties on a hardware interface is fixed at compile time.

  • Unbounded arrays: The schema declares a maximum element count. If a D-Bus caller sends more elements, the bridge returns org.freedesktop.DBus.Error.LimitsExceeded.

  • Maximum message size: The bridge rejects any D-Bus message exceeding DBUS_MAX_MESSAGE_SIZE (default: 16 MiB, matching systemd's sd-bus limit; well below the D-Bus spec limit of 128 MiB). Messages exceeding this limit are dropped before schema validation with org.freedesktop.DBus.Error.LimitsExceeded.

11.11.4.1 Variant Nesting Depth Limit

D-Bus VARIANT types can nest recursively (a variant containing a variant containing a variant...). Without a depth limit, a malicious or buggy client could send a deeply nested message that causes unbounded stack consumption during schema validation. The bridge enforces a compile-time nesting limit:

/// Maximum nesting depth for D-Bus variant types in schema-validated messages.
/// Deeper nesting is rejected during schema validation before any allocation.
/// Matches sd-bus default limit (sd_bus_message_enter_container depth).
pub const DBUS_VARIANT_MAX_DEPTH: u8 = 32;

The schema validation algorithm performs depth-first traversal of the incoming D-Bus message type tree with a stack-local depth counter (initialized to 0). Each time the validator enters a container type (VARIANT v, ARRAY a, STRUCT (...), or DICT_ENTRY {...}), the counter is incremented. If the counter exceeds DBUS_VARIANT_MAX_DEPTH, the message is rejected immediately with org.freedesktop.DBus.Error.InvalidArgs — no further parsing or allocation occurs. This check runs before any ring buffer write, ensuring that malformed messages cannot consume driver-side resources.

The depth counter is decremented on container exit, so sibling containers at the same depth level do not accumulate against the limit. Only true nesting depth (ancestor chain length) is measured.

Phase: Phase 3.

11.11.5 Message Flow

11.11.5.1 Method Call (D-Bus Client → Tier 2 Driver)

1. Desktop app calls org.bluez.Adapter1.StartDiscovery()
   via D-Bus system bus (libdbus / sd-bus / GDBus).

2. D-Bus daemon routes the message to the bridge's well-known name
   (org.bluez on system bus — registered by the bridge based on the
   BT driver's schema).

3. Bridge receives D-Bus method call:
   a. Extracts interface name + method name + arguments
   b. Looks up DbusEntryDesc by (interface, method)
   c. Checks caller's capability (via SO_PEERCRED → pid → capability check)
   d. Marshals arguments from D-Bus wire format to KABI ring struct
   e. Writes ring message with ring_ordinal tag to the BT driver's ring

4. Kernel delivers ring message to BT driver (Tier 2 process).

5. BT driver's generated dispatch stub:
   a. Reads ring_ordinal tag
   b. Deserializes arguments
   c. Calls driver's adapter_start_discovery() implementation
   d. Writes response (success or error) to completion ring

6. Bridge receives completion:
   a. Marshals response from KABI ring struct to D-Bus wire format
   b. Sends D-Bus method return (or error) to the original caller

11.11.5.2 Signal (Tier 2 Driver → D-Bus Clients)

1. BT driver discovers a new device during scan.

2. Driver writes DeviceFoundSignal struct to the bridge ring
   (signal ring_ordinal, push direction: driver → bridge).

3. Bridge receives ring message:
   a. Looks up DbusEntryDesc by ring_ordinal (kind = signal)
   b. Constructs D-Bus signal message from the struct fields
   c. Emits signal on the D-Bus bus (org.bluez.Adapter1.DeviceFound)

4. All subscribed D-Bus clients receive the signal.

11.11.5.3 Property Read/Write

Property reads and writes follow the standard org.freedesktop.DBus.Properties.Get / .Set / .GetAll pattern. The bridge intercepts these D-Bus methods and translates them:

  • Get(interface, property) → ring message with the property's read ordinal → driver returns value → bridge formats D-Bus VARIANT response.
  • Set(interface, property, value) → ring message with the property's write ordinal + value → driver validates and applies → bridge returns success/error.
  • GetAll(interface) → bridge sends one ring message per readable property, collects responses, assembles a{sv} dict → returns to D-Bus caller.

The bridge caches property values with a configurable TTL (default: 0 = no cache, every read hits the driver). Drivers that want caching set @dbus_property_cache(ms) in the schema. The cache is invalidated when the driver emits a PropertiesChanged signal via the ring.

11.11.6 Bus Types: System vs Session

The bridge runs as two logical instances:

Instance D-Bus bus Scope Typical drivers
System bridge System bus (/run/dbus/system_bus_socket) Machine-wide, root-owned Bluetooth, WiFi control, fwupd, disk management, TPM
Session bridge Session bus ($DBUS_SESSION_BUS_ADDRESS) Per-user login session (reserved for future per-user hardware access)

The system bridge starts at boot as part of the driver infrastructure. It registers on the system D-Bus bus under the well-known names declared by its connected drivers (e.g., org.bluez, net.connman.iwd, org.freedesktop.fwupd).

The session bridge is optional and starts per-user if any Tier 2 driver declares session-bus interfaces (bus_type = 1 in DbusInterfaceDesc). Session-scoped capabilities are granted via the user's login credential.

Both instances are the same binary with different configuration (bus type selection, capability scope). They share the same DbusSchemaTable parsing and ring dispatch code.

11.11.7 Capability Integration

The bridge replaces polkit for hardware-access authorization decisions made by Tier 2 drivers. In the Linux world:

[App] → D-Bus → [BlueZ] → polkit check → [BlueZ continues or rejects]

In UmkaOS:

[App] → D-Bus → [Bridge] → capability check → ring → [BT driver]

The bridge checks the calling process's capabilities before forwarding to the driver. Each DbusEntryDesc carries a required_perm field (from the @perm annotation in the schema). The bridge:

  1. Extracts the caller's PID from the D-Bus connection (via SO_PEERCRED on the Unix domain socket, or via org.freedesktop.DBus.GetConnectionUnixProcessID). Note: UmkaOS uses u64 PIDs, so PID reuse TOCTOU is not a practical concern. Alternatively, the kernel can embed the caller's capability set in the ring message header, eliminating the PID lookup entirely.
  2. Looks up the process's capability set via the kernel's capability API.
  3. Checks that the process holds a capability with the required PermissionBits for the target device.
  4. If the check fails, returns org.freedesktop.DBus.Error.AccessDenied without forwarding to the driver.

This replaces polkit's rule-based authorization with UmkaOS's capability model. The advantage: capability checks are O(1) kernel lookups, not JavaScript rule evaluation in a separate polkitd daemon. The policy is embedded in the capability grants, not in /etc/polkit-1/rules.d/ files.

Backward compatibility: Applications that call org.freedesktop.PolicyKit1.Authority.CheckAuthorization directly (rare — most use implicit polkit via D-Bus activation) will receive NotAuthorized since polkitd does not exist. The bridge's capability check is the replacement. Applications that rely on D-Bus activation policies (.service files with SystemdService=) work unchanged — the bridge registers the well-known names, and D-Bus activation config points to the bridge.

11.11.7.1 Temporal Authorization Grants

Polkit supports time-bounded authorizations: a user can approve an action "for this session" or "keep this authorization for N minutes." The D-Bus bridge maps these to the UmkaOS capability model using the existing CapConstraints.expires_at field (Section 9.1), which holds a monotonic timestamp in nanoseconds (0 = no expiry).

Grant creation: When a D-Bus method call requires elevated privileges and the user approves a time-bounded grant (via the desktop authentication agent), the bridge creates a CapEntry with the following CapConstraints:

Authorization scope expires_at value Revocation trigger
One-shot (default) 0 (no temporal grant — single-use, not stored) Immediate after use
For N minutes monotonic_now_ns() + N * 60 * 1_000_000_000 Timer expiry or explicit revoke
For this session 0 (no time limit) — revoked by session logout Session CapNode drop

Per-call expiry check: On each D-Bus method call, after the bridge resolves the caller's capability via cap_lookup() and checks cap_has_rights(), it additionally checks the expires_at constraint:

/// Check temporal validity of a D-Bus authorization grant.
/// Called after cap_has_rights() succeeds, before forwarding to the driver ring.
///
/// Returns `true` if the capability has not expired.
/// Returns `false` if `expires_at` is non-zero and the current monotonic
/// timestamp exceeds it.
#[inline]
fn cap_temporally_valid(cap: &Capability, now_ns: u64) -> bool {
    cap.constraints.expires_at == 0 || now_ns <= cap.constraints.expires_at
}

If the capability has expired, the bridge returns org.freedesktop.DBus.Error.AccessDenied to the caller without forwarding the request to the driver ring. The expired CapEntry is lazily cleaned up — the bridge calls cap_revoke() on the expired entry, which sets REVOKED_FLAG and enqueues child revocation via the workqueue (Section 9.1).

Session logout revocation: Session-scoped temporal capabilities are associated with the user's login session CapNode in the capability tree. When the desktop session terminates (logout, lock-screen timeout, or loginctl terminate-session), the session's CapNode is dropped, which triggers cap_revoke() on all capabilities in the session subtree — including all temporal D-Bus authorization grants. This ensures no stale authorizations survive session boundaries, regardless of their expires_at value.

Phase: Phase 4.

11.11.8 Crash Recovery

11.11.8.1 Bridge Crash

The D-Bus bridge is a Tier 2 driver. If it crashes:

  1. Kernel detects process exit (Tier 2 recovery sequence, Section 11.9).
  2. Bridge restarts in ~10ms.
  3. On startup, the bridge: a. Reconnects to the D-Bus bus, re-registers well-known names. b. Re-reads DbusSchemaTable from all connected Tier 2 drivers. c. Re-establishes ring channels to all drivers.
  4. D-Bus clients see a brief service disappearance (~10-50ms). Applications using GDBus or sd-bus with auto-reconnect handle this transparently. Applications that cache D-Bus proxy objects may need to re-create them (standard D-Bus practice for service restarts).

11.11.8.2 Connected Driver Crash

If a Tier 2 driver (e.g., BT driver) crashes while the bridge is running:

  1. Kernel restarts the driver (~10ms).
  2. The bridge detects ring disconnection on the driver's channel.
  3. Pending method calls in flight are held in a bounded queue (max 32 entries, configurable per-driver).
  4. When the driver restarts and re-establishes its ring: a. Bridge replays queued method calls. b. New method calls from D-Bus clients are forwarded normally.
  5. D-Bus clients see a brief latency spike (10-50ms), not a service failure. The bridge does NOT emit NameOwnerChanged (service disappearance) unless the driver fails to restart within the hold window (default: 500ms).

Non-idempotent method calls: If a Tier 1 driver crashes after executing a non-idempotent D-Bus method (e.g., SetVolume) but before the reply is sent, the bridge marks the pending request as UNCERTAIN. The bridge returns org.freedesktop.DBus.Error.NoReply to the caller. The caller must query current state (e.g., GetVolume) after driver reload to determine whether the operation succeeded.

11.11.8.3 Both Crash Simultaneously

If both the bridge and a connected driver crash (e.g., kernel bug affecting Tier 2 scheduling), both restart independently via Tier 2 recovery. The bridge re-discovers all drivers at startup. Order does not matter — the bridge retries ring setup for drivers that are not yet available, with exponential backoff up to 1s.

Discovery mechanism: At startup, the bridge calls enumerate_service_providers (ServiceType::Dbus) (KABI method on KernelServicesVTable) to get the initial set of D-Bus-enabled drivers. For runtime changes (new driver loads, driver unloads), the bridge subscribes to RegistryEvent::ServiceChanged events with service_type == ServiceType::Dbus for push notifications. The ServiceChanged event includes the driver ID and action (Published/Unpublished).

11.11.9 Object Path Mapping

D-Bus interfaces are attached to object paths. The bridge must know which driver handles which object paths. The object_pattern field in the schema provides this:

Pattern Example match Mapping
/org/bluez/hci{N} /org/bluez/hci0 adapter_index = 0
/org/bluez/hci{N}/dev_{ADDR} /org/bluez/hci0/dev_AA_BB_CC_DD_EE_FF adapter_index = 0, device_addr extracted
/org/freedesktop/fwupd /org/freedesktop/fwupd Singleton (no parameters)
/org/freedesktop/UDisks2/drives/{ID} /org/freedesktop/UDisks2/drives/WDC_... drive_id extracted from path segment

Pattern variables ({N}, {ADDR}, {ID}) are extracted by the bridge and passed as arguments to the ring message. The bridge maintains an object tree that maps registered patterns to driver ring channels. D-Bus introspection (org.freedesktop.DBus.Introspectable.Introspect) is served from the compiled schema — no round-trip to the driver needed.

Dynamic objects: When a BT device is discovered, the BT driver creates a new object path (e.g., /org/bluez/hci0/dev_AA_BB_CC_DD_EE_FF). The driver notifies the bridge via a ring message (DBUS_OBJECT_ADDED { path, interfaces }), and the bridge registers the new object on D-Bus. Similarly, DBUS_OBJECT_REMOVED for device departure. The bridge emits the standard org.freedesktop.DBus.ObjectManager.InterfacesAdded / InterfacesRemoved signals.

11.11.10 Canonical Tier 2 Migration Candidates

The following table summarizes which traditional userspace daemons are strong candidates for migration to Tier 2 drivers with D-Bus bridge exposure:

Daemon / Driver Tier D-Bus interface Migration value D-Bus bridge needed? Notes
BlueZ (bluetoothd) Tier 2 org.bluez.* High — crash-prone, hardware-adjacent Yes HCI transport via KABI ring to BT controller
iwd / wpa_supplicant Tier 2 net.connman.iwd.* / fi.w1.wpa_supplicant1.* High — security-critical auth Yes Control plane only; WiFi data path stays Tier 1
fwupd Tier 2 org.freedesktop.fwupd Medium — capability-gated device access Yes Per-device firmware write caps
udisks2 Tier 2 org.freedesktop.UDisks2.* Medium — disk management Yes Block device operations via KABI
GPU display ctrl Tier 1 org.freedesktop.DisplayManager Medium — config without perf loss Yes DRM/KMS data path at Tier 1; config via bridge
NVMe health Tier 1 (custom) Low — status/health queries Yes NVMe I/O at Tier 1; S.M.A.R.T. queries via bridge
tpm2-abrmd Tier 2 com.intel.tss2.Tabrmd Low — kernel TPM RM already exists Possibly May be absorbed by kernel TPM subsystem
FUSE daemons Tier 2 None (VFS syscalls) High — eliminates double-copy No Apps use read/write/stat; no D-Bus involved
iSCSI (iscsid) Tier 2 None (block device) High — unifies kernel/userspace split No Apps see /dev/sdX
smartd None (sysfs/procfs) Low — simple polling, FMA integration No Better as userspace → FMA event pipe
PipeWire (session) org.freedesktop.ReserveDevice1 Not recommended N/A Per-user session daemon; wrong scope for Tier 2
CUPS None (HTTP/IPP) Not recommended N/A Application server, not hardware driver

Decision criteria for daemon replacement: A daemon is a strong Tier 2 + D-Bus bridge candidate if it is (a) hardware-adjacent (currently uses /dev/* + ioctl), (b) complex or crash-prone, (c) exposes D-Bus interfaces that desktop apps depend on, and (d) not a per-user session service or application-level server.

11.11.11 Universal Driver Management Bus

The D-Bus bridge is not limited to replacing existing daemons. It serves a broader role: a universal management plane for the entire driver ecosystem.

Any driver — Tier 1 or Tier 2, whether or not it replaces an existing daemon — can declare dbus_interface blocks in its KABI manifest to expose configuration, status, and diagnostics to desktop applications and system management tools. The data path (block I/O, packet forwarding, DMA transfers) stays on the driver's native hot path; the management plane goes through D-Bus via the bridge. Two orthogonal channels from the same driver, declared in one manifest.

11.11.11.1 Why This Matters

In Linux, every subsystem invents its own management interface:

Subsystem Linux management interface Problems
Block devices sysfs text files, ioctl, udev properties Fragmented, text parsing, no type safety
Network netlink (nl80211, rtnetlink), sysfs, ethtool ioctl Three different IPC mechanisms for one subsystem
Storage health S.M.A.R.T. passthrough ioctl + smartd daemon Requires root, no structured event delivery
RAID mdadm CLI parsing sysfs + proc/mdstat Text parsing, polling, no event notification
iSCSI iscsiadm CLI + iscsid daemon + kernel module Three-way split across CLI, daemon, and kernel
GPU DRM ioctl + sysfs + vendor-specific tools Per-vendor CLI tools, no unified interface

UmkaOS provides a single, structured channel for all driver management: D-Bus via the bridge. Desktop applications (GNOME Disks, NetworkManager, GNOME Settings, KDE Plasma) and system management tools get typed, introspectable, event-driven access to any driver's management interface — without per-subsystem parsers, without polling sysfs, without text parsing.

11.11.11.2 New Management Interfaces (Not Daemon Replacements)

These drivers never had D-Bus interfaces in Linux. The bridge enables them to expose structured management for the first time:

Driver Tier D-Bus interface Exposed management functions
NVMe Tier 1 org.umkaos.NVMe1.* S.M.A.R.T. health, temperature, firmware slot info, namespace management, error log
AHCI/SATA Tier 1 org.umkaos.Sata1.* Drive health, link speed, hot-plug status, power state
RAID / dm Tier 1 org.umkaos.Raid1.* Array status, rebuild progress, spare management, scrub scheduling
iSCSI Tier 2 org.umkaos.Iscsi1.* Target discovery, session login/logout, CHAP config, connection status, error recovery
Network (per-NIC) Tier 1 org.umkaos.Net1.* Link state, speed negotiation, error counters, offload config, ring buffer sizing
USB hub Tier 2 org.umkaos.UsbHub1.* Port power control, device enumeration, over-current status
Thermal Tier 1 org.umkaos.Thermal1.* Zone temperatures, trip points, cooling device state, throttling status
Power supply Tier 1 org.umkaos.PowerSupply1.* Battery capacity, charge state, AC/DC status, health
Audio (ALSA) Tier 1 org.umkaos.Audio1.* Card enumeration, mixer control, PCM stream info, jack detection, power state

These interfaces use UmkaOS-specific D-Bus well-known names (org.umkaos.*) since they are new — no Linux backward-compatibility constraint. Desktop environments add support incrementally; the D-Bus introspection mechanism provides full self-description.

11.11.11.3 Data Path vs Management Plane

The critical design principle: D-Bus is for the management plane only. It is never on the data path.

Channel Mechanism Latency Use
Data path Kernel syscalls (read/write/ioctl), block I/O, packet I/O Microseconds Per-packet, per-I/O, per-page-fault
Management plane D-Bus via bridge → KABI ring → driver Milliseconds Configuration, status queries, event subscription

A GNOME Disks user clicking "Show S.M.A.R.T. Data" goes through D-Bus. The actual disk I/O serving that user's files goes through the block layer. These two paths share no code, no locks, and no latency coupling. The bridge could be restarting from a crash while the NVMe driver continues serving I/O at full speed.

11.11.11.4 Linux Compatibility of New Interfaces

New org.umkaos.* D-Bus interfaces do not break Linux compatibility — they are additive. Applications that work on Linux today continue to work on UmkaOS via the existing syscall interface (sysfs, procfs, ioctl, netlink). The D-Bus management interfaces are a better alternative that new or updated applications can adopt.

For existing Linux D-Bus interfaces (BlueZ, NetworkManager, udisks2), the bridge provides exact compatibility — same well-known names, same object paths, same method signatures. Applications cannot distinguish UmkaOS's bridge-backed implementation from the original daemon.

11.11.12 Non-D-Bus Userspace Protocols

Some daemons expose non-D-Bus protocols:

Daemon Protocol Recommendation
wpa_supplicant wpa_ctrl Unix socket Primary consumer (NetworkManager) already uses D-Bus. Legacy wpa_ctrl users can use a minimal compat shim or migrate to D-Bus.
CUPS HTTP/IPP on port 631 Stays as userspace daemon — not a Tier 2 candidate.
Avahi D-Bus + custom mDNS socket D-Bus side via bridge; mDNS protocol handling in Tier 2 driver.
PulseAudio Native protocol over Unix socket PipeWire compat: stays in userspace. Audio HAL → Tier 2.

The D-Bus bridge handles the majority case. The remaining protocols are either served by daemons that stay in userspace (CUPS, PipeWire) or have D-Bus as their primary consumer interface (wpa_supplicant via NetworkManager).

11.11.13 Relationship to umkafs

The D-Bus bridge coexists with umkafs (Section 20.5). They serve different purposes:

Aspect D-Bus bridge umkafs
Consumers Desktop applications (GNOME, KDE), NetworkManager, system management GUIs System tools, shell scripts, kernel introspection, admin automation
Protocol D-Bus wire protocol (typed methods, signals, properties) File I/O (read/write/stat on pseudo-files)
Scope Driver management interfaces (declared per-driver) All kernel objects (automatic)
Discovery D-Bus introspection (org.freedesktop.DBus.Introspectable) Directory traversal (ls, find)
Event delivery D-Bus signals (push, subscribed) inotify / FileWatchCap (poll or watch)
Access control Capability check on caller PID Capability on file handle

Both mechanisms expose the same underlying driver state through different protocols optimized for different consumers. A BT driver exposes org.bluez.Adapter1.Powered via D-Bus for GNOME Settings, and /ukfs/kernel/drivers/bluetooth/hci0/powered via umkafs for shell scripts. The driver writes the state once; the bridge and umkafs read it independently. These are complementary, not competing — the driver does not need to implement two management interfaces.

11.11.14 Cross-References