Chapter 12: KABI — Kernel Driver ABI¶

Stable driver ABI, KABI IDL, vtable design, driver signing, compatibility windows

KABI (Kernel ABI) is the stable binary interface between the kernel and drivers. Vtables are append-only and versioned via kabi_version (primary discriminant) and vtable_size (bounds guard). Drivers compiled against any supported KABI version load and run without recompilation — a 5-release support window gives drivers ~5 years of binary compatibility. The KABI IDL compiler (kabi-gen) generates all vtable boilerplate from .kabi files.

12.1 KABI Overview¶

12.1.1 The Problem We Solve¶

Linux has NO stable in-kernel ABI. This means:

Drivers must recompile with every kernel update
Nvidia ships binary blobs that constantly break
DKMS rebuilds are fragile and fail
The community's answer is "get upstream or suffer"
Enterprise customers cannot independently update kernel and drivers

UmkaOS provides a stable, versioned, append-only C-ABI (called KABI) that survives kernel updates. A driver compiled against KABI v1 will load and run correctly on any future kernel that supports KABI v1 -- without recompilation.

12.1.2 Interface Definition Language (.kabi)¶

All driver interfaces are defined in .kabi IDL files. The umka-kabi-gen tool (short form: kabi-gen) generates both Rust and C bindings from these definitions.

// interfaces/block_device.kabi
@version(1)
interface BlockDevice {
    fn submit_io(op: BioOp, lba: u64, count: u32, buf: DmaBuffer) -> IoResult;
    fn poll_completion(handle: RequestHandle) -> PollResult;
    fn get_capabilities() -> BlockCapabilities;
}

@version(2) @extends(BlockDevice, 1)
interface BlockDeviceV2 {
    // V2+ methods MUST have @default annotations — they are optional
    // extensions that a V1 kernel or V1 driver will not provide.
    @default(-EOPNOTSUPP)
    fn discard_blocks(lba: u64, count: u32) -> IoResult;
    @default(-EOPNOTSUPP)
    fn zone_management(op: ZoneOp, zone: u64) -> ZoneResult;
}

This compiles down to a C-compatible vtable. The Nucleus KABI dispatch trampoline (Section 2.21) validates every vtable call before dispatch:

Bounds check: method_id (the zero-based ordinal of the requested method in the vtable) is checked against the vtable's method count, computed as (vtable.vtable_size - VTABLE_HEADER_SIZE) / size_of::<fn()>(). If method_id >= method_count, the trampoline returns ENOSYS without dereferencing the function pointer. This prevents out-of-bounds reads into memory beyond the vtable allocation.
Null check for optional methods: For V2+ optional methods (Option<fn>), the trampoline checks for None (null pointer) after the bounds check passes. If the method slot is None, the trampoline returns ENOSYS. This allows callers to attempt V2 methods against V1 drivers without crashing.
vtable_size validation: At driver registration time, the kernel validates that vtable_size >= VTABLE_HEADER_SIZE (16 bytes: vtable_size + kabi_version) and that vtable_size is a multiple of size_of::<u64>() (pointer alignment). A driver providing a malformed vtable_size is rejected at registration with EINVAL.

#[repr(C)]
pub struct BlockDeviceVTable {
    pub vtable_size: u64,          // Bounds-safety check: byte count of this vtable
    pub kabi_version: u64,         // Primary version discriminant: KabiVersion::as_u64()

    // V1 methods -- mandatory, never Option
    pub submit_io: unsafe extern "C" fn(
        ctx: *mut c_void, op: BioOp, lba: u64, count: u32, buf: DmaBuffer,
    ) -> IoResult,
    pub poll_completion: unsafe extern "C" fn(
        ctx: *mut c_void, handle: RequestHandle,
    ) -> PollResult,
    pub get_capabilities: unsafe extern "C" fn(
        ctx: *mut c_void,
    ) -> BlockCapabilities,

    // V2 methods -- optional, wrapped in Option for graceful absence
    pub discard_blocks: Option<unsafe extern "C" fn(
        ctx: *mut c_void, lba: u64, count: u32,
    ) -> IoResult>,
    pub zone_management: Option<unsafe extern "C" fn(
        ctx: *mut c_void, op: ZoneOp, zone: u64,
    ) -> ZoneResult>,
}
// BlockDeviceVTable: vtable_size(u64=8) + kabi_version(u64=8) + 3 mandatory fn ptrs +
//   2 optional fn ptrs. Function pointers are pointer-width-sized.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<BlockDeviceVTable>() == 56);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<BlockDeviceVTable>() == 36);

The NIC driver interface follows the same vtable pattern:

/// KABI vtable for NIC drivers (Tier 1). Exported by NIC driver modules
/// via the tier entry point (`kabi_entry_tier1() -> *const NicDriverVTable`).
/// Used by umka-net to drive physical and virtual network interfaces.
///
/// The vtable is returned by the driver's entry point during the bilateral
/// capability exchange ([Section 12.3](#kabi-bilateral-capability-exchange)). umka-net holds a pointer to this
/// vtable for the lifetime of the driver instance. On driver crash and
/// reload, a new vtable is obtained from the replacement instance.
// kernel-internal, not KABI
#[repr(C)]
pub struct NicDriverVTable {
    /// Bounds-safety check: byte count of this vtable. The kernel reads only
    /// the first `min(vtable_size, KERNEL_NIC_VTABLE_SIZE)` bytes. Methods
    /// beyond `vtable_size` are treated as absent (null-equivalent).
    pub vtable_size: u64,
    /// Primary version discriminant: `KabiVersion::as_u64()`. Disambiguates
    /// vtables that may have the same `vtable_size` after deprecation cycles
    /// (see [Section 12.2](#kabi-abi-rules-and-lifecycle) Rule 2a). Monotonically increasing.
    pub kabi_version: u64,

    // --- V1 methods (mandatory) ---

    /// Bring interface up. Allocate RX/TX ring buffers, enable device
    /// interrupts, program MAC filters, start NAPI instances.
    /// Called when userspace runs `ip link set <dev> up`.
    /// Returns 0 on success, negative errno on failure.
    pub open: unsafe extern "C" fn(dev: DeviceHandle) -> i32,

    /// Bring interface down. Drain TX/RX queues, disable device interrupts,
    /// disable NAPI, free ring buffers. Called on `ip link set <dev> down`.
    /// Returns 0 on success, negative errno on failure.
    pub stop: unsafe extern "C" fn(dev: DeviceHandle) -> i32,

    /// Transmit a packet. Called from umka-net's TX path after queue
    /// selection and traffic control. The driver sets up DMA descriptors
    /// pointing at the NetBuf's data pages and writes the TX doorbell.
    ///
    /// Returns 0 on successful queueing to hardware.
    /// Returns `-EBUSY` if the TX ring is full — umka-net stops the
    /// corresponding TxQueue and retries after `netif_tx_wake_queue()`.
    /// The NetBufHandle ownership transfers to the driver on success;
    /// on `-EBUSY` the handle remains owned by the caller.
    pub start_xmit: unsafe extern "C" fn(dev: DeviceHandle, buf: NetBufHandle) -> i32,

    /// NAPI poll callback. Called by the NAPI subsystem when a softirq
    /// fires for this device's NAPI instance. The driver processes up to
    /// `budget` packets from its RX ring, calling `napi_receive_buf()` for
    /// each received packet (accumulated into the NAPI batch for delivery
    /// to umka-net).
    ///
    /// Returns the number of packets processed (0..=budget).
    /// If return < budget: driver calls `napi_complete_done()` and re-enables
    /// device RX interrupts (the NAPI polling cycle ends until the next IRQ).
    /// If return == budget: NAPI re-schedules the poll (the device has more
    /// packets waiting; polling continues without re-enabling interrupts).
    pub napi_poll: unsafe extern "C" fn(dev: DeviceHandle, napi_id: u32, budget: i32) -> i32,

    /// Register a NAPI instance for this device. Called during `open()`.
    /// `napi_id`: driver-assigned NAPI instance identifier (typically one per
    /// RX queue; multi-queue NICs register one NAPI per queue).
    /// `weight`: default poll budget per NAPI poll cycle (typically 64).
    /// Returns 0 on success, negative errno on failure.
    pub napi_register: unsafe extern "C" fn(dev: DeviceHandle, napi_id: u32, weight: i32) -> i32,

    /// Query device statistics. Fills `stats` with current RX/TX packet
    /// counts, byte counts, error counts, and drop counts. Called by
    /// ethtool, /proc/net/dev, and netlink RTM_GETSTATS.
    pub get_stats: unsafe extern "C" fn(dev: DeviceHandle, stats: *mut NetDevStats),

    /// Negotiate hardware offload features. umka-net calls this after
    /// userspace changes feature flags via ethtool (`ethtool -K <dev> ...`).
    /// The driver enables or disables the requested offloads in hardware.
    /// `features` is the desired feature set (intersection of hw_features
    /// and the user's requested features). Returns 0 on success.
    pub set_features: unsafe extern "C" fn(dev: DeviceHandle, features: NetDevFeatures) -> i32,

    /// Get the hardware feature set supported by this NIC. Called once at
    /// device registration to populate `NetDevice::hw_features`. The returned
    /// flags advertise which offloads the hardware can perform (checksum,
    /// TSO, GRO, scatter-gather, VLAN offload, etc.).
    pub get_features: unsafe extern "C" fn(dev: DeviceHandle) -> NetDevFeatures,

    /// Set the device MAC address. Called when userspace runs
    /// `ip link set <dev> address <mac>`. Returns 0 on success,
    /// `-EADDRNOTAVAIL` if the address is invalid for this device type.
    pub set_mac_addr: unsafe extern "C" fn(dev: DeviceHandle, addr: *const [u8; 6]) -> i32,

    /// Change the device MTU. Called when userspace runs
    /// `ip link set <dev> mtu <val>`. The driver validates against its
    /// hardware minimum/maximum MTU range. Returns 0 on success,
    /// `-EINVAL` if the requested MTU is outside the hardware range.
    pub change_mtu: unsafe extern "C" fn(dev: DeviceHandle, new_mtu: u32) -> i32,

    /// Set receive mode flags. Called when promiscuous mode, all-multicast
    /// mode, or the multicast filter list changes (e.g., userspace joins
    /// a multicast group via `setsockopt(IP_ADD_MEMBERSHIP)`).
    /// `flags` is a bitmask of `RxModeFlags` (IFF_PROMISC, IFF_ALLMULTI,
    /// IFF_MULTICAST). The driver programs the NIC's MAC filter
    /// accordingly.
    pub set_rx_mode: unsafe extern "C" fn(dev: DeviceHandle, flags: RxModeFlags),
}
// NicDriverVTable: vtable_size(u64=8) + kabi_version(u64=8) + 11 mandatory fn ptrs.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<NicDriverVTable>() == 104);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<NicDriverVTable>() == 60);

/// Receive mode flags passed to `NicDriverVTable::set_rx_mode`.
bitflags::bitflags! {
    #[repr(transparent)]
    pub struct RxModeFlags: u32 {
        /// Receive all frames regardless of destination MAC.
        const PROMISC   = 1 << 0;
        /// Receive all multicast frames (bypass multicast filter).
        const ALLMULTI   = 1 << 1;
        /// Multicast filtering enabled (driver should program its
        /// hardware multicast filter table with the current group list).
        const MULTICAST = 1 << 2;
    }
}

NIC interrupt model (unified 4-step): When the NIC raises an RX interrupt, the Tier 0 generic_irq_handler() handles it — the Tier 1 NIC driver does NOT have an ISR (Tier 0 manages all hardware interrupts). The four steps are:

Hardware IRQ → Tier 0 handler (~10 cycles): ACK, mask vector.
Write IRQ event to per-driver IRQ ring (Section 12.8).
Driver consumer loop dequeues event (domain service schedules it).
Driver runs NAPI-equivalent poll, drains RX ring, re-enables interrupt.

Key UmkaOS difference: no softirq needed. The IRQ ring wake IS the scheduling mechanism. The driver's consumer thread has its own priority (not tied to ksoftirqd). The authoritative specification is in Section 16.13. See also Section 16.14 for the poll path dispatch (Tier 0 Direct, Tier 1 KabiRing, Tier 2 IpcRpc).

12.2 ABI Rules and Version Lifecycle¶

12.2.1 ABI Rules (Enforced by CI)¶

These rules are non-negotiable and enforced by the kabi-compat-check tool in CI:

Vtables are append-only -- new methods are added at the end only.
Existing methods are never removed, reordered, or changed in signature. Deprecated methods are replaced with a tombstone stub (kabi_deprecated_stub, returns -ENOSYS) at the end of the deprecation cycle (Rule 2a). The vtable slot is preserved — vtable_size does not shrink.
Rule 2a (Deprecation lifecycle): a method deprecated in KABI vN gains #[deprecated(since = "KABI_vN")] and emits a kernel log warning when called. After +3 minor versions (or +5 for LTS), the method pointer is replaced with kabi_deprecated_stub. Callers receive -ENOSYS; the slot is not removed.
All types crossing the ABI use #[repr(C)] with explicit sizes (u32, u64; never usize which varies by platform).
Enums use #[repr(u32)] with explicit discriminant values.
New struct fields are appended only, never removed or reordered.
The kabi_version field is the primary version discriminant. Every vtable's second field is kabi_version: u64 (a packed KabiVersion::as_u64()). This is the authoritative identity for version negotiation and compatibility checking. The vtable_size field (first field) remains as a bounds-safety check — the kernel reads only min(vtable_size, KERNEL_VTABLE_SIZE) bytes — but is no longer the sole discriminant, since deprecation tombstones (Rule 2a) mean that two different KABI versions could theoretically have the same vtable_size.
Padding fields are reserved and must be zero-initialized for forward compatibility.

12.2.1.1 Tombstone Stub Protocol¶

When a KABI method is deprecated (Rule 2a above), the deprecation lifecycle produces a tombstone stub — a placeholder function pointer that occupies the original vtable slot:

/// Tombstone stub for deprecated KABI methods. Returns -ENOSYS to all callers.
/// The vtable slot is preserved (vtable_size does not shrink) so that older
/// drivers compiled against the previous KABI version do not read out-of-bounds.
///
/// Each deprecated method may optionally declare a per-method removal error code
/// in the KABI manifest (e.g., `-EOPNOTSUPP` instead of `-ENOSYS`). If declared,
/// the tombstone stub returns that code instead. This allows subsystems to
/// distinguish "method was removed" from "method was never implemented".
/// Shared tombstone stub for simple integer-returning methods only.
/// Struct-returning and void-returning methods require per-method tombstones
/// generated by `umka-kabi-gen` to match the original method's C ABI signature.
/// See [Section 12.4](#kabi-version-negotiation--deprecation-tombstones) for the full
/// tombstone generation protocol.
pub extern "C" fn kabi_deprecated_stub_i64() -> i64 {
    -(ENOSYS as i64)
}

The tombstone stub is referenced by the live evolution framework when replaying pending ops against a new component: if a pending op's method_id targets a tombstoned slot, the op is not replayed and the tombstone's error code is returned instead.

kabi-compat-check tool specification:

The CI tool enforces the rules above by diffing the current .kabi IDL against the previous release baseline. The algorithm:

Parse both versions: Load old.kabi and new.kabi into an AST representation (vtable definitions, struct definitions, enum definitions, constant declarations).
Vtable diff: For each vtable present in old.kabi:
Reject if any method was removed, reordered, or had its signature changed.
Accept appended methods (new entries after the old vtable's last method).
Reject if the vtable name or module path changed.
Struct diff: For each #[repr(C)] struct in old.kabi:
Reject if any existing field was removed, reordered, or changed type.
Accept appended fields (new fields after the last old field).
Verify explicit padding (_pad: [u8; N]) is preserved (not repurposed).
Verify #[repr(C)] attribute is present on all ABI-crossing types.
Enum diff: For each #[repr(u32)] enum:
Reject if any existing variant was removed or had its discriminant changed.
Accept new variants appended at the end.
Type size check: Verify that no usize, isize, bool, or Vec appears in ABI-crossing types. Only fixed-width types (u8-u64, i8-i64, f32, f64, *const T, *mut T) and #[repr(C)] composites are permitted.
Report: On any rule violation, emit a structured error identifying the breaking change (field name, old type, new type, line number in .kabi file) and exit non-zero. CI treats this as a hard failure — no KABI-breaking change can merge.

The .kabi files are the single source of truth for the stable ABI surface. They are checked into the repository alongside the Rust source and are versioned with the KABI major version number (e.g., kabi-v1.kabi, kabi-v2.kabi).

12.2.2 KABI Version Lifecycle and Deprecation Policy¶

Append-only vtables ensure forward compatibility indefinitely — a driver compiled against KABI v1 runs on a kernel implementing KABI v47. But without a deprecation policy, vtables grow without bound, accumulating dead methods that no driver uses, wasting cache lines, and complicating auditing. This section defines the lifecycle.

Version numbering — KABI versions are integer-incremented (v1, v2, v3...). Each version corresponds to a vtable layout. A new version is minted when methods are appended or struct fields are added (never removed or reordered). Major kernel releases bump the KABI version; minor releases do not.

Support window — each KABI version is supported for 5 major kernel releases from the release that introduced it. This provides a concrete, predictable window:

KABI v1: introduced in UmkaOS 1.0 → supported through UmkaOS 5.x → deprecated in 6.0
KABI v5: introduced in UmkaOS 5.0 → supported through UmkaOS 9.x → deprecated in 10.0

Deprecation process:

Deprecation announcement (N-2 releases before removal): KABI v1 is marked deprecated when UmkaOS 4.0 ships. Loading a driver built against a deprecated KABI version logs a warning: umka: driver nvme.uko uses deprecated KABI v1 (supported until UmkaOS 5.x, rebuild recommended)
Compatibility shim (during deprecation window): deprecated vtable methods are backed by shim implementations that translate old calls to current equivalents. This is a vtable-level adapter, not per-call overhead.
Removal (at window expiry): when UmkaOS 6.0 ships, the KABI v1 compatibility shim is removed. Drivers compiled against KABI v1 fail to load with a clear error: umka: driver nvme.uko requires KABI v1 (minimum supported: v2)
Never break within window: a driver compiled against any supported KABI version must load and function correctly. This is a hard contract, verified by CI testing with driver binaries compiled against every supported KABI version.

Vtable compaction — when a KABI version is removed, the kernel MAY reorganize internal vtable storage to reclaim space from removed shims. This is invisible to drivers (they see only their own KABI version's vtable layout, which never changes within the support window). Compaction is an implementation optimization, not a semantic change.

Practical impact — with annual major releases and a 5-release window, drivers have ~5 years before they must recompile. This is dramatically longer than Linux's "recompile every kernel update" reality, while avoiding the "append forever" problem.

12.2.3 Behavioral Compatibility Rules¶

The ABI rules above (Rules 1-7) guarantee structural compatibility: vtable layouts, field offsets, and type sizes remain stable across KABI versions. But structural stability alone is insufficient — a method can preserve its signature while changing its observable behavior in ways that break callers. This section defines the rules governing behavioral evolution of KABI methods.

These rules complement the structural rules and are enforced by the KABI behavioral test suite (see below) rather than by the kabi-compat-check CI tool.

12.2.3.1 Error Code Evolution Policy¶

KABI methods return error codes (typically i64 errno values or KabiResult variants). When a method's set of possible error codes changes across KABI versions, backward compatibility depends on whether the caller can already handle the code:

Category 1: Existing error code, newly returned. A method that previously never returned a particular error code begins returning it, where that code already exists in the KABI error namespace (e.g., ENOMEM, EBUSY, EINVAL). Example: a block device submit_io that previously never returned ENOMEM now does so because an internal path gained an allocation.

Classification: backward-compatible.
Rationale: all KABI callers are required to handle the full KabiResult error space or propagate unknown errors to their own callers. A well-written driver treats any unexpected negative errno as a transient or permanent failure. The error code itself is not new to the ABI — only its association with this specific method is new.
Shim requirement: none.
Documentation requirement: the KABI changelog entry for the version that introduces the new return path MUST document the method name, the newly-returned error code, and the conditions under which it occurs.

Category 2: Truly new error code. A KABI version introduces a new error code that does not exist in any prior version's error namespace (e.g., a hypothetical EDEVICE_PARTITIONED specific to cluster-aware block devices).

Classification: breaking for callers below the version that introduced the code.
Shim requirement: the dispatch trampoline (Section 12.4) MUST map the new error code to a semantically appropriate legacy error code when the caller's negotiated KABI version predates the introduction. The mapping is declared in the .kabi IDL file using the @error_compat annotation:

@version(5)
@error_compat(EDEVICE_PARTITIONED => ENETUNREACH, since = 5)
interface BlockDeviceV5 {
    fn submit_io(op: BioOp, lba: u64, count: u32, buf: DmaBuffer) -> IoResult;
}

The @error_compat annotation specifies: - The new error code name. - The legacy error code it maps to for callers below since. - The KABI version (since) at which callers are expected to handle the new code natively.

kabi-gen generates the mapping logic into the dispatch trampoline. For callers at or above since, the new error code is passed through unmodified.

CI enforcement: kabi-compat-check rejects any new error code in a .kabi file that lacks an @error_compat annotation. This prevents accidental introduction of unshimmed error codes.

Category 3: Error code removal. A method that previously returned a specific error code stops returning it (the error condition no longer arises).

Classification: backward-compatible.
Rationale: callers already handle the code; they simply never see it. No behavioral contract is violated by the absence of an error.
Shim requirement: none.
Documentation requirement: the KABI changelog SHOULD note the removal for clarity, but it is not a breaking change.

12.2.3.2 Semantic Change Documentation¶

When a KABI method's observable behavior changes beyond its error code set — for example, new side effects, changed ordering guarantees, altered completion semantics, or modified resource ownership transfer rules — the change must be documented in the KABI changelog with the following mandatory fields:

Field	Description
Method	Fully qualified method name (e.g., `BlockDevice::submit_io`)
KABI version	The minimum version in which the new behavior is present
Change type	One of: `SHIMMED` (old callers see old behavior), `TRANSPARENT` (old callers see new behavior, non-breaking), `BREAKING` (old callers see new behavior, potentially breaking)
Description	What changed and why
Old behavior	Observable behavior for callers below the stated KABI version
New behavior	Observable behavior for callers at or above the stated KABI version
Shim mechanism	For `SHIMMED` changes: how the dispatch trampoline or wrapper preserves old behavior

Rules governing each change type:

SHIMMED: The kernel maintains a version-conditional code path in the dispatch trampoline or a wrapper around the method implementation. Callers below the stated KABI version observe the old behavior exactly. The shim is maintained for the full support window of the caller's KABI version (Section 12.2). Example: a flush method that previously guaranteed synchronous completion now returns asynchronously; the shim inserts a blocking wait for old callers.
TRANSPARENT: The new behavior is strictly compatible with all existing callers. No shim is needed because the behavioral change falls within the existing contract's latitude. Example: a poll_completion method returns results faster due to an internal optimization — all callers benefit, none break.
BREAKING: The new behavior is incompatible with some callers, and no shim is feasible (e.g., a fundamental change in ownership semantics that cannot be transparently bridged). Breaking changes are permitted ONLY at a KABI major version boundary and MUST be accompanied by a migration guide. The old KABI version remains supported for its full deprecation window.

Semantic changes MUST NOT be introduced silently. A method whose observable behavior changes without a changelog entry is a specification bug. The behavioral test suite (below) exists to catch such regressions.

12.2.3.3 Behavioral Test Suite¶

Each KABI version defines a behavioral test suite that validates the semantic contracts of every method — not just structural compatibility (which kabi-compat-check handles) but the observable behavior that callers depend on.

Test suite structure:

tests/kabi/
├── v1/
│   ├── block_device_behavioral.rs    # Tests for BlockDevice v1 contracts
│   ├── net_device_behavioral.rs      # Tests for NetDevice v1 contracts
│   └── ...
├── v2/
│   ├── block_device_behavioral.rs    # Tests for BlockDevice v2 additions
│   └── ...
└── common/
    ├── error_code_coverage.rs        # Verify all @error_compat mappings
    └── shim_correctness.rs           # Verify shimmed behaviors match old spec

Test categories:

Contract tests: For each method, test that preconditions, postconditions, and invariants documented in the .kabi IDL are upheld. Example: submit_io with a valid DMA buffer returns IoResult::Pending or IoResult::Complete, never IoResult::InvalidHandle.
Error mapping tests: For each @error_compat annotation, test that a caller negotiating an older KABI version receives the legacy error code, while a caller negotiating the current version receives the new error code.
Shim fidelity tests: For each SHIMMED semantic change, test that a caller negotiating the pre-change KABI version observes the documented old behavior exactly. This requires a test harness that can present a vtable at a specific negotiated KABI version.
Regression tests: When a behavioral bug is fixed, a test is added to the version in which the fix shipped, ensuring the bug does not recur. The test documents the buggy behavior and the corrected behavior.

CI integration:

The behavioral test suite runs in CI on every change to a .kabi file, a KABI dispatch trampoline, or a method implementation backing a KABI vtable.
Tests are executed against vtables negotiated at every supported KABI version (not just the latest). For a kernel supporting KABI v3 through v7, the test matrix includes v3, v4, v5, v6, and v7.
A behavioral test failure is a hard CI failure, equivalent in severity to a kabi-compat-check structural failure.

Relationship to WDK-style validation:

This test suite serves the same role as Windows Driver Kit (WDK) driver verifier tests: it validates that a driver (or kernel subsystem) conforms to its documented behavioral contract, not just its binary interface. The key difference is that KABI behavioral tests run bidirectionally — testing both the kernel's promises to drivers and drivers' adherence to the contracts the kernel depends on.

Phase assignment: The behavioral test framework is a Phase 2 deliverable (Section 24.2). Phase 1 establishes the structural ABI rules and kabi-compat-check. Phase 2 adds behavioral contract testing as the KABI surface area grows beyond the initial device classes.

12.3 Bilateral Capability Exchange¶

Unlike Linux's global kernel symbol table (EXPORT_SYMBOL), UmkaOS uses a bilateral vtable exchange model. There is exactly one well-known exported symbol per driver: __kabi_driver_entry. No other global symbols, no symbol versioning, and no uncontrolled dependencies.

Driver Loading Sequence:

1. Kernel resolves ONE well-known symbol: __kabi_driver_entry
2. __kabi_driver_entry returns *const KabiDriverManifest
   (the manifest contains the three tier entry points: entry_direct, entry_ring, entry_ipc)
3. Kernel reads transport_mask, selects the entry point for the assigned tier
4. Kernel calls the selected entry, passing KernelServicesVTable TO driver
   (this is what the kernel provides to the driver)
5. Driver returns DriverVTable TO kernel
   (this is what the driver provides to the kernel)
6. All further communication flows through these two vtables
7. No other symbols are resolved -- ever

The KernelServicesVTable is also versioned and append-only. The vtable pointer passed to a driver is valid for the entire driver lifetime (from entry_init to module unload). The vtable is a static kernel object — it is never deallocated or moved. Drivers may call services from any non-IRQ context.

// kernel-internal, not KABI
#[repr(C)]
pub struct KernelServicesVTable {
    pub vtable_size: u64,          // Bounds-safety check: byte count of this vtable
    pub kabi_version: u64,         // Primary version discriminant: KabiVersion::as_u64()

    // Memory management.
    // All sizes use u64, not usize, to maintain ABI stability across
    // 32-bit (ARMv7, PPC32) and 64-bit targets (rule 3, [Section 12.2](#kabi-abi-rules-and-lifecycle)).
    pub alloc_dma_buffer: unsafe extern "C" fn(
        size: u64, align: u64, flags: AllocFlags,
    ) -> AllocResult,
    pub free_dma_buffer: unsafe extern "C" fn(
        handle: DmaBufferHandle,
    ) -> FreeResult,

    // Interrupt management
    pub register_interrupt: unsafe extern "C" fn(
        irq: u32, handler: InterruptHandler, ctx: *mut c_void,
    ) -> IrqResult,
    pub deregister_interrupt: unsafe extern "C" fn(
        irq: u32,
    ) -> IrqResult,

    // Logging
    pub log: unsafe extern "C" fn(
        level: u32, msg: *const u8, len: u32,
    ),

    // Ring buffer creation (added in v2)
    pub create_ring_buffer: Option<unsafe extern "C" fn(
        entries: u32, entry_size: u32, flags: RingFlags,
    ) -> RingResult>,

    // FMA health reporting (added in v3; `HealthEventClass` and `HealthSeverity`
    // are defined in [Section 20.1](20-observability.md#fault-management-architecture))
    pub fma_report_health: Option<unsafe extern "C" fn(
        device_handle: DeviceHandle,
        event_class: HealthEventClass,
        event_code: u32,
        severity: HealthSeverity,
        data: *const u8,
        data_len: u32,
    ) -> IoResultCode>,

    // Socket wakeup notification (added in v4). Tier 1 NIC drivers call this
    // to notify Tier 0 that data is available on a socket's receive queue,
    // triggering epoll/io_uring wakeup for userspace waiters. Without this,
    // Tier 1 drivers have no mechanism to wake socket-level waiters after
    // delivering packets via the NAPI completion ring — the NAPI path
    // delivers NetBufs to umka-net, but socket-level wakeup requires Tier 0
    // involvement because the socket wait queue lives in Tier 0 address space.
    //
    // `sock_handle`: opaque handle identifying the socket (obtained from
    //   the flow table lookup in Tier 0 during connection setup).
    // `events`: bitmask of ready events (EPOLLIN, EPOLLOUT, EPOLLERR).
    //
    // # Safety
    // `sock_handle` must be a valid handle previously returned by
    // `register_flow()`. Invalid handles are detected (generation check)
    // and return `IoResultCode::InvalidHandle` without side effects.
    pub wake_socket: Option<unsafe extern "C" fn(
        sock_handle: SocketHandle,
        events: u32,
    ) -> IoResultCode>,

    // Character device registration (added in v5). Tier 1 drivers call these
    // to register device number regions and create/remove device nodes in
    // devtmpfs. The actual operations execute in Tier 0 (umka-core) because
    // devtmpfs and the device number registry are Tier 0 data structures that
    // Tier 1 drivers cannot access directly.
    //
    // These are V5+ optional methods. Drivers on older KABI versions (v1-v4)
    // check `vtable_size` to confirm these fields are present before calling.
    // A `None` value means the kernel does not support character device
    // registration via KABI (should not occur in practice — v5+ kernels
    // always populate these).

    /// Register a character device region for a Tier 1 driver.
    /// Called from the driver's probe/init to claim a contiguous range of
    /// (major, minor) device numbers. Returns 0 on success, negative errno
    /// on failure (e.g., -EBUSY if the region overlaps an existing
    /// registration).
    ///
    /// `ctx`: opaque kernel context pointer (passed to the driver at
    ///   entry_init; must be forwarded unchanged).
    /// `major`: major device number. 0 = dynamically allocate.
    /// `first_minor`: first minor number in the range.
    /// `count`: number of consecutive minors to register.
    /// `name`: driver name (ASCII, for /proc/devices listing).
    /// `name_len`: length of `name` in bytes.
    ///
    /// # Safety
    /// `ctx` must be the context pointer from entry_init.
    /// `name` must point to `name_len` valid bytes. `name_len` must be
    /// <= 64 (longer names are truncated).
    pub register_chrdev_region: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        major: u32,
        first_minor: u32,
        count: u32,
        name: *const u8,
        name_len: u32,
    ) -> i32>,

    /// Create a device node in devtmpfs for a Tier 1 driver.
    /// Called after `register_chrdev_region()` to make the device visible
    /// to userspace at `/dev/<name>`.
    ///
    /// `ctx`: opaque kernel context pointer.
    /// `devt`: device number packed as `(major << 20) | minor`. This
    ///   encoding matches Linux's `MKDEV()` for 12-bit major / 20-bit minor.
    /// `name`: device node name (e.g., "ttyS0", "dri/card0"). May contain
    ///   a single '/' for subdirectory creation (e.g., "input/event0").
    /// `name_len`: length of `name` in bytes.
    /// `mode`: file permissions (e.g., 0o666 for world-readable char devs).
    ///
    /// Returns 0 on success, negative errno on failure.
    ///
    /// # Safety
    /// `ctx` must be the context pointer from entry_init.
    /// `name` must point to `name_len` valid bytes.
    pub devtmpfs_create_node: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        devt: u64,
        name: *const u8,
        name_len: u32,
        mode: u32,
    ) -> i32>,

    /// Remove a device node from devtmpfs.
    /// Called during driver teardown (before `unregister_chrdev_region()`).
    ///
    /// `ctx`: opaque kernel context pointer.
    /// `devt`: device number (same encoding as `devtmpfs_create_node`).
    ///
    /// Returns 0 on success, negative errno on failure (e.g., -ENOENT if
    /// the node does not exist).
    ///
    /// # Safety
    /// `ctx` must be the context pointer from entry_init.
    pub devtmpfs_remove_node: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        devt: u64,
    ) -> i32>,

    /// Unregister a previously registered character device region.
    /// Called during driver teardown after all device nodes in the region
    /// have been removed via `devtmpfs_remove_node()`. Frees the device
    /// number range for reuse.
    ///
    /// `ctx`: opaque kernel context pointer.
    /// `major`: major device number (must match the registration).
    /// `first_minor`: first minor number (must match).
    /// `count`: number of consecutive minors (must match).
    ///
    /// Returns 0 on success, negative errno on failure.
    ///
    /// # Safety
    /// `ctx` must be the context pointer from entry_init.
    pub unregister_chrdev_region: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        major: u32,
        first_minor: u32,
        count: u32,
    ) -> i32>,

    // P2P DMA mapping (added in v6). Enables direct device-to-device DMA
    // without host memory bounce. See [Section 4.14](04-memory.md#dma-subsystem) for lifecycle.
    pub dma_p2p_map: Option<unsafe extern "C" fn(
        src: DeviceHandle, dst: DeviceHandle, size: u64,
    ) -> i64>,
    pub dma_p2p_unmap: Option<unsafe extern "C" fn(
        handle: i64,
    ) -> i32>,

    // ... extends over time, always append-only ...
}

12.3.1 CapValidationToken: Amortized Capability Validation¶

Capability checks on every KABI dispatch add per-call overhead. The CapValidationToken mechanism amortizes this cost: a caller validates a capability once against a driver and receives a long-lived token. Subsequent dispatches present the token rather than re-validating the full capability tree. At each dispatch, the KABI trampoline checks the token and produces a short-lived ValidatedCap<'dispatch> (Section 9.1) scoped to that call via RCU guard.

/// Opaque identifier for a driver isolation domain. Allocated from a
/// monotonically increasing u64 counter at driver load time.
///
/// **Longevity**: At 100 driver loads per second (sustained crash loop —
/// abnormal operation), u64 exhausts in ~5.8 billion years. No recycling
/// mechanism is needed. If the counter were somehow exhausted, the driver
/// load syscall returns `ENOSPC` and an FMA event is emitted.
///
/// **Performance**: On 64-bit architectures (x86-64, AArch64, RISC-V 64,
/// PPC64LE), domain ID comparison is a single register compare — zero
/// overhead vs u32. On 32-bit architectures (ARMv7, PPC32), it becomes two
/// 32-bit comparisons (+1 cycle), negligible on a KABI dispatch path that
/// already costs 23+ cycles for the domain switch.
pub struct DriverDomainId(pub u64);

/// A long-lived capability validation token for KABI dispatch.
/// Created by presenting a `Capability` to the target driver domain via
/// `validate_cap()`. Stored by the caller and presented at each KABI
/// dispatch. Valid until the driver crashes (generation mismatch) or the
/// capability is revoked.
///
/// **Distinction from `ValidatedCap<'dispatch>`** ([Section 9.1](09-security.md#capability-based-foundation--capability-validation-amortization-validatedcapguard)):
/// `CapValidationToken` is a long-lived, cross-call token that drivers store
/// between KABI dispatches. At each dispatch, the KABI trampoline checks the
/// token's generation fields and, if valid, produces a short-lived
/// `ValidatedCap<'dispatch>` scoped to that single call via RCU guard.
/// The two types form a two-layer validation pipeline:
/// 1. `CapValidationToken` — amortizes initial validation (cap lookup, LSM, type check)
/// 2. `ValidatedCap<'dispatch>` — guarantees the capability remains valid for
///    the duration of the dispatch (RCU prevents revocation mid-call)
pub struct CapValidationToken {
    /// The CapId of the CapEntry that was validated at token creation time.
    /// Used for cap_table_lookup() on re-validation. This is the XArray key
    /// in the capability table, known at token creation time from the
    /// CapEntry lookup that produced the Capability object.
    pub cap_id: CapId,
    /// The underlying capability.
    pub cap: Capability,
    /// Driver domain that validated this capability.
    /// Checked on every dispatch to prevent cross-domain misuse.
    pub domain_id: DriverDomainId,
    /// Generation counter of the driver domain at validation time.
    /// The driver domain's `generation` field is incremented each time
    /// the driver crashes and reloads. A mismatch means the token is stale.
    pub driver_generation: u64,
    /// Capability generation at validation time. The cap table entry's
    /// generation is incremented when a capability is revoked and re-issued.
    /// A mismatch means the capability was revoked after this token was created.
    pub cap_generation: u64,
    /// Global capability generation at validation time.
    /// `GLOBAL_CAP_GENERATION` is incremented by bulk revocation operations
    /// (`cap_revoke_all`, `setenforce`). A mismatch means a bulk revocation
    /// occurred after this token was created.
    pub global_gen: u64,
    /// Credential generation snapshot from `current_task().cred_generation`
    /// at token creation time. KABI dispatch compares this against the task's
    /// current `cred_generation`; a mismatch forces re-validation because
    /// the task's credentials changed (e.g., setuid/setgid). This prevents
    /// a process from retaining capabilities that the new credential set
    /// would not grant. See [Section 9.9](09-security.md#credential-model-and-capabilities).
    ///
    /// **Tier 1 driver-initiated calls**: For KABI calls initiated by Tier 1
    /// drivers (not from a task context), `cred_gen` is 0. Drivers do not
    /// have task credentials — they operate under their `DeviceCapGrant`
    /// scope, not under any process's credential set. The dispatch
    /// trampoline skips the `cred_generation` comparison entirely when
    /// `cred_gen == 0` — it does not call `current_task()` in interrupt
    /// context where the interrupted task's credentials are irrelevant.
    /// Credential-based revocation only applies to task-initiated calls
    /// after `setuid()` / `setgid()` / `commit_creds()`.
    pub cred_gen: u64,
    /// LSM policy generation at token creation time. Populated from
    /// `LSM_REGISTRY.policy_generation.load(Acquire)` when the token is
    /// created. On each KABI dispatch, compared against the current
    /// `LSM_REGISTRY.policy_generation`; a mismatch means an LSM policy
    /// reload (SELinux `selinux_policy_load`, AppArmor `.replace`) has
    /// occurred since this token was validated, and the cached LSM
    /// decision may be stale. Forces full re-validation including
    /// `security_file_receive()` / `security_capable()` re-evaluation.
    /// See [Section 9.8](09-security.md#linux-security-module-framework--lsm-policy-generation).
    pub policy_gen: u64,
    /// Opaque rights bitmask extracted from the capability at validation time.
    /// Cached here to avoid re-parsing the capability on each dispatch.
    /// Uses `PermissionBits` encoding from [Section 9.1](09-security.md#capability-based-foundation).
    pub cached_rights: u64,
}

Revocation semantics: Generation fields (driver_generation, cap_generation) are checked on every KABI dispatch — not just at validation time. A revoked capability takes effect at the next dispatch attempt, giving revocation latency = time until next dispatch (typically <1 μs under load). The generation check costs 12–25 ns (two u64 compares + branch). There is no background sweep or lazy invalidation — stale tokens are detected inline and rejected with KabiError::DomainCrashed.

Revocation latency (TOCTOU window): There is a bounded window between the CapValidationToken generation check (step 1 of dispatch entry) and the creation of ValidatedCap<'dispatch> (step 3, under RCU read lock). During this window, the capability may be revoked — the generation check passes because revocation has not yet incremented the generation, but by the time the RCU read lock is acquired the capability is logically invalid. The in-flight dispatch completes because the RCU read lock prevents the underlying resource from being freed.

Window size: Bounded by one dispatch duration — at most ~100 μs, enforced by the KABI dispatch timeout watchdog (Section 11.4).

Security implication: This window is equivalent to Linux's revocation latency at syscall boundaries: a capability (or file descriptor) revoked while a syscall is in progress takes effect only after the syscall returns. The in-flight dispatch was authorized at entry time and does not constitute a privilege escalation.

Security-critical revocation tightening: For security-critical revocations (container termination, credential change), the KABI dispatch trampoline performs an additional REVOKED_FLAG check on the CapEntry.active_ops field immediately after the generation check, before creating the ValidatedCap. This closes the TOCTOU window at the cost of one AtomicU32::load(Relaxed) (~1-5 cycles per dispatch). The check is:
// In KABI dispatch trampoline, after generation check passes:
if cap_is_revoked(cap_entry) {
    return Err(KabiError::CapabilityRevoked);
}
// Proceed to create ValidatedCap<'dispatch> under RCU read lock.
The REVOKED_FLAG is set atomically by drain() during revocation (Section 9.1). Because drain() uses fetch_or(REVOKED_FLAG, AcqRel) and the trampoline check uses load(Relaxed), a revocation that completes before the trampoline check is guaranteed to be visible. The Relaxed ordering is sufficient because the check is a best-effort tightening — the generation check already provides correctness, and the REVOKED_FLAG check merely narrows the window.

12.3.1.1 CapValidationToken Creation API¶

A CapValidationToken is created by presenting a raw capability handle to the kernel's capability validation path. This is the only entry point for obtaining a CapValidationToken; there is no public constructor.

The canonical CapError enum is defined in Section 9.1. Capability validation (validate_cap()) returns the following subset of CapError variants, plus Expired which is specific to time-bounded capabilities used in KABI validation:

// CapError — canonical definition in [Section 9.1](09-security.md#capability-based-foundation).
// Validation context uses variants: InvalidHandle, Revoked,
// InsufficientPermissions, WrongType, Expired.

/// Validate a raw capability handle against a target driver domain and return
/// a reusable `CapValidationToken` for subsequent KABI dispatches.
///
/// This function is called once per (caller, capability, driver) triple. The
/// returned `CapValidationToken` amortizes the cost of all subsequent dispatches
/// until the driver crashes (generation mismatch) or the capability is revoked.
///
/// # Arguments
/// - `handle`: The caller's local CapHandle obtained from `request_service()` or
///   inherited via `fork()`/`exec()`. This is a per-process local handle (index
///   into the caller's CapSpace), NOT a global CapId. The kernel resolves it
///   through the caller's CapSpace to obtain the underlying capability — this
///   resolution IS the ownership check, preventing cross-process privilege escalation.
/// - `required_perms`: The minimum `PermissionBits` the caller needs for its
///   intended operations. Validated against the capability's rights bitmask.
/// - `caller`: The calling task's credentials (UID, cgroup, security context).
///   Used for CapSpace lookup, LSM checks, and audit logging.
/// - `domain`: The target driver domain. The capability's service type must
///   match the domain's registered service type.
///
/// # Returns
/// - `Ok(CapValidationToken)` — token is valid for dispatch until invalidated.
/// - `Err(CapError::InvalidHandle)` — handle does not exist in caller's CapSpace.
/// - `Err(CapError::Revoked)` — capability was revoked; caller must re-acquire.
/// - `Err(CapError::InsufficientPermissions(missing))` — caller lacks required rights.
/// - `Err(CapError::WrongType)` — capability type does not match the domain.
/// - `Err(CapError::Expired)` — capability lifetime has elapsed.
///
/// # Security invariant
/// The caller can only validate capabilities present in their own CapSpace.
/// A process that learns a CapId through a side channel (e.g., by observing
/// another driver's IPC) CANNOT use it — the CapSpace lookup rejects handles
/// the process does not hold. This is analogous to how Unix file descriptors
/// are per-process: knowing fd 42 exists in another process does not let you
/// read from it.
///
/// # Performance
/// - O(1): CapSpace array lookup + XArray lookup (cap table) + three comparisons
///   + one LSM hook.
/// - Typical cost: ~60–120ns (CapSpace lookup ~10ns, rest as before).
/// - Called once per session, not per-dispatch — amortized to zero on the hot path.
pub fn validate_cap(
    handle: CapHandle,
    required_perms: PermissionBits,
    caller: &TaskCredentials,
    domain: &DriverDomain,
) -> Result<CapValidationToken, CapError> {
    // Step 1: Resolve through caller's CapSpace (O(1) array index lookup).
    // This is the OWNERSHIP CHECK. The caller can only present handles that
    // exist in their own CapSpace — they cannot reference capabilities held
    // by other processes or driver domains. CapSpace is indexed by local
    // CapHandle (like a file descriptor table), not by global CapId.
    let local_entry = caller.cap_space.lookup(handle)
        .ok_or(CapError::InvalidHandle)?;

    // Step 2: Cross-check against global cap table for revocation detection.
    // The local CapSpace entry may be stale if the capability was revoked
    // after delegation. The global table's generation is authoritative.
    let global_entry = cap_table_lookup(local_entry.id)
        .ok_or(CapError::Revoked)?;

    // Step 3: Generation check — detect revocation since delegation.
    if global_entry.generation != local_entry.cap.generation {
        return Err(CapError::Revoked);
    }

    // Step 4: Expiry check (monotonic time comparison).
    if global_entry.expires_ns != 0 {
        let now = arch::current::cpu::read_cycle_counter_ns();
        if now > global_entry.expires_ns {
            return Err(CapError::Expired);
        }
    }

    // Step 5: Type check — capability service type must match domain.
    if global_entry.service_type != domain.service_type {
        return Err(CapError::WrongType);
    }

    // Step 6: Permission check — required_perms must be a subset of held rights.
    let held = global_entry.rights;
    if !permission_bits_contains(held, required_perms) {
        let missing = required_perms.bits() & !held.bits();
        return Err(CapError::InsufficientPermissions(missing));
    }

    // Step 7: LSM hook (SELinux/AppArmor policy check).
    // NOTE: LSM policy is evaluated here at validate_cap() time (token creation),
    // not per-dispatch. This is intentional: dispatches happen millions of times
    // per second; LSM checks add ~100-500ns each. Policy changes take effect when
    // existing tokens expire or are invalidated (driver crash, capability
    // revocation, explicit cap_revoke()). For immediate enforcement after
    // setenforce(1), call cap_revoke_all() to invalidate all outstanding tokens.
    lsm_check_cap_validate(caller, &global_entry, domain)?;

    // Step 8: Build and return the validated token.
    Ok(CapValidationToken {
        cap_id: local_entry.id,
        cap: global_entry.capability,
        domain_id: domain.id,
        driver_generation: domain.generation.load(Ordering::Acquire),
        cap_generation: global_entry.generation,
        global_gen: GLOBAL_CAP_GENERATION.load(Acquire),
        cred_gen: current_task().cred_generation.load(Acquire),
        policy_gen: LSM_REGISTRY.policy_generation.load(Acquire),
        cached_rights: held.bits(),
    })
}

12.3.1.2 Calling Context for validate_cap()¶

validate_cap() is always called in task context — during module load, ioctl dispatch, or syscall open. It is never called from interrupt or softirq context (the CapSpace lookup requires a stable current_task() reference, and LSM hooks may sleep on policy lookups).

Tier 1 drivers in interrupt/softirq context: Interrupt handlers and softirq callbacks (NAPI poll, timer callbacks) cannot call validate_cap(). Instead, Tier 1 drivers use pre-created tokens obtained during the device-open path. When a userspace task opens a device file, the open handler validates the capability and stores the resulting CapValidationToken in the per-file driver state. Subsequent interrupt-context operations (DMA completion, packet RX) use this pre-validated token for KABI dispatch without re-validating.

Multiple userspace tasks through the same driver: Each task creates its own CapValidationToken via their own CapSpace during their open() call. The token is bound to (task_cred_generation, cap_generation, domain_generation) and is invalidated if any of these change. A single driver instance may hold tokens from many different tasks simultaneously — each token is scoped to its originating task's credentials and capability state.

Interrupt/softirq context credential scope: Tokens used in interrupt/softirq context (e.g., NAPI poll, timer callback, DMA completion) have cred_gen = 0, indicating device-initiated context with no associated task. These tokens are exempt from credential-based revocation (setuid/setgroups changes) — only cap_generation and domain_generation invalidate them. This is safe because interrupt-context operations are on behalf of the device (bounded by DeviceCapGrant), not a specific user. The dispatch trampoline skips the cred_generation comparison entirely when cred_gen == 0 — it does NOT call current_task(), which would return the interrupted task (irrelevant credentials). See CapValidationToken.cred_gen field documentation.

12.3.1.3 Token Lifetime Summary¶

A CapValidationToken is invalidated by ANY of the following triggers:

Driver crash: domain_generation mismatch — the driver has been reloaded since the token was created.
Capability revocation: cap_generation mismatch — the specific capability referenced by this token has been revoked or modified.
Credential change: setuid(), setgroups(), or similar calls bump cred_generation on the originating task, invalidating all tokens created under the old credentials.
Bulk revocation: cap_revoke_all() increments GLOBAL_CAP_GENERATION, invalidating all outstanding tokens system-wide. Used after setenforce(1) or emergency security response.
Explicit token drop: The holder drops the token voluntarily (e.g., on file close, driver unload, or session teardown).

The KABI dispatch trampoline checks all generation fields on every dispatch. A stale token is detected in O(1) — three integer comparisons.

Using a CapValidationToken: The KABI dispatch trampoline checks the token before forwarding the call to the driver:

/// Errors returned by KABI dispatch and driver operations.
///
/// This is the canonical error type for all KABI-level operations.
/// Driver-specific errors are returned as `KabiError::DriverError(i32)`
/// where the i32 is a negated errno value from the driver.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum KabiError {
    /// The driver domain has crashed or been replaced since the handle
    /// or CapValidationToken was issued. The caller knows which domain
    /// crashed from context (the handle they called, or the
    /// `CapValidationToken::domain_id` they validated against).
    /// Caller must: (1) discard the stale handle/token, (2) wait for
    /// the driver to recover (poll generation for odd, or use
    /// service_recovered callback), (3) re-resolve the handle or
    /// re-validate the capability against the new driver instance via
    /// `request_service()`.
    /// See "Caller recovery: DomainCrashed" below.
    DomainCrashed,
    /// The capability has been revoked since the CapValidationToken was
    /// issued. The cap table entry's generation has advanced past the
    /// token's `cap_generation`. The `cap_id` identifies which capability
    /// is no longer valid. Caller must re-acquire the capability from
    /// the authority that originally granted it (which may fail
    /// permanently if the grant was one-shot or the granting authority
    /// revoked the entire subtree). See "Caller recovery: CapRevoked" below.
    CapRevoked { cap_id: CapId },
    /// The CapValidationToken has been invalidated by a bulk revocation
    /// (cap_revoke_all / setenforce) or by a credential change (setuid).
    /// GLOBAL_CAP_GENERATION or cred_generation advanced past the token's
    /// snapshot. Caller must discard the token and re-validate.
    TokenRevoked,
    /// The caller's cached_rights do not include the PermissionBits
    /// required by the requested KABI operation. Each vtable method
    /// declares its required permissions (see "KABI Operation Permission
    /// Requirements" below). The caller must hold a capability with
    /// sufficient rights, or the dispatch is rejected.
    InsufficientPermissions,
    /// The driver domain does not hold the required SystemCaps for the
    /// requested operation. This is the dual-check counterpart to
    /// `InsufficientPermissions` (which checks PermissionBits).
    /// Example: alloc_dma_buffer requires CAP_DMA from SystemCaps
    /// in addition to PermissionBits::WRITE. The domain's
    /// `granted_syscaps` is set at device_init time via the
    /// DeviceCapGrant and is immutable for the domain's lifetime.
    InsufficientSystemCaps,
    /// The requested vtable method is not implemented by this driver.
    /// Returned when a null fn pointer is encountered in the vtable.
    NotSupported,
    /// The service is not authorized for the requested operation.
    /// Capability rights check failed.
    UnauthorizedService,
    /// The driver's command submission queue is full.
    /// Caller should retry after a brief delay.
    QueueFull,
    /// Too many dependencies in a single submission batch.
    TooManyDeps,
    /// The referenced semaphore or handle does not exist or was destroyed.
    InvalidHandle,
    /// The semaphore or resource is in use and cannot be destroyed.
    Busy,
    /// The semaphore was destroyed while a command was waiting on it.
    SemaphoreDestroyed,
    /// Another command is already signaling this semaphore.
    AlreadySignaling,
    /// Module load path validation failed (path traversal, invalid prefix).
    InvalidPath,
    /// Driver-specific error (negated errno from the driver implementation).
    DriverError(i32),

    // --- Dispatch-path variants (hot path, used in kabi_call!/kabi_call_async!) ---

    /// The handle's cached generation does not match the module's current
    /// generation counter. The module was unloaded, crashed, or replaced
    /// since the handle was obtained. The caller must re-resolve the handle
    /// via `DomainService::resolve()`.
    StaleHandle,
    /// The dispatch timed out waiting for a completion from the target
    /// domain. The timeout is per-handle (`KabiHandle::timeout_ns`).
    /// The request may still be in-flight — the caller cannot assume it
    /// was not executed. Used by ring-based dispatch when
    /// `wait_completion()` exceeds the deadline.
    Timeout,
    /// The component is undergoing live evolution (Phase A' quiescence).
    /// The vtable is about to be swapped. Direct-path callers should
    /// back off briefly (spin or yield) and retry — the evolution
    /// typically completes within microseconds to milliseconds. After
    /// evolution completes, the generation counter changes, so the
    /// caller's next attempt will either succeed (same generation) or
    /// get `StaleHandle` (generation advanced, re-resolve needed).
    ComponentQuiescing,
    /// The driver panicked during request processing. Detected by the
    /// `catch_domain_panic()` setjmp/longjmp recovery wrapper in the
    /// consumer loop. The domain is in an inconsistent state and will
    /// be torn down by the crash recovery path.
    DriverPanic,
    /// The target domain has not completed initialization (its ring
    /// is not yet set up). Returned by `dispatch_to_domain()` when
    /// `domain.t1_ring` is `None`. The caller should retry after
    /// the domain announces readiness.
    DomainNotReady,

    // --- Lifecycle variants (cold path, used in DomainService operations) ---

    /// The requested service is not registered in any domain. Returned
    /// by `DomainService::resolve()` when no module provides the
    /// requested `ServiceId`.
    ServiceNotFound,
    /// The requested service version is incompatible with the caller's
    /// requirements. Returned by `DomainService::resolve()` when the
    /// provider's KABI version does not satisfy `is_compatible_with()`.
    VersionMismatch,
    /// Cross-domain ring setup failed (memory allocation, shared region
    /// mapping, or doorbell registration). Returned by
    /// `DomainService::resolve()` when a ring-transport binding cannot
    /// be established.
    RingSetupFailed,
    /// The module manifest failed validation (invalid magic, unsupported
    /// KABI version, malformed dependency list). Returned by
    /// `DomainService::register()`.
    ManifestInvalid,
    /// The target domain has reached its maximum module capacity.
    /// Returned by `DomainService::register()`.
    DomainFull,
    /// The specified module is not loaded in this domain. Returned by
    /// `DomainService::resolve_all()` and `announce_ready()`.
    ModuleNotFound,
    /// A mandatory dependency is not yet available but may become
    /// available later (the providing module is being loaded). The
    /// caller should wait for a service-available notification and
    /// retry. Returned by `DomainService::resolve_all()`.
    DependencyDeferred,
    /// The rebinding/migration operation timed out. The old domain's
    /// quiescence drain did not complete within the migration deadline.
    /// Returned by the rebinding path.
    MigrationTimeout,
    /// Internal error: a code path that should be unreachable was reached.
    /// Used as a catch-all for exhaustive `RingError` matching on variants
    /// that `wait_completion()` never produces (`Full`, `Overloaded`).
    /// If this error is ever observed, it indicates a KABI subsystem bug.
    InternalError,
}

impl KabiError {
    /// Convert a `KabiError` to a negated errno value for the completion
    /// ring `status` field. Used by the consumer loop when posting
    /// completions for failed requests. Callers converting `KabiError` to
    /// userspace-visible errors (e.g., syscall return) also use this.
    pub fn to_errno(&self) -> i32 {
        match self {
            KabiError::DomainCrashed => -EIO,
            KabiError::CapRevoked { .. } => -EACCES,
            KabiError::TokenRevoked => -EACCES,
            KabiError::InsufficientPermissions => -EPERM,
            KabiError::InsufficientSystemCaps => -EPERM,
            KabiError::NotSupported => -ENOSYS,
            KabiError::UnauthorizedService => -EACCES,
            KabiError::QueueFull => -EAGAIN,
            KabiError::TooManyDeps => -E2BIG,
            KabiError::InvalidHandle => -EBADF,
            KabiError::Busy => -EBUSY,
            KabiError::SemaphoreDestroyed => -EIDRM,
            KabiError::AlreadySignaling => -EBUSY,
            KabiError::InvalidPath => -EINVAL,
            KabiError::DriverError(errno) => *errno,
            KabiError::StaleHandle => -ESTALE,
            KabiError::Timeout => -ETIMEDOUT,
            KabiError::DriverPanic => -EIO,
            KabiError::DomainNotReady => -EAGAIN,
            KabiError::ServiceNotFound => -ENODEV,
            KabiError::VersionMismatch => -ENOTSUP,
            KabiError::RingSetupFailed => -ENOMEM,
            KabiError::ManifestInvalid => -EINVAL,
            KabiError::DomainFull => -ENOSPC,
            KabiError::ModuleNotFound => -ENOENT,
            KabiError::DependencyDeferred => -EAGAIN,
            KabiError::MigrationTimeout => -ETIMEDOUT,
            KabiError::InternalError => -EIO,
        }
    }
}

/// Status codes for cross-domain KABI call responses.
///
/// Returned in every `KabiResponse` to indicate the outcome of a dispatched
/// request. Values 0x0000–0x00FF are reserved for KABI-level status codes;
/// values 0x0100–0xFFFF are available for driver-class-specific extensions
/// (defined per-vtable in the `.kabi` IDL, allocated by `kabi-gen`).
///
/// Maps to `KabiError` variants for the error cases: the dispatch trampoline
/// converts `KabiStatus` into `Result<KabiResponse, KabiError>` at the
/// call site boundary. `KabiStatus` is the wire encoding; `KabiError` is
/// the Rust-idiomatic caller-facing type.
#[repr(u32)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum KabiStatus {
    /// The call completed successfully. `KabiResponse::result` contains
    /// the serialized return value.
    Success                = 0x0000,
    /// The target vtable method is not implemented (null function pointer
    /// or method_index beyond vtable bounds).
    NotSupported           = 0x0001,
    /// The caller's PermissionBits are insufficient for this method.
    InsufficientPermissions = 0x0002,
    /// The caller's SystemCaps are insufficient for this method.
    InsufficientSystemCaps = 0x0003,
    /// The capability token used by the caller has been revoked
    /// (e.g., driver unload, live evolution swap, or administrative
    /// revocation via `cap_revoke()`). The caller must re-acquire
    /// the capability before retrying.
    TokenRevoked           = 0x0004,
    /// The driver domain has crashed or been replaced since the
    /// CapValidationToken was issued.
    DomainCrashed          = 0x0005,
    /// The capability was revoked after the CapValidationToken was created.
    CapRevoked             = 0x0006,
    /// The argument buffer is malformed (failed deserialization or
    /// constraint validation). The driver did not execute the method.
    InvalidArgument        = 0x0007,
    /// The driver's inbound ring is full (T1/T2 transports only).
    /// Caller should back off and retry.
    QueueFull              = 0x0008,
    /// The driver returned a driver-specific error. The negated errno
    /// value is stored in `KabiResponse::driver_errno`.
    DriverError            = 0x0009,
    /// Internal dispatch error (e.g., domain lookup failure, transport
    /// fault). Should not occur under normal operation; logged to FMA.
    InternalError          = 0x00FF,
}

/// Cross-domain KABI call request.
///
/// Kernel-internal dispatch context for a single vtable method invocation.
/// This struct is NOT the wire format written to shared ring buffers.
/// The ring wire formats are `T1CommandEntry` ([Section 12.8](#kabi-domain-runtime))
/// and `T2CommandEntry` ([Section 12.6](#kabi-transport-classes)), which contain only
/// the transport-relevant fields (method_index, args, cookie). The
/// `vtable_perm_table` and `vtable_syscap_table` pointer fields are consumed
/// by the kernel-side dispatch trampoline (Steps 3-4 of
/// `kabi_dispatch_with_vcap`) BEFORE the request is serialized into the
/// transport-specific ring entry format — they never cross a domain boundary.
///
/// For T0 (direct) transport, `KabiRequest` is constructed on the caller's
/// stack — no serialization overhead. For T1/T2 transports, the trampoline
/// constructs a `T1CommandEntry` or `T2CommandEntry` from the relevant fields.
///
/// The `method_index` field is the zero-based ordinal of the target method
/// in the vtable (same encoding as `T2CommandEntry::method_index` in
/// [Section 12.6](#kabi-transport-classes)). The dispatch trampoline validates it
/// against `vtable_size` before dereferencing the function pointer.
#[repr(C)]
pub struct KabiRequest {
    /// Vtable method ordinal (zero-based index). Validated against
    /// `(vtable.vtable_size - VTABLE_HEADER_SIZE) / size_of::<fn()>()`
    /// before dispatch. Out-of-bounds values produce `KabiStatus::NotSupported`.
    pub method_index: u32,

    /// Explicit alignment padding: method_index (u32, offset 0-3) is followed by
    /// cap_token (u64, requires 8-byte alignment). Per CLAUDE.md rule 11.
    pub _pad0: u32,

    /// Capability handle authorizing this call. Resolved through the
    /// caller's CapSpace to the underlying capability. The dispatch
    /// trampoline uses this to locate the target driver domain and
    /// to verify PermissionBits and SystemCaps. For T0 direct calls
    /// within the Core domain, this field carries `CAP_HANDLE_KERNEL`
    /// (a sentinel value indicating intra-Core dispatch with no
    /// capability check — Core trusts itself).
    pub cap_token: CapHandle,

    /// Length of the serialized argument data in bytes. The dispatch
    /// trampoline validates `args_len <= KABI_MAX_ARGS_LEN` (default:
    /// 64 KiB) before reading argument data. Zero is valid (methods
    /// with no parameters).
    pub args_len: u32,

    /// Padding for pointer alignment of vtable_perm_table. Must be zero.
    pub _pad1: u32,

    /// Pointer to the static per-vtable permission table, set by the
    /// generated caller stub. Each entry maps a method ordinal to its
    /// required `PermissionBits`. The dispatch trampoline indexes this
    /// table by `method_index` — O(1), no branching.
    pub vtable_perm_table: *const PermissionBits,

    /// Pointer to the static per-vtable SystemCaps table, set by the
    /// generated caller stub. Methods without `@syscap` annotation have
    /// `SystemCaps::empty()` in their slot. The dispatch trampoline
    /// skips the SystemCaps check when the entry is empty.
    pub vtable_syscap_table: *const SystemCaps,

    // --- Variable-length argument data follows the header ---
    // For T0 transport: `args_ptr` points directly to caller's stack frame
    //   (zero-copy, same address space).
    // For T1 transport: `args_ptr` points into the shared ring argument
    //   region (zero-copy, cross-domain shared memory).
    // For T2 transport: `args_ptr` points into the mmap'd shared argument
    //   buffer (offset validated against buffer_size).

    /// Pointer to the serialized argument data. The encoding format is
    /// defined by `kabi-gen` per-method: fixed-layout `#[repr(C)]` structs
    /// for methods with multiple parameters, or the raw parameter value
    /// for single-parameter methods. The dispatch trampoline passes this
    /// pointer (with `args_len`) to the target function after validation.
    pub args_ptr: *const u8,
}
// KabiRequest (64-bit): method_index(u32=4,off=0) + _pad0(u32=4,off=4) +
//   cap_token(u64=8,off=8) + args_len(u32=4,off=16) + _pad1(u32=4,off=20) +
//   vtable_perm_table(ptr=8,off=24) + vtable_syscap_table(ptr=8,off=32) +
//   args_ptr(ptr=8,off=40) = 48 bytes. All padding explicit.
// KabiRequest (32-bit): method_index(4,off=0) + _pad0(4,off=4) +
//   cap_token(u64=8,off=8) + args_len(4,off=16) + _pad1(4,off=20) +
//   vtable_perm_table(ptr=4,off=24) + vtable_syscap_table(ptr=4,off=28) +
//   args_ptr(ptr=4,off=32) = 36 bytes. _pad0 provides alignment for cap_token.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<KabiRequest>() == 48);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<KabiRequest>() == 36);

/// Sentinel CapHandle value for intra-Core-domain dispatch (T0 transport).
/// When the caller and callee are both in the Core domain, no capability
/// check is needed — Core trusts itself. The dispatch trampoline recognizes
/// this sentinel and skips Steps 1–4 of `kabi_dispatch_with_vcap`.
pub const CAP_HANDLE_KERNEL: CapHandle = CapHandle(u64::MAX);

/// Maximum argument data length per KABI request. Requests exceeding this
/// limit are rejected with `KabiStatus::InvalidArgument` without entering
/// the driver domain. 64 KiB is sufficient for all current vtable methods
/// (the largest argument struct is `AccelSubmitParams` at ~512 bytes).
/// This bound prevents a malicious or buggy caller from causing unbounded
/// memcpy in the dispatch path.
pub const KABI_MAX_ARGS_LEN: u32 = 64 * 1024;

/// Maximum result data length per KABI response. Responses exceeding this
/// limit are truncated and `KabiStatus::InternalError` is set.
pub const KABI_MAX_RESULT_LEN: u32 = 64 * 1024;

/// Cross-domain KABI call response.
///
/// Returned by `dispatch_to_domain()` after the target method completes
/// (or immediately on dispatch failure). The fixed-size header contains
/// the status code and result metadata; the variable-length result data
/// follows. For T0 (direct) transport, the result is written into the
/// caller's stack-allocated response buffer. For T1/T2 (ring) transports,
/// the result is written into the shared completion ring / result buffer.
#[repr(C)]
pub struct KabiResponse {
    /// Outcome of the dispatch. `KabiStatus::Success` indicates the
    /// method executed and `result_ptr` / `result_len` contain valid
    /// return data. Any other value indicates a dispatch-level or
    /// driver-level failure; `result_len` is zero on failure.
    pub status: KabiStatus,

    /// Driver-specific errno (valid only when `status == KabiStatus::DriverError`).
    /// Contains the negated errno value returned by the driver's vtable
    /// method (e.g., -EIO = -5). Zero when `status != DriverError`.
    pub driver_errno: i32,

    /// Length of the serialized result data in bytes. Zero on failure
    /// or for void-returning methods. The caller must not read beyond
    /// `result_ptr + result_len`.
    pub result_len: u32,

    /// Padding for 8-byte alignment. Must be zero.
    pub _pad: u32,

    /// Pointer to the serialized result data. Encoding matches the
    /// method's return type as generated by `kabi-gen` (same `#[repr(C)]`
    /// convention as `KabiRequest::args_ptr`). Null when `result_len == 0`.
    ///
    /// Lifetime: for T0 transport, this points into the caller's stack
    /// frame and is valid until the caller returns. For T1/T2 transport,
    /// this points into the shared result buffer and is valid until the
    /// caller acknowledges the completion entry.
    pub result_ptr: *const u8,
}
// KabiResponse: status(KabiStatus=u32=4) + driver_errno(i32=4) + result_len(u32=4) +
//   _pad(u32=4) + result_ptr(ptr).
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<KabiResponse>() == 24);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<KabiResponse>() == 20);

/// Dispatch a validated KABI request to the target driver domain.
///
/// This is the core cross-domain dispatch function. It is called by
/// `kabi_dispatch_with_vcap` after all capability, permission, and
/// SystemCaps checks have passed. The transport mechanism is selected
/// based on the target domain's assigned tier.
///
/// # Arguments
/// - `domain`: The target driver domain (contains the vtable pointer,
///   ring buffer handles, tier assignment, and isolation key).
/// - `request`: The validated request (method_index, args, cap_token).
/// - `caller_rights`: The caller's cached PermissionBits as u64 (from the
///   CapValidationToken.cached_rights). Passed to the driver domain for optional
///   fine-grained authorization within the driver's own logic.
///
/// # Transport selection
///
/// The dispatch path is determined by `domain.assigned_tier`, which was
/// set at driver load time by the loader algorithm
/// ([Section 12.6](#kabi-transport-classes--kabidrivermanifest-transport-capability-advertisement)).
/// The tier never changes for a loaded driver instance — transport
/// selection is a single branch on a cache-hot field.
///
/// # Returns
/// - `Ok(KabiResponse)` with `status == Success` on successful dispatch.
/// - `Err(KabiError::NotSupported)` if `method_index` exceeds the vtable.
/// - `Err(KabiError::QueueFull)` if the T1/T2 ring is full.
/// - `Err(KabiError::DriverError(errno))` if the driver returns an error.
fn dispatch_to_domain(
    domain: &DriverDomain,
    request: &KabiRequest,
    caller_rights: u64,
) -> Result<KabiResponse, KabiError> {
    match domain.assigned_tier {
        // Tier 0: Direct vtable call. No domain boundary to cross.
        // The caller and callee share the same address space (Core domain).
        // Dispatch cost: ~2-5 cycles (indirect call through vtable pointer).
        //
        // The method_index is bounds-checked against vtable_size. The
        // function pointer is read from the vtable and called directly
        // with the deserialized arguments. The return value is serialized
        // into the KabiResponse on the caller's stack.
        IsolationTier::Tier0 => {
            let vtable = domain.vtable_ptr;
            let method_count = vtable_method_count(vtable);
            if request.method_index as usize >= method_count {
                return Err(KabiError::NotSupported);
            }
            // SAFETY: method_index is bounds-checked above. vtable_ptr
            // is valid for the domain's lifetime (Tier 0 modules are
            // load_once — never unloaded). The args_ptr and args_len
            // were validated by the caller stub.
            let result = unsafe {
                vtable_dispatch_direct(vtable, request.method_index, request.args_ptr, request.args_len)
            };
            Ok(result)
        }

        // Tier 1: Ring buffer + domain switch (Transport T1).
        // The callee runs in Ring 0 in a hardware-isolated memory domain
        // (MPK/POE/DACR). Communication uses the T1 ring buffer protocol
        // defined in [Section 12.6](#kabi-transport-classes).
        //
        // The driver runs its own consumer loop inside the isolated domain.
        // The kernel submits requests to the T1 command ring; the driver's
        // consumer dequeues and processes them within the domain boundary.
        // Responses are written to the T1 completion ring.
        //
        // Dispatch cost: ~200-500 cycles for individual operations. With
        // batching (N operations per domain-switch cycle), amortized cost
        // drops to ~23-80 cycles per operation — achieving NEGATIVE overhead
        // at N≥12 vs Linux's monolithic direct-call model.
        //
        // Crash recovery: if the driver faults, its consumer loop dies.
        // The ring transitions to Disconnected state. Pending requests
        // receive KabiError::DomainCrashed. The kernel never enters the
        // driver domain directly — no kernel stack frames are corrupted,
        // no kernel locks are leaked, no exception fixup tables are needed.
        //
        // This is the fundamental advantage of ring dispatch for Tier 1:
        // crash containment is TRIVIAL because the kernel and driver never
        // share a call stack.
        IsolationTier::Tier1 => {
            let ring = domain.t1_ring.as_ref()
                .ok_or(KabiError::DomainNotReady)?;
            if ring.is_disconnected() {
                return Err(KabiError::DomainCrashed);
            }
            // Serialize request into a T1 ring command entry. The ring is
            // passed to compute arg_offset (relative to the shared arg buffer)
            // and to generate a correlation cookie.
            let cmd = T1CommandEntry::from_request(request, ring);
            // Submit to the command ring. If ring is full, return QueueFull
            // for caller to backoff/retry.
            let seq = ring.submit(cmd).map_err(|_| KabiError::QueueFull)?;
            // Wait for the corresponding completion (or domain crash).
            // The driver's consumer loop processes the request within its
            // isolated domain and writes the result to the completion ring.
            match ring.wait_completion(seq, domain.timeout_ns) {
                Ok(completion) => Ok(validate_t1_completion(completion, domain)?),
                Err(RingError::Disconnected) => Err(KabiError::DomainCrashed),
                Err(RingError::Timeout) => Err(KabiError::Timeout),
                // Full and Overloaded are never returned by wait_completion.
                Err(_) => Err(KabiError::InternalError),
            }
        }

        // Tier 2: Ring buffer + syscall. The callee runs in Ring 3 as a
        // separate process. Communication is fully asynchronous via the
        // T2 command/completion rings defined in [Section 12.6](#kabi-transport-classes).
        // Dispatch cost: ~1-5 microseconds (ring enqueue + wake + dequeue +
        // privilege transition).
        //
        // 1. Serialize the request into a T2CommandEntry.
        // 2. Enqueue into the driver's command ring (atomic head update).
        // 3. If the driver is sleeping on poll(), send a wake-up event.
        // 4. Wait for the corresponding T2CompletionEntry (poll the
        //    completion ring or block on the completion eventfd).
        // 5. Deserialize the completion into a KabiResponse.
        //
        // Ring full: if the command ring has no free slots, return
        // KabiError::QueueFull immediately (non-blocking). The caller
        // is responsible for backoff and retry.
        IsolationTier::Tier2 => {
            // Compute offset relative to the shared buffer base.
            // `request.args_ptr` is a kernel virtual address pointing into the
            // mmap'd shared argument buffer. The T2 driver process maps the
            // same physical pages at its own virtual address, so we communicate
            // the offset within the buffer (not the pointer value).
            let arg_offset = (request.args_ptr as usize)
                .checked_sub(domain.shared_buf_base as usize)
                .expect("args_ptr outside shared buffer") as u32;
            let cmd = T2CommandEntry {
                method_index: request.method_index,
                flags: 0, // T2_CMD_NOTIFY set by default
                arg_offset,
                arg_len: request.args_len,
                cookie: domain.next_cookie.fetch_add(1, Ordering::Relaxed),
                _reserved: [0u8; 40],
            };
            let enqueued = domain.cmd_ring.try_enqueue(&cmd);
            if !enqueued {
                return Err(KabiError::QueueFull);
            }
            // Wake the Tier 2 driver process if it is blocked on poll().
            domain.wake_driver();
            // Block until the completion entry with matching cookie arrives.
            let completion = domain.completion_ring.wait_for_cookie(cmd.cookie);
            Ok(validate_t2_completion(completion, domain)?)
        }
    }
}

/// Validate a Tier 1 ring completion entry before converting to `KabiResponse`.
///
/// Tier 1 completions are written by the driver's consumer loop within its
/// isolated Ring 0 domain. Although the driver runs in Ring 0, its memory
/// domain is hardware-isolated — the kernel cannot trust completion data
/// without validation, because a buggy (not malicious) driver could produce
/// malformed completions.
///
/// Validation rules:
/// 1. `status` is a valid errno (in `[-4095, 0]`) or zero (success).
/// 2. `result_len <= KABI_MAX_RESULT_LEN` (prevents buffer overread).
/// 3. `result_offset + result_len <= domain.result_buffer_size` (bounds check).
///
/// Invalid values are treated as `KabiError::DriverError(EIO)` — the driver
/// produced a malformed completion, logged as a soft fault via FMA.
fn validate_t1_completion(
    completion: T1CompletionEntry,
    domain: &DriverDomain,
) -> Result<KabiResponse, KabiError> {
    // Rule 1: status must be 0 (success) or a valid negated errno.
    if completion.status > 0 || completion.status < -4095 {
        domain.fma_report_soft_fault(SoftFaultKind::MalformedCompletion);
        return Err(KabiError::DriverError(EIO));
    }
    // Rule 2: result_len within maximum.
    if completion.result_len > KABI_MAX_RESULT_LEN {
        domain.fma_report_soft_fault(SoftFaultKind::MalformedCompletion);
        return Err(KabiError::DriverError(EIO));
    }
    // Rule 3: result data within the shared result buffer bounds.
    let end = (completion.result_offset as u64)
        .checked_add(completion.result_len as u64);
    if end.is_none() || end.unwrap() > domain.result_buffer_size as u64 {
        domain.fma_report_soft_fault(SoftFaultKind::MalformedCompletion);
        return Err(KabiError::DriverError(EIO));
    }
    Ok(KabiResponse {
        status: if completion.status == 0 {
            KabiStatus::Success
        } else {
            KabiStatus::DriverError
        },
        driver_errno: completion.status,
        result_len: completion.result_len,
        _pad: 0,
        // SAFETY: bounds checked above; result_offset + result_len <= buffer_size.
        result_ptr: unsafe {
            domain.result_buffer_base.add(completion.result_offset as usize)
        },
    })
}

/// Validate a Tier 2 ring completion entry before converting to `KabiResponse`.
///
/// Tier 2 completions are written by the driver process in Ring 3. The driver
/// is explicitly untrusted — malicious completions must be caught. The same
/// three validation rules as T1 apply, but are SECURITY-CRITICAL here (T1
/// validation catches bugs; T2 validation prevents exploitation).
fn validate_t2_completion(
    completion: T2CompletionEntry,
    domain: &DriverDomain,
) -> Result<KabiResponse, KabiError> {
    // Rule 1: status must be 0 or a valid negated errno.
    if completion.status > 0 || completion.status < -4095 {
        domain.fma_report_soft_fault(SoftFaultKind::MalformedCompletion);
        return Err(KabiError::DriverError(EIO));
    }
    // Rule 2: result_len within maximum.
    if completion.result_len > KABI_MAX_RESULT_LEN {
        domain.fma_report_soft_fault(SoftFaultKind::MalformedCompletion);
        return Err(KabiError::DriverError(EIO));
    }
    // Rule 3: result data within the shared result buffer bounds.
    let end = (completion.result_offset as u64)
        .checked_add(completion.result_len as u64);
    if end.is_none() || end.unwrap() > domain.result_buffer_size as u64 {
        domain.fma_report_soft_fault(SoftFaultKind::MalformedCompletion);
        return Err(KabiError::DriverError(EIO));
    }
    Ok(KabiResponse {
        status: if completion.status == 0 {
            KabiStatus::Success
        } else {
            KabiStatus::DriverError
        },
        driver_errno: completion.status,
        result_len: completion.result_len,
        _pad: 0,
        // SAFETY: bounds checked above; result_offset + result_len <= buffer_size.
        result_ptr: unsafe {
            domain.result_buffer_base.add(completion.result_offset as usize)
        },
    })
}

/// RAII guard that holds an RCU read-side critical section for the
/// duration of a KABI dispatch. Prevents capability revocation from
/// completing while the dispatch is in progress: `cap_revoke()` calls
/// `synchronize_rcu()` after setting `REVOKED_FLAG`, so as long as this
/// guard is live, the capability entry and its backing resources remain
/// valid.
///
/// See [Section 9.1](09-security.md#capability-based-foundation) for the revocation protocol.
pub struct KabiDispatchGuard { _private: () }

impl KabiDispatchGuard {
    /// Enter an RCU read-side critical section for KABI dispatch.
    pub fn enter() -> Self {
        rcu_read_lock();
        Self { _private: () }
    }
}

impl Drop for KabiDispatchGuard {
    fn drop(&mut self) {
        rcu_read_unlock();
    }
}

fn kabi_dispatch_with_vcap(
    vcap: &CapValidationToken,
    domain: &DriverDomain,
    request: &KabiRequest,
) -> Result<KabiResponse, KabiError> {
    // RCU read-side critical section: prevents concurrent cap_revoke()
    // from completing (it calls synchronize_rcu() before freeing) while
    // this dispatch is in flight. The guard also anchors the lifetime of
    // any ValidatedCap<'dispatch> produced downstream — see
    // [Section 9.1](09-security.md#capability-based-foundation--capability-validation-amortization-validatedcapguard).
    let _guard = KabiDispatchGuard::enter();

    // Step 0: Check global cap generation (incremented by cap_revoke_all / setenforce).
    //    This is a rare-path check — only fires after bulk revocation.
    if vcap.global_gen != GLOBAL_CAP_GENERATION.load(Acquire) {
        return Err(KabiError::TokenRevoked);
    }

    // Step 0b: Credential generation check. Detects credential changes
    //    (setuid, setgid, setgroups, commit_creds) since the token was
    //    created. A mismatch means the task's privileges changed and the
    //    cached capability validation may no longer be valid.
    //    See [Section 9.9](09-security.md#credential-model-and-capabilities).
    //
    //    Skip if cred_gen is 0: tokens created for interrupt/softirq
    //    context (e.g., NAPI poll, timer callback, DMA completion, or
    //    the Tier 0 IRQ handler posting to a driver's IRQ ring) have
    //    cred_gen = 0 because there is no meaningful task credential
    //    context. In interrupt context, current_task() returns the
    //    interrupted task, whose credentials are irrelevant to the
    //    handler's dispatch. These tokens are validated against the
    //    driver's DeviceCapGrant, not against any task's credentials.
    //    See CapValidationToken.cred_gen field documentation above.
    if vcap.cred_gen != 0 {
        if vcap.cred_gen != current_task().cred_generation.load(Acquire) {
            return Err(KabiError::TokenRevoked);
        }
    }

    // Step 1: Driver liveness check.
    // Single atomic load — L1-resident, ~1-3 cycles.
    let current_gen = domain.generation.load(Ordering::Acquire);
    if vcap.domain_id != domain.id || vcap.driver_generation != current_gen {
        return Err(KabiError::DomainCrashed);
    }

    // Step 2: Capability generation check (prevents use of revoked caps).
    // Re-read the capability object's generation from the cap table.
    // If it has advanced past vcap.cap_generation, the capability was
    // revoked (and possibly re-issued to a different principal).
    // cap_table_lookup is O(1) — XArray lookup by cap_id.
    //
    // **Confusion attack prevention**: Steps 1+2 together prevent token
    // reuse across driver replacements. If driver A crashes and driver B
    // starts on the same domain_id, domain.generation increments,
    // invalidating all of A's tokens at Step 1. A token cannot be used
    // against B even if B reuses the domain_id.
    let cap = cap_table_lookup(vcap.cap_id)?;
    if cap.generation != vcap.cap_generation {
        return Err(KabiError::CapRevoked { cap_id: vcap.cap_id });
    }

    // Step 2b: Explicit revocation flag check.
    // Catches the window between cap_revoke() setting REVOKED_FLAG
    // (bit 63 of active_ops, via fetch_or(REVOKED_FLAG, AcqRel)) and
    // the generation increment (which happens on re-issue, not on
    // revoke). The generation check (Step 2) catches re-issue; this
    // check catches revocation-before-reissue.
    //
    // Cost: one AtomicU64::load(Acquire) — ~1-5 cycles.
    // REVOKED_FLAG is defined in [Section 9.1](09-security.md#capability-based-foundation)
    // as bit 63 of CapEntry.active_ops.
    //
    // Uses cap_is_revoked() which loads active_ops with Acquire
    // ordering. This is stronger than the Relaxed ordering mentioned
    // in the TOCTOU narrative above, but correct: the Acquire pairs
    // with the AcqRel in drain()'s fetch_or(REVOKED_FLAG), ensuring
    // visibility of the revocation across all architectures.
    if cap_is_revoked(&cap) {
        return Err(KabiError::CapRevoked { cap_id: vcap.cap_id });
    }

    // Step 3: Permission check (verifies required PermissionBits for this op).
    // Each KabiRequest carries a method index that maps to a static
    // per-vtable permission table (see "KABI Operation Permission
    // Requirements" section below). The lookup is a constant-time
    // array index, not a runtime computation.
    let required = request.required_permissions();
    if !permission_bits_contains(vcap.cached_rights, required) {
        return Err(KabiError::InsufficientPermissions);
    }

    // Step 4: SystemCaps dual-check (verifies the domain holds the
    // required administrative capabilities for this method).
    //
    // PermissionBits (Step 3) control what a capability allows on its
    // target object. SystemCaps control what system-wide operations a
    // domain may perform. Both checks must pass — they are orthogonal.
    //
    // Example: alloc_dma_buffer requires PermissionBits::WRITE (the
    // caller may write/allocate via this capability) AND SystemCaps::
    // CAP_DMA (the domain is permitted to perform DMA operations).
    // A domain with WRITE but without CAP_DMA cannot allocate DMA
    // buffers — this prevents a driver that only needs memory allocation
    // from escalating to DMA access by calling alloc_dma_buffer.
    //
    // The required_syscaps() lookup is the same pattern as
    // required_permissions(): a static per-vtable table indexed by
    // method ordinal, generated by kabi-gen from @syscap annotations.
    // Methods with no @syscap annotation have SystemCaps::empty()
    // (no additional SystemCaps required beyond PermissionBits).
    let required_sys = request.required_syscaps();
    if !required_sys.is_empty() {
        // domain.granted_syscaps is an RcuCell — read via RCU for
        // lock-free access. Live policy updates are picked up automatically
        // on the next dispatch without token invalidation.
        let syscaps = domain.granted_syscaps.rcu_read();
        if !syscaps.contains(required_sys) {
            return Err(KabiError::InsufficientSystemCaps);
        }
    }

    // Step 5: Dispatch to driver domain.
    // All checks passed — forward with the caller's cached rights.
    // _guard drops here → rcu_read_unlock(). The RCU grace period
    // cannot complete until after dispatch returns, ensuring the
    // capability entry remains valid for the entire call.
    dispatch_to_domain(domain, request, vcap.cached_rights)
}

// Performance note: Steps 2-4 add ~14-28ns per dispatch.
// - rcu_read_lock / rcu_read_unlock: nesting counter inc/dec (~1-2ns total,
//   CpuLocal register — no memory barrier on most arches).
// - cap_table_lookup: O(1) XArray read (~5-10ns cache-warm / ~50-100ns cache-cold;
//   amortized cost for repeatedly accessed capabilities is at the low end — KABI
//   dispatch typically accesses the same capability thousands of times).
// - cap.generation comparison: single u64 compare (~1ns).
// - cap_is_revoked (Step 2b): AtomicU64::load(Acquire) on active_ops (~1-5ns;
//   typically L1-resident because cap_table_lookup just touched the CapEntry).
// - required_permissions(): static table index (~1ns).
// - permission_bits_contains(): single bitwise AND + compare (~1ns).
// - required_syscaps(): static table index (~1ns).
// - syscaps contains(): single u128 AND + compare (~1-2ns).
// The SystemCaps check adds ~2-3ns. For methods with no @syscap
// annotation, is_empty() returns true immediately (branch predicted).
// Total KABI dispatch is ~200-500ns for the domain crossing itself,
// so the re-validation overhead is <10% of the total call cost.

12.3.2 CapValidationToken Invalidation on Driver Crash¶

When a Tier 1 driver crashes and is reloaded, any CapValidationToken tokens issued against the old instance are stale. The generation counter mechanism closes this window without requiring a global scan of all callers.

DriverDomain generation counter:

// All generation counters in UmkaOS use u64.
// Rationale: u64 provides 584 years of wraparound-free operation at 1 billion
// driver reloads per second. Silent u64→u32 truncation in generation comparisons
// could allow stale handles to pass validation after a counter wraparound,
// creating a security vulnerability. The uniform u64 policy prevents this class
// of bug entirely.

/// Kernel-managed state for a Tier 1 driver isolation domain.
/// Stored in umka-core memory (never in the driver's own domain) so it
/// remains valid and writable after the driver domain is torn down.
pub struct DriverDomain {
    /// Unique domain identifier. Never reused after domain destruction.
    pub id: DriverDomainId,
    /// Generation counter. Starts at 1 (odd = active). Incremented to an
    /// even value on crash (marking inactive), then to the next odd value
    /// when the replacement driver instance is ready (marking active again).
    /// Stored with `Ordering::SeqCst` writes, `Ordering::Acquire` reads.
    pub generation: AtomicU64,

    /// SystemCaps granted to this domain. Initially set at device_init
    /// time from the DeviceCapGrant bundle
    /// ([Section 11.4](11-drivers.md#device-registry-and-bus-management--device-capability-grant-bundle)).
    ///
    /// **Live policy update**: Unlike PermissionBits (cached in tokens),
    /// SystemCaps are read from the domain on every KABI dispatch (Step 4
    /// of `kabi_dispatch_with_vcap`). To update policy without crash/reload:
    /// 1. Build new `SystemCaps` bitmap.
    /// 2. `self.granted_syscaps.rcu_swap(new_bitmap)` — old readers finish
    ///    their current dispatch; new readers see the updated bitmap.
    /// 3. `synchronize_rcu()` — ensures no in-flight dispatch can see the
    ///    old bitmap after this returns.
    ///
    /// No token invalidation is needed because tokens do not cache
    /// SystemCaps — the check is always live against the domain.
    ///
    /// On driver reload after crash, this field is re-populated from the
    /// new DeviceCapGrant (which may differ if system policy changed).
    pub granted_syscaps: RcuCell<SystemCaps>,

    // ... isolation key, ring buffer references, etc.
}

On driver crash (performed by the domain fault handler before teardown):

The fault handler atomically increments DriverDomain::generation from odd (active) to even (inactive):
```
domain.generation.fetch_add(1, Ordering::SeqCst);
```
Any subsequent kabi_dispatch_with_vcap call that compares vcap.driver_generation against the now-even generation will find a mismatch and return KabiError::DomainCrashed immediately — no stale dispatch reaches the crashed (or reloaded) driver.
After the replacement driver completes initialization, the fault handler increments generation again (even → odd), activating the new instance.

Per-CPU CapValidationToken cache flush:

Tier 1 drivers maintain a per-CPU cache of recently issued CapValidationToken tokens (up to 16 entries per CPU) to avoid repeatedly re-creating tokens for the same capability.

/// Per-CPU cache of recently issued CapValidationTokens.
///
/// Amortizes the cost of capability validation across repeated
/// KABI calls to the same driver with the same capability. Each
/// entry caches the result of a full validation pass (capability
/// lookup + generation check + permission check + SystemCaps check).
///
/// Lookup: linear scan of 16 entries keyed by (cap_id, domain_id).
/// Cache is small enough that linear scan beats a hash table.
/// Eviction: LRU — the least-recently-used entry is replaced on
/// insert when the cache is full. LRU position is tracked by a
/// 4-bit counter per entry (incremented on access, decayed on insert).
///
/// Placement: PerCpu section (accessed via `PerCpu::get()` with
/// `&PreemptGuard`). No locks — each CPU has its own cache instance.
pub struct CapTokenCache {
    /// Cached tokens. Index 0..count-1 are valid.
    entries: [CapTokenCacheEntry; 16],
    /// Number of valid entries (0..=16).
    count: u8,
}

/// Single cache entry: a (cap_id, domain_id) → CapValidationToken mapping.
pub struct CapTokenCacheEntry {
    /// The capability ID this token was validated for.
    pub cap_id: CapId,
    /// The domain this token is valid for.
    pub domain_id: DomainId,
    /// The cached CapValidationToken (contains cached_rights, generations).
    pub token: CapValidationToken,
    /// LRU counter: incremented on hit, used for eviction selection.
    pub lru_counter: u8,
}

When a domain's generation is incremented, these caches must be purged:

The fault handler sets a cap_flush_pending bit in each CPU's CpuLocal data (Section 3.12), targeted at the crashing domain's DriverDomainId.
The fault handler issues a cross-CPU IPI to all CPUs that have touched this domain since the last quiescent state (tracked via a per-domain CPU bitmask updated on each KABI call).
Each IPI handler clears all CapValidationToken cache entries with domain_id == crashed_domain.id.
The IPI completes before the fault handler releases the domain's memory. Any in-flight dispatch that passed the generation check before the IPI but has not yet completed will fault into the (now revoked) isolation domain and be caught by the domain fault handler — returning KabiError::DomainCrashed to the caller via the crash-recovery path, not a kernel panic.

Caller recovery: A caller that receives KabiError::DomainCrashed must:

Discard the stale CapValidationToken.
Wait for the driver to recover (poll the domain's generation for an odd value, or use the service_recovered callback defined in Section 11.6).
Re-validate the original Capability against the new driver instance to obtain a fresh CapValidationToken.

This three-step recovery parallels the ServiceHandle re-open protocol described in Section 11.6. Both mechanisms force callers to observe the crash boundary rather than silently continuing through a reloaded driver. The difference in representation: KabiServiceHandle (this chapter) is the C-ABI stable handle with a generation counter (detects driver instance replacement); liveness is maintained by the capability system, not by a per-handle reference count. ServiceHandle (Section 11.6) is the lower-level cross-domain handle used in the trampoline dispatch path. A KabiServiceHandle is resolved from a ServiceHandle — the registry fills in the generation from the provider's current state_generation at lookup time.

Caller recovery: CapRevoked: A caller that receives KabiError::CapRevoked must:

Discard the stale CapValidationToken.
Re-request the capability from the granting authority. This follows the standard ServiceBind protocol (Section 9.1): the caller invokes request_service() targeting the original service provider.
If re-acquisition succeeds, validate the new capability to obtain a fresh CapValidationToken and resume operations.
If re-acquisition fails (the authority revoked the entire subtree, or the capability was one-shot), the caller must propagate the error to its own callers — this is a permanent failure, not a transient one.

Unlike DomainCrashed, which is always transient (the driver will eventually reload), CapRevoked may be permanent. Callers must not retry indefinitely — a single re-acquisition attempt is sufficient. If the authority intended to re-grant, the ServiceBind will succeed; if the revocation is permanent (e.g., a security policy change via Section 9.8), repeated retries waste cycles and risk spinning.

Performance: The generation check is a single Ordering::Acquire atomic load (~3-5 cycles, L1-resident in the domain descriptor cache line). This is cheaper than re-validating the capability tree on each dispatch. The IPI on crash is infrequent (driver crashes are exceptional events) and bounded to CPUs that have actively used the domain.

12.3.3 KABI Operation Permission Requirements¶

Every vtable method declares the PermissionBits required for invocation. The dispatch trampoline (kabi_dispatch_with_vcap, Step 3) checks the caller's cached_rights against this requirement before forwarding the call. If the caller's rights are insufficient, dispatch returns KabiError::InsufficientPermissions without entering the driver domain.

Permission requirements are stored in a static, per-vtable lookup table indexed by method ordinal. The required_permissions() call is a constant-time array index (no runtime computation, no branching). The kabi-gen tool generates this table from @perm annotations in the .kabi IDL source.

PermissionBits flags (bitmask, u32):

The KABI dispatch layer uses the canonical `PermissionBits` definition from
[Section 9.1](09-security.md#capability-based-foundation--permissionbits-flags). Only three flags are
relevant to driver dispatch — `READ` (bit 0), `WRITE` (bit 1), and `ADMIN`
(bit 6). The remaining flags (EXECUTE, DEBUG, DELEGATE, MAP_*, KERNEL_READ)
are never set on driver capabilities and are ignored by the dispatch
trampoline. The `kabi-gen` tool validates at compile time that `.kabi` IDL
files only use `READ`, `WRITE`, or `ADMIN` in `@perm` annotations.

```rust
// Re-exported from capability-based-foundation — NOT a separate definition.
// KABI drivers see the same PermissionBits type as the rest of the kernel.
// Only these three flags appear in driver capability grants:
//   PermissionBits::READ  = 1 << 0   (bit 0)
//   PermissionBits::WRITE = 1 << 1   (bit 1)
//   PermissionBits::ADMIN = 1 << 6   (bit 6)

**`BlockDeviceVTable` permission requirements:**

| Method | Required `PermissionBits` | Required `SystemCaps` | Rationale |
|--------|---------------------------|-----------------------|-----------|
| `submit_io` (read) | `READ` | (none) | Reading block data |
| `submit_io` (write) | `WRITE` | (none) | Writing block data |
| `poll_completion` | `READ` | (none) | Polling does not mutate device state |
| `get_capabilities` | `READ` | (none) | Querying device capabilities |
| `discard_blocks` | `WRITE` | (none) | Discards mutate on-device state |
| `zone_management` | `ADMIN` | (none) | Zone operations are administrative |

Note: `submit_io` permission depends on the `BioOp` direction. The dispatch
trampoline inspects `BioOp::is_write()` to select `READ` or `WRITE`.

**`KernelServicesVTable` permission requirements (dual-check):**

Methods that perform DMA or IRQ operations require both `PermissionBits` AND
`SystemCaps`. The dispatch trampoline checks both independently — a domain that
holds `PermissionBits::WRITE` but lacks `CAP_DMA` cannot call `alloc_dma_buffer`.

| Method | Required `PermissionBits` | Required `SystemCaps` | Rationale |
|--------|---------------------------|-----------------------|-----------|
| `alloc_dma_buffer` | `WRITE` | `CAP_DMA` | DMA allocation is a privileged hardware operation |
| `free_dma_buffer` | `WRITE` | `CAP_DMA` | Releasing DMA buffers unmaps IOMMU entries |
| `register_interrupt` | `ADMIN` | `CAP_IRQ` | IRQ vector configuration is privileged hardware access |
| `deregister_interrupt` | `ADMIN` | `CAP_IRQ` | IRQ vector deconfiguration is privileged hardware access |
| `log` | `READ` | (none) | Diagnostic output, low-privilege |
| `create_ring_buffer` | `WRITE` | (none) | Allocates shared memory (no hardware access) |
| `fma_report_health` | `READ` | (none) | Reporting health is observational |
| `crypto_register_alg` | `ADMIN` | `CAP_CRYPTO` | Registering crypto algorithms modifies the global algorithm table; required by hardware crypto accelerator drivers (QAT, CCP, etc.) that run as Tier 1 |
| `crypto_unregister_alg` | `ADMIN` | `CAP_CRYPTO` | Unregistering crypto algorithms; triggers forced drain of in-flight transforms via `TfmRegistry` |

`CAP_CRYPTO` (bit 95) is a UmkaOS-native SystemCap defined in
[Section 9.2](09-security.md#permission-and-acl-model--system-administration-capabilities). It gates
crypto algorithm registration/unregistration to prevent a non-crypto driver
from replacing system-wide cryptographic implementations.

`CAP_IRQ` (bit 94) is a new UmkaOS-native SystemCap defined in
[Section 9.2](09-security.md#permission-and-acl-model--system-administration-capabilities). It gates
interrupt vector registration/deregistration. Previously, `register_interrupt`
required only `PermissionBits::ADMIN`; the dual-check adds `CAP_IRQ` to
prevent a driver with administrative permissions on its own capability from
registering arbitrary interrupt vectors without explicit IRQ authorization.

**`NetworkDeviceVTable` permission requirements:**

| Method | Required `PermissionBits` | Required `SystemCaps` | Rationale |
|--------|---------------------------|-----------------------|-----------|
| `start_xmit` | `WRITE` | (none) | Transmitting packets mutates NIC state |
| `set_rx_mode` | `ADMIN` | `CAP_NET_ADMIN` | Changing receive filters is administrative |
| `change_mtu` | `ADMIN` | `CAP_NET_ADMIN` | MTU changes affect all consumers |
| `get_stats` | `READ` | (none) | Reading statistics is observational |

**`KabiRequest::required_permissions()` implementation:**

```rust
impl KabiRequest {
    /// Returns the PermissionBits required to dispatch this request.
    /// Looked up from a static per-vtable table generated by `kabi-gen`.
    /// Cost: single array index + load (~1ns).
    pub fn required_permissions(&self) -> PermissionBits {
        // Each vtable type has a static array: PERM_TABLE_<VtableName>.
        // The method_index field identifies the vtable slot being called.
        // Example for BlockDeviceVTable:
        //   static PERM_TABLE_BLOCK_DEVICE: [PermissionBits; 6] = [
        //       PermissionBits::READ,   // submit_io (read path)
        //       PermissionBits::READ,   // poll_completion
        //       PermissionBits::READ,   // get_capabilities
        //       PermissionBits::WRITE,  // discard_blocks
        //       PermissionBits::ADMIN,  // zone_management
        //       // ... appended as vtable grows
        //   ];
        self.vtable_perm_table[self.method_index as usize]
    }

    /// Returns the SystemCaps required to dispatch this request.
    /// Looked up from a static per-vtable table generated by `kabi-gen`
    /// from `@syscap` annotations. Methods with no `@syscap` annotation
    /// have `SystemCaps::empty()` — the dispatch trampoline skips the
    /// SystemCaps check entirely for those methods (branch predicted).
    /// Cost: single array index + load (~1ns).
    ///
    /// The dual-check protocol requires BOTH `required_permissions()` AND
    /// `required_syscaps()` to pass. They are orthogonal:
    /// - `PermissionBits` (per-capability): "may this capability do X?"
    /// - `SystemCaps` (per-domain): "may this domain perform Y?"
    ///
    /// Example: `alloc_dma_buffer` requires `PermissionBits::WRITE`
    /// (object-level write) AND `SystemCaps::CAP_DMA` (domain-level DMA
    /// authorization). A driver that has WRITE on a block device but was
    /// not granted CAP_DMA cannot allocate DMA buffers.
    pub fn required_syscaps(&self) -> SystemCaps {
        // Each vtable type has a static array: SYSCAP_TABLE_<VtableName>.
        // Example for KernelServicesVTable:
        //   static SYSCAP_TABLE_KERNEL_SERVICES: [SystemCaps; 7] = [
        //       SystemCaps::CAP_DMA,           // alloc_dma_buffer
        //       SystemCaps::CAP_DMA,           // free_dma_buffer
        //       SystemCaps::CAP_IRQ,           // register_interrupt
        //       SystemCaps::CAP_IRQ,           // deregister_interrupt
        //       SystemCaps::empty(),           // log
        //       SystemCaps::empty(),           // create_ring_buffer
        //       SystemCaps::empty(),           // fma_report_health
        //   ];
        self.vtable_syscap_table[self.method_index as usize]
    }
}

IDL @perm and @syscap annotation syntax:

Custom .kabi files declare per-method permission requirements using @perm (for PermissionBits) and optionally @syscap (for SystemCaps):

@version(1)
vtable MyDeviceVTable {
    @version(1)
    vtable_size: u64,

    @version(1)
    @perm(READ)
    fn get_status() -> DeviceStatus;

    @version(1)
    @perm(WRITE)
    @syscap(CAP_DMA)
    fn submit_dma_command(cmd: DmaCommand) -> CommandResult;

    @version(2)
    @optional
    @perm(ADMIN)
    fn reset_device() -> ResetResult;
}

Methods without @syscap have no SystemCaps requirement — the dispatch trampoline skips the SystemCaps check entirely for those methods. @syscap is optional because most methods only need PermissionBits; the dual-check is reserved for methods that touch privileged hardware resources (DMA, IRQ, IOMMU).

The kabi-gen tool:

Parses @perm annotations and emits the static permission table for each vtable.
Parses @syscap annotations and emits the static SystemCaps table for each vtable. Methods without @syscap get SystemCaps::empty() in the table.
Rejects any method that lacks a @perm annotation (every method must declare its required permission level; there is no default).
Validates that @perm values are valid PermissionBits flag names (READ, WRITE, ADMIN, or |-combined like READ | WRITE).
Validates that @syscap values are valid SystemCaps flag names (e.g., CAP_DMA, CAP_IRQ, CAP_NET_ADMIN, or |-combined like CAP_DMA | CAP_DMA_IDENTITY).
The kabi-compat-check tool verifies that permission requirements on existing methods are never weakened across versions (a method that required ADMIN cannot be relaxed to READ in a later version, as that would widen the attack surface). The same rule applies to @syscap — a method that required CAP_DMA cannot drop that requirement.

12.3.4 Generation Counter Wrap Policy¶

The DriverDomain::generation counter is a u64 that increments by 1 on every driver crash (odd → even) and again when the replacement driver becomes ready (even → odd). Two increments per crash cycle means up to ~9.2 × 10^18 increments before the counter reaches u64::MAX. At one billion driver reloads per second (far beyond any realistic scenario) the counter would take approximately 584 years to exhaust. Although this is effectively unreachable in practice, production correctness requires explicit handling of wrap so that no combination of inputs can produce a silent invariant violation.

On increment — wrap detection:

/// Increment the generation counter from odd (active) to even (inactive),
/// marking the domain as crashed. Returns Err(EOVERFLOW) if the next
/// generation value would be zero (wrap boundary).
pub fn mark_crashed(domain: &DriverDomain) -> Result<(), KernelError> {
    loop {
        let prev = domain.generation.load(Ordering::Acquire);
        let next = prev.wrapping_add(1);
        if next == 0 {
            // The counter has wrapped. The slot must be reset by the operator
            // before it can accept another driver load. Log and refuse.
            log::error!(
                "DriverDomain {:?}: generation counter exhausted after {} cycles. \
                 Operator must clear the slot before reloading.",
                domain.id,
                prev / 2,
            );
            return Err(KernelError::EOVERFLOW);
        }
        // CAS prevents TOCTOU: two concurrent crash handlers cannot both
        // advance the counter (only one CAS succeeds; the other retries).
        match domain.generation.compare_exchange(
            prev, next, Ordering::SeqCst, Ordering::Acquire,
        ) {
            Ok(_) => return Ok(()),
            Err(_) => continue,  // Another crash handler won; retry.
        }
    }
}

When EOVERFLOW is returned, the crash recovery path logs the event to the FMA telemetry ring (Section 20.1) and marks the driver slot as SlotState::GenerationExhausted. The device node remains in the registry (so userspace observability tools can see its state) but all driver_load() calls for that slot return -EOVERFLOW until an operator issues driver_slot_reset() via the management KABI. The reset zeroes the generation counter and clears the exhausted state, allowing the slot to be used again.

/// Reset a driver slot to its initial state. Requires CAP_SYS_ADMIN and DebugCap.
/// Clears the generation counter, deallocates domain resources, and logs an FMA event.
/// Used by operators to recover slots whose generation counter has reached the
/// exhaustion threshold.
pub fn driver_slot_reset(slot: DriverSlotId) -> Result<(), DriverError> {
    // Validate: caller holds CAP_SYS_ADMIN + DebugCap
    // Verify: slot has no active driver loaded
    // Clear: generation = 0, domain = None, state = SlotState::Empty
    // Log: FMA event HealthEventClass::DriverSlotReset
}

Outstanding handles — invalidation across wrap:

Any CapValidationToken or DriverHandle that carries a generation value from before the wrap is automatically invalidated by the existing mismatch check in kabi_dispatch_with_vcap (Section 12.3): the current domain generation (now reset to 1 after operator reset) will not match any previously issued token carrying a generation value near u64::MAX. No additional logic is required.

HMAC key rotation across wrap:

The DriverHmacKey.generation field (Section 11.4) is an input to the HKDF derivation. After an operator slot reset the generation restarts at 1. A key derived with generation=1 after a reset is cryptographically distinct from a key derived with generation=1 at initial driver load, because HKDF's Info field includes a time_of_creation field (a monotonic kernel timestamp, captured at key allocation time). The DriverHmacKey struct carries this timestamp for generation-0-after-reset disambiguation:

pub struct DriverHmacKey {
    key:          Zeroize<[u8; 32]>,
    driver_slot:  DriverSlot,
    generation:   u64,
    /// Monotonic nanosecond timestamp at which this key was created.
    /// Included in HKDF `Info` to distinguish keys with the same (slot, generation)
    /// pair that arise after a generation counter reset. Never exported.
    created_at_ns: u64,
}

The HKDF Info field is therefore: b"umka-driver-hmac" || slot_id.to_le_bytes() || generation.to_le_bytes() || created_at_ns.to_le_bytes()

This prevents an attacker who can observe an HMAC tag from a pre-reset driver from replaying it against a post-reset driver with the same (slot, generation) tuple.

Operational note:

Generation counter exhaustion requires approximately 9.2 × 10^18 crash-and-reload increments on a single driver slot. At one crash per second (an extremely unstable driver), this takes over 292 billion years. The wrap check satisfies production correctness requirements and makes the invariant explicit in code, but it is not expected to trigger in any operational environment. Systems that monitor driver crash rates via FMA telemetry will have flagged a repeatedly crashing driver long before generation exhaustion becomes possible.

12.4 Version Negotiation¶

When a driver loads, version negotiation proceeds as follows:

1. Kernel resolves __kabi_driver_entry and calls it to obtain *const KabiDriverManifest
2. Kernel selects the tier entry point (entry_direct/entry_ring/entry_ipc) and calls it,
   passing kernel_vtable; the entry returns driver_vtable
3. Both sides read `kabi_version` (second field) and check compatibility:
   - Driver: `KabiVersion::from_u64(kernel_vtable.kabi_version)?.is_compatible_with(driver_min_version)`
   - Kernel: `KabiVersion::from_u64(driver_vtable.kabi_version)?.is_compatible_with(kernel_min_version)`
   - Mismatch → return KABI_ERR_VERSION_MISMATCH
4. Both sides read `vtable_size` (first field) as a bounds-safety check:
   - Read only `min(their_vtable_size, our_compiled_vtable_size)` bytes
   - Methods beyond the smaller of the two sizes are treated as absent
5. `kabi_version` is the primary version discriminant; `vtable_size` is the bounds guard
6. Optional methods (Option<fn>) are checked before each call

This allows: - Old drivers on new kernels (kabi_version compatibility check passes; kernel ignores methods beyond the driver's vtable_size) - New drivers on old kernels (kabi_version compatibility check passes; driver checks for method presence via Option<fn>, degrades gracefully) - Independent kernel and driver update cycles

See also: Section 19.9 (Safe Kernel Extensibility) extends the KABI vtable pattern to kernel policy modules, enabling hot-swappable scheduler policies, memory policies, and fault handlers using the same append-only ABI mechanism.

12.4.1 Vtable Bounds Safety (Zero-Extension Contract)¶

Step 3 of the negotiation protocol uses kabi_version to determine version compatibility. Step 4 uses vtable_size as an independent bounds-safety guard: it tells each side how many bytes to read, regardless of which KABI version the other side declares.

This distinction matters because vtable_size does not uniquely identify a KABI version. Deprecation tombstones preserve vtable slots (Rule 2a), so two different KABI versions can have the same vtable_size. The kabi_version field is the authoritative version identity; vtable_size is a memory-safety mechanism that prevents out-of-bounds reads.

This subsection defines the exact binary-level rules — the zero-extension contract — for all three size relationships.

12.4.1.1.1 Case 1: Driver vtable larger than kernel expects¶

The driver's vtable_size exceeds the kernel's compiled KERNEL_VTABLE_SIZE. This typically means the driver was compiled against a KABI version that appended methods the kernel does not know about.

Rule: The kernel reads only the first min(driver_vtable_size, KERNEL_VTABLE_SIZE) bytes of the driver's vtable. Methods beyond KERNEL_VTABLE_SIZE are silently ignored. The kernel MUST NOT access any byte at or beyond its own compiled vtable size.

// At driver load time, in the kernel's vtable acceptance path:
let effective_size = driver_vtable.vtable_size.min(KERNEL_VTABLE_SIZE as u64);
// All subsequent method dispatches use effective_size as the bound.

This is safe because vtables are append-only (Rule 1): the first KERNEL_VTABLE_SIZE bytes of any larger driver vtable are layout-identical to the kernel's own definition of that vtable.

12.4.1.1.2 Case 2: Driver vtable smaller than kernel expects¶

The driver's vtable_size is less than the kernel's compiled KERNEL_VTABLE_SIZE. This typically means the driver was compiled against an older KABI version that did not include methods the kernel now defines.

Rule: Every byte in the kernel's vtable definition that lies beyond driver_vtable.vtable_size MUST be treated as zero. Concretely: any fn pointer in a slot beyond the driver's declared size is null. The kernel MUST check for null before calling any such method pointer and MUST use the per-method default declared in the IDL (see below) when the pointer is null.

This rule extends to Option<fn(...)> fields that happen to fall inside the driver's declared size but are nonetheless null: null-check is always required regardless of whether the slot might be beyond vtable_size.

The canonical dispatch helper is:

/// Extract the function pointer type of a named field from a vtable struct.
///
/// Given `kabi_fn_type!(BlockDeviceVTable, submit_io)`, this resolves to the
/// declared function pointer type of the `submit_io` field in
/// `BlockDeviceVTable` — e.g., `unsafe extern "C" fn(*mut c_void, *mut Bio) -> i32`.
///
/// Implemented via the `field_type_of!` built-in — a proc-macro that reads the
/// struct definition at compile time. This is equivalent to what C achieves with
/// `typeof(((struct block_vtable *)0)->submit_io)`.
///
/// Alternative: callers can write the function pointer type explicitly instead of
/// using this macro. The macro exists to avoid duplicating type signatures between
/// the vtable struct definition and every call site.
macro_rules! kabi_fn_type {
    ($type:ty, $method:ident) => {
        // Resolves to the declared type of `$type::$method`.
        // In implementation: `field_type_of!($type, $method)`.
        <$type as KabiVtable>::MethodType
    };
}

/// Dispatch a vtable method with bounds-check, null-check, and fallback default.
///
/// **SDK-internal** — driver code MUST NOT call this macro directly. Drivers
/// use `kabi_call!(handle, method, args)` ([Section 12.8](#kabi-domain-runtime)) which is
/// the public transport-abstraction API. This macro is used internally by:
/// - The `kabi_call!` Direct path (for same-domain vtable dispatch)
/// - `kabi_call_t0!` (for Tier 0 Core-domain vtable dispatch under RCU)
/// - The consumer loop's `vtable_dispatch()` (for ring-received commands)
///
/// Evaluates to `$default` when the method slot lies beyond the driver's
/// declared `vtable_size` or the function pointer is null (older driver that
/// predates this method).
///
/// The `effective_size` is computed internally as
/// `min((*$vtable).vtable_size, size_of::<$type>())`. The caller does NOT
/// pass this value — it is derived from the vtable header and the kernel's
/// compile-time knowledge of the vtable struct size. This prevents a
/// malicious driver from bypassing bounds checks with `u64::MAX`.
///
/// # Safety
///
/// The caller must ensure `$vtable` points to a valid, immutable vtable for
/// the lifetime of the call. Callers obtain vtable pointers exclusively
/// through the driver load path, which validates `vtable_size`, bounds-checks
/// all slots, and copies the vtable into kernel-owned memory. The raw pointer
/// precondition is established once at load time, not per-call.
///
/// # Layering
///
/// ```text
/// Driver code:   kabi_call!(handle, method, args)           // public API
///   Direct path:   kabi_vtable_call!(vtable, Type, method, default, args)  // SDK internal
///   Ring path:     serialize -> ring.submit -> [consumer uses kabi_vtable_call!]
/// Tier 0 hooks:  kabi_call_t0!(ptr, Type, method, default, args)  // wraps kabi_vtable_call! with RCU
/// ```
#[doc(hidden)]
macro_rules! kabi_vtable_call {
    ($vtable:expr, $type:ty, $method:ident, $default:expr $(, $args:expr)*) => {{
        let vt = $vtable;
        // Compute effective_size internally: min of the driver's declared
        // vtable_size and the kernel's compile-time vtable struct size.
        // This prevents a malicious driver from using u64::MAX to bypass
        // bounds checks. The driver's vtable_size was already validated
        // against KABI_MAX_VTABLE_SIZE at load time.
        let effective_size: usize = {
            // SAFETY: vt points to a valid vtable with a VtableHeader prefix.
            let driver_size = unsafe { (*vt).vtable_size } as usize;
            let kernel_size = core::mem::size_of::<$type>();
            core::cmp::min(driver_size, kernel_size)
        };
        let offset = core::mem::offset_of!($type, $method);
        let fn_size = core::mem::size_of::<*const ()>();
        if offset + fn_size <= effective_size {
            // Read raw bytes as *const () first to avoid UB on null fn pointers.
            // Bare fn types have a non-null validity invariant; reading null bytes
            // into `fn(...)` is instant UB. Reading as *const () is always safe.
            let raw_ptr = unsafe {
                core::ptr::read(
                    core::ptr::addr_of!((*vt).$method) as *const *const ()
                )
            };
            if !raw_ptr.is_null() {
                // SAFETY: bounds-checked and non-null; transmute to fn pointer.
                let f: $crate::kabi_fn_type!($type, $method) =
                    unsafe { core::mem::transmute(raw_ptr) };
                unsafe { f($($args),*) }
            } else {
                // Null method slot. Two cases:
                //   - Optional method (declared as Option<fn>): returning $default
                //     is the correct behavior — the driver chose not to implement
                //     this method and the caller provided the appropriate default.
                //   - Mandatory method (declared as bare fn): a null slot indicates
                //     vtable corruption (overwritten memory, malformed driver binary,
                //     or partial initialization). Log a warning for crash recovery
                //     diagnostics, then return $default as a defensive fallback.
                //     The caller will typically receive an error code or zero value,
                //     which is preferable to a kernel panic from calling through a
                //     null function pointer.
                //
                // The macro cannot distinguish optional from mandatory at expansion
                // time. kabi-gen wrappers for mandatory methods should use a $default
                // that signals an error (e.g., KabiResult::Err(EFAULT)) and log:
                //   log::warn!("KABI: null mandatory method {} in vtable {:?}",
                //              stringify!($method), core::ptr::addr_of!(*$vtable));
                $default
            }
        } else {
            $default
        }
    }};
}

In practice, the macro is wrapped by per-method helper functions generated by kabi-gen so driver authors and kernel subsystems never write the offset arithmetic by hand. The name kabi_vtable_call! distinguishes it from the public kabi_call! macro (Section 12.8) which is the transport-abstraction API for driver code.

12.4.1.1.3 Case 3: Equal vtable sizes¶

Both sides declare the same vtable_size. No special handling is required. All method pointers within the vtable may still be null if they are declared Option<fn(...)> in the IDL; null-checks on optional methods are always required regardless of version match.

12.4.1.1.4 Per-method null-pointer defaults in the IDL¶

Each method in a .kabi IDL file MUST carry a default annotation. kabi-gen uses this annotation to generate the fallback arm of the dispatch helper and to enforce that every call site provides a default value. The annotation syntax is:

// Accepted default forms:
fn on_suspend() -> () = default_noop;     // null → do nothing, return ()
fn get_capabilities() -> u64 = 0u64;      // null → return literal zero
fn handle_interrupt() -> bool = false;    // null → interrupt not claimed
fn custom_ioctl(cmd: u32, arg: u64)
    -> KabiResult<u64, IoctlError>
    = default_err(KabiError::NOT_SUPPORTED); // null → return error variant

Methods without a default annotation are mandatory: they must be non-null in every driver vtable, and their absence causes driver load rejection (see below). kabi-gen marks these fields as bare unsafe extern "C" fn(...) (not Option<fn>).

12.4.1.1.5 Minimum vtable size and load rejection¶

If driver_vtable.vtable_size < KABI_MINIMUM_VTABLE_SIZE, the kernel MUST reject the driver at load time with ENOEXEC and log:

umka: driver <name> rejected: vtable_size=<N> below minimum=<M> (KABI baseline v1)

KABI_MINIMUM_VTABLE_SIZE is the byte size of the vtable struct as it existed at KABI v1 — the first release that established the mandatory baseline method set. It is a compile-time constant derived from the v1 IDL snapshot and never changes (because KABI v1 methods are never removed or reordered). A driver that cannot even fill the v1 layout is either corrupt, built against a pre-release ABI, or targeting a different device class entirely; loading it would be unsafe.

This check is distinct from the KABI version support-window check (Section 12.2): a driver can be within the support window yet still produce ENOEXEC if its vtable is truncated below the v1 baseline (e.g., due to a linker error).

Load rejection decision tree:
  vtable_size < KABI_MINIMUM_VTABLE_SIZE  → ENOEXEC (corrupt / pre-baseline)
  vtable_size > KABI_MAX_VTABLE_SIZE      → ENOEXEC (corrupt / implausible)
  driver version < minimum supported      → ENOEXEC (outside support window)
  vtable_size < KERNEL_VTABLE_SIZE        → load OK, smaller vtable (Case 2 above)
  vtable_size = KERNEL_VTABLE_SIZE        → load OK, equal size (Case 3)
  vtable_size > KERNEL_VTABLE_SIZE        → load OK, larger vtable (Case 1 above)

12.4.1.1.6 Maximum vtable size guard¶

KABI_MAX_VTABLE_SIZE = 4096 is a sanity check upper bound. No device class vtable has more than ~50 function pointers (8 bytes each = 400 bytes); even with generous future growth, 4096 bytes allows ~500 function pointers. A driver declaring vtable_size > KABI_MAX_VTABLE_SIZE is treated as corrupt — the kernel rejects it with ENOEXEC and logs a warning. This prevents the kabi_vtable_call! macro's bounds check (offset + size <= vtable_size) from using an attacker-controlled vtable_size to read arbitrary memory from the driver's address space. The effective size used by kabi_vtable_call! is min(driver_vtable.vtable_size, KERNEL_VTABLE_SIZE), but the pre-validation at load time catches implausible values early.

12.4.2 Deprecation Tombstones¶

When a vtable method completes its deprecation cycle (Rule 2a: deprecated at KABI vN, tombstoned at vN+3 or vN+5 for LTS), the method pointer in the kernel's vtable is replaced with kabi_deprecated_stub:

/// Tombstone generator for deprecated KABI vtable methods.
///
/// Each deprecated method gets a **type-correct tombstone** that matches the
/// original method's C-ABI signature exactly. This is necessary because:
///
/// - On x86-64 System V ABI, methods returning structs >16 bytes use a hidden
///   first pointer parameter (rdi). A generic `() -> i64` tombstone would not
///   consume this hidden parameter, corrupting the caller's stack.
/// - On AArch64, struct-returning methods may use x8 as the result location
///   register. Signature mismatch leaves the result buffer uninitialized.
///
/// The KABI IDL compiler (`umka-kabi-gen`) generates a per-method tombstone at
/// deprecation time. The tombstone has the same parameter list as the original
/// method, ignores all arguments, and returns an error-indicating value:
///
/// - Integer-returning methods: return `-(ENOSYS as i64)`.
/// - Struct-returning methods: zero-fill the output struct and set the embedded
///   `status` field (all KABI structs with results include one) to `-ENOSYS`.
/// - Void-returning methods: no-op.
///
/// Example generated tombstone for a struct-returning method:
/// ```rust
/// pub unsafe extern "C" fn foo_deprecated(out: *mut FooResult, ...) -> *mut FooResult {
///     core::ptr::write_bytes(out, 0, 1);  // zero-fill
///     (*out).status = -(Errno::ENOSYS as i32);
///     out  // return the pointer per ABI contract
/// }
/// ```
///
/// For simple integer-returning methods, a single shared tombstone suffices:
pub unsafe extern "C" fn kabi_deprecated_stub_i64() -> i64 {
    -(Errno::ENOSYS as i64)
}

Why tombstones instead of removal:

A method deprecated in vN may still be called by drivers compiled against vN that are within the support window (Section 12.2). The tombstone provides a clean error (-ENOSYS) rather than undefined behavior from a null or reused slot. Because tombstones preserve the slot, vtable_size is monotonically non-decreasing, and the zero-extension contract (Cases 1-3 above) remains valid.

The kabi_version field (Rule 6) is the authoritative version identity. Even in the theoretical case where a future vtable reorganization changes vtable_size, kabi_version unambiguously identifies the version. vtable_size is a bounds guard, not an identity.

Deprecation lifecycle summary:

KABI version	Method state	Caller behavior
vN	Active	Normal operation
vN (deprecated)	Active + `#[deprecated]`	Kernel log warning on each call
vN+3 (or vN+5 LTS)	Tombstone (`kabi_deprecated_stub`)	Caller receives `-ENOSYS`
Forever	Tombstone preserved	Slot never reused for a different method

12.5 KABI IDL Language Specification¶

The .kabi IDL defines the stable driver ABI. The umka-kabi-gen compiler transforms .kabi source files into C headers and Rust modules for use by drivers and the kernel. This section is the canonical reference for authoring .kabi files; implementors of umka-kabi-gen must conform to every rule described here.

KabiResult<T, E> (used in vtable return types) is a #[repr(C)] result type with a discriminant tag and a union payload, defined in umka-driver-sdk/src/abi.rs. Rust's Result<T, E> has no guaranteed #[repr(C)] layout and must never cross the ABI boundary directly.

KabiResult<T, E> layout (canonical definition):

/// ABI-stable Result type. Crosses the KABI boundary in place of Rust's Result<T, E>.
///
/// Total size = 8 + max(size_of::<T>(), size_of::<E>()).
/// The `kabi-gen` tool generates C headers with an equivalent layout:
///   struct KabiResult_FooBar { uint32_t discriminant; uint8_t _pad[4]; union { Foo ok; Bar err; } payload; };
// kernel-internal, not KABI
#[repr(C)]
pub struct KabiResult<T, E> {
    /// 0 = Ok, 1 = Err. u32 for ABI stability (`bool` is forbidden in repr(C) ABI types —
    /// its size and valid bit patterns are not guaranteed across compilers and languages).
    pub discriminant: u32,
    /// Align the payload union to 8 bytes regardless of T/E alignment.
    pub _pad: [u8; 4],
    /// Payload. Only `ok` is valid when discriminant==0; only `err` when discriminant==1.
    /// Reading the wrong variant is undefined behavior.
    pub payload: KabiResultPayload<T, E>,
}

/// Payload union for `KabiResult`. The active variant is determined by the
/// enclosing `KabiResult.discriminant` field.
#[repr(C)]
pub union KabiResultPayload<T, E> {
    pub ok:  ManuallyDrop<T>,
    pub err: ManuallyDrop<E>,
}

ManuallyDrop<T> is a Rust core type with #[repr(transparent)] — it has the same layout as T and prevents the compiler from running Drop on the inactive union variant.

The single resolved symbol: Drivers export exactly one symbol — __kabi_driver_entry: extern "C" fn() -> *const KabiDriverManifest. The kernel resolves this single symbol at load time and calls it to obtain a pointer to the driver's KabiDriverManifest (defined in Section 12.6). The manifest contains the three typed entry points (entry_direct, entry_ring, entry_ipc) that correspond to the three transport tiers. The kernel reads the manifest's transport_mask to determine which entry points are present, selects the appropriate one based on the assigned tier, and calls it — passing the kernel's services vtable and receiving the driver's vtable in return.

This reconciles the "one resolved symbol" design (the linker only resolves __kabi_driver_entry) with the three-tier transport model (the manifest provides the typed dispatch table). All subsequent communication flows through vtable method calls; no further symbol resolution occurs. This eliminates the entire class of symbol versioning problems that plagues Linux's EXPORT_SYMBOL / MODULE_VERSION mechanism.

umka-kabi-gen vs kabi-compat-check: umka-kabi-gen generates code (Rust structs, C headers, validation stubs) from .kabi IDL files. kabi-compat-check (described in Section 12.2) validates that a new .kabi file is backward-compatible with the previous release baseline. Both tools share the same .kabi parser frontend but serve different purposes: umka-kabi-gen runs at build time, kabi-compat-check runs in CI.

12.5.1 File Format¶

A .kabi file is UTF-8 encoded plain text. Line comments use //. Block comments use /* */ (may not be nested). File extension: .kabi. Convention: one vtable definition per file, stored in umka-driver-sdk/interfaces/.

The first non-comment, non-blank statement in a .kabi file must be the version declaration:

kabi_version <N>;

where <N> is a positive integer giving the highest version number defined in this file. Fields and methods introduced in earlier versions remain present. When a new field or method is added, kabi_version is bumped and the new item carries a @version(N) annotation matching the new version number.

12.5.2 Type System¶

All types in the IDL map to fixed-layout C and Rust equivalents. The type system deliberately excludes types whose layout is platform-dependent.

12.5.2.1 Primitive Types¶

IDL type	C type (64-bit targets)	C type (32-bit targets, via `umka_kabi_u128_t`)	Rust output	Width
`u8`	`uint8_t`	`uint8_t`	`u8`	8-bit
`u16`	`uint16_t`	`uint16_t`	`u16`	16-bit
`u32`	`uint32_t`	`uint32_t`	`u32`	32-bit
`u64`	`uint64_t`	`uint64_t`	`u64`	64-bit
`u128`	`__uint128_t`	`umka_kabi_u128_t` (struct, field order endian-dependent — see §128-bit Portability Shim below)	`u128`	128-bit
`i8`	`int8_t`	`int8_t`	`i8`	8-bit
`i16`	`int16_t`	`int16_t`	`i16`	16-bit
`i32`	`int32_t`	`int32_t`	`i32`	32-bit
`i64`	`int64_t`	`int64_t`	`i64`	64-bit
`i128`	`__int128_t`	`umka_kabi_i128_t` (struct, field order endian-dependent — see §128-bit Portability Shim below)	`i128`	128-bit
`f32`	`float`	`float`	`f32`	IEEE 754 32-bit
`f64`	`double`	`double`	`f64`	IEEE 754 64-bit
`bool`	`uint8_t`	`uint8_t`	`u8`	0=false, 1=true; other values undefined

Warning: f32 and f64 have platform-defined NaN representations and must not be used to carry values that must be bit-identical across architectures. Use scaled integers (e.g., u32 in milliunits) instead.

12.5.2.2 128-bit Integer Portability Shim¶

__uint128_t and __int128_t are GCC/Clang extensions available on 64-bit targets only. On 32-bit UmkaOS targets (ARMv7, PPC32), the compiler-defined __SIZEOF_INT128__ macro is absent and these types do not exist. The umka-kabi-gen tool emits the following shim in the C header preamble for all generated headers, selecting the correct representation at compile time:

/* --- umka_kabi_u128_t / umka_kabi_i128_t portability shim ---
 *
 * Generated by umka-kabi-gen in the C header preamble for all targets.
 * On 64-bit targets with __SIZEOF_INT128__, maps directly to compiler built-ins.
 * On 32-bit targets (ARMv7, PPC32) where __SIZEOF_INT128__ is absent, uses a
 * two-u64-field struct. Field order is endian-aware: little-endian targets have
 * lo first; big-endian targets (PPC32 BE) have hi first.
 *
 * Use the UMKA_KABI_U128_LO / UMKA_KABI_U128_HI macros for portable access.
 */
#if defined(__SIZEOF_INT128__)
  typedef unsigned __int128  umka_kabi_u128_t;
  typedef          __int128  umka_kabi_i128_t;
  #define UMKA_KABI_U128_LO(v) ((uint64_t)((v) & 0xFFFFFFFFFFFFFFFFULL))
  #define UMKA_KABI_U128_HI(v) ((uint64_t)((v) >> 64))
#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  typedef struct { uint64_t lo; uint64_t hi; } umka_kabi_u128_t;  /* LE layout */
  typedef struct { uint64_t lo; int64_t  hi; } umka_kabi_i128_t;  /* LE layout */
  #define UMKA_KABI_U128_LO(v) ((v).lo)
  #define UMKA_KABI_U128_HI(v) ((v).hi)
#else
  typedef struct { uint64_t hi; uint64_t lo; } umka_kabi_u128_t;  /* BE layout */
  typedef struct { int64_t  hi; uint64_t lo; } umka_kabi_i128_t;  /* BE layout */
  #define UMKA_KABI_U128_LO(v) ((v).lo)
  #define UMKA_KABI_U128_HI(v) ((v).hi)
#endif

Big-endian ABI note: On big-endian targets (PPC32 in BE mode), Rust's u128 is stored with the most-significant 64 bits at the lower memory address (hi first). The shim's #else branch mirrors this: { uint64_t hi; uint64_t lo; }. This matches the compiler's layout of __uint128_t on big-endian 64-bit targets. Cross-node wire protocols that carry u128 values must use explicit Le128/Be128 wrapper types with documented byte order — raw u128 in a #[repr(C)] struct is endian-dependent.

Design guidance: In KABI IDL, avoid u128 in interfaces unless mathematically required (e.g., cryptographic nonces, UUIDs, 128-bit packet counters). Prefer two u64 fields with explicit semantics for cross-platform clarity and debuggability.

The IDL has no usize or isize type. These are intentionally excluded: pointer-sized integers differ between 32-bit and 64-bit UmkaOS targets (ARMv7, PPC32 are 32-bit; x86-64, AArch64, RISC-V 64, PPC64LE are 64-bit). Use u64 or i64 for ABI-stable pointer-sized values. For actual pointers, use *const T or *mut T.

12.5.2.3 Pointer and Aggregate Types¶

IDL syntax	C output	Rust output	Notes
`*const T`	`const T*`	`*const T`	Non-null; use `Option<*const T>` for nullable
`*mut T`	`T*`	`*mut T`	Non-null; use `Option<*mut T>` for nullable
`Option<*const T>`	`const T*`	`Option<*const T>`	Nullable pointer; C uses NULL
`Option<*mut T>`	`T*`	`Option<*mut T>`	Nullable pointer; C uses NULL
`[T; N]`	`T arr[N]`	`[T; N]`	N must be a positive integer literal

Pointers in the IDL are always raw. Rust references (&T, &mut T) are never permitted in .kabi files because lifetime annotations cannot cross the ABI boundary. Raw pointers carry no lifetime, which is correct for cross-language vtable dispatch.

12.5.2.4 Type Aliases¶

Type aliases give semantic names to primitive types for readability. They have no effect on ABI layout.

type DeviceId = u64;
type Timeout  = u32;    // milliseconds; a comment documenting units is required
type ErrCode  = i32;

Type aliases must appear before their first use in the file. An alias to another alias is allowed; cycles are rejected with a compile error.

12.5.3 Struct Definition¶

Structs map to #[repr(C)] in Rust and to a typedef struct with explicit alignment in C. Field layout follows the standard C ABI rules for the target architecture.

12.5.3.1 Syntax¶

@version(<N>)            // highest version any field in this struct was introduced
@align(<A>)              // optional; force struct alignment to A bytes (power of 2)
struct <Name> {
    @version(<V>)        // version this field was introduced; required on every field
    <field_name>: <Type>,

    @version(<V>)
    @deprecated(since = <D>)  // optional; informational only
    <field_name>: <Type>,
}

Rules enforced by the compiler:

For vtable structs, vtable_size must be the first field declared. For plain structs not used as vtables, there is no mandatory first field.
Field version annotations must be monotonically non-decreasing top to bottom (a field annotated @version(5) may not appear before a field annotated @version(3)).
No field may be removed between IDL versions (enforced by kabi-compat-check).
No field may be reordered between IDL versions.
@align must be a power of 2 between 1 and 4096 inclusive.
@deprecated(since = N) is informational only; the field remains in generated output with no layout or runtime effect.

12.5.3.2 Example¶

// interfaces/block_device.kabi
kabi_version 2;

@version(1)
@align(8)
struct BlockDeviceInfo {
    @version(1)
    vtable_size: u32,        // set by caller to sizeof(BlockDeviceInfo)

    @version(1)
    device_flags: u32,

    @version(1)
    block_size: u32,         // bytes per logical block

    @version(1)
    queue_depth: u32,        // maximum number of in-flight commands

    @version(2)              // field added in version 2
    numa_node: u32,          // preferred NUMA node for DMA allocation

    @version(2)
    _pad: [u8; 4],           // explicit padding for 8-byte alignment of capacity_blocks

    @version(2)
    capacity_blocks: u64,    // device capacity in logical blocks
}

12.5.3.3 Generated C Output¶

/* Generated by umka-kabi-gen from interfaces/block_device.kabi */
/* DO NOT EDIT — regenerate with: umka-kabi-gen block_device.kabi */

#define KABI_BLOCK_DEVICE_INFO_V1_SIZE \
    ((size_t)offsetof(kabi_BlockDeviceInfo, numa_node))
#define KABI_BLOCK_DEVICE_INFO_V2_SIZE \
    ((size_t)sizeof(kabi_BlockDeviceInfo))

typedef struct __attribute__((aligned(8))) {
    uint32_t vtable_size;
    uint32_t device_flags;
    uint32_t block_size;
    uint32_t queue_depth;
    /* Version 2+: present only if vtable_size >= KABI_BLOCK_DEVICE_INFO_V2_SIZE */
    uint32_t numa_node;
    uint8_t  _pad[4];
    uint64_t capacity_blocks;
} kabi_BlockDeviceInfo;

12.5.3.4 Generated Rust Output¶

// Generated by umka-kabi-gen from interfaces/block_device.kabi
// DO NOT EDIT — regenerate with: umka-kabi-gen block_device.kabi

#[repr(C, align(8))]
pub struct BlockDeviceInfo {
    pub vtable_size:     u32,
    pub device_flags:    u32,
    pub block_size:      u32,
    pub queue_depth:     u32,
    // Version 2+:
    pub numa_node:       u32,
    pub _pad:            [u8; 4],
    pub capacity_blocks: u64,
}

impl BlockDeviceInfo {
    /// Size of the struct through the last V1 field (up to, not including, numa_node).
    pub const V1_SIZE: usize =
        core::mem::offset_of!(BlockDeviceInfo, numa_node);
    /// Full size of the struct including all fields through version 2.
    pub const V2_SIZE: usize =
        core::mem::size_of::<BlockDeviceInfo>();
}
// BlockDeviceInfo: vtable_size(u32=4) + device_flags(u32=4) + block_size(u32=4) +
//   queue_depth(u32=4) + numa_node(u32=4) + _pad([u8;4]=4) + capacity_blocks(u64=8)
//   = 32 bytes, align(8).
const_assert!(size_of::<BlockDeviceInfo>() == 32);

12.5.4 Vtable Definition¶

Vtables are the primary unit of KABI. A vtable is a C-compatible struct of function pointers with a mandatory vtable_size: u64 field as the first member.

12.5.4.1 Syntax¶

@version(<N>)
vtable <Name> {
    @version(1)
    vtable_size: u64,             // MUST be first, MUST be version 1, MUST be u64

    @version(<V>)
    @perm(<PermissionBits>)       // MANDATORY: required permission for dispatch
    @syscap(<SystemCaps>)         // OPTIONAL: required SystemCaps for dual-check
    fn <method_name>(<param>: <Type>, ...) -> <ReturnType>;

    @version(<V>)
    @optional                     // null function pointer is permitted
    @perm(<PermissionBits>)       // still mandatory even on optional methods
    @syscap(<SystemCaps>)         // optional even on optional methods
    fn <method_name>(<param>: <Type>, ...) -> <ReturnType>;
}

The vtable_size field type is always u64, not u32. u64 ensures the same wire encoding on both 32-bit and 64-bit UmkaOS targets (ARMv7 and PPC32 are 32-bit platforms where usize is 32 bits; using u64 avoids a layout discrepancy when a 64-bit kernel talks to a 32-bit driver in a cross-arch scenario).

Function parameters and return types use the same primitive and aggregate types as structs. The return type () denotes no return value (void in C). The KabiResult<T, E> type for error-returning methods is defined in the KABI runtime support library (umka-driver-sdk/src/abi.rs) and referenced by name in the IDL.

@optional marks a method whose function pointer may be null in a loaded driver. Callers generated by umka-kabi-gen always check for null before calling an @optional method and invoke the fallback branch when it is null. Methods without @optional must be non-null in every loaded vtable; a null mandatory pointer causes the driver loader to reject the driver with ENOEXEC.

@perm(<PermissionBits>) declares the PermissionBits required to invoke the method. This annotation is mandatory on every method (including @optional methods). Valid values: READ, WRITE, ADMIN, or |-combined (e.g., READ | WRITE). The kabi-gen tool rejects methods without a @perm annotation. See the "KABI Operation Permission Requirements" section for per-vtable tables and the dispatch enforcement path.

12.5.4.2 Versioning Contract¶

The kernel fills vtable_size to sizeof(KernelServicesVTable) (the kernel's own compile-time size) before passing its vtable to the driver.
The driver fills vtable_size to sizeof(DriverVTable) (the driver's own compile-time size) before passing its vtable to the kernel.
The receiver of a vtable must check vtable_size >= offset_of!(VTable, method) before calling any method that may not be present in an older vtable.
The umka-kabi-gen-generated _or_fallback helpers perform this check automatically; driver code must always call the helper, never the raw function pointer for a versioned (non-V1) method.
New methods are always appended at the end; the append-only rule is enforced by kabi-compat-check in CI.

12.5.4.3 Example¶

kabi_version 3;

/// Kernel-provided services vtable.  The kernel fills this struct and passes
/// a pointer to the driver at load time via the tier entry point
/// (entry_direct/entry_ring/entry_ipc from the KabiDriverManifest).
/// See [Section 12.3](#kabi-bilateral-capability-exchange) for the canonical Rust definition.
@version(3)
vtable KernelServicesVTable {
    @version(1)
    vtable_size: u64,

    @version(1)
    @perm(WRITE)
    @syscap(CAP_DMA)
    fn alloc_dma_buffer(size: u64, align: u64, flags: AllocFlags) -> AllocResult;

    @version(1)
    @perm(WRITE)
    @syscap(CAP_DMA)
    fn free_dma_buffer(handle: DmaBufferHandle) -> FreeResult;

    @version(1)
    @perm(ADMIN)
    @syscap(CAP_IRQ)
    fn register_interrupt(irq: u32, handler: InterruptHandler, ctx: *mut c_void) -> IrqResult;

    @version(1)
    @perm(ADMIN)
    @syscap(CAP_IRQ)
    fn deregister_interrupt(irq: u32) -> IrqResult;

    @version(1)
    @perm(READ)
    fn log(level: u32, msg: *const u8, len: u32);

    @version(2)
    @optional
    @perm(WRITE)
    fn create_ring_buffer(entries: u32, entry_size: u32, flags: RingFlags) -> RingResult;

    @version(3)
    @optional
    @perm(READ)
    fn fma_report_health(device_handle: DeviceHandle, event_class: HealthEventClass,
                         event_code: u32, severity: HealthSeverity,
                         data: *const u8, data_len: u32) -> IoResultCode;
}

12.5.4.4 Generated C Output¶

/* Generated by umka-kabi-gen from interfaces/kernel_services.kabi */
/* DO NOT EDIT */

#define KABI_KERNEL_SERVICES_V1_SIZE \
    ((size_t)offsetof(kabi_KernelServicesVTable, create_ring_buffer))
#define KABI_KERNEL_SERVICES_V2_SIZE \
    ((size_t)offsetof(kabi_KernelServicesVTable, fma_report_health))
#define KABI_KERNEL_SERVICES_V3_SIZE \
    ((size_t)sizeof(kabi_KernelServicesVTable))

typedef struct {
    uint64_t vtable_size;
    uint64_t kabi_version;
    kabi_AllocResult (*alloc_dma_buffer)(uint64_t size, uint64_t align,
                                         kabi_AllocFlags flags);
    kabi_FreeResult  (*free_dma_buffer)(kabi_DmaBufferHandle handle);
    kabi_IrqResult   (*register_interrupt)(uint32_t irq,
                                            kabi_InterruptHandler handler,
                                            void *ctx);
    kabi_IrqResult   (*deregister_interrupt)(uint32_t irq);
    void             (*log)(uint32_t level, const uint8_t *msg, uint32_t len);
    /* Version 2+ (@optional): may be NULL; check vtable_size before calling */
    kabi_RingResult  (*create_ring_buffer)(uint32_t entries, uint32_t entry_size,
                                           kabi_RingFlags flags);
    /* Version 3+ (@optional): FMA health telemetry ([Section 20.1](20-observability.md#fault-management-architecture)) */
    kabi_IoResultCode (*fma_report_health)(kabi_DeviceHandle device_handle,
                                            kabi_HealthEventClass event_class,
                                            uint32_t event_code,
                                            kabi_HealthSeverity severity,
                                            const uint8_t *data, uint32_t data_len);
} kabi_KernelServicesVTable;

12.5.4.5 Generated Rust Output¶

// Generated by umka-kabi-gen from interfaces/kernel_services.kabi
// DO NOT EDIT

// kernel-internal, not KABI
#[repr(C)]
pub struct KernelServicesVTable {
    pub vtable_size:          u64,
    pub kabi_version:         u64,
    pub alloc_dma_buffer:     unsafe extern "C" fn(u64, u64, AllocFlags) -> AllocResult,
    pub free_dma_buffer:      unsafe extern "C" fn(DmaBufferHandle) -> FreeResult,
    pub register_interrupt:   unsafe extern "C" fn(u32, InterruptHandler, *mut c_void) -> IrqResult,
    pub deregister_interrupt: unsafe extern "C" fn(u32) -> IrqResult,
    pub log:                  unsafe extern "C" fn(u32, *const u8, u32),
    // Version 2+ (@optional):
    pub create_ring_buffer:   Option<unsafe extern "C" fn(u32, u32, RingFlags) -> RingResult>,
    // Version 3+ (@optional):
    pub fma_report_health:    Option<unsafe extern "C" fn(
        DeviceHandle, HealthEventClass, u32, HealthSeverity, *const u8, u32,
    ) -> IoResultCode>,
}

impl KernelServicesVTable {
    pub const V1_SIZE: usize =
        core::mem::offset_of!(KernelServicesVTable, create_ring_buffer);
    pub const V2_SIZE: usize =
        core::mem::offset_of!(KernelServicesVTable, fma_report_health);
    pub const V3_SIZE: usize =
        core::mem::size_of::<KernelServicesVTable>();

    /// Version-safe wrapper for `create_ring_buffer` (V2 optional method).
    /// Falls back gracefully when the kernel's vtable predates V2 or
    /// when the method pointer is null.
    /// Always use this helper; never call `create_ring_buffer` directly.
    ///
    /// # Safety
    ///
    /// `self` must point to a valid kernel-provided vtable. All pointer
    /// arguments must satisfy the documented preconditions of the underlying
    /// `create_ring_buffer` method.
    pub unsafe fn create_ring_buffer_or_fallback(
        &self,
        size: u32,
        entry_size: u32,
    ) -> Result<RingBufferHandle, i32> {
        if self.vtable_size as usize >= Self::V2_SIZE {
            if let Some(f) = self.create_ring_buffer {
                return f(size, entry_size);
            }
        }
        // V1 kernel: no ring buffer support. Driver falls back to
        // polling-based I/O or returns an error to the caller.
        Err(-ENOSYS)
    }
}

umka-kabi-gen generates one _or_fallback helper per versioned method. Driver code must call the helper instead of the raw function pointer for any method introduced in version 2 or later. Calling the raw pointer directly on an older vtable produces undefined behavior.

12.5.5 Enum Definition¶

Enums map to #[repr(C, <repr>)] in Rust and to a typedef of the underlying integer type in C. Every enum requires an explicit @repr annotation.

12.5.5.1 Syntax¶

@version(<N>)
@repr(<UnsignedIntType>)    // required; one of: u8, u16, u32, u64
@flags                      // optional; see [Section 12.5](#kabi-idl-language-specification--example-flag-enums)
enum <Name> {
    @version(<V>)
    <Variant> = <IntegerLiteral>,
}

Rules:

Every variant must carry an explicit integer discriminant value.
Discriminant values must be unique within the enum.
For @flags enums, every discriminant value must be a power of 2. The compiler validates this and rejects any non-power-of-2 value.
New variants may only be appended; existing discriminant values may never be reassigned.
Code receiving an enum value from the ABI must handle unknown discriminants gracefully (return an error or, for @flags, silently mask unknown bits). The generated Rust type carries #[non_exhaustive] to enforce this at compile time. C code must include a default: branch in every switch on a KABI enum.

12.5.5.2 Example: Exclusive States¶

@version(2)
@repr(u32)
enum DriverState {
    @version(1)
    Initializing = 0,

    @version(1)
    Running      = 1,

    @version(1)
    Suspended    = 2,

    @version(2)
    Degraded     = 3,    // new in version 2; code built against V1 never receives this
}

Generated Rust:

#[repr(u32)]
#[non_exhaustive]
pub enum DriverState {
    Initializing = 0,
    Running      = 1,
    Suspended    = 2,
    // Version 2+:
    Degraded     = 3,
}

Generated C:

typedef uint32_t kabi_DriverState;
#define KABI_DRIVER_STATE_INITIALIZING  ((kabi_DriverState)0u)
#define KABI_DRIVER_STATE_RUNNING       ((kabi_DriverState)1u)
#define KABI_DRIVER_STATE_SUSPENDED     ((kabi_DriverState)2u)
/* Version 2+: */
#define KABI_DRIVER_STATE_DEGRADED      ((kabi_DriverState)3u)

12.5.5.3 Example: Flag Enums¶

Enums annotated with @flags represent bitmask sets. Variant values must be non-overlapping powers of 2. Unknown bits from a newer sender must be silently masked out by older receivers. The generated Rust type is an integer typedef with named constants rather than a Rust enum, because Rust's match exhaustiveness cannot handle arbitrary bitmask combinations.

@version(1)
@repr(u32)
@flags
enum DriverFlags {
    @version(1)
    SupportsHotplug   = 0x0001,

    @version(1)
    SupportsSuspend   = 0x0002,

    @version(2)
    SupportsMigration = 0x0004,
}

Generated Rust:

/// Bitmask type for DriverFlags. Unknown bits must be ignored for forward
/// compatibility with newer kernel versions.
pub mod DriverFlags {
    pub type Type = u32;
    pub const SUPPORTS_HOTPLUG:    Type = 0x0001;
    pub const SUPPORTS_SUSPEND:    Type = 0x0002;
    /// Version 2+:
    pub const SUPPORTS_MIGRATION:  Type = 0x0004;
    /// Mask of all bits defined through the version this code was compiled against.
    pub const KNOWN_BITS: Type = 0x0007;
}

Generated C:

typedef uint32_t kabi_DriverFlags;
#define KABI_DRIVER_FLAGS_SUPPORTS_HOTPLUG    ((kabi_DriverFlags)0x0001u)
#define KABI_DRIVER_FLAGS_SUPPORTS_SUSPEND    ((kabi_DriverFlags)0x0002u)
/* Version 2+: */
#define KABI_DRIVER_FLAGS_SUPPORTS_MIGRATION  ((kabi_DriverFlags)0x0004u)

12.5.6 `requires` and `provides` Declarations¶

Every .kabi file that defines a loadable module's interface must declare the KABI services it provides and the services it requires from other modules. These declarations are checked at build time (topological sort to reject cycles) and embedded as metadata in the compiled module binary (see Section 12.7).

// mdio.kabi — MDIO bus framework (Tier 0 loadable)
requires pci_bus;       // this module needs the pci_bus interface
provides mdio_service;  // this module exports the mdio_service interface

Circular requires/provides graphs are rejected at build time with an error that identifies the full cycle:

error[KABI-E0021]: circular dependency detected
  → mdio_framework requires pci_bus
  → pci_bus requires mdio_framework

12.5.7 Version Compatibility Rules¶

The following rules apply uniformly to structs, vtables, and enums.

Append-only fields and methods: Fields and methods may only be added at the end. Removal, reordering, or type changes require a new kabi_version number and a compatibility shim (see Section 12.2).
Size-based version detection: Both sides of a KABI boundary set their own vtable_size to the compile-time sizeof of the vtable. The receiver uses min(sender_size, receiver_size) as the safe access boundary.
Unknown enum variants: Receivers must not panic or invoke undefined behavior when they receive an unknown discriminant from a newer sender. Rust's #[non_exhaustive] enforces this at compile time. C code must include a default: case in every switch.
Unknown flag bits: Unknown bits in a @flags value must be silently ignored. Code must not assert on the exact set of bits present.
Deprecation: @deprecated(since = N) is informational only. Deprecated items remain in the ABI indefinitely with no layout or behavioral change.
Mandatory presence: The vtable_size field must always be present and always first. Drivers that expose a vtable smaller than the offset of the last mandatory (non-@optional) method are rejected at load time with ENOEXEC.

12.5.8 Compiler Invocation¶

umka-kabi-gen is the single tool for generating KABI bindings and validating ABI compatibility. The build system invokes it automatically; the forms below support driver SDK development and manual CI operations.

12.5.8.1 Code Generation¶

# Generate C and Rust bindings from a single .kabi file:
umka-kabi-gen \
    --input      interfaces/block_device.kabi \
    --output-c   generated/kabi_block_device.h \
    --output-rs  generated/kabi_block_device.rs \
    --transport  ring        # direct | ring | ipc (see [Section 12.6](#kabi-transport-classes))

# Generate all three transport variants at once (build system default):
umka-kabi-gen \
    --input      interfaces/block_device.kabi \
    --output-dir generated/
# Produces: kabi_block_device_direct.rs, kabi_block_device_ring.rs,
#           kabi_block_device_ipc.rs, kabi_block_device.h

The --transport flag selects the call dispatch mechanism (Section 12.6):

Value	Use case	Dispatch cost
`direct`	Core-domain callers, Tier 0 modules	~2-5 cycles
`ring`	Tier 1 drivers (domain-isolated, Ring 0)	~150-300 cycles
`ipc`	Tier 2 drivers (process-isolated, Ring 3)	~1-5 microseconds

When --output-dir is given without --transport, all three transport variants are generated and the C header is shared across all three.

12.5.8.2 ABI Compatibility Validation¶

# Validate that new.kabi is backward-compatible with baseline old.kabi:
umka-kabi-gen \
    --validate  interfaces/block_device.kabi \
    --previous  baseline/block_device_v1.kabi

# Exit code 0: compatible.
# Non-zero: at least one incompatibility; errors printed to stderr.

The validator rejects any of the following:

A field or method present in --previous is absent in --validate.
Fields or methods appear in a different order than in --previous.
The @repr type of an enum is changed.
vtable_size is absent or is not the first field of a vtable.
An enum discriminant value present in --previous is changed in --validate.
A new field or method in --validate carries a @version annotation not greater than every annotation in --previous.
A @flags enum variant value is not a power of 2.
Two enum variants share an integer discriminant value.

Validation runs in CI on every commit that modifies a .kabi file (see Section 24.3, step 4). Failures block merge to master.

12.5.8.3 Diagnostic Format¶

All errors and warnings are reported in a structured format with source locations and stable error codes:

interfaces/block_device.kabi:14:5: error[KABI-E0011]: field 'block_size' removed
    block_size: u32,
    ^~~~~~~~~~
  note: this field was present in baseline/block_device_v1.kabi:14:5
  help: fields may never be removed; annotate with @deprecated(since = 2) instead

interfaces/block_device.kabi:18:5: error[KABI-E0012]: field 'queue_depth' reordered
    queue_depth: u32,
    ^~~~~~~~~~~
  note: expected at position 3 (matching baseline); found at position 4
  help: fields may never be reordered; append new fields at the end

error: aborting due to 2 previous errors

Each diagnostic carries a unique error code (KABI-E<NNNN>) for stable reference in CI logs, changelogs, and issue trackers. umka-kabi-gen exits with a non-zero status whenever any error is emitted.

12.6 KABI Transport Classes¶

KABI bundles two orthogonal concerns that must be kept distinct: the interface contract (IDL types, vtable layout, version fields) and the transport (how a call physically crosses from caller to callee). The interface contract is universal — it is the stable ABI. The transport is determined by whether the call crosses a hardware-enforced domain boundary.

12.6.1 Why Transport Is Separate from Interface¶

The ring buffer transport (Section 11.7) exists for one reason: to safely cross a hardware memory domain boundary (MPK/POE/DACR). At that boundary the caller cannot directly call into the callee's address range — the domain switch must happen first, and the ring buffer is the handshake. The transport IS the isolation mechanism.

This reasoning does not apply inside the Core domain. Tier 0 loadable modules (Section 11.3) run in Ring 0 in the same memory domain as the static kernel binary. There is no MPK boundary. Forcing ring buffers between them would add 100–500 cycles per call with zero safety benefit, because:

No domain boundary exists to enforce — both sides share the same address space
Ring buffers do not contain crashes when there is no memory isolation behind them
Synchronous framework APIs (bus register read, SCSI command dispatch, SPI transfer) require a result before the caller can proceed; the async ring model adds roundtrip cost for no latency benefit
Debugging is harder — async rings split call chains, hiding the source of a bug

The ring buffer's safety guarantee requires the combination of (ring buffer) + (hardware domain isolation). Without the isolation, the ring buffer is overhead with false safety intuition attached.

12.6.2 Three Transport Classes¶

The kabi-gen toolchain generates bindings for three transport classes from a single .kabi IDL source. The transport class is a parameter to kabi-gen, not a property of the interface.

Transport T0 — Direct Vtable Call (Core domain)

Used between static Core and Tier 0 loadable modules, and between Tier 0 loadable modules. Both caller and callee are in the same memory domain.

/// Generated by: kabi-gen --transport=direct mdio.kabi
///
/// Caller is in Core domain. Callee (mdio_framework) is a Tier 0 loadable module
/// in the same domain. The call is a direct indirect branch through the vtable
/// function pointer — ~2-5 cycles dispatch overhead.
///
/// # Safety
/// `handle` must be a valid T0 service handle obtained from `KabiServiceRegistry`.
/// The module providing this handle is guaranteed loaded and never unloaded
/// (Tier 0 load_once semantics; see [Section 12.7](#kabi-service-dependency-resolution--tier-0-module-lifecycle-load_once)).
pub unsafe fn mdio_read_reg(handle: &MdioServiceHandleT0, dev: u32, reg: u16) -> u16 {
    ((*handle.vtable).read_reg)(handle.ctx, dev, reg)
}

Properties: - Cost: ~2–5 cycles (vtable pointer dereference + indirect call) - Synchronous: caller blocks until return; no queue management - Stack: uses caller's stack; no separate consumer thread - Data: zero-copy — arguments passed in registers or by pointer, same address space - Crash consequence: a fault in the callee panics the kernel (same as static Core) - Debugging: full contiguous call stack visible in backtraces and panic dumps

Transport T1 — Ring Buffer + Domain Switch (Cross-domain)

Used at every boundary that crosses hardware domain isolation: Core domain → Tier 1, Tier 1 → Tier 1, Core domain → Tier 2, Tier 1 → Tier 2. This is the existing KABI transport described in Section 11.7. Ring buffers ARE the isolation mechanism at these boundaries.

Properties: - Cost: ~200–500 cycles minimum (atomic head/tail update, potential cache miss, domain switch, wake-up, dequeue on the far side) - Async-capable: producer and consumer can run independently; completions delivered via ring notifications - Data: zero-copy via shared memory ring descriptors - Crash consequence: contained within the isolated domain; far side survives - Debugging: split stack traces; correlation requires ring sequence numbers

Transport T2 — Ring Buffer + Syscall (Ring 3 boundary)

Used at the Tier 2 (Ring 3 process) boundary. Structurally identical to T1 but the domain switch is a privilege level change (Ring 0 → Ring 3 or vice versa). The ring buffer crossing also acts as a syscall interception point for capability validation.

T2 Ring Entry Formats:

The Tier 2 command and completion rings use cache-line-aligned entries for optimal performance across the Ring 0/Ring 3 boundary. Arguments and results are passed through a separate shared memory region (mmap'd into both the driver process and the kernel), not inline in the ring — this keeps the ring entries fixed-size and avoids variable-length parsing on the kernel side.

/// Command entry placed by a Tier 2 driver into the submission ring.
/// The driver writes one entry per vtable method invocation.
#[repr(C, align(64))]
pub struct T2CommandEntry {
    /// Vtable method ordinal (0-based index into the service vtable).
    /// The kernel validates this against vtable_size before dispatch.
    pub method_index: u32,
    /// T2_CMD_NOTIFY: request doorbell notification on completion.
    /// T2_CMD_BATCH: more commands follow — defer doorbell until last.
    pub flags: u32,
    /// Byte offset into the shared argument buffer where serialized
    /// arguments begin. The kernel validates offset + arg_len <= buffer_size.
    pub arg_offset: u32,
    /// Argument data length in bytes.
    pub arg_len: u32,
    /// Opaque value echoed verbatim in the corresponding T2CompletionEntry.
    /// The driver uses this to correlate completions with outstanding requests
    /// (e.g., as an index into a pending-request table).
    pub cookie: u64,
    /// Reserved for future use. Must be zero.
    pub _reserved: [u8; 40],
}
// T2CommandEntry: method_index(4) + flags(4) + arg_offset(4) + arg_len(4) +
//   cookie(8) + _reserved(40) = 64 bytes (one cache line).
const_assert!(core::mem::size_of::<T2CommandEntry>() == 64);

/// Completion entry placed by the kernel into the completion ring.
/// The driver reads completions to determine method call results.
#[repr(C, align(64))]
pub struct T2CompletionEntry {
    /// Matches the cookie from the corresponding T2CommandEntry.
    pub cookie: u64,
    /// 0 = success, negative = -errno (matches Linux error convention).
    pub status: i32,
    /// Result data length in bytes (in the shared result buffer).
    pub result_len: u32,
    /// Byte offset into the shared result buffer where the result begins.
    pub result_offset: u32,
    /// Reserved for future use. Must be zero.
    pub _reserved: [u8; 44],
}
// T2CompletionEntry: cookie(8) + status(4) + result_len(4) +
//   result_offset(4) + _reserved(44) = 64 bytes (one cache line).
const_assert!(core::mem::size_of::<T2CompletionEntry>() == 64);

Both structs are exactly 64 bytes (one cache line). The shared argument/result buffer is a separate mmap'd region allocated per Tier 2 driver instance; its size is negotiated at driver attach time (default: 256 KiB, configurable via driver policy). Method dispatch: the kernel reads method_index from the command entry, validates it against the service's vtable_size (bounds check — rejects any index that would exceed the vtable), and dispatches to the corresponding vtable function pointer. An out-of-bounds method_index returns status = -EINVAL in the completion entry.

T2 Copy-Before-Process Invariant: The kernel MUST copy T2 command arguments from the shared argument buffer into kernel-private memory before any processing, validation, or dispatch. The shared buffer is writable by the Tier 2 driver process; reading arguments directly from shared memory creates a TOCTOU (time-of-check to time-of-use) vulnerability where the driver can modify argument bytes between validation and use.

The T2 dispatch sequence:

Read method_index, arg_offset, arg_len from the command ring entry (ring entry is in shared memory — read each field exactly once into local variables).
Bounds-check: arg_offset + arg_len <= buffer_size. Reject with -EINVAL if out of bounds.
Copy: memcpy arg_len bytes from shared_buf[arg_offset..] into a kernel-private argument buffer (stack for small args, slab allocation for args exceeding 512 bytes).
Deserialize the kernel-private copy into the vtable method's argument struct.
Dispatch to the vtable function pointer with the deserialized arguments.

The copy cost is bounded by arg_len (typical: <1 KiB for method arguments; maximum: buffer_size, default 256 KiB). For the common case of small arguments, the copy targets a stack buffer — zero heap allocation.

This invariant is non-negotiable for the Tier 2 security boundary. A Tier 2 driver is explicitly untrusted (Section 11.3). Any code path that reads from the shared buffer more than once for the same argument field is a security bug.

12.6.3 Call Direction at the Tier 0 Boundary¶

The transport class is determined by the calling side's domain, not the callee's:

Tier 1 driver → Tier 0 loadable service:
  Tier 1 enqueues request into Core-domain ring (T1 transport).
  Core domain receives, dispatches to Tier 0 vtable via direct call (T0 transport).
  Tier 1 does not know or care that the service is Tier 0 loadable vs static Core.
  From Tier 1's perspective: one ring buffer call to "the kernel", same as always.

Tier 0 loadable → Tier 1 driver (callback / event):
  Tier 0 module writes into Tier 1 driver's inbound ring (T1 transport, outbound direction).
  Same mechanism as static Core → Tier 1 today.
  No new ring buffer infrastructure needed.

Static Core → Tier 0 loadable:
  Direct vtable call (T0 transport).
  ~2-5 cycles (steady state).
  Generation-checked: every T0 call site loads
    t0_vtable_generation (AtomicU64::load(Acquire)) and compares against the
    caller's cached generation. On mismatch (Evolvable evolution in progress):
    caller blocks on evolution waitqueue. Cost: ~1-3 extra cycles (one
    predicted-taken branch in steady state). During evolution Phase B
    (stop-the-world, step 5e), the generation increments — all new T0 calls
    arriving after Phase B see the mismatch and block. In-flight T0 calls
    complete against the old vtable (still valid, retained until Phase C).
    T0 calls MUST be non-blocking; any potentially blocking Evolvable
    operation uses T1 ring transport instead. Phase B swaps vtable pointer
    and releases waitqueue.

Tier 0 loadable → static Core:
  Direct call — both in the same domain, Core exports functions through the
  KernelServicesVTable (T0 transport, same direct vtable call mechanism).
  No generation check needed: static Core is Nucleus, never evolved.

The Tier 0 loadable module is transparent to Tier 1 and Tier 2 callers. The domain dispatch inside Core routes the inbound ring buffer request to the appropriate service, whether that service is static or dynamically loaded.

Ring buffer drain during KABI service evolution: When a KABI service is evolved via Section 13.18, all T1 ring buffers connected to that service are drained during Phase A' (quiescence). The orchestration layer stops accepting new ring submissions (producer side returns EAGAIN), then waits for the consumer side to process all in-flight entries before proceeding to the Phase B vtable swap. This ensures no ring entry is interpreted by the old vtable's dispatch logic after the new vtable is installed.

12.6.4 IDL Toolchain Transport Parameter¶

umka-kabi-gen has two distinct jobs, both driven from the same .kabi source:

Job 1 — Caller-side bindings (how a module calls out to a service it depends on). The transport is determined by the calling module's tier relative to the service's tier. A Tier 0 module calling a Tier 0 service uses --transport=direct; a Tier 1 module calling into Core uses --transport=ring:

# A Core-domain module calling the MDIO service (T0→T0: direct)
umka-kabi-gen --transport=direct --input mdio.kabi --output-rs mdio_caller_direct.rs

# A Tier 1 module calling the MDIO service (T1→T0: ring, tunnelled through Core)
umka-kabi-gen --transport=ring   --input mdio.kabi --output-rs mdio_caller_ring.rs

Job 2 — Driver-side entry points (the three entry functions the kernel loader calls when loading the driver binary, one per tier). Always generate all three:

# Default: generate all three entry stubs into a directory
umka-kabi-gen --output-dir generated/ --input mdio.kabi
# Produces:
#   generated/kabi_entry_direct.rs  — T0 entry point
#   generated/kabi_entry_ring.rs    — T1 entry point
#   generated/kabi_entry_ipc.rs     — T2 entry point
#   generated/kabi_types.h          — shared C header

The generated entry files share the same interface types (argument structs, return types, error enums) — the IDL defines these. Only the call dispatch code differs.

12.6.5 KabiDriverManifest: Transport Capability Advertisement¶

Every driver binary embeds a KabiDriverManifest structure in the .kabi_manifest ELF section. The kernel loader reads this section before resolving any other driver symbols. It is the single source of truth for which transport entry points the binary implements, which tier the driver prefers, and which tiers it will accept.

/// ELF-embedded driver transport manifest.
/// Placed in section `.kabi_manifest` by the linker script.
/// Generated by `umka-kabi-gen --output-dir`; linked automatically.
/// Driver authors do not write or modify this struct directly.
// kernel-internal, not KABI
#[repr(C)]
pub struct KabiDriverManifest {
    /// Magic: 0x4B424944 ("KBID") — identifies a valid manifest.
    pub magic: u32,
    /// Manifest structure version (currently 1). Loader rejects unknown versions.
    pub manifest_version: u32,

    /// Transport implementations present in this binary (bitmask):
    ///   bit 0 = T0 Direct entry point present (entry_direct non-null)
    ///   bit 1 = T1 Ring Buffer entry point present (entry_ring non-null)
    ///   bit 2 = T2 Ring+Syscall entry point present (entry_ipc non-null)
    /// Default (all drivers, no manifest constraints): 0b111.
    ///
    /// **Security note**: The default of 0b111 is deliberately permissive to
    /// maximise deployment flexibility — it is NOT a security control. The
    /// actual tier assignment is governed by:
    /// (a) signing certificate `max_tier` (hard ceiling),
    /// (b) license compatibility (soft ceiling),
    /// (c) operator policy in `/etc/umka/driver-policy.d/<name>.toml`, and
    /// (d) hardware capability (Tier 1 unavailable on RISC-V/s390x/LoongArch64).
    /// A driver with `transport_mask = 0b111` does NOT automatically get Tier 0
    /// access. It means the binary *contains* all three entry points; which one
    /// is invoked is decided by the loader pipeline (steps 2-6 in the loading
    /// sequence above). Operators who want to restrict a specific driver to
    /// Tier 2 only should set `maximum_tier = 2` in the driver's policy file.
    pub transport_mask: u8,
    /// Driver's preferred tier (0, 1, or 2).
    /// The loader assigns this tier if hardware supports it and policy allows.
    pub preferred_tier: u8,
    /// Minimum tier this binary accepts (0 = any).
    /// Loader returns ENOTSUP if assigned tier < minimum_tier.
    pub minimum_tier: u8,
    /// Maximum tier this binary accepts (2 = any).
    /// Loader returns ENOTSUP if assigned tier > maximum_tier.
    pub maximum_tier: u8,

    /// Driver's fallback hint when `preferred_tier` is unavailable on this
    /// platform (e.g., Tier 1 on RISC-V). The loader uses this to decide
    /// whether to fall back toward Tier 0 (lower latency) or Tier 2 (stronger
    /// isolation). Operator policy in `/etc/umka/driver-policy.d/<name>.toml`
    /// overrides this hint. Default: `FallbackBias::Isolation` (prefer safety).
    pub fallback_bias: FallbackBias,
    pub _pad0: [u8; 3],       // alignment padding, must be zero

    /// Null-terminated UTF-8 driver name (max 63 bytes + null).
    pub driver_name: [u8; 64],
    /// Driver version (major << 16 | minor).
    pub driver_version: u32,

    /// SPDX license identifier index. Maps to the Approved Linking License
    /// Registry (ALLR) defined in the OKLF ([Section 24.7](24-roadmap.md#licensing-model-open-kernel-license-framework-v13)).
    ///
    /// **Trust model**: This field is a self-declaration by the driver author —
    /// like Linux's `MODULE_LICENSE()`. It is NOT a security boundary. A vendor
    /// could set `license_id = GPL-2.0-only` on a proprietary binary; the kernel
    /// cannot verify the claim. The real technical enforcement is
    /// `DriverCertEntry.max_tier` in the signing certificate, which is controlled
    /// by the distro and cannot be faked by the vendor.
    ///
    /// `license_id` serves three purposes:
    /// 1. **Distro-built drivers**: the distro builds from source and knows the
    ///    actual license. `license_id` automates tier policy without manual
    ///    per-driver configuration.
    /// 2. **Audit and attestation**: a vendor lying about their license is
    ///    discoverable (source code inspection, legal review) and legally
    ///    actionable under the OKLF. The IMA measurement log records the
    ///    declared `license_id` for remote attestation.
    /// 3. **Kernel taint**: a mismatch between `license_id` and signing cert
    ///    source (e.g., vendor-signed cert with `license_id = GPL-2.0`) sets
    ///    `TAINT_PROPRIETARY_MODULE` and logs a warning.
    ///
    /// The module loader uses this field for tier policy (for distro-built
    /// drivers where the declaration is trustworthy):
    /// - ALLR Tier 1/2 licenses (GPL-compatible or permissive): Tier 0/1/2 eligible
    /// - ALLR Tier 3 licenses (CDDL, GPLv3-only, EUPL-1.2): max Tier 1
    ///   (KABI IPC provides the license boundary; no static linking into Core)
    /// - Proprietary (0xF000-0xFFFF) or unspecified (0): max Tier 2
    ///   (Ring 3 only, no kernel address space access)
    ///
    /// Well-known values:
    ///   0x0000 = unspecified (treated as proprietary → Tier 2 only)
    ///   0x0001 = GPL-2.0-only
    ///   0x0002 = GPL-2.0-or-later
    ///   0x0003 = MIT
    ///   0x0004 = BSD-2-Clause
    ///   0x0005 = BSD-3-Clause
    ///   0x0006 = Apache-2.0
    ///   0x0007 = MPL-2.0
    ///   0x0008 = LGPL-2.1-only
    ///   0x0009 = EPL-2.0-with-secondary (GPLv2 Secondary License designated)
    ///   0x000A = CDDL-1.0 (→ max Tier 1, no Tier 0)
    ///   0x000B = ISC
    ///   0x000C = Zlib
    ///   0x000D = EPL-2.0-no-secondary (→ Tier 2 only)
    ///   0x000E = GPL-3.0-only (→ max Tier 1, no Tier 0)
    ///   0x000F = LGPL-3.0-only (→ Tier 2 only)
    ///   0x0010 = EUPL-1.2 (→ max Tier 1, no Tier 0; same as CDDL/GPLv3)
    ///   0xF000-0xFFFF = proprietary (→ Tier 2 only)
    ///
    /// The `umka-kabi-gen` compiler reads the `license` field from the `.kabi`
    /// IDL file and encodes it into this field. Driver authors declare:
    ///   `license "GPL-2.0-only";` in their `.kabi` file.
    pub license_id: u16,

    /// Module Binary Store exclusion flag. When set, the module's raw ELF
    /// binary is NOT retained in the kernel-resident Module Binary Store
    /// (MBS) after loading. Crash recovery for excluded modules reloads
    /// from the filesystem (best-effort — if the filesystem serving
    /// `/lib/modules/` is available).
    ///
    /// Default: `false` (module binary IS cached in MBS). Set `true` for
    /// driver classes where a brief service interruption on crash is
    /// acceptable and immediate reload is not system-critical:
    /// audio/sound, video capture, display/GPU, bluetooth.
    ///
    /// Drivers providing storage, filesystem, block, network, or crypto
    /// services should NEVER set this flag — those drivers may be on the
    /// root mount path, and the filesystem needed to reload them from disk
    /// may itself be served by the crashed driver (circular dependency).
    ///
    /// The `.kabi` IDL declaration: `mbs_exclude true;`
    pub mbs_exclude: u8, // 0 = false, 1 = true
    pub _reserved: [u8; 9],   // must be zero

    // Entry points. null = transport not implemented in this binary.

    /// T0 Direct entry — called when driver runs as a Tier 0 loadable module.
    pub entry_direct: Option<KabiT0EntryFn>,
    /// T1 Ring Buffer entry — called when driver runs at Tier 1.
    pub entry_ring:   Option<KabiT1EntryFn>,
    /// T2 IPC entry — called when driver runs at Tier 2 (Ring 3).
    pub entry_ipc:    Option<KabiT2EntryFn>,
}
/// 64-bit: 4+4+1+1+1+1+1+3+64+4+2+1+9+8+8+8 = 120 bytes
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<KabiDriverManifest>() == 120);
/// 32-bit: 4+4+1+1+1+1+1+3+64+4+2+1+9+4+4+4 = 108 bytes
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<KabiDriverManifest>() == 108);

/// Fallback direction when the driver's preferred tier is unavailable.
///
/// When `preferred_tier` cannot be assigned (e.g., Tier 1 on RISC-V where no
/// fast isolation mechanism exists), the loader uses this hint to choose between
/// moving the driver toward Tier 0 (performance) or Tier 2 (isolation).
///
/// The driver author sets this based on domain knowledge: a NIC handling untrusted
/// packets should prefer `Isolation` (Tier 2); a real-time audio driver should
/// prefer `Performance` (Tier 0). Operator policy overrides this hint.
#[repr(u8)]
pub enum FallbackBias {
    /// Prefer stronger isolation: fall back toward Tier 2.
    /// Default for most drivers — safety over speed.
    Isolation   = 0,
    /// Prefer lower latency: fall back toward Tier 0.
    /// Appropriate for latency-sensitive drivers (audio, GPU, NVMe).
    Performance = 1,
}

/// T0 entry: receives direct-call KernelServicesVTable, returns a pointer to the
/// driver-class-specific vtable.
///
/// The concrete vtable type depends on the driver class. For example, a NIC driver
/// returns `*const NicDriverVTable` ([Section 12.1](#kabi-overview)), a block driver returns
/// `*const BlockDeviceVTable` ([Section 12.1](#kabi-overview)). The kernel validates the `vtable_size`
/// field to ensure ABI compatibility.
pub type KabiT0EntryFn = unsafe extern "C" fn(
    ksvc: *const KernelServicesVTable,
) -> *const ();

/// T1 entry: receives ring-variant KernelServicesVTable and pre-allocated ring pair.
pub type KabiT1EntryFn = unsafe extern "C" fn(
    ksvc:     *const KernelServicesVTable,
    inbound:  *mut RingBuffer,   // Core → Driver requests
    outbound: *mut RingBuffer,   // Driver → Core completions
) -> u32;                        // 0 = success, errno on failure

/// Tier 2 driver entry point function type.
///
/// Called by the kernel when a Tier 2 driver process is started (or restarted after crash).
/// The driver executes entirely within this function; return means the driver is shutting down.
///
/// # Parameters
///
/// ## `outbound_fd` — Command Ring (kernel → driver)
/// A kernel-created `umka_ring_fd` (ring buffer file descriptor) for the COMMAND ring.
/// The kernel writes commands and events; the driver reads them.
/// - Ring type: `UmkaRingFd` (shared memory ring, same design as [Section 11.7](11-drivers.md#zero-copy-io-path))
/// - Created by the kernel before calling this function, pre-populated with any pending
///   commands from the quiescence buffer (accumulated during crash recovery).
/// - **Read blocking**: `poll(outbound_fd, POLLIN)` blocks until a command is available.
/// - Capacity: `KABI_T2_CMD_RING_CAPACITY = 4096` entries.
///
/// ## `inbound_fd` — Completion Ring (driver → kernel)
/// A kernel-created `umka_ring_fd` for the COMPLETION ring.
/// The driver writes completions and events; the kernel reads them.
/// - **Write non-blocking**: O_NONBLOCK on write side; returns EAGAIN if full.
///   A full completion ring indicates the kernel is not draining it — the driver
///   should back off and retry after a short poll.
/// - Capacity: `KABI_T2_COMPLETION_RING_CAPACITY = 4096` entries.
///
/// # FD Ownership
/// Both FDs are **kernel-owned**. The driver MUST NOT close them. The kernel closes
/// them when the driver process is detached or terminated. Both FDs remain valid for
/// the entire lifetime of the entry function and become invalid immediately after return.
///
/// # Return Value
/// - `0`: Clean shutdown (graceful driver exit, no error).
/// - Non-zero: Error code; triggers the crash recovery protocol in umka-core.
///   The kernel will attempt to restart the driver up to `DRIVER_MAX_RESTART_ATTEMPTS` times.
pub type KabiT2EntryFn = unsafe extern "C" fn(
    ksvc:        *const KernelServicesVTable,
    outbound_fd: i32,   // Command ring: kernel → driver (POLLIN to receive commands)
    inbound_fd:  i32,   // Completion ring: driver → kernel (O_NONBLOCK writes)
) -> u32;               // 0 = clean shutdown; non-zero = error, triggers crash recovery

pub const KABI_T2_CMD_RING_CAPACITY: usize = 4096;
pub const KABI_T2_COMPLETION_RING_CAPACITY: usize = 4096;

/// Maximum number of automatic restart attempts for a crashing driver before
/// the kernel marks the driver slot as failed and engages the auto-demotion policy.
pub const DRIVER_MAX_RESTART_ATTEMPTS: u32 = 3;

Loader algorithm (in driver_load()):

1. Map driver ELF into memory (read-only staging area).
2. Locate `.kabi_manifest` section. If absent → ENOEXEC.
3. Validate manifest.magic == 0x4B424944. If wrong → ENOEXEC.
4. Check manifest.manifest_version ≤ supported. If newer → ENOTSUP.
5. Determine assigned tier T:
   a. Start with T = manifest.preferred_tier.
   b. If operator policy exists (/etc/umka/driver-policy.d/<name>.toml) → apply
      its tier override (this takes absolute precedence; skip to step 5e).
   c. Query hardware capability for tier T (see [Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes)). If T is
      available on this platform → keep T, skip to 5e.
   d. T is unavailable (e.g., Tier 1 on RISC-V). Apply manifest.fallback_bias:
        - FallbackBias::Performance → search downward: try T-1, T-2, ...
        - FallbackBias::Isolation   → search upward:   try T+1, T+2, ...
      First tier within [minimum_tier, maximum_tier] that the platform supports
      and whose transport bit is set in transport_mask becomes T.
      If no viable tier found → ENOTSUP.
   e. Validate constraints:
        If T < manifest.minimum_tier → ENOTSUP.
        If T > manifest.maximum_tier → ENOTSUP.
6. Confirm transport bit for T is set in manifest.transport_mask:
     T=0: bit 0 set AND entry_direct non-null → else ENOEXEC.
     T=1: bit 1 set AND entry_ring  non-null → else ENOEXEC.
     T=2: bit 2 set AND entry_ipc   non-null → else ENOEXEC.
7. Set up tier resources (MPK domain / Ring 3 process / none).
8. Call the entry point for tier T. On non-zero return → unmap, return errno.
9. Record (driver_name, driver_version, assigned_tier, fallback_bias,
   transport_mask) in the driver registry. Expose via /ukfs/kernel/drivers/<name>/.

The umkafs record allows operators to inspect any loaded driver's tier, fallback bias, and available transports: /ukfs/kernel/drivers/<name>/tier, /ukfs/kernel/drivers/<name>/fallback_bias, /ukfs/kernel/drivers/<name>/transport_mask.

12.6.6 Default Policy: All Drivers Ship All Three Transports¶

Every driver binary must include all three transport entry points by default.

The umka-kabi-gen --output-dir invocation (the build system default) generates all three receiver stubs. The driver's build.rs includes them automatically:

// In driver/build.rs — emitted by the umka-driver-sdk build helper:
println!("cargo:rerun-if-changed=my_driver.kabi");
umka_kabi_gen::build("my_driver.kabi");   // generates all three into OUT_DIR

// In driver/src/lib.rs:
include!(concat!(env!("OUT_DIR"), "/kabi_entry_direct.rs"));
include!(concat!(env!("OUT_DIR"), "/kabi_entry_ring.rs"));
include!(concat!(env!("OUT_DIR"), "/kabi_entry_ipc.rs"));

All three entry points are linked. The manifest's transport_mask = 0b111. The binary is tier-agnostic at compile time.

Consequence: tier change requires no recompilation. The kernel loader reads transport_mask, confirms the desired tier's bit is set, and calls the matching entry point. The driver binary is unchanged. Moving a driver from Tier 1 to Tier 0 (because the hardware has no fast isolation mechanism, e.g. RISC-V) is an operator action: update /etc/umka/driver-policy.d/<name>.toml, reload. No kernel rebuild, no driver rebuild.

Opting out of a transport — the only legitimate exception — requires an explicit manifest constraint in the .kabi module declaration:

// In my_rt_audio_driver.kabi:
module my_rt_audio_driver {
    provides alsa_driver >= 1.0;
    requires alsa_core >= 1.0;
    minimum_tier: 0;          // real-time audio: cannot tolerate ring buffer latency
    maximum_tier: 0;          // must run at Tier 0 (direct call)
    fallback_bias: performance;  // (redundant here since min==max, shown for completeness)
}

umka-kabi-gen omits the T1 and T2 stubs, sets transport_mask = 0b001, and the loader enforces the constraint. Drivers without explicit declarations default to minimum_tier: 0, maximum_tier: 2, transport_mask: 0b111, fallback_bias: FallbackBias::Isolation.

Why this is production-correct: a driver that only ships one transport is a redeployment risk. If hardware changes (e.g., a Tier 1 driver on a RISC-V system where no fast isolation exists, requiring Tier 0 or Tier 2 placement), the system cannot adapt without rebuilding the driver binary. Shipping all three transports eliminates this at the cost of ~2–5 KB of additional binary size per transport stub — negligible for any real driver.

12.7 KABI Service Dependency Resolution¶

12.7.1 The Problem¶

A Tier 1 NIC driver loads. Its probe function calls request_service::<MdioService>(). The MDIO bus framework (a Tier 0 loadable module) is not yet loaded. Without a resolution mechanism, the driver fails to probe, the device never initialises, and the administrator has no clear explanation.

This scenario is not exceptional — it is the normal operating condition for Tier 0 loadable framework modules. The SCSI mid-layer must load before any HBA driver. The cfg80211 framework must load before any WiFi driver. The SoundWire bus core must load before any SoundWire audio codec driver. Dependency resolution is a first-class requirement, not an edge case.

12.7.2 IDL `requires` and `provides` Declarations¶

Every module's .kabi file declares the KABI services it provides and the services it requires. These declarations are checked by kabi-gen and embedded in the compiled module's metadata section.

// mdio.kabi — the MDIO bus framework (Tier 0 loadable)
@version(1)
module mdio_framework {
    provides mdio_service >= 1.0;

    requires pci_bus >= 2.0;          // always in static Core; always satisfied

    load_once: true;                  // Tier 0 module: never unloaded once loaded
    load_phase: boot;                 // load before device enumeration begins
}

// ixgbe.kabi — an Intel 10G NIC driver (Tier 1)
@version(1)
module ixgbe_driver {
    provides ethernet_driver >= 4.2;

    requires mdio_service >= 1.0;     // provided by mdio_framework
    requires pci_bus >= 3.0;          // always in static Core

    load_once: false;
    load_phase: on_demand;
    fallback_bias: isolation;         // handles untrusted packets → prefer Tier 2
}

// mpt3sas.kabi — LSI SAS HBA driver (Tier 1)
@version(1)
module mpt3sas_driver {
    provides scsi_host >= 1.0;

    requires scsi_midlayer >= 1.0;    // provided by scsi_framework (Tier 0 loadable)
    requires pci_bus >= 3.0;

    load_once: false;
    load_phase: on_demand;
    fallback_bias: performance;       // SAS HBA is latency-sensitive storage → prefer Tier 0
}

The requires entries are minimum version constraints: >= 1.0 means any provider with version ≥ 1.0 satisfies the dependency. The provides entry declares what version of the service this module exports.

12.7.3 KabiProviderIndex — Boot-Time Service Map¶

The provider index and all registry types use KabiVersion, a three-field version triple defined here. Every KABI service carries a KabiVersion that is compared at bind time to enforce the compatibility rules from Section 12.2.

/// KABI version triple carried in every driver vtable and compared at registration.
///
/// Compatibility rule: a driver compiled against KABI (`major`, `minor`, `patch`) is
/// accepted by a kernel with KABI (`kmajor`, `kminor`, `kpatch`) if and only if:
/// - `kmajor == major` (major version must match exactly — breaking changes)
/// - `kminor >= minor` (kernel minor must be >= driver minor — additive extensions)
/// - `kpatch` is ignored for compatibility (patch = bug-fix only, no ABI change)
///
/// Vtable size (`vtable_size` field) is checked independently of version.
#[repr(C)]
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct KabiVersion {
    /// Breaking change counter. Incompatible across major versions.
    pub major: u16,
    /// Additive extension counter. Backwards-compatible within same major.
    pub minor: u16,
    /// Bug-fix counter. No ABI impact.
    pub patch: u16,
    /// Reserved; must be zero.
    pub _pad: u16,
}
// KabiVersion: major(u16=2) + minor(u16=2) + patch(u16=2) + _pad(u16=2) = 8 bytes.
const_assert!(core::mem::size_of::<KabiVersion>() == 8);

impl KabiVersion {
    pub const fn new(major: u16, minor: u16, patch: u16) -> Self {
        Self { major, minor, patch, _pad: 0 }
    }

    /// Returns true if a driver built against `self` (the driver's required version) is
    /// compatible with `kernel` (the running kernel's version).
    ///
    /// Compatibility rules:
    /// - Same major version required (major version bumps are breaking changes).
    /// - Driver minor ≤ kernel minor: the kernel must expose at least the vtable fields
    ///   the driver was compiled against. A driver requiring minor=3 cannot load on a
    ///   kernel that only provides minor=2.
    ///
    /// The condition `kernel.minor >= self.minor` is an **asymmetric** check:
    /// `self` = driver's required version, `kernel` = running kernel version.
    /// These are two different `KabiVersion` values — NOT a self-comparison.
    pub const fn is_compatible_with(&self, kernel: KabiVersion) -> bool {
        // self.major == kernel.major : breaking change guard
        // kernel.minor >= self.minor : kernel must be >= what driver requires
        self.major == kernel.major && kernel.minor >= self.minor
    }

    /// Pack version into a `u64` for atomic storage in vtable headers.
    ///
    /// Layout (most-significant to least-significant byte group):
    /// `[major:16][minor:16][patch:16][_pad:16]` in native byte order.
    /// This layout ensures that `v1.as_u64() < v2.as_u64()` iff `v1` is an
    /// older version than `v2` (within the same major). Comparison across
    /// major versions also works correctly (major is in MSB), but the
    /// primary use case is version ordering within a compatible major range.
    /// Enables lock-free version checks via `AtomicU64::compare_exchange`.
    ///
    /// Note: `_pad` (bits 0-15) is omitted from the packed representation
    /// because it must be zero. Repurposing `_pad` in a future KABI version
    /// requires a KABI major version bump OR a migration strategy where
    /// zero = "field not present / old format" (standard reserved field
    /// pattern). All existing driver binaries have bits 0-15 = 0.
    pub const fn as_u64(self) -> u64 {
        // Assert _pad == 0 unconditionally (not debug-only) to match the
        // from_u64() rejection of non-zero bits 0-15. This prevents a
        // KabiVersion with corrupted _pad from round-tripping through
        // as_u64()/from_u64() in release builds. The only constructors
        // (new() and from_u64()) guarantee _pad == 0, so this assert
        // fires only on memory corruption or unsafe mutation.
        assert!(self._pad == 0, "KabiVersion._pad must be zero");
        ((self.major as u64) << 48)
            | ((self.minor as u64) << 32)
            | ((self.patch as u64) << 16)
    }

    /// Unpack a `u64` vtable header word back into a `KabiVersion`.
    ///
    /// Returns `Err(KernelError::EINVAL)` if bits 0-15 are non-zero.
    /// A buggy driver writing garbage into the version field's reserved
    /// bits is caught at load time rather than silently discarded.
    pub const fn from_u64(v: u64) -> Result<Self, KernelError> {
        if v & 0xFFFF != 0 {
            return Err(KernelError::EINVAL); // Non-zero reserved bits 0-15
        }
        Ok(Self {
            major: ((v >> 48) & 0xffff) as u16,
            minor: ((v >> 32) & 0xffff) as u16,
            patch: ((v >> 16) & 0xffff) as u16,
            _pad: 0,
        })
    }
}

/// Current kernel KABI version. Drivers must be built against a compatible version.
///
/// **Clarification**: `KABI_CURRENT` = 1.0.0 is the baseline KABI version at
/// initial release. Individual vtable interfaces may define methods up to
/// `@version(N)` where N > 1 — these are minor-version extensions within the
/// same major KABI version. A kernel at KABI 1.6.0 would accept drivers built
/// against 1.0.0 through 1.6.0. The vtable's `kabi_version` field carries the
/// specific version of that interface, not `KABI_CURRENT`.
pub const KABI_CURRENT: KabiVersion = KabiVersion::new(1, 0, 0);

/// Entry in the KABI provider index. Populated at boot by scanning module headers
/// in the verified module store. Read-only after boot.
#[derive(Debug)]
pub struct KabiProviderEntry {
    /// Stable identifier for the service (e.g., `[b"mdio_service\0\0..."]`).
    pub service_id: ServiceId,
    /// Minimum version of the service that this module provides.
    pub min_version: KabiVersion,
    /// Maximum version this module's implementation is compatible with.
    pub max_version: KabiVersion,
    /// Path to the module in the verified module store.
    /// Read-only static string; never heap-allocated after boot.
    pub module_path: &'static str,
    /// When this module must be loaded relative to the boot sequence.
    pub load_phase: LoadPhase,
    /// Priority for multi-provider resolution. When multiple providers
    /// serve the same ServiceId, the registry selects the highest priority.
    /// Default: 50 for KABI drivers. Tier M PeerServiceProxy entries use
    /// priority 100 (higher, preferred when available). When a Tier M peer
    /// disconnects, its proxy is deregistered and the registry auto-resolves
    /// to the next-best provider (the KABI driver at priority 50).
    /// See [Section 5.11](05-distributed.md#smartnic-and-dpu-integration--service-registry-integration-peerserviceproxy).
    pub priority: u32,
}

/// Service identifier — a 64-byte fixed-width name plus major version namespace.
/// Two services with the same name but different major versions are considered
/// distinct services (incompatible API change).
#[repr(C)]
pub struct ServiceId {
    /// ASCII service name, NUL-padded. Maximum 59 characters + at least one NUL
    /// terminator. Names longer than 59 characters are rejected at registration
    /// time. The entire 60-byte field is zero-initialized; unused bytes are NUL.
    pub name: [u8; 60],
    /// Major version namespace — part of the identity, not just metadata.
    pub major: u32,
}
// ServiceId: name([u8;60]=60) + major(u32=4) = 64 bytes.
const_assert!(core::mem::size_of::<ServiceId>() == 64);

The `KabiDriverManifest` (embedded in the signed ELF `.kabi_manifest` section) declares
the list of `ServiceId` values the module is authorized to provide. At registration time,
the kernel validates that the claimed `ServiceId` appears in the registering module's
manifest. A module cannot register services not declared in its signed manifest —
attempting to do so returns `KabiError::UnauthorizedService`.

/// Load phase controls when demand-loading is triggered.
#[repr(u32)]
pub enum LoadPhase {
    /// Load before device enumeration. Required by bus frameworks (MDIO, SPI,
    /// SCSI mid-layer) that must exist before any device driver can bind.
    /// Handled by the kernel-internal Tier 0 module loader (no userspace needed).
    Boot = 0,
    /// Load when first requested by a driver probe. Handled by either the
    /// kernel-internal loader (if module is in initramfs) or the userspace
    /// `umka-modload` daemon (for post-boot installations).
    OnDemand = 1,
}

/// The index is built once at boot and never mutated.
/// Stored in read-only kernel memory after construction.
pub struct KabiProviderIndex {
    /// Sorted by service_id for O(log n) lookup.
    entries: &'static [KabiProviderEntry],
}

impl KabiProviderIndex {
    /// Find the provider for `service_id` that is compatible with `min_version`.
    ///
    /// A provider is compatible when the requested version falls within its
    /// supported range: `e.min_version <= min_version <= e.max_version`.
    ///
    /// When multiple providers match (same ServiceId, overlapping version ranges),
    /// returns the one with the highest `priority`. This enables automatic fallback:
    /// a Tier M peer's `PeerServiceProxy` (priority 100) is preferred over a host
    /// KABI driver (priority 50). When the peer disconnects and its entry is removed,
    /// the next call returns the KABI driver.
    ///
    /// Returns `None` if no registered provider covers the requested version.
    pub fn find(&self, service_id: &ServiceId, min_version: KabiVersion)
        -> Option<&KabiProviderEntry>
    {
        self.entries.iter()
            .filter(|e| e.service_id == *service_id
                && e.min_version <= min_version
                && e.max_version >= min_version)
            .max_by_key(|e| e.priority)
    }
}

The KabiProviderIndex is populated during early boot by scanning the initramfs module store. All entries are verified against the kernel's ML-DSA signing key before being accepted (Section 9.3). The index is sealed read-only before any driver loads. A Tier 1 driver cannot add entries to the index — it can only request services whose entries already exist.

12.7.4 KabiServiceRegistry — Runtime Service Map¶

/// C-ABI stable handle to a KABI service provider.
/// Passed across isolation domain boundaries (Ring 0 driver ↔ UmkaOS Core).
///
/// **Type erasure**: `vtable` and `ctx` are `*const ()` because
/// `KabiServiceHandle` is stored in a heterogeneous registry
/// (`ArrayVec<(ServiceId, KabiServiceHandle), MAX_KABI_SERVICES>`).
/// Call sites should use `TypedServiceHandle<V>` (below) for compile-time
/// type safety when dispatching through the vtable.
///
/// **Liveness guarantee**: the module providing this service cannot be unloaded
/// while any live KABI service capability references it. This struct does NOT
/// hold an Arc or Rc — liveness is a capability-level invariant, not a per-handle
/// reference count. Users must not retain a raw `KabiServiceHandle` beyond their
/// capability's lifetime.
///
/// **Generation**: `generation` is incremented on driver hot-reload. Callers
/// detect stale handles by comparing against the registry's current generation.
#[repr(C)]
pub struct KabiServiceHandle {
    /// Opaque vtable pointer. Callee casts to the service-specific vtable type.
    /// Use `TypedServiceHandle<V>::from_raw()` at call sites for type safety.
    pub vtable: *const (),
    /// Opaque context pointer; passed as first argument to all vtable methods.
    pub ctx: *const (),
    /// Generation counter; incremented each time this service provider is reloaded.
    /// A stale handle has `generation < registry.current_generation(service_id)`.
    pub generation: u64,
    /// KABI version reported by the service provider's vtable `kabi_version` field at
    /// registration time. This is the primary version discriminant — callers use it to
    /// gate access to methods added in later versions and to detect deprecation cycles.
    pub version: KabiVersion,
}
// KabiServiceHandle: vtable(ptr) + ctx(ptr) + generation(u64=8) + version(KabiVersion=8).
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<KabiServiceHandle>() == 32);
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<KabiServiceHandle>() == 24);

/// Compile-time-safe wrapper around `KabiServiceHandle` that prevents
/// cross-service handle confusion. `V` is a marker trait implemented by
/// each service vtable type (e.g., `BlockDeviceVTable`, `NicDriverVTable`).
///
/// The registry stores untyped `KabiServiceHandle` (necessary for the
/// heterogeneous service table). Call sites construct `TypedServiceHandle<V>`
/// from a raw handle to gain compile-time type safety. Zero runtime cost
/// (PhantomData is zero-sized).
pub struct TypedServiceHandle<V: ServiceVtable> {
    inner: KabiServiceHandle,
    _marker: core::marker::PhantomData<V>,
}

impl<V: ServiceVtable> TypedServiceHandle<V> {
    /// Construct a typed handle from a raw `KabiServiceHandle`.
    ///
    /// # Safety
    /// The caller must ensure the handle's `vtable` pointer actually points
    /// to a `V`-typed vtable. This is guaranteed when the handle was obtained
    /// from a registry lookup keyed by the correct `ServiceId` for `V`.
    pub unsafe fn from_raw(handle: KabiServiceHandle) -> Self {
        Self { inner: handle, _marker: core::marker::PhantomData }
    }

    /// Get a typed reference to the vtable.
    pub fn vtable(&self) -> *const V {
        self.inner.vtable as *const V
    }

    /// Get the context pointer.
    pub fn ctx(&self) -> *const () {
        self.inner.ctx
    }
}

/// Marker trait for KABI service vtable types. Implemented by each
/// generated vtable struct (e.g., `BlockDeviceVTable`, `NicDriverVTable`).
pub trait ServiceVtable {}

// Type hierarchy for service handles:
//   ServiceHandle (u64, C-ABI token)      — crosses KABI boundary, passed by value
//     ↓ registry lookup by id
//   InternalServiceRef (kernel-private)   — kernel's bookkeeping with raw ptr + generation
//     ↓ vtable resolution
//   KabiServiceHandle (*const (), *const ()) — passed to vtable call site
//
// This separation keeps C-ABI types at boundaries and Rust types in kernel internals.

/// RCU-protected immutable snapshot. Wraps a value published via RCU:
/// - `load()` acquires an RCU read guard and returns a reference — lock-free.
/// - `update(new)` clones the new value into an `Arc`, atomically swaps the
///   pointer, and schedules the old `Arc` for deferred drop after one RCU
///   grace period.
/// Analogous to `RcuCell<T>` ([Section 3.4](03-concurrency.md#cumulative-performance-budget)) but enforces
/// that the contained value is always initialized (no `Option` wrapping).
pub struct RcuSnapshot<T> {
    /// Atomic pointer to the current snapshot (Box-allocated).
    /// Reads acquire an RCU read guard and load the pointer.
    /// Writes swap the pointer under a writer mutex and schedule
    /// deferred drop of the old allocation via `rcu_call()`.
    ptr: AtomicPtr<T>,
}
impl<T> RcuSnapshot<T> {
    /// Create a new snapshot with an initial value (heap-allocated).
    pub fn new(initial: T) -> Self {
        Self { ptr: AtomicPtr::new(Box::into_raw(Box::new(initial))) }
    }
    /// Load current snapshot under the caller's RCU read guard.
    /// Returns a shared reference valid for the guard's lifetime.
    /// The guard prevents the referent from being freed during access.
    pub fn load<'g>(&self, _guard: &'g RcuReadGuard) -> &'g T {
        // SAFETY: pointer is always valid while an RCU read guard is held;
        // the writer defers deallocation until after all readers have passed
        // through a grace period.
        unsafe { &*self.ptr.load(Ordering::Acquire) }
    }
    /// Atomically replace the snapshot. The old value is deferred-freed
    /// after one RCU grace period (all current readers have released
    /// their guards). Caller must hold the external writer mutex.
    pub fn swap(&self, new: Box<T>) -> *mut T {
        self.ptr.swap(Box::into_raw(new), Ordering::Release)
    }
}

/// Named workqueue for deferred driver probe retries. Created at KABI
/// subsystem init (canonical Phase 3.7). Uses `SCHED_OTHER` because probe retries
/// are not latency-sensitive — they run after a service registration event.
/// Depth 256 matches `MAX_PROBE_WAITERS_PER_SERVICE`.
static KABI_PROBE_WQ: OnceCell<Arc<WorkQueue>> = OnceCell::new();

/// Callback for deferred driver probe retry. Receives a raw pointer to an
/// `Arc<DeviceNode>` (created via `Arc::into_raw`). Reconstructs the Arc
/// and re-runs the device probe sequence.
///
/// # Safety
/// `data` must be a pointer produced by `Arc::into_raw(Arc<DeviceNode>)`.
unsafe fn kabi_probe_retry_fn(data: *mut ()) {
    // SAFETY: caller guarantees `data` was produced by `Arc::into_raw`.
    let device = unsafe { Arc::from_raw(data as *const DeviceNode) };
    kabi_driver_probe(&device);
}

/// Waker token used to re-trigger a driver probe when a missing service is registered.
/// Stored in `KabiServiceRegistry::waiters` keyed by `ServiceId`.
/// When the service registers, the kernel dequeues all `ProbeWaker` tokens and
/// re-submits the waiting drivers to the device probe work queue.
pub struct ProbeWaker {
    /// Reference to the device that needs re-probing.
    /// (`DeviceNode` is defined in [Section 11.4](11-drivers.md#device-registry-and-bus-management).)
    pub device: Arc<DeviceNode>,
}
impl ProbeWaker {
    pub fn new(device: Arc<DeviceNode>) -> Self { Self { device } }

    /// Re-queue the device for deferred probe. Submits a work item to the
    /// `KABI_PROBE_WQ` workqueue that will retry init after the service
    /// becomes available. Uses the `WorkQueue::queue_work` API
    /// ([Section 3.11](03-concurrency.md#workqueue-deferred-work)).
    pub fn schedule_retry(&self) {
        let wq = KABI_PROBE_WQ.get().expect("KABI probe WQ not initialized");
        let data = Arc::into_raw(self.device.clone()) as *mut ();
        if let Err(_) = wq.queue_work(WorkItem {
            f: kabi_probe_retry_fn,
            data,
            deadline_ns: None,
        }) {
            // Queue full — reconstruct the Arc to avoid a leak, then log.
            // SAFETY: `data` was just created by `Arc::into_raw` above.
            unsafe { Arc::from_raw(data as *const DeviceNode); }
            log::warn!("kabi_probe WQ full; probe retry for {:?} dropped", self.device);
        }
    }
}

/// Low-level I/O error from device operations. Used by `ProbeError::Io` to
/// distinguish hardware errors (bus errors, timeouts, DMA faults) from
/// higher-level KABI errors.
pub enum IoError {
    Timeout,
    BusError    { addr: u64 },
    DmaFault    { iova: u64 },
    HardwareReset,
    Other       { code: i32 },
}

/// Maximum probe waiters per service. Returns EAGAIN to the driver probe
/// function when this limit is reached, causing deferred probe to back off.
pub const MAX_PROBE_WAITERS_PER_SERVICE: usize = 256;

/// The runtime service registry. Lives in Core, using RCU for read-mostly access.
///
/// The service table is an immutable sorted `Vec<(ServiceId, KabiServiceHandle)>`
/// published under RCU. Lookups acquire an RCU read guard and binary-search the
/// sorted array with no lock contention. Registration is rare and uses a single-
/// writer mutex to clone, update, sort, and publish a new snapshot.
pub struct KabiServiceRegistry {
    /// RCU-protected immutable snapshot of the service table.
    /// Sorted by `ServiceId` for O(log n) binary search.
    services: RcuSnapshot<ServiceTable>,
    /// Single-writer mutex for service registration and unregistration.
    /// Held only during the clone-update-publish sequence; never held during lookups.
    registry_write_mutex: Mutex<()>,
    /// Waiters blocked on a not-yet-registered service.
    /// Key: service_id. Value: list of waker tokens for deferred probe retry.
    /// Bounded to MAX_PROBE_WAITERS_PER_SERVICE per service; excess returns EAGAIN.
    /// **Single push site invariant**: `request_service()` is the ONLY function
    /// that pushes to these Vec instances (verified by grep). The bound check
    /// at the push site ensures `vec.len() < MAX_PROBE_WAITERS_PER_SERVICE`
    /// before every push. Any new push site MUST enforce the same bound.
    waiters: Mutex<BTreeMap<ServiceId, Vec<ProbeWaker>>>,  // len() <= MAX_PROBE_WAITERS_PER_SERVICE
}

/// Maximum number of simultaneously registered KABI services.
/// Covers all built-in + loadable modules on the largest configurations.
/// Bounded at compile time so that clone-on-write during registration
/// copies a fixed-size inline buffer — no heap allocation per registration.
pub const MAX_KABI_SERVICES: usize = 128;

/// Immutable sorted service table published under RCU.
/// Binary search gives O(log n) lookup without any locking.
/// Uses `ArrayVec` so that the clone in the register path copies the
/// inline buffer (stack allocation) rather than heap-allocating a new `Vec`.
pub struct ServiceTable {
    pub entries: ArrayVec<(ServiceId, KabiServiceHandle), MAX_KABI_SERVICES>,
}

impl KabiServiceRegistry {
    /// Look up a registered service. Returns None if not yet registered.
    /// Does NOT trigger loading — call `request_service` for that.
    ///
    /// Lookup: acquire `rcu_read_lock()`, load the snapshot pointer, binary-search
    /// the sorted array (O(log n)), release guard. Zero lock contention; zero
    /// cache-line bouncing on multi-core systems.
    pub fn get(&self, id: &ServiceId) -> Option<KabiServiceHandle> {
        let guard = rcu_read_lock();
        let table = self.services.load(&guard);
        let result = table.entries
            .binary_search_by_key(&id, |(k, _)| k)
            .ok()
            .map(|idx| table.entries[idx].1.clone());
        drop(guard);
        result
    }

    /// Register a service. Called by a Tier 0 module during its init function.
    /// Notifies all waiters blocked on this service_id.
    ///
    /// Registration: acquire `registry_write_mutex` (single-writer), clone the
    /// existing `ServiceTable` (ArrayVec inline copy — no heap allocation),
    /// append the new entry, sort, publish via `rcu_assign_pointer()`, release
    /// mutex, defer-free the old table via `rcu_call()`.
    ///
    /// **Latency**: Clone is O(n) memmove of the ArrayVec inline buffer
    /// (128 entries × ~40 bytes ≈ 5 KB). Sort is O(n log n). At n=128,
    /// total ≈ 50-100 μs — acceptable on a cold path (driver registration).
    ///
    /// Returns `KernelError::ResourceExhausted` if `MAX_KABI_SERVICES` is reached.
    pub fn register(&self, id: ServiceId, handle: KabiServiceHandle)
        -> Result<(), KernelError>
    {
        let _write_guard = self.registry_write_mutex.lock();
        let guard = rcu_read_lock();
        let old_table = self.services.load(&guard);
        let mut new_entries = old_table.entries.clone();
        drop(guard);
        if new_entries.is_full() {
            return Err(KernelError::ResourceExhausted);
        }
        new_entries.push((id.clone(), handle));
        new_entries.sort_by(|(a, _), (b, _)| a.cmp(b));
        let new_table = Box::new(ServiceTable { entries: new_entries });
        let old_ptr = self.services.swap(new_table);
        drop(_write_guard);
        // Defer-free the old table after all RCU readers have passed through.
        // SAFETY: old_ptr was allocated via Box::new; after all RCU readers
        // have passed through, no references remain and deallocation is safe.
        unsafe { rcu_call_box_drop(old_ptr) };
        // Wake all deferred probes waiting on this service.
        let mut waiters = self.waiters.lock();
        if let Some(wakers) = waiters.remove(&id) {
            for waker in wakers {
                waker.schedule_retry();
            }
        }
        Ok(())
    }

    /// Unregister a service (called on module unload — Tier 1 only;
    /// load_once Tier 0 modules never call this).
    pub fn unregister(&self, id: &ServiceId) {
        let _write_guard = self.registry_write_mutex.lock();
        let guard = rcu_read_lock();
        let old_table = self.services.load(&guard);
        let mut new_entries = old_table.entries.clone();
        drop(guard);
        new_entries.retain(|(k, _)| k != id);
        let new_table = Box::new(ServiceTable { entries: new_entries });
        let old_ptr = self.services.swap(new_table);
        drop(_write_guard);
        // SAFETY: old_ptr was allocated via Box::new; after all RCU readers
        // have passed through, no references remain and deallocation is safe.
        unsafe { rcu_call_box_drop(old_ptr) };
    }
}

The service registry uses an RCU-protected immutable snapshot table rather than a mutable map under a lock. The internal type is RcuSnapshot<ServiceTable> where ServiceTable is an immutable sorted Vec<(ServiceId, ServiceHandle)> (where ServiceHandle is the lower-level cross-domain handle; see KabiServiceHandle in Section 12.3 for the C-ABI stable variant with generation counter).

Lookups (request_service()): acquire rcu_read_lock(), load the snapshot pointer, binary-search the sorted array (O(log n)), release guard. No lock contention, no cache-line bouncing on multi-core systems. Zero overhead when the registry is stable.
Registration (rare): acquire registry_write_mutex (single-writer), clone the existing ServiceTable, append the new entry, sort, publish via rcu_assign_pointer(), release mutex, defer-free the old table via rcu_call().

This follows the standard UmkaOS pattern for read-mostly shared state: RCU for readers, single-writer mutex for updates.

12.7.5 Requesting a Service: Probe Deferral¶

A driver's probe function requests services via request_service. If the service is not yet registered, the probe is deferred rather than blocking or failing.

KabiService trait — how kernel subsystems declare services that drivers can request:

/// Marker trait for KABI service declarations.
///
/// Implementing this trait is how a kernel subsystem (or Tier 0 loadable module)
/// declares a named, versioned service that drivers can request via `request_service`.
/// `SERVICE_ID` is a compile-time constant generated by `kabi-gen` from the `.kabi`
/// IDL file (it encodes the interface name and major version as a fixed-size byte
/// array + u32). `MIN_VERSION` is the oldest wire-compatible version of this service
/// that this implementation supports. `Vtable` is the `#[repr(C)]` vtable struct
/// generated from the IDL.
///
/// Example (generated by `kabi-gen` from `mdio.kabi`):
///
/// ```rust
/// pub struct MdioService;
/// impl KabiService for MdioService {
///     // name: 12 chars + 48 NUL bytes = 60 bytes total (max 59 chars + NUL)
///     const SERVICE_ID: ServiceId = ServiceId { name: *b"mdio_service\0\0\0\0...", major: 1 };
///     const MIN_VERSION: KabiVersion = KabiVersion { major: 1, minor: 0 };
///     type Vtable = MdioServiceVTable;
/// }
/// ```
pub trait KabiService: 'static {
    /// Stable identifier for this service (name + major version).
    const SERVICE_ID: ServiceId;
    /// Minimum version this implementation is wire-compatible with.
    const MIN_VERSION: KabiVersion;
    /// The `#[repr(C)]` vtable type for this service.
    type Vtable: 'static;
}

/// Request a KABI service with the given minimum version.
///
/// Returns:
/// - `Ok(handle)` — service is registered and version-compatible.
/// - `Err(ProbeError::Deferred)` — service not yet available; probe will be
///   retried automatically when the service is registered. The driver must
///   return `Err(ProbeError::Deferred)` from its probe function immediately
///   after receiving this — no partial initialisation.
/// - `Err(ProbeError::ServiceUnavailable)` — service is not in the
///   KabiProviderIndex at all; it will never become available. The driver
///   should fail permanently and log the missing dependency.
///
/// **Security note**: `request_service` does not perform per-call capability checks.
/// Service requests are gated by the signed provider index ([Section 12.7](#kabi-service-dependency-resolution--security-model)): a
/// driver can only trigger loading of modules already present in the ML-DSA-verified
/// provider index. Unsigned or unknown modules cannot be requested.
///
/// **Calling context**: `request_service()` acquires a spinlock and may block on
/// module loading. It MUST NOT be called from IRQ context or with interrupts disabled.
/// Callers must be in process context.
pub fn request_service<S: KabiService>(
    registry: &KabiServiceRegistry,
    provider_index: &KabiProviderIndex,
    device: &DeviceNode,
    min_version: KabiVersion,
) -> Result<KabiServiceHandle, ProbeError> {
    let id = S::SERVICE_ID;

    // Fast path: service already registered.
    if let Some(handle) = registry.get(&id) {
        if min_version.is_compatible_with(handle.version) {
            return Ok(handle);
        }
        // Registered but wrong version — permanent failure.
        return Err(ProbeError::ServiceVersionMismatch {
            service: id,
            have: handle.version,
            need: min_version,
        });
    }

    // Check if a provider exists at all.
    let entry = provider_index.find(&id, min_version)
        .ok_or(ProbeError::ServiceUnavailable { service: id })?;

    // Provider exists but not loaded yet. Register waiter and trigger load.
    {
        let mut waiters = registry.waiters.lock();
        let list = waiters.entry(id.clone()).or_default();
        if list.len() >= MAX_PROBE_WAITERS_PER_SERVICE {
            return Err(ProbeError::TooManyWaiters { service: id });
        }
        list.push(ProbeWaker::new(device));
    }
    // Trigger demand loading of the providing module.
    // For LoadPhase::Boot modules this is a no-op (already loaded or loading).
    // For LoadPhase::OnDemand this schedules the module loader.
    if let Err(ProbeError::LoadQueueFull) = schedule_module_load(entry) {
        // Queue is full — the waiter is already registered (above), so the probe
        // will be retried when the module loader drains the queue and wakes waiters.
        // Return Deferred with the same retry semantics.
        return Err(ProbeError::Deferred { waiting_for: id });
    }

    Err(ProbeError::Deferred { waiting_for: id })
}

/// Driver probe return type.
pub enum ProbeError {
    /// Permanent failure — log and do not retry.
    Io(IoError),
    NotSupported,
    ServiceUnavailable { service: ServiceId },
    ServiceVersionMismatch { service: ServiceId, have: KabiVersion, need: KabiVersion },
    /// Temporary — the kernel will retry this probe when `waiting_for` is registered.
    /// The driver MUST return immediately after receiving Deferred from request_service.
    /// Partial initialisation state is not allowed — no allocations, no side effects.
    Deferred { waiting_for: ServiceId },
    /// The module loader work queue (MODULE_LOADER_QUEUE) is at capacity.
    /// Treated as transient: the probe waiter is registered and will be retried
    /// when the queue drains. Callers should convert this to Deferred.
    LoadQueueFull,
    /// The waiter list for this service has reached MAX_PROBE_WAITERS_PER_SERVICE.
    /// Maps to EAGAIN at the syscall layer. The driver probe framework treats this
    /// as transient and will retry with exponential back-off.
    TooManyWaiters { service: ServiceId },
}

Retry semantics: when KabiServiceRegistry::register is called (a Tier 0 module finishes loading and registers its service), all ProbeWaker entries for that service_id are dequeued and each deferred driver's probe is re-submitted to the device registry's probe work queue. The retry is asynchronous — the registering module is not blocked waiting for all dependent drivers to probe successfully.

No partial initialisation invariant: a driver that returns Deferred must not have allocated resources, registered character devices, or modified shared state. The device registry enforces this by checking that the device node remains in the Matching state (Section 11.4) after a Deferred return — any device that has advanced to Probing and returns Deferred triggers a warning and a state reset.

12.7.6 Demand Loading¶

Tier 0 loadable modules are loaded by the kernel-internal module loader — a small ELF loader in static Core that does not require userspace to be running. This covers both boot-phase loads (before init starts) and on-demand loads of framework modules referenced in the initramfs module store.

The loader is driven by a bounded work queue. The types below define the queue entries and the reason codes that drive loader policy (signature requirements, timeout budget, error handling).

/// Reason a module is being loaded. Drives loader policy (signature requirements,
/// timeout budget, error handling).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum LoadReason {
    /// Loaded on boot from initramfs or built-in driver list.
    Boot,
    /// Loaded because a device was hot-plugged and matched this module's alias.
    HotPlug,
    /// Loaded by explicit userspace request (e.g., `modprobe`).
    UserRequest,
    /// Loaded as a dependency of another module.
    Dependency,
    /// Loaded because a running driver declared a KABI `requires` dependency that
    /// resolved to an on-demand module (see `schedule_module_load` below).
    ServiceDependency,
    /// Loaded to replace a crashed Tier 1 driver (live recovery path).
    /// CrashRecovery applies to Tier 1 and Tier 2 modules only. A Tier 0 module crash
    /// is a kernel panic — there is no recovery path. The `LoadReason::CrashRecovery`
    /// variant is not valid for Tier 0 modules.
    CrashRecovery,
}

/// Errors that can occur during kernel module loading.
pub enum ModuleLoadError {
    /// Module file not found in the verified module store.
    NotFound          { path: ArrayString<256> },
    /// ML-DSA signature verification failed.
    SignatureInvalid,
    /// Module's declared KABI ABI version is incompatible with the running kernel.
    KabiVersionMismatch { module_version: KabiVersion, kernel_version: KabiVersion },
    /// Module init function returned an error.
    InitFailed        { errno: i32 },
    /// Loader I/O error while reading the module file.
    Io(IoError),
    /// Module path does not end with `.uko` or fails other validation.
    InvalidPath,
    /// Module's license (from `KabiDriverManifest.license_id`) is incompatible
    /// with the assigned tier. For example, a proprietary driver requesting
    /// Tier 0 or Tier 1, or a CDDL driver requesting Tier 0.
    LicenseTierViolation { license_id: u16, requested_tier: u8, max_allowed: u8 },
    /// IMA appraisal failed (content hash mismatch or missing xattr).
    ImaAppraisalFailed,
}

/// Maximum length of a module path in the verified module store.
const MAX_MODULE_PATH_LEN: usize = 255;

/// A pending module load request queued to the module loader worker.
pub struct ModuleLoadRequest {
    /// Module path in the initramfs or sysfs module store. Fixed inline buffer
    /// (no heap allocation) — all module paths are bounded by MAX_MODULE_PATH_LEN.
    ///
    /// **Path validation** (enforced by `validate_module_path()` before queuing):
    /// - Must start with `/lib/modules/umka/` prefix (the canonical module store).
    /// - Must not contain `..` path components (prevents traversal attacks).
    /// - Must not contain NUL bytes.
    /// - Symlinks in the path are NOT followed (the loader uses `O_NOFOLLOW` at
    ///   each component). This prevents symlink-based path canonicalization attacks.
    /// - Rejected paths return `Err(ModuleLoadError::InvalidPath)` with `EINVAL`.
    pub path: ArrayString<256>,
    /// Why this module is being loaded (affects policy checks).
    pub reason: LoadReason,
    /// Completion channel: the loader sends `Ok(())` or an error when done.
    pub completion: oneshot::Sender<Result<(), ModuleLoadError>>,
}

/// Maximum module binary size (uncompressed). Modules larger than this are
/// rejected at load time with `ModuleLoadError::TooLarge`. Default: 64 MiB —
/// covers the largest known Linux modules (nvidia.ko at ~40 MiB). Tunable
/// via kernel parameter `umka.module_max_size_mb` (default: 64).
pub const MODULE_MAX_SIZE: usize = 64 * 1024 * 1024;

/// Validate a module path before constructing a `ModuleLoadRequest`.
///
/// Returns `Ok(())` if the path is safe; `Err(ModuleLoadError::InvalidPath)` otherwise.
/// Called by `schedule_module_load()` before pushing to `MODULE_LOADER_QUEUE`.
fn validate_module_path(path: &str) -> Result<(), ModuleLoadError> {
    // 1. Must start with the canonical module store prefix.
    if !path.starts_with("/lib/modules/umka/") {
        return Err(ModuleLoadError::InvalidPath);
    }
    // 2. Must end with `.uko` (Umka Kernel Object). UmkaOS driver binaries
    //    use `.uko` to distinguish them from Linux `.ko` modules — prevents
    //    accidental cross-loading on dual-boot systems.
    if !path.ends_with(".uko") {
        return Err(ModuleLoadError::InvalidPath);
    }
    // 3. Must not contain ".." components.
    if path.split('/').any(|c| c == "..") {
        return Err(ModuleLoadError::InvalidPath);
    }
    // 4. Must not contain NUL bytes.
    if path.bytes().any(|b| b == 0) {
        return Err(ModuleLoadError::InvalidPath);
    }
    Ok(())
}

/// The global module loader queue is defined in
/// [Section 11.4](11-drivers.md#device-registry-and-bus-management--module-loader-queue) as
/// `ModuleLoaderQueue` — a `SpinLock<BinaryHeap<PrioritizedLoadRequest>>`
/// with capacity 256 and 4 concurrent loader worker threads. The priority-
/// ordered heap ensures crash recovery loads (priority 180) preempt deferred
/// on-demand loads (priority 100).
///
/// This section uses the same queue via the `schedule_module_load()` function
/// below. See [Section 11.4](11-drivers.md#device-registry-and-bus-management--module-loader-queue) for the `ModuleLoaderQueue` struct, `PrioritizedLoadRequest`
/// ordering, and worker thread lifecycle.

/// Schedule loading of the module that provides a KABI service.
/// Non-blocking: queues the load on the module-loader work queue.
///
/// Returns `Ok(())` if the load was successfully queued (or is a boot module
/// that will be loaded by the init sequence). Returns `Err(ProbeError::LoadQueueFull)`
/// if the `MODULE_LOADER_QUEUE` is at capacity — the caller (`request_service`)
/// converts this into `ProbeError::Deferred` with a retry hint so the probe is
/// re-attempted when the queue drains.
fn schedule_module_load(entry: &KabiProviderEntry) -> Result<(), ProbeError> {
    match entry.load_phase {
        LoadPhase::Boot => {
            // Boot modules are loaded by the kernel-internal loader
            // during init sequence. If called after boot, this is a
            // programming error — boot modules must be pre-loaded.
            // Log and do nothing; probe deferral handles the retry.
            log::warn!("Boot-phase module {} not loaded at boot — \
                        will retry when loaded manually", entry.module_path);
            Ok(())
        }
        LoadPhase::OnDemand => {
            // Queue on the module loader work queue.
            // The loader verifies ML-DSA signature, maps into Core domain,
            // runs the module's init function.
            let (tx, _rx) = oneshot::channel();
            let request = ModuleLoadRequest {
                path: entry.module_path,
                reason: LoadReason::ServiceDependency,
                completion: tx,
            };
            match MODULE_LOADER_QUEUE.push(request) {
                Ok(()) => {
                    // The caller does not await `_rx` here — probe deferral
                    // handles the retry when the module finishes loading.
                    Ok(())
                }
                Err(_rejected) => {
                    // Queue is full. The probe waiter is already registered
                    // in KabiServiceRegistry::waiters and will be woken when
                    // the queue drains (the module loader wakes all pending
                    // probes after completing each load).
                    Err(ProbeError::LoadQueueFull)
                }
            }
        }
    }
}

The kernel-internal module loader handles all Tier 0 loadable module loads. For Tier 1/2 modules, the kernel also loads them internally via KabiProviderIndex matching on device discovery — no userspace tool is involved in the common auto-load path. The Linux-compatible syscall interface (below) provides manual load/unload for administrative use and compatibility with existing Linux tools.

12.7.7 Linux Module Tool Compatibility (Dual-Boot Support)¶

UmkaOS uses .uko modules, not .ko. But unmodified Linux module management tools (modprobe, rmmod, lsmod, modinfo from the kmod package) work on UmkaOS without recompilation. This enables dual-boot: the same distro rootfs boots with either a Linux kernel or an UmkaOS kernel, and the same tools manage modules on both.

12.7.7.1 Filesystem Layout¶

/lib/modules/
  5.15.0-generic/          ← Linux kernel's modules (.ko files)
    modules.alias
    modules.dep
    kernel/drivers/...
  umka-1.0.0/              ← UmkaOS kernel's modules (.uko files)
    modules.alias          ← generated by umka-depmod
    modules.dep            ← generated by umka-depmod
    modules.symbols        ← generated by umka-depmod
    kernel/drivers/
      net/ethernet/intel/ixgbe/ixgbe.uko
      block/nvme/nvme.uko
      fs/ext4/ext4.uko
      ...

Both directories coexist on the same rootfs. uname -r returns the active kernel's version string (5.15.0-generic for Linux, umka-1.0.0 for UmkaOS), and modprobe reads from the corresponding /lib/modules/$(uname -r)/ directory. Package managers install both module sets; GRUB selects which kernel boots.

12.7.7.2 Syscall Compatibility¶

UmkaOS implements the Linux module management syscalls. These accept .uko files (not .ko) and use the KABI loader internally:

/// Load a module from a file descriptor. Linux-compatible syscall.
/// modprobe opens a .uko file, calls finit_module(fd, params, flags).
/// UmkaOS reads the .uko ELF from the fd and loads via the KABI loader.
///
/// If the module is already loaded (kernel-internal auto-load raced with
/// modprobe): returns -EEXIST. modprobe treats this as success.
///
/// Requires CAP_SYS_MODULE (Linux capability bit 16 in SystemCaps).
pub fn sys_finit_module(fd: i32, params: *const u8, flags: i32) -> i64;

/// Load a module from a memory buffer. Linux-compatible syscall.
/// Used by insmod. Same behavior as finit_module but reads from
/// userspace buffer instead of file descriptor.
pub fn sys_init_module(image: *const u8, len: u64, params: *const u8) -> i64;

/// Unload a module by name. Linux-compatible syscall.
/// Used by rmmod. Triggers the KABI module unload sequence:
/// 1. `cancel_work_sync()` all pending deferred work items
///    ([Section 3.11](03-concurrency.md#workqueue-deferred-work)) — ensures no work callbacks fire
///    after the module's code pages are freed.
/// 2. Quiesce ring buffer connections (drain in-flight KABI calls).
/// 3. Drain I/O (wait for outstanding DMA and interrupt handlers).
/// 4. Call module exit function.
/// 5. Unregister KABI services ([Section 12.7](#kabi-service-dependency-resolution)).
/// 6. Release isolation domain ([Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes)).
/// 7. Free module memory (code + data pages).
/// Flags: O_NONBLOCK (fail if module is in use), O_TRUNC (force unload).
pub fn sys_delete_module(name: *const u8, flags: u32) -> i64;

12.7.7.3 /proc/modules Compatibility¶

UmkaOS provides /proc/modules in Linux-compatible format so that lsmod (which reads this file) works unmodified:

module_name  size  refcount  deps  state  offset
nvme         53248  1        -     Live   0xffffffffa0000000
ext4         204800 2        -     Live   0xffffffffa0010000
ixgbe        102400 0        -     Live   0xffffffffa0040000

Fields are populated from the KABI module registry. The offset field contains the module's .text section base address (for debugging tools). The deps field lists KABI service dependencies (modules this module requires). The state field maps from KABI module state: Loading → Loading, Active → Live, Unloading → Unloading, CrashRecovery → Live (recovery is transparent to userspace).

12.7.7.4 umka-depmod Tool¶

umka-depmod is the UmkaOS equivalent of depmod. It reads .uko KABI manifests and generates Linux-format module index files:

umka-depmod [--basedir /lib/modules/umka-X.Y.Z]

Reads:  *.uko files in the module directory
        → parses .kabi_manifest ELF section for:
          - driver_name (module name)
          - match rules (PCI vendor/device, USB vid/pid, platform name,
            OF compatible, ACPI HID — same data as Linux MODULE_DEVICE_TABLE)
          - KABI service dependencies (requires/provides)

Writes: modules.alias    — MODALIAS patterns → module names
        modules.dep      — module name → file path + dependencies
        modules.symbols  — exported service symbols

Run by the package manager after installing/updating .uko files, same as depmod runs after apt install linux-image-*. On dual-boot systems, depmod manages the Linux module index; umka-depmod manages the UmkaOS module index. Both coexist.

12.7.7.5 How systemd-udevd Works on UmkaOS¶

systemd-udevd is unmodified. The flow:

UmkaOS kernel detects new device (PCI enumeration, USB hotplug, etc.)
Kernel emits uevent via NETLINK_KOBJECT_UEVENT (Section 19.5) with Linux-compatible ACTION, DEVPATH, SUBSYSTEM, MODALIAS attributes.
udevd receives uevent, matches against udev rules.
If a rule says RUN+="modprobe $env{MODALIAS}":
modprobe reads /lib/modules/umka-1.0.0/modules.alias
Finds matching .uko module name
Opens the .uko file
Calls finit_module(fd, "", 0)
UmkaOS loads the module
If module was already loaded by kernel-internal auto-load: finit_module returns -EEXIST, modprobe returns success silently
udevd creates device nodes (/dev/*), sets permissions, creates symlinks — all based on uevent attributes, same as on Linux.

In practice, the kernel-internal auto-loader (KabiProviderIndex + schedule_module_load()) loads most drivers BEFORE udevd's modprobe call arrives. The modprobe call becomes a harmless no-op (-EEXIST). This is normal — Linux also has this race (kernel auto-loads some modules via request_module() before udevd gets to them).

12.7.7.6 Explicit modprobe in Scripts¶

Systemd units and scripts that explicitly load modules work:

# These all work on UmkaOS:
modprobe ip_tables       # loads /lib/modules/umka-1.0.0/kernel/net/ipv4/ip_tables.uko
modprobe fuse            # loads fuse.uko
modprobe loop            # loads loop.uko
modprobe -r ixgbe        # unloads ixgbe (calls delete_module)
lsmod                    # reads /proc/modules
modinfo ixgbe.uko        # reads .uko ELF sections

12.7.7.7 No Dedicated Daemon Needed¶

Linux doesn't have a module-loading daemon — modprobe is invoked on-demand by the kernel (call_usermodehelper) or by udevd. UmkaOS follows the same model: kernel-internal auto-loading for device discovery, finit_module() syscall for explicit user requests. No umka-modload daemon.

For post-boot module installation (admin installs a new .uko package): run umka-depmod to regenerate modules.alias/modules.dep, then the next device hotplug or explicit modprobe finds the new module. Same as running depmod after installing a new Linux kernel package.

12.7.8 Circular Dependency Prohibition¶

Circular dependencies between Tier 0 loadable modules are statically prohibited. The kabi-gen toolchain runs a topological sort over the complete requires/provides graph at build time and rejects cycles with a build error identifying the cycle:

error[KABI-E0021]: circular dependency detected
  mdio_framework requires pci_bus (ok)
  scsi_framework requires block_layer (ok)
  hypothetical_a requires hypothetical_b
  hypothetical_b requires hypothetical_a  ← cycle here
  fix: merge hypothetical_a and hypothetical_b into one module, or
       break the cycle by moving shared state into static Core

If two services genuinely need each other, they must either be merged into one module (which can then provide both services and call between them as direct internal calls) or their shared state must be factored into a third module that neither depends on the other. Circular dependencies between a Tier 0 module and static Core are impossible by construction — static Core has no requires declarations and is always available.

Runtime circular dependency detection: While the build-time topological sort catches cycles in the declared dependency graph, the runtime module loader also performs cycle detection as a defense-in-depth measure (protects against tampered or hand-crafted .uko files with inconsistent metadata): 1. The loader maintains a per-load-operation "currently resolving" set (thread-local ArrayVec<ServiceId, 32>). 2. Before resolving each requires entry, the loader checks if the service is already in the "currently resolving" set. 3. If found: the load fails with ModuleLoadError::CircularDependency { cycle: [ids] }. 4. On success or failure, the service is removed from the set (RAII guard pattern). This is O(N) per dependency check where N <= 32 (maximum dependency depth). The constant bound ensures the detection has negligible cost on the module load path.

12.7.9 Tier 0 Module Lifecycle (`load_once`)¶

Tier 0 loadable modules are never unloaded. Once a Tier 0 module's init function completes and it registers its services, it remains in Core's memory domain for the lifetime of the system. This is enforced by the load_once: true declaration in the module's .kabi file and by the module loader, which never processes an unload request for a load_once module.

Rationale: Tier 0 modules execute in the same address space as static Core. An interrupt handler, timer callback, or RCU deferred callback anywhere in the kernel might hold a function pointer into a Tier 0 module's code. Reference counting alone cannot guarantee that no stale pointer exists — safe unloading would require auditing every possible execution context. The cost (permanent resident memory) is acceptable because Tier 0 loadable modules are framework code (SCSI mid-layer, MDIO, SPI bus core, etc.) — they are small compared to the hardware they enable, and a system that loads them has implicitly declared a need for them.

Tier 1 modules (which are domain-isolated) can be unloaded safely because the isolation boundary prevents stale intra-kernel pointers. Unloading a Tier 1 module revokes its MPK domain and all ring buffer connections to it; no part of the Core domain retains a callable pointer into Tier 1 code.

12.7.10 Version Negotiation¶

When driver X requests service Y at >= version 1.2, and the registered provider exports version 2.0, the registry negotiates the binding version:

fn negotiate_version(
    handle: &KabiServiceHandle,
    caller_min: KabiVersion,
    caller_max: KabiVersion,
) -> Result<KabiServiceHandle, ProbeError> {
    // Provider is newer than caller expects: caller gets a downgraded view.
    // The vtable_size field limits which methods are visible to the caller.
    // The provider's vtable is laid out append-only ([Section 12.2](#kabi-abi-rules-and-lifecycle) Rule 1),
    // so limiting to caller's known size is always safe.
    let effective_version = handle.version.min(caller_max);
    if effective_version < caller_min {
        return Err(ProbeError::ServiceVersionMismatch { ... });
    }
    Ok(KabiServiceHandle { version: effective_version, ..handle.clone() })
}

This uses the kabi_version field as the primary version discriminant (Section 12.2 Rule 6). vtable_size remains as a bounds-safety check for the zero-extension contract.

12.7.11 Security Model¶

The dependency resolution mechanism is a potential privilege escalation path: a compromised Tier 1 driver requesting a service could cause the kernel to load a Tier 0 module. UmkaOS's defence:

KabiProviderIndex is sealed after boot. The index is populated from ML-DSA-signed module headers during early init and marked read-only before any driver loads. A Tier 1 driver cannot add entries to the index.
Service requests are by opaque ID, not by module path. A Tier 1 driver calls request_service::<MdioService>() — it cannot specify which module file to load. The resolution from service ID to module path happens entirely inside Core, using the pre-verified index.
A Tier 1 driver can only trigger loading of modules that the system already trusts. If a module is not in the signed provider index, request_service returns ServiceUnavailable and no loading occurs.
All module loads verify the ML-DSA signature (Section 9.3; see that section for the full ML-DSA-65 signature format) before the module's code is executed. A Tier 1 driver cannot cause execution of unsigned code even if it somehow injected an entry into the provider index (which it cannot, per point 1).

12.7.12 Signing Key Initialization¶

Driver signature format and verification algorithm are defined in Section 9.3. This section specifies the KABI loader's integration with that verification infrastructure.

The KABI loader's signature verification keyring is initialized during Phase 3.7 (kabi_keyring_init), before any driver loading occurs. This ensures that no KABI binary can be loaded before the verification infrastructure is ready.

/// KABI loader key initialization sequence.
///
/// Called during Phase 3.7, after `device_registry_init()` and before
/// any driver probe. The sequence is:
///
/// 1. **Allocate the `.kabi` keyring**: `kabi_keyring = keyring_alloc(".kabi")`.
///    This is a kernel-internal keyring ([Section 10.2](10-security-extensions.md#kernel-key-retention-service)),
///    not visible to userspace via `/proc/keys`. Its `KeyFlags` include
///    `KEY_FLAG_KERNEL_ONLY` to prevent userspace `keyctl()` manipulation.
///
/// 2. **Copy trusted signing keys** from the kernel's embedded `.driver_certs`
///    ELF section into `.kabi` keyring. These keys were placed in the kernel
///    binary at build time and verified during the boot chain
///    ([Section 9.3](09-security.md#verified-boot-chain) step 4).
///    Each key is an ML-DSA-65 public key (1952 bytes). The `.driver_certs`
///    section is a packed array of `DriverCertEntry` structs:
///    ```rust
// kernel-internal, not KABI
///    #[repr(C)]
///    pub struct DriverCertEntry {
///        /// Key identifier (SHA3-256 hash of the DER-encoded public key).
///        pub key_id: [u8; 32],
///        /// DER-encoded SubjectPublicKeyInfo (SPKI) containing the ML-DSA-65
///        /// public key. SPKI wraps the raw 1952-byte ML-DSA-65 public key
///        /// (NIST FIPS 204) in an ASN.1 SEQUENCE with an AlgorithmIdentifier
///        /// OID, producing a self-describing key blob. The SPKI format is
///        /// required because `AsymmetricKey.instantiate()` in the Key
///        /// Retention Service ([Section 10.2](10-security-extensions.md#kernel-key-retention-service)) expects
///        /// DER SPKI for all asymmetric key types. Using raw bytes here
///        /// would require a format conversion at import time, adding
///        /// complexity and a potential mismatch.
///        ///
///        /// Maximum size: 1980 bytes (1952-byte ML-DSA-65 key + ~28 bytes
///        /// ASN.1 overhead). The actual length is determined by parsing the
///        /// DER SEQUENCE length field; trailing bytes are ignored.
///        pub public_key: [u8; 1980],
///        /// Actual length of the DER SPKI data in `public_key`.
///        /// Must be <= 1980. Zero indicates an empty/invalid entry.
///        pub public_key_len: u16,
///        /// Maximum tier this key can vouch for (tier ceiling).
///        /// 0 = Tier 0/1/2 eligible (distro-built, open-source verified)
///        /// 1 = Tier 1/2 only (open-source but not audited for Tier 0)
///        /// 2 = Tier 2 only (vendor binary blob, or unaudited code)
///        /// The module loader enforces: effective_tier = min(requested_tier,
///        /// max_tier). A vendor whose CA cert is pinned with max_tier=2 cannot
///        /// have its drivers loaded at Tier 0 or Tier 1, regardless of admin
///        /// overrides. This is the distro's trust policy lever.
///        pub max_tier: u8,
///        /// Certificate source — determines hard constraints on max_tier.
///        /// 0 = CERT_SOURCE_BUILTIN: embedded at distro kernel build time.
///        ///     max_tier is respected as-is.
///        /// 1 = CERT_SOURCE_MOK: enrolled at runtime via Machine Owner Key
///        ///     (shim/MOK manager). Always capped at Tier 2 by the kernel
///        ///     regardless of the max_tier field value — the kernel enforces
///        ///     `effective_max_tier = 2` for all MOK-enrolled certs. This
///        ///     prevents a user from granting Ring 0 access to unvetted code
///        ///     via MOK enrollment alone.
///        pub source: u8,
///    }
///    const_assert!(core::mem::size_of::<DriverCertEntry>() == 2016);
///    ```
///
/// 3. **Import UEFI db keys** (conditional): if Secure Boot is active and the
///    kernel was loaded via UEFI, also import keys from the UEFI Signature
///    Database (`db`) that have the `EFI_CERT_TYPE_DRIVER_SIGNING` usage flag.
///    These are converted from X.509 DER to the internal key format. Keys from
///    `dbx` (revocation list) are checked first — any key present in `dbx` is
///    rejected. This path is skipped on non-UEFI boots (legacy BIOS, DTB-only).
///
/// 3a. **Import MOK keys** (conditional): if the shim bootloader enrolled
///    Machine Owner Keys (MOK), import them from the
///    `.secondary_trusted_keys` keyring ([Section 10.2](10-security-extensions.md#kernel-key-retention-service)).
///    MOK keys are imported with `source = CERT_SOURCE_MOK` (1) and their
///    `max_tier` is hardcoded to 2 by the kernel — regardless of any
///    `max_tier` value the key might carry. This means:
///    - A user can enroll any vendor's signing key via MOK, and that
///      vendor's drivers will load and work — but always at Tier 2
///      (Ring 3, process-isolated).
///    - To get Tier 0 or Tier 1, the signing key must be embedded in
///      `.driver_certs` at kernel build time (`source = CERT_SOURCE_BUILTIN`).
///    - This prevents a compromised admin or social engineering attack
///      from granting Ring 0 access to unvetted binary blobs via MOK
///      enrollment alone.
///    MOK keys that match any entry in `dbx` or the KRL are rejected.
///
/// 4. **Lock the keyring**: after initialization, the `.kabi` keyring is sealed
///    (`KEY_FLAG_SEALED`). No new keys can be added at runtime. This prevents
///    key injection attacks where a compromised process with `CAP_SYS_ADMIN`
///    might attempt to add a rogue signing key.
///
/// **ML-DSA signing key rotation**: Since the `.kabi` keyring is sealed at boot,
/// key rotation requires a kernel update. The rotation procedure:
/// 1. Generate new ML-DSA-65 keypair. Add the new public key to `.driver_certs`.
/// 2. Re-sign all `.uko` modules with the new private key.
/// 3. Keep the old public key in `.driver_certs` for a transition period (at least
///    two kernel releases) so that modules signed with the old key remain loadable.
/// 4. After the transition period, remove the old public key from `.driver_certs`
///    and add its `key_id` to the KRL ([Section 9.3](09-security.md#verified-boot-chain)) to explicitly
///    revoke it. Modules signed with the revoked key will be rejected at load time.
/// This is a deliberate design choice: runtime key addition would expand the attack
/// surface. Key rotation is an administrative, not a runtime, operation.
///
/// # Errors
/// - `KeyError::CryptoInit`: the kernel crypto subsystem is not initialized
///   (programming error — crypto init at Phase 1.3.1 must precede Phase 3.7).
/// - `KeyError::NoKeys`: the `.driver_certs` section is empty and no UEFI db
///   keys were found. This is fatal in Secure Boot mode (the system cannot
///   verify any driver). In non-Secure-Boot mode, a warning is logged but
///   boot continues (all KABI loads will fail signature verification).
pub fn kabi_keyring_init() -> Result<KeyringHandle, KeyError>

/// Verify a KABI driver binary's signature.
///
/// Uses `crypto_verify_signature()` from the Kernel Crypto API
/// ([Section 10.1](10-security-extensions.md#kernel-crypto-api)). No parallel
/// cryptographic implementation — the same ML-DSA/SLH-DSA code paths used
/// for boot verification are reused here.
///
/// The signature is stored in a `.kabi_sig` ELF section within the driver
/// binary. The section layout is defined by `KabiSigSection`:
///
/// ```rust
// kernel-internal, not KABI
/// #[repr(C)]
/// pub struct KabiSigSection {
///     /// Signature algorithm identifier.
///     /// 1 = ML-DSA-65 (primary, mandatory support).
///     /// 2 = SLH-DSA-SHA2-128s (fallback for hardware without ML-DSA).
///     pub algo: u8,
///     /// Reserved (must be zero).
///     pub _reserved: [u8; 3],
///     /// Key identifier: SHA3-256 hash of the signing key's public key.
///     /// Used to look up the correct key in the `.kabi` keyring without
///     /// trying all keys.
///     pub key_id: [u8; 32],
///     /// Signature length in bytes. ML-DSA-65 signatures are 3309 bytes.
///     pub sig_len: u32,
///     /// The signature bytes (variable length, immediately follows this header).
///     /// The signed message is the entire driver binary EXCLUDING the
///     /// `.kabi_sig` section itself (the section is zeroed before hashing).
///     pub signature: [u8; 0], // flexible array member
/// }
/// // Fixed header size (excluding flexible array):
/// const_assert!(core::mem::size_of::<KabiSigSection>() == 40);
/// ```
///
/// # Errors
/// - `SignatureError::KeyNotFound`: no key in the `.kabi` keyring matches `key_id`.
/// - `SignatureError::VerificationFailed`: the signature does not match the binary.
/// - `SignatureError::UnsupportedAlgorithm`: `algo` is not a recognized value.
/// - `SignatureError::MalformedSection`: the `.kabi_sig` section is truncated or
///   has invalid field values.
pub fn kabi_verify_signature(
    binary: &[u8],
    sig_section: &KabiSigSection,
) -> Result<(), SignatureError> {
    // 1. Look up the signing key by key_id.
    let key = KABI_KEYRING.find_key(&sig_section.key_id)
        .ok_or(SignatureError::KeyNotFound)?;
    // 2. Verify using the kernel crypto API (same code as boot verification).
    crypto_verify_signature(&key, &sig_section.signature(), binary)
}

/// Global reference to the sealed `.kabi` keyring.
/// Initialized by `kabi_keyring_init()` during Phase 3.7.
/// Immutable after initialization (the keyring is sealed).
static KABI_KEYRING: OnceCell<KeyringHandle> = OnceCell::new();

12.7.13 IMA Measurement Hook¶

After signature verification succeeds and before loading the binary into memory, the module loader records the driver in IMA's runtime measurement log. This ensures all loaded KABI drivers appear in the IMA measurement log for remote attestation (Section 9.5).

The measurement uses the same hash computed during signature verification (no re-hash), so the IMA hook adds zero cryptographic overhead:

/// Record a KABI driver binary in IMA's measurement log.
///
/// Called after signature verification succeeds, before the binary is
/// mapped into the Core domain for execution. Inserting the measurement
/// at this point ensures:
///
/// 1. **No wasted cycles on invalid binaries**: unsigned or tampered binaries
///    are rejected by `kabi_verify_signature()` before reaching this point.
/// 2. **Complete attestation record**: every successfully loaded KABI driver
///    appears in the IMA log. A remote verifier can reconstruct the exact
///    set of drivers loaded on the system.
/// 3. **Hash reuse**: the `binary_hash` parameter is the SHA3-256 digest
///    computed during signature verification. The IMA hook does not re-hash
///    the binary.
///
/// The IMA event record contains:
/// - `pcr`: PCR 10 (the standard IMA measurement PCR).
/// - `template`: `ima-sig` (hash + signature + filename).
/// - `digest`: `binary_hash` (SHA3-256, 32 bytes).
/// - `filename`: `driver_name` (e.g., "nvme_driver.uko").
///
/// If IMA is not enabled (no TPM or IMA policy not loaded), this function
/// is a no-op — the driver load proceeds without measurement.
pub fn ima_kabi_measure(binary_hash: &[u8; 32], driver_name: &str) {
    ima_measure_record(ImaEventType::KabiDriver, binary_hash, driver_name);
}

/// Appraise a KABI driver binary against its stored IMA extended attribute.
///
/// Called after `ima_kabi_measure()` during driver load. Appraisal verifies
/// that the driver binary's content hash matches the signed hash stored in
/// the file's `security.ima` extended attribute. If the IMA policy includes
/// an `appraise func=KABI_CHECK appraise_type=imasig` rule (which is the
/// default built-in policy), this function enforces the check.
///
/// **Appraisal steps**:
/// 1. Read the `security.ima` xattr from the driver file. If missing and
///    appraisal is enforced (`umka.ima=enforce`), return `ImaError::NoXattr`.
/// 2. Parse the xattr: extract the algorithm ID and the signature over the
///    file hash. Verify the signature against the IMA keyring (`.ima_mok`
///    or `.builtin_trusted_keys`).
/// 3. Compare the signed hash against `binary_hash` (the hash computed
///    during the load sequence). If they differ, the file has been modified
///    since signing — return `ImaError::HashMismatch`.
///
/// In `umka.ima=log` mode, mismatches are logged but the driver load
/// proceeds (audit-only). In `umka.ima=enforce` mode, mismatches cause
/// the driver load to fail.
///
/// # Errors
/// - `ImaError::NoXattr`: the file lacks a `security.ima` xattr and
///   appraisal is enforced.
/// - `ImaError::HashMismatch`: the file's content hash does not match the
///   signed hash in the xattr.
/// - `ImaError::SignatureInvalid`: the xattr signature verification failed
///   (the xattr was tampered or signed with an untrusted key).
/// - `ImaError::UnsupportedAlgorithm`: the xattr specifies a hash algorithm
///   not supported by IMA.
pub fn ima_kabi_appraise(
    file: &OpenFile,
    binary_hash: &[u8; 32],
    driver_name: &str,
) -> Result<(), ImaError> {
    // 1. Read security.ima xattr.
    let xattr = vfs_getxattr(file, "security.ima")
        .map_err(|_| ImaError::NoXattr)?;
    // 2. Parse algorithm ID and signature from xattr.
    let (algo_id, sig_data) = ima_parse_xattr(&xattr)?;
    // 3. Verify the xattr signature against IMA keyrings.
    ima_verify_xattr_signature(algo_id, sig_data, binary_hash)?;
    Ok(())
}

Integration with the module load sequence: The module loader worker (MODULE_LOADER_QUEUE consumer) processes each ModuleLoadRequest in the following order:

Read module binary from the verified module store (/lib/modules/umka/). Module files use the .uko extension ("Umka Kernel Object") to distinguish them from Linux .ko modules. validate_module_path() rejects paths that do not end with .uko.
Verify ML-DSA signature: kabi_verify_signature(binary, sig_section). Reject on failure (ModuleLoadError::SignatureInvalid).
Signing certificate tier ceiling: Look up the signing key's DriverCertEntry in the .kabi keyring. Read max_tier and source. If source == CERT_SOURCE_MOK, force effective_max_tier = 2 regardless of the max_tier field. Otherwise use max_tier as-is. If the driver's requested tier exceeds effective_max_tier, demote the driver to effective_max_tier. This enforces the distro's trust policy: a vendor CA pinned with max_tier=2 cannot have its drivers loaded at Tier 0/1. Log: "driver {name}: cert tier ceiling {max_tier}, effective tier {tier}".
License-tier validation: Read license_id from the .kabi_manifest section. Look up the license in the ALLR table (Section 24.7):
ALLR Tier 1/2 licenses (GPL-2.0, MIT, BSD, Apache, MPL-2.0, LGPL-2.1, EPL-2.0-with-secondary, ISC, Zlib): Tier 0/1/2 eligible — no restriction.
ALLR Tier 3 licenses (CDDL-1.0, GPL-3.0-only, EUPL-1.2): cap at Tier 1. Tier 0 static linking is prohibited (creates derivative work). KABI IPC at Tier 1 provides the license boundary.
Process-isolation-only licenses (LGPL-3.0, EPL-2.0-no-secondary): cap at Tier 2. No kernel address space access.
Proprietary (0xF000+) or unspecified (0x0000): cap at Tier 2. If the driver's effective tier (after step 3) still exceeds the license ceiling, demote further. Reject with ModuleLoadError::LicenseTierViolation if no valid tier remains (i.e., the driver's minimum_tier exceeds the license ceiling). Trust note: license_id is a self-declaration (like Linux's MODULE_LICENSE()). The full ALLR license-to-tier mapping is defined in Section 24.7; the table above summarizes the tiers relevant to driver loading. For distro-built drivers (signed by distro key, source=CERT_SOURCE_BUILTIN), the distro built from source and the declaration is trustworthy. For vendor-signed drivers, the signing cert max_tier (step 3) is the real enforcement — license_id is advisory. A vendor-signed driver claiming GPL while the cert has max_tier=2 is still capped at Tier 2 by step 3. A vendor-signed driver falsely claiming GPL with max_tier=0 would only occur if the distro pinned the vendor cert at max_tier=0, which implies the distro trusts the vendor — and that trust decision is the distro's responsibility, not the kernel's.
IMA measurement: ima_kabi_measure(&binary_hash, &request.path). Records the driver hash in IMA's measurement log (PCR 10).
IMA appraisal: ima_kabi_appraise(&file, &binary_hash, &request.path). Verifies the driver binary's content hash matches its signed security.ima xattr. In enforce mode, a mismatch or missing xattr rejects the load (ModuleLoadError::ImaAppraisalFailed). This step ensures that even if an attacker replaces a driver binary with a validly-signed binary for a different driver (signature is per-binary, but IMA xattr is per-file), the appraisal catches the mismatch. IMA-KABI interaction detail: IMA appraisal for KABI modules is triggered by the KABI_CHECK IMA policy rule (not by the generic FILE_CHECK hook). The IMA policy must include appraise func=KABI_CHECK for this step to run. Without this policy rule, step 6 is a no-op (measurement in step 5 still occurs). On ima_policy=tcb (default on secure systems), KABI_CHECK appraisal is enabled automatically. The hash algorithm for IMA xattrs on .uko files must match the kernel's IMA hash (ima_hash=sha3-256 default); if the xattr was signed with a different algorithm, appraisal fails with ImaAppraisalFailed and the mismatch is logged.
Parse ELF headers: validate architecture, section layout, relocations.
Map into Core domain: allocate pages, apply relocations, set page permissions (code = RX, data = RW, rodata = RO).
KABI version check: verify the module's declared KABI version is compatible with the running kernel.
Call init function: module.init() registers KABI services and providers.
Wake waiters: notify all request_service() callers waiting on services that this module provides.

Steps 3-4 enforce the three-layer trust model: signing cert trust policy (step 3) and license-tier compatibility (step 4) together determine the effective tier before any code is loaded into memory. Steps 5-6 are the IMA additions. Measurement (step 5) is positioned after signature verification (step 2) to avoid measuring rejected binaries. Appraisal (step 6) follows measurement so the measurement log records the driver regardless of appraisal outcome (for audit trail completeness). Both steps occur before memory mapping (step 8) to ensure the check reflects the on-disk binary, not the post-relocation in-memory image.

12.8 Domain Runtime — Unified Domain Model Mechanics¶

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). All #[repr(C)] structs have const_assert! size verification. See CLAUDE.md §Spec Pseudocode Quality Gates.

The design philosophy (Section 1.1) defines the Unified Domain Model conceptually: every piece of code runs in a domain; same-domain calls are direct, cross- domain calls use rings. The transport classes (Section 12.6) define the wire formats: T0 (direct vtable), T1 (ring + domain switch), T2 (ring + syscall). This section specifies the runtime mechanics that connect those two layers: how modules discover each other, how rings are created, how handles are cached, how interrupts flow across domains, and how the system adapts when modules change domains at runtime.

12.8.1.1 Canonical Type Definitions¶

/// Domain identifier. A u64 counter assigned by the domain registry at domain
/// creation time. Domain 0 is the Core domain (Tier 0). Domains 1-N are Tier 1
/// hardware-isolated domains. Tier 2 domains have per-process IDs.
///
/// This is the canonical definition — all references to `DomainId` across
/// the KABI subsystem (including [Section 12.3](#kabi-bilateral-capability-exchange),
/// [Section 12.6](#kabi-transport-classes), [Section 11.3](11-drivers.md#driver-isolation-tiers)) resolve to
/// this type.
pub type DomainId = u64;

/// The Core domain (Tier 0) always has domain ID 0. All Tier 0 modules
/// share this single domain. Direct vtable calls within this domain.
pub const CORE_DOMAIN_ID: DomainId = 0;

/// Domain lifecycle state. Stored as `AtomicU8` in `DomainDescriptor.state`
/// using the discriminant values below. All transitions are performed via
/// `compare_exchange(Acquire, Relaxed)` on the atomic — no external lock
/// required for the state field itself (but some transitions require the
/// domain registry lock for broader invariants).
///
/// State machine:
/// ```text
///   Stopped ──[create]──► Active
///   Active  ──[fault]───► Crashed
///   Active  ──[shutdown]─► Quiescing ──[drained]──► Stopped
///   Crashed ──[recover]──► Recovering ──[ready]──► Active
///   Crashed ──[disable]──► Stopped
///   Recovering ──[fault]─► Crashed  (double-fault during recovery)
/// ```
///
/// Used by `domain_crashed()` in [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)
/// and by the workqueue subsystem ([Section 3.11](03-concurrency.md#workqueue-deferred-work)) to skip
/// work items from crashed domains.
///
/// This is the canonical definition — all references to `DomainState` across
/// the architecture resolve to this type.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum DomainState {
    /// Domain is operational. Modules are loaded and serving requests.
    /// Rings are active. This is the normal steady-state.
    Active = 0,
    /// Domain is being gracefully shut down. In-flight ring requests
    /// are being drained. No new requests are accepted (producers
    /// observe the quiescing flag via the ring header and return
    /// `EAGAIN`). Transition to `Stopped` occurs when all in-flight
    /// requests have completed and all rings are empty.
    Quiescing = 1,
    /// A fault was detected in this domain (e.g., hardware memory
    /// protection violation, panic in driver code, watchdog timeout).
    /// All rings to/from this domain are frozen. The crash recovery
    /// sequence ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)) has been
    /// initiated. Modules in this domain are not executing.
    Crashed = 2,
    /// The domain is being recovered after a crash. Driver binary is
    /// being reloaded, state is being restored from the Module Binary
    /// Store snapshot. Rings are not yet re-enabled. If a second fault
    /// occurs during recovery, the domain transitions back to `Crashed`
    /// (double-fault — recovery may be retried with a different driver
    /// version or the domain may be permanently stopped).
    Recovering = 3,
    /// Domain is inactive. Either it was never started, was gracefully
    /// shut down (from `Quiescing`), or was permanently disabled after
    /// a crash (operator decision or exceeded retry limit). All
    /// resources have been freed. The `DomainId` may be recycled.
    Stopped = 4,
}

/// Maximum number of devices per isolation domain. Bounded by practical
/// domain grouping: AArch64 POE has 3 driver domains, x86 MPK has ~12.
/// A domain with >16 devices is a configuration error (too much blast radius).
pub const MAX_DEVICES_PER_DOMAIN: usize = 16;

/// Maximum number of inbound/outbound rings per domain. Each ring connects
/// this domain to one other domain for one service. 64 is generous —
/// most domains have 2-8 rings (one per bound service × number of peer domains).
pub const MAX_RINGS_PER_DOMAIN: usize = 64;

/// Domain descriptor. The canonical definition of a domain's runtime state.
///
/// One `DomainDescriptor` exists per active domain in the system. Stored in
/// the global `domain_registry: XArray<DomainId, DomainDescriptor>` (Tier 0
/// memory, protected by RCU for read access and a per-entry SpinLock for
/// mutations).
///
/// # Lifecycle
///
/// Created by `create_domain()` during driver probe or domain-group formation.
/// Fields are populated incrementally: `id`, `state`, `tier`, `isolation_key`
/// at creation; `modules`, `devices`, rings at probe/bind time; `iommu_domains`
/// cached at first device assignment.
///
/// Destroyed by `destroy_domain()` after successful crash recovery teardown
/// or operator-initiated domain removal. The `generation` counter is
/// incremented on every crash/recovery cycle; handles with stale generation
/// values detect the domain change via the generation check in `kabi_call!`.
///
/// # Crash recovery usage
///
/// The crash recovery orchestrator ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation))
/// uses `DomainDescriptor` fields as follows:
/// - Step 1 (DETECT): `domain_registry.get(domain_id)` to retrieve descriptor.
/// - Step 1a: `tier` to select recovery path (Tier 0 = panic, Tier 1 = recover,
///   Tier 2 = process restart). `IsolationTier::TierM` is unreachable for local
///   domains (TierM is a transport-level concept for remote peers).
/// - Step 2 (ISOLATE): `isolation_key` to revoke hardware permissions;
///   `state` set to `Crashed`; `generation` incremented.
/// - Step 2' (SET RING STATE): `inbound_rings` and `outbound_rings` enumerated
///   to set all ring states to `RING_STATE_DISCONNECTED`.
/// - Step 2a (NMI): `NmiCrashContext.revoked_domain_id` broadcast.
/// - DMA-1: `devices` iterated to issue FLR per device.
/// - DMA-2: `iommu_domains` iterated (deduplicated) for IOTLB invalidation.
///   Each `Arc<IommuDomain>` is obtained by iterating `domain_desc.devices`:
///   for each `DeviceHandle`, the `DmaDeviceHandle.iommu_domain` field (an
///   `RcuCell<Option<Arc<IommuDomain>>>`) is read under `rcu_read_lock()` at
///   domain creation time. The `Arc` is cloned into this array to extend the
///   lifetime beyond the RCU critical section. IOMMU domains are deduplicated
///   (multiple devices in the same IOMMU group share one `IommuDomain`).
/// - Step 7 (UNLOAD): `modules` iterated for teardown.
/// - Step 8 (RELOAD): modules re-loaded, rings re-created.
// kernel-internal, not KABI
pub struct DomainDescriptor {
    /// Unique domain identifier. Immutable after creation.
    pub id: DomainId,
    /// Current lifecycle state. Transitions under `crash_lock`.
    /// Readers use `Acquire` ordering; writers use `Release` ordering.
    pub state: AtomicU8,
    /// Effective isolation tier. Immutable after domain creation.
    /// `IsolationTier::TierM` is never assigned to a local domain (TierM
    /// represents remote peer kernels; local domains are Tier 0, 1, or 2).
    pub tier: IsolationTier,
    /// Hardware isolation key index (MPK PKEY on x86, POE domain on AArch64,
    /// DACR domain on ARMv7, segment register on PPC32, Radix PID on PPC64LE).
    /// 0 = Core domain (no isolation key needed). Tier 2 domains use IOMMU +
    /// process address space, not a key — this field is 0 for Tier 2.
    pub isolation_key: u32,
    /// Generation counter. Incremented on every crash/recovery cycle.
    /// Handles cache this value at bind time; the generation check in
    /// `kabi_call!` detects stale handles from crashed/reloaded domains.
    /// u64: at 10^6 crashes/sec (absurd), wraps in 584K years.
    pub generation: AtomicU64,
    /// Consumer threads running in this domain. One per inbound service ring.
    /// The crash recovery orchestrator sends NMI to CPUs where these threads
    /// are executing (identified via `CpuLocal.active_domain == self.id`).
    pub consumer_threads: ArrayVec<TaskId, MAX_RINGS_PER_DOMAIN>,
    /// Devices assigned to this domain. Populated at driver probe time.
    /// Used by crash recovery for FLR (DMA-1) and device re-initialization.
    pub devices: ArrayVec<DeviceHandle, MAX_DEVICES_PER_DOMAIN>,
    /// Inbound rings (this domain is the consumer). Each ring connects a
    /// remote producer domain to a service provided by a module in this domain.
    /// Used by crash recovery Step 2' to set all ring states to Disconnected.
    pub inbound_rings: ArrayVec<*const CrossDomainRing, MAX_RINGS_PER_DOMAIN>,
    /// Outbound rings (this domain is the producer). Each ring connects a
    /// module in this domain to a service in a remote consumer domain.
    /// Used by crash recovery to notify peer domains of the crash.
    pub outbound_rings: ArrayVec<*const CrossDomainRing, MAX_RINGS_PER_DOMAIN>,
    /// Cached IOMMU domain references for devices in this domain. Deduplicated
    /// from device IOMMU groups at domain creation time — multiple devices in
    /// the same IOMMU group share one `IommuDomain`. Used by crash recovery
    /// DMA-2 for IOTLB invalidation without traversing the device registry
    /// under RCU during the time-critical crash path.
    pub iommu_domains: ArrayVec<Arc<IommuDomain>, MAX_DEVICES_PER_DOMAIN>,
    /// Module manifest references. One entry per module loaded into this domain.
    /// Bounded by practical domain sizing — most domains have 1-4 modules.
    pub modules: ArrayVec<ModuleId, 16>,
    /// Exception-context crash lock. Held ONLY during Steps 1-2a
    /// (exception/NMI context, non-sleeping). Protects the domain
    /// revocation, ring state transition, and NMI broadcast. Released
    /// by the exception handler BEFORE pushing the `CrashRecoveryRequest`
    /// to `CRASH_RECOVERY_RING`. NOT held during Steps 3-9 (which
    /// require sleeping — use `recovery_mutex` instead).
    ///
    /// Level assignment: CRASH_LOCK(70) in the lock ordering table
    /// ([Section 3.4](03-concurrency.md#cumulative-performance-budget--lock-ordering)).
    pub crash_lock: SpinLock<()>,
    /// Process-context recovery mutex. Held during Steps 3-9 (DRAIN,
    /// DMA quiesce, state snapshot, FMA, UNLOAD, RELOAD, RESUME).
    /// These steps require sleeping (memory allocation, zstd
    /// decompression, module loading). Acquired by the
    /// `CrashRecoveryWorker` kthread after dequeuing a request.
    ///
    /// Serializes per-domain recovery: if a second crash occurs for
    /// the same domain while Steps 3-9 are in progress, the second
    /// `CrashRecoveryRequest` queues in `CRASH_RECOVERY_RING` and the
    /// worker blocks on `recovery_mutex` until the first recovery
    /// completes.
    ///
    /// Lock ordering: `crash_lock`(70) < `recovery_mutex`(75).
    /// They are never held simultaneously (crash_lock is released
    /// before recovery_mutex is acquired).
    pub recovery_mutex: Mutex<()>,
}

impl DomainDescriptor {
    /// Return the effective isolation tier for this domain.
    ///
    /// Used by crash recovery Step 1a to select the recovery path.
    /// `IsolationTier::TierM` is unreachable for local domains — a debug
    /// assertion enforces this invariant.
    pub fn effective_tier(&self) -> IsolationTier {
        debug_assert!(
            !matches!(self.tier, IsolationTier::TierM),
            "TierM is a transport-level concept; local domains cannot have TierM"
        );
        self.tier
    }

    /// Check whether this domain is in a crashed, recovering, or stopped state.
    ///
    /// Used by the workqueue subsystem to drop pending work items for a
    /// crashed domain instead of dispatching them. Returns `true` if
    /// `state` is `Crashed`, `Recovering`, or `Stopped`.
    pub fn domain_crashed(&self) -> bool {
        let s: DomainState = self.state.load(Ordering::Acquire);
        matches!(s, DomainState::Crashed | DomainState::Recovering | DomainState::Stopped)
    }
}

/// NMI crash communication context. Written by the faulting CPU before
/// sending the NMI IPI broadcast; read by the NMI handler on remote CPUs.
///
/// A single global instance suffices: concurrent domain crashes are
/// serialized by each domain's `crash_lock`. The second crash waits for
/// the first NMI cycle to complete before using this context.
///
/// Memory ordering: the faulting CPU writes all fields with `Release`
/// before the NMI IPI send (which includes an architecture-appropriate
/// barrier — see [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation) CR-22
/// barrier table). The NMI handler reads with `Acquire`.
// kernel-internal, not KABI
pub struct NmiCrashContext {
    /// Domain ID of the crashed domain. NMI handler compares this against
    /// `CpuLocal.active_domain` to determine if the current CPU is executing
    /// in the crashed domain.
    pub revoked_domain_id: AtomicU64,
    /// Number of CPUs ejected by the NMI handler (those that were executing
    /// in the crashed domain). The faulting CPU spins until this equals the
    /// expected count (number of CPUs with `active_domain == revoked_domain_id`).
    pub ejected_count: AtomicU64,
    /// Whether a crash NMI cycle is in progress. Serializes access to
    /// `revoked_domain_id` and `ejected_count` across concurrent domain
    /// crashes. 0 = idle, 1 = active.
    ///
    /// **Protocol** (enforced by crash exception handler):
    /// 1. CAS `active` from 0 to 1 (Acquire). If CAS fails, another
    ///    domain's NMI cycle is in progress — spin-wait on `active`
    ///    until it reads 0, then retry CAS.
    /// 2. Write `revoked_domain_id` (Release) and reset `ejected_count`
    ///    to 0 (Release).
    /// 3. Send NMI IPI broadcast.
    /// 4. Spin-poll `ejected_count` until all targeted CPUs have acked.
    /// 5. Store `active` to 0 (Release) — releases the context for the
    ///    next domain crash.
    ///
    /// Note: per-domain `crash_lock` does NOT serialize cross-domain
    /// access to this global context. Two different domains crashing
    /// concurrently acquire different `crash_lock` instances — only
    /// this `active` CAS serializes their access to `NMI_CRASH_CTX`.
    pub active: AtomicU8,
}

/// Global NMI crash context. Single instance — concurrent domain crashes
/// are serialized by `NMI_CRASH_CTX.active` (CAS 0→1 before writing
/// fields; store 0 after NMI cycle completes). The per-domain `crash_lock`
/// serializes same-domain re-crashes but does NOT serialize cross-domain
/// access to this global context — `active` does.
pub static NMI_CRASH_CTX: NmiCrashContext = NmiCrashContext {
    revoked_domain_id: AtomicU64::new(0),
    ejected_count: AtomicU64::new(0),
    active: AtomicU8::new(0),
};

/// Crash recovery handoff request. Written by the exception handler
/// (non-sleeping context) and consumed by the `CrashRecoveryWorker`
/// kthread (sleeping context). This bridges the exception→process context
/// boundary: Steps 1-2a run in exception/NMI context (cannot sleep);
/// Steps 3-9 are offloaded via this request to the worker which can sleep
/// (module loading at Step 8 requires memory allocation and zstd
/// decompression).
// kernel-internal, not KABI
pub struct CrashRecoveryRequest {
    /// Domain that crashed.
    pub domain_id: DomainId,
    /// Number of CPUs ejected by NMI (for logging/diagnostics).
    pub ejected_cpu_count: u64,
    /// Fault PC from the exception handler (for FMA crash records).
    pub fault_pc: u64,
    /// Fault address (if applicable — e.g., data abort on ARM).
    pub fault_addr: u64,
    /// Architecture-specific exception type code.
    pub exception_type: u32,
}

/// Bounded SPSC ring for crash recovery handoff requests. The exception
/// handler (producer) writes requests; the CrashRecoveryWorker (consumer)
/// dequeues and processes them. Capacity 16 is sufficient: each domain
/// crash produces exactly one request, and the system has at most
/// `MAX_ISOLATION_DOMAINS` domains (platform-dependent, typically 8-16).
/// If the ring is full (16 concurrent domain crashes), the 17th crash
/// increments `CRASH_OVERFLOW_COUNT` and falls back to panic — an extreme
/// cascade scenario where the system is unrecoverable anyway.
pub static CRASH_RECOVERY_RING: BoundedSpscRing<CrashRecoveryRequest, 16> =
    BoundedSpscRing::new();

/// Atomic counter for diagnostic tracking of ring-full crash overflow events.
pub static CRASH_OVERFLOW_COUNT: AtomicU64 = AtomicU64::new(0);

/// Global crash recovery worker. A single kthread that processes all
/// domain crash recovery requests from `CRASH_RECOVERY_RING`. Created
/// once at boot by `crash_recovery_init()`.
///
/// **Why a single global worker** (not per-domain):
/// - Domain crashes are rare events (~0-3 per year in production).
/// - A per-domain worker would waste memory and task slots for a
///   never-run kthread in every domain.
/// - The `DomainDescriptor.recovery_mutex` serializes per-domain
///   recovery anyway, so multiple workers would serialize on the mutex.
/// - If parallel recovery is desired (multiple domains crashing
///   simultaneously), the single worker dequeues all pending requests
///   and spawns one-shot kthreads for each (bounded by
///   `MAX_CONCURRENT_RECOVERIES = 4`).
///
/// **Lifecycle**:
/// - Created at boot by `crash_recovery_init()` (after scheduler is live).
/// - Sleeps on `CRASH_RECOVERY_WAITQUEUE` when the ring is empty.
/// - Woken by the exception handler after pushing a `CrashRecoveryRequest`.
/// - Never exits (50-year uptime requirement).
// kernel-internal, not KABI
pub struct CrashRecoveryWorker {
    /// TaskId of the worker kthread.
    pub task_id: TaskId,
}

/// WaitQueue used by the exception handler to wake the CrashRecoveryWorker
/// after pushing a request to `CRASH_RECOVERY_RING`.
pub static CRASH_RECOVERY_WAITQUEUE: WaitQueue = WaitQueue::new();

/// Maximum concurrent recovery operations. When the worker dequeues
/// multiple requests, it spawns up to this many one-shot kthreads.
/// The worker always handles at least one itself (inline).
pub const MAX_CONCURRENT_RECOVERIES: usize = 4;

/// Initialize the crash recovery subsystem. Called once during boot,
/// after the scheduler and kthread infrastructure are available.
///
/// Creates the `CrashRecoveryWorker` kthread. The kthread runs at
/// `SCHED_FIFO` priority 90 (below watchdog at 99, above normal
/// kworkers) to ensure timely recovery processing.
pub fn crash_recovery_init() {
    let task = kthread_create(
        "crash_recovery_worker",
        crash_recovery_worker_main,
        SchedPolicy::Fifo { priority: 90 },
    );
    // Store task_id for diagnostic queries (e.g., via /proc/umka/crash_recovery).
    CRASH_RECOVERY_WORKER.init(CrashRecoveryWorker { task_id: task.id });
}

/// Main loop of the crash recovery worker kthread.
///
/// Dequeues `CrashRecoveryRequest` entries from `CRASH_RECOVERY_RING`
/// and executes the process-context portion of crash recovery (Steps 3-9
/// from [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)).
///
/// For each request:
/// 1. Acquire `DomainDescriptor.recovery_mutex` (sleeping Mutex).
/// 2. Execute Steps 3 (DRAIN), 4 (DMA quiesce), 5 (state snapshot),
///    6 (FMA record), 7 (UNLOAD), 8 (RELOAD), 9 (RESUME).
/// 3. Release `recovery_mutex`.
///
/// The `crash_lock` (SpinLock) is NOT held during Steps 3-9. It is held
/// only during Steps 1-2a (exception context) and released by the
/// exception handler before pushing the request to `CRASH_RECOVERY_RING`.
fn crash_recovery_worker_main() {
    loop {
        // Sleep until a crash recovery request is available.
        wait_event!(CRASH_RECOVERY_WAITQUEUE, !CRASH_RECOVERY_RING.is_empty());

        // Dequeue all pending requests (batch processing).
        while let Some(request) = CRASH_RECOVERY_RING.try_pop() {
            let domain = domain_table.get(request.domain_id);

            // Acquire the per-domain recovery_mutex. This serializes
            // process-context recovery for the same domain (a second crash
            // during recovery queues behind the first).
            let _recovery_guard = domain.recovery_mutex.lock();

            // Execute Steps 3-9 (sleeping operations allowed).
            crash_recovery_process_context(&request, &domain);

            // recovery_mutex released when _recovery_guard drops.
        }
    }
}

12.8.1.2 T1 Ring Entry Definitions¶

T1 command and completion entries are the ring wire format for Tier 1 (in-kernel, hardware-isolated) domain crossings. They mirror the T2 variants defined in Section 12.6 but operate entirely within Ring 0 — the consumer reads entries directly from shared memory without copying them to a private buffer first (unlike T2 where the kernel copies before processing to prevent TOCTOU from a Ring 3 producer).

Each ring slot carries inline argument space — arguments are serialized directly into the command entry, not into a shared argument buffer. This eliminates the concurrent-corruption bug where multiple kabi_call! callers would write to the same shared arg_buffer region. Each slot's argument space is disjoint by construction.

The inline argument size (INLINE_ARG_SIZE) is determined per-service-interface by the IDL compiler based on the maximum argument size across all methods in the .kabi definition. Typical sizes:

Service type	Inline size	Entry size	Ring memory (256 slots)
Control/metadata (most KABI)	64 bytes	128 bytes (2 cache lines)	32 KiB
Small I/O (VFS read/write)	192 bytes	256 bytes (4 cache lines)	64 KiB
Large I/O (DMA descriptors)	64 bytes inline + overflow	128 bytes + 256 bytes overflow/slot	96 KiB

For services whose largest method exceeds the inline capacity, each slot owns a fixed-size overflow chunk in a partitioned overflow region appended after the ring data. The arg_overflow_offset field in the command entry points into this region. Each slot's overflow chunk is disjoint — no concurrent access is possible.

/// Default inline argument size in bytes. Used when no `.kabi` override is
/// specified. Sufficient for most control/metadata services. The IDL compiler
/// overrides this per service based on the maximum serialized argument size.
pub const DEFAULT_INLINE_ARG_SIZE: usize = 64;

/// Command entry placed by a producer into a T1 (Tier 1) ring.
///
/// The producer is typically the kabi_call! macro expansion or the
/// dispatch_to_domain() path in [Section 12.3](#kabi-bilateral-capability-exchange).
/// The consumer is the IDL-generated consumer loop running in the target
/// Tier 1 domain.
///
/// Unlike T2CommandEntry (where the kernel must copy-before-process to
/// prevent TOCTOU from a Ring 3 producer), T1 entries are accessed directly
/// from shared memory — both producer and consumer run in Ring 0 and the
/// hardware isolation domain prevents tampering after submission.
///
/// Each slot carries inline argument space (`args` field). Arguments are
/// serialized directly into the slot by the producer, eliminating the
/// shared-arg-buffer concurrent-corruption bug. The IDL compiler determines
/// the inline size per service interface; `DEFAULT_INLINE_ARG_SIZE` (64 bytes)
/// is used when no override is specified.
///
/// For arguments exceeding the inline capacity, the producer writes overflow
/// data to the slot's dedicated overflow chunk (at `arg_overflow_offset` in
/// the partitioned overflow region appended after the ring data). Each slot
/// owns a fixed-size chunk — no concurrent access across slots.
#[repr(C, align(64))]
pub struct T1CommandEntry {
    /// Vtable method ordinal (0-based index into the service vtable).
    /// The consumer loop validates this against `vtable_method_count()`
    /// before dispatch.
    pub method_index: u32,
    /// Flags bitfield.
    /// - `T1_CMD_NOTIFY` (0x01): request doorbell notification on completion.
    /// - `T1_CMD_BATCH` (0x02): more commands follow — defer doorbell.
    /// - `T1_CMD_OVERFLOW` (0x04): arguments exceed inline capacity;
    ///   overflow data is at `arg_overflow_offset`.
    pub flags: u32,
    /// Argument data length in bytes (inline portion).
    /// The consumer reads `args[0..arg_len]` for the serialized arguments.
    pub arg_len: u32,
    /// Byte offset into the per-slot overflow region, relative to the
    /// overflow region base. Only valid when `T1_CMD_OVERFLOW` is set.
    /// Each slot owns `overflow_chunk_size` bytes starting at
    /// `overflow_base + slot_index * overflow_chunk_size`.
    pub arg_overflow_offset: u32,
    /// Opaque value echoed verbatim in the corresponding T1CompletionEntry.
    /// The producer uses this to correlate completions with outstanding
    /// requests (e.g., as a sequence number or pending-request index).
    pub cookie: u64,
    /// Inline argument data. Serialized by the IDL-generated code in
    /// the `kabi_call!` macro expansion. Size is `DEFAULT_INLINE_ARG_SIZE`
    /// (64 bytes) unless overridden per-service by the IDL compiler.
    /// The consumer reads `args[0..arg_len]`.
    pub args: [u8; DEFAULT_INLINE_ARG_SIZE],
    /// Reserved for future use. Must be zero.
    pub _reserved: [u8; 32],
}
// T1CommandEntry: method_index(4) + flags(4) + arg_len(4) +
//   arg_overflow_offset(4) + cookie(8) + args(64) + _reserved(32)
//   = 16 + 8 + 64 + 32 = 120 bytes. Padded to 128 bytes (2 cache lines)
//   by repr(C, align(64)).
const_assert!(core::mem::size_of::<T1CommandEntry>() == 128);

/// Flag: arguments exceed inline capacity; overflow data present.
pub const T1_CMD_OVERFLOW: u32 = 0x04;

/// Flag: request doorbell notification on completion.
pub const T1_CMD_NOTIFY: u32 = 0x01;
/// Flag: more commands follow — defer doorbell until last in batch.
pub const T1_CMD_BATCH: u32 = 0x02;

/// Completion entry placed by the consumer loop into the T1 completion ring.
///
/// Written by the IDL-generated consumer loop after dispatching through the
/// vtable. The producer reads completions to determine method call results.
///
/// Unlike T2CompletionEntry (where the driver writes completions in Ring 3
/// and the kernel reads them), T1 completions are written within the
/// isolated Ring 0 domain — no copy-before-read needed.
#[repr(C, align(64))]
pub struct T1CompletionEntry {
    /// Matches the cookie from the corresponding T1CommandEntry.
    pub cookie: u64,
    /// 0 = success, negative = -errno (matches Linux error convention).
    pub status: i32,
    /// Result data length in bytes (in the shared result buffer).
    pub result_len: u32,
    /// Byte offset into the shared result buffer where the result begins.
    pub result_offset: u32,
    /// Reserved for future use. Must be zero.
    pub _reserved: [u8; 44],
}
// T1CompletionEntry: cookie(8) + status(4) + result_len(4) +
//   result_offset(4) + _reserved(44) = 64 bytes (one cache line).
const_assert!(core::mem::size_of::<T1CompletionEntry>() == 64);

impl T1CommandEntry {
    /// Construct a T1 command entry from a KabiRequest and a ring reference.
    ///
    /// Used by `dispatch_to_domain()` in [Section 12.3](#kabi-bilateral-capability-exchange)
    /// for the Tier 1 dispatch path. Copies arguments inline into the entry's
    /// `args` field and generates a monotonic `cookie` from the ring's cookie
    /// counter. If arguments exceed `DEFAULT_INLINE_ARG_SIZE`, copies the excess
    /// into the slot's dedicated overflow chunk and sets `T1_CMD_OVERFLOW`.
    ///
    /// # Arguments
    ///
    /// - `request`: The validated `KabiRequest` from the caller.
    /// - `ring`: The `CrossDomainRing` for the target domain, used to
    ///   allocate a cookie and access the overflow region.
    /// - `slot_index`: The ring slot index assigned to this entry (from
    ///   the CAS on `header.head`). Used to compute the overflow offset.
    ///
    /// # Panics
    ///
    /// Debug-asserts that `request.args_len` does not exceed the inline +
    /// overflow capacity for this ring.
    pub fn from_request(
        request: &KabiRequest,
        ring: &CrossDomainRing,
        slot_index: usize,
    ) -> Self {
        let cookie = ring.next_cookie();
        let mut entry = Self {
            method_index: request.method_index,
            flags: 0,
            arg_len: request.args_len,
            arg_overflow_offset: 0,
            cookie,
            args: [0u8; DEFAULT_INLINE_ARG_SIZE],
            _reserved: [0u8; 32],
        };

        if request.args_len > 0 && !request.args_ptr.is_null() {
            let inline_copy = core::cmp::min(
                request.args_len as usize,
                DEFAULT_INLINE_ARG_SIZE,
            );
            // SAFETY: args_ptr is validated by the caller; inline_copy <=
            // DEFAULT_INLINE_ARG_SIZE. Both source and destination are in
            // kernel address space.
            unsafe {
                core::ptr::copy_nonoverlapping(
                    request.args_ptr,
                    entry.args.as_mut_ptr(),
                    inline_copy,
                );
            }

            if request.args_len as usize > DEFAULT_INLINE_ARG_SIZE {
                // Overflow: copy excess into this slot's dedicated overflow chunk.
                let overflow_len = request.args_len as usize - DEFAULT_INLINE_ARG_SIZE;
                let overflow_offset = slot_index * ring.overflow_chunk_size as usize;
                debug_assert!(
                    overflow_len <= ring.overflow_chunk_size as usize,
                    "args exceed inline + overflow capacity"
                );
                // SAFETY: overflow region is partitioned per slot; this slot's
                // chunk is disjoint from all other slots.
                unsafe {
                    core::ptr::copy_nonoverlapping(
                        request.args_ptr.add(inline_copy),
                        ring.overflow_base.add(overflow_offset),
                        overflow_len,
                    );
                }
                entry.flags |= T1_CMD_OVERFLOW;
                entry.arg_overflow_offset = overflow_offset as u32;
            }
        }

        entry
    }
}

12.8.2 `kabi_call!` Macro Specification¶

kabi_call! is the primary transport abstraction for all cross-module invocations. Every caller in every tier uses kabi_call! — it is the single entry point to the domain model. Callers never specify the transport directly; the handle carries the transport decision made at bind time.

12.8.2.1 Macro Signature¶

/// Invoke a KABI service method through a cached transport handle.
///
/// Expands to either a direct vtable call (~2-5 cycles) or a ring buffer
/// submission (~200-500 cycles individual, ~23-80 amortized) depending on
/// the handle's transport, which was determined at bind time.
///
/// # Arguments
///
/// - `$handle`: `&KabiHandle<S>` — cached handle obtained from
///   `DomainService::resolve()` at module initialization time.
/// - `$method`: identifier — the vtable method name, matching the `.kabi`
///   IDL method declaration.
/// - `$($arg),*`: method arguments, matching the IDL signature.
///
/// # Returns
///
/// `Result<$ReturnType, KabiError>` — the method's return type wrapped in
/// `Result`. Direct calls propagate the vtable method's return value.
/// Ring calls may additionally return `KabiError::QueueFull`,
/// `KabiError::DomainCrashed`, or `KabiError::Timeout`.
///
/// # Relationship to `kabi_dispatch_with_vcap()`
///
/// `kabi_call!` is the **transport-layer dispatch** macro — it handles
/// the generation check and same-domain vs cross-domain transport
/// decision. It does NOT perform capability validation.
///
/// `kabi_dispatch_with_vcap()` ([Section 12.3](#kabi-bilateral-capability-exchange))
/// is the **capability-validated dispatch** function — it performs the
/// full CapValidationToken validation (RCU lock, global/cred/driver
/// generation checks, permission checks, SystemCaps checks) and then
/// calls `dispatch_to_domain()` for the actual transport.
///
/// The two are layered: `kabi_dispatch_with_vcap()` is the high-level
/// entry point that validates capabilities and then delegates to
/// `dispatch_to_domain()`. `kabi_call!` is the low-level transport
/// abstraction used by code that already has validated handles.
/// Module-to-module calls within the same domain use `kabi_call!`
/// directly (capabilities were validated at bind time). External
/// callers (e.g., syscall dispatch) use `kabi_dispatch_with_vcap()`.
///
/// # Transport branch cost
///
/// The branch on `handle.transport` is a single predicted conditional.
/// In steady state the branch predictor learns the pattern (virtually all
/// handles remain on the same transport for their lifetime), so the
/// effective cost is 0-1 cycles for the branch plus the transport cost.
///
/// # Result flattening
///
/// Vtable methods may return `Result<T, i32>` (driver error as negated errno).
/// The `kabi_call!` macro uses the `IntoKabiResult` trait to flatten the
/// return value into `Result<T, KabiError>`:
///
/// - If the vtable method returns a bare value `T`, `IntoKabiResult` wraps it
///   in `Ok(T)`.
/// - If the vtable method returns `Result<T, i32>`, `IntoKabiResult` maps
///   `Err(errno)` to `Err(KabiError::DriverError(errno))` and `Ok(v)` to
///   `Ok(v)`, producing `Result<T, KabiError>` — NOT `Result<Result<T,i32>, KabiError>`.
///
/// This is critical: without flattening, `kabi_call!(h, read_page, ...)?`
/// would yield `Result<PageHandle, i32>` instead of `PageHandle`, requiring
/// the caller to double-unwrap.
///
/// # Example
///
/// ```rust
/// let page = kabi_call!(block_handle, read_page, dev_id, offset)?;
/// // `page` is `PageHandle`, not `Result<PageHandle, i32>`
/// ```
macro_rules! kabi_call {
    ($handle:expr, $method:ident $(, $arg:expr)*) => {{
        let __handle: &KabiHandle<_> = $handle;
        match __handle.transport {
            KabiTransport::Direct { vtable, ctx } => {
                // Same domain: direct vtable dispatch.
                // Cost: ~2-5 cycles (indirect call through function pointer).
                //
                // SAFETY: vtable pointer validity is guaranteed by the domain
                // service — it was resolved from a live module's registration.
                // ctx is the module's opaque context, valid for the module's
                // lifetime. The generation check below detects stale handles
                // from crashed/unloaded modules.
                // SAFETY: generation pointer is allocated separately
                // (Box::into_raw) and outlives all handles. Dereference
                // is safe while the module is loaded or crash-recovering.
                let gen = unsafe { (*__handle.generation).load(Ordering::Acquire) };
                if gen != __handle.cached_generation {
                    Err(KabiError::StaleHandle)
                } else {
                    // Check if the component is quiescing (live evolution
                    // Phase A'). When quiescing == 1, the vtable is about
                    // to be swapped — direct calls must be intercepted to
                    // prevent racing with the state export.
                    let header = vtable as *const VtableHeader;
                    if unsafe { (*header).quiescing.load(Ordering::Acquire) } != 0 {
                        // Component is quiescing. Return a retryable error
                        // so the caller can back off and retry after the
                        // evolution completes (generation will change).
                        return Err(KabiError::ComponentQuiescing);
                    }
                    // SAFETY: vtable and ctx are valid for the module's
                    // loaded lifetime, guaranteed by the domain service.
                    //
                    // The vtable pointer is type-erased (*const ()).
                    // Cast to the concrete vtable type via the KabiService
                    // trait's associated type. S is the service type
                    // parameter on KabiHandle<S>.
                    //
                    // NOTE: No runtime vtable bounds check or null fn
                    // pointer check is performed on the Direct path.
                    // This is by design: Direct path relies on compile-time
                    // type safety (the KabiService trait ensures type
                    // correctness) and load-time validation (the module
                    // loader validates vtable method count and non-null fn
                    // pointers at load time). Per-call runtime checks would
                    // add ~5-10 cycles to every Direct call, negating the
                    // performance advantage over Ring dispatch.
                    let typed_vtable = vtable as *const <S as KabiService>::VTable;
                    let raw_result = unsafe {
                        ((*typed_vtable).$method)(ctx $(, $arg)*)
                    };
                    // Flatten: if raw_result is Result<T, i32>, convert to
                    // Result<T, KabiError>. If raw_result is bare T, wrap
                    // in Ok(T). The IntoKabiResult trait handles both.
                    IntoKabiResult::into_kabi_result(raw_result)
                }
            }
            KabiTransport::Ring { ref ring } => {
                // Generation check — same as Direct path. Detect crashed/
                // replaced domains BEFORE ring submission (~3-5 cycles)
                // instead of waiting for the 5-second completion timeout.
                // SAFETY: generation pointer is allocated separately
                // (Box::into_raw) and outlives all handles.
                let gen = unsafe { (*__handle.generation).load(Ordering::Acquire) };
                if gen != __handle.cached_generation {
                    return Err(KabiError::StaleHandle);
                }

                // Different domain: serialize → submit → wait → deserialize.
                // Cost: ~200-500 cycles individual. With batch amortization
                // at N>=12 concurrent operations per domain-switch cycle,
                // the amortized domain-switch cost drops to ~23 cycles.
                // Total per-op cost at N=12 is ~50-70 cycles (domain switch
                // amortized + serialization ~15-20cy + ring CAS ~5-10cy).
                // This is much better than microkernel IPC (~1000+ cycles)
                // but higher than Linux's direct function call (~2-5 cycles).
                // The NEGATIVE overhead target is achieved by OTHER UmkaOS
                // optimizations (CpuLocal registers, lock-free structures,
                // per-CPU batching, cache-friendly layouts) that save
                // more cycles than the ring dispatch adds. See
                // [Section 3.4](03-concurrency.md#cumulative-performance-budget) for the full accounting.
                //
                // Serialization is IDL-generated: the kabi-gen compiler
                // produces a method_index constant and a serialize_args
                // function for each method in the .kabi definition.
                // Serialize arguments inline into the command entry.
                // The IDL compiler generates serialize_args_inline() which
                // writes directly into the entry's `args` field, avoiding
                // the shared-arg-buffer concurrent-corruption bug.
                let method_index = <_ as KabiMethodIndex>::$method();
                let cookie = ring.next_cookie();
                let mut cmd = T1CommandEntry {
                    method_index,
                    flags: 0,
                    arg_len: 0,
                    arg_overflow_offset: 0,
                    cookie,
                    args: [0u8; DEFAULT_INLINE_ARG_SIZE],
                    _reserved: [0u8; 32],
                };
                // serialize_args_inline writes directly into cmd.args and
                // returns the total serialized length. If the arguments
                // exceed DEFAULT_INLINE_ARG_SIZE, the overflow portion is
                // handled inside ring.submit() which copies to the slot's
                // dedicated overflow chunk.
                cmd.arg_len = serialize_args_inline(
                    &mut cmd.args $(, &$arg)*
                ) as u32;
                // submit() returns the cookie for completion matching.
                // The cookie is embedded in the command entry and echoed
                // in the completion entry by the consumer.
                ring.submit(cmd).map_err(|e| match e {
                    RingError::Full => KabiError::QueueFull,
                    RingError::Disconnected => KabiError::DomainCrashed,
                    _ => KabiError::InternalError,
                })?;
                let completion = ring.wait_completion(cookie,
                    __handle.timeout_ns)
                    .map_err(|e| match e {
                        RingError::Disconnected => KabiError::DomainCrashed,
                        RingError::Timeout => KabiError::Timeout,
                        // Full and Overloaded are never returned by
                        // wait_completion (only by submit/enqueue).
                        _ => KabiError::InternalError,
                    })?;
                deserialize_result::<_>(&completion)
            }
        }
    }};
}

12.8.2.1.1 `IntoKabiResult` Trait and `KabiService` Trait¶

The IntoKabiResult trait enables kabi_call! to handle both bare-value and Result-returning vtable methods uniformly.

ABI boundary handling: Vtable methods are extern "C" functions. Rust's Result<T, i32> does not have a stable C ABI representation. The KABI IDL compiler handles this by generating extern "C" wrapper functions that return an i64 encoding: - Success: return value >= 0 (the result, cast or encoded per method) - Error: return value < 0 (negated errno)

For methods returning complex types (structs), the return is via an out-pointer parameter (*mut T) and the extern "C" function returns i32 (0 success or negative errno). The IntoKabiResult trait operates on the CALLER side of the boundary (after the extern "C" call returns), converting the C-ABI encoded result back into Result<T, KabiError>.

For the Ring path, this encoding is not needed — arguments and results are serialized into the ring entry's inline buffer by the IDL-generated serialize_args_inline() / deserialize_result() functions. The serialization format uses a length-prefixed byte buffer with explicit error codes.

For the Direct path, the extern "C" vtable function returns the C-ABI encoded result, and the IntoKabiResult impl decodes it.

/// Trait for converting vtable method return values into `Result<T, KabiError>`.
///
/// Implemented for:
/// - `Result<T, i32>`: maps `Err(errno)` to `KabiError::DriverError(errno)`.
/// - Bare `T` (via blanket impl): wraps in `Ok(T)`.
///
/// The IDL compiler generates vtable method signatures that return either
/// `Result<T, i32>` (for failable operations) or bare `T` (for infallible
/// operations). The `kabi_call!` macro uses this trait to produce a uniform
/// `Result<T, KabiError>` regardless of which form the method uses.
pub trait IntoKabiResult {
    type Output;
    fn into_kabi_result(self) -> Result<Self::Output, KabiError>;
}

/// Failable vtable methods: `Result<T, i32>` → `Result<T, KabiError>`.
impl<T> IntoKabiResult for Result<T, i32> {
    type Output = T;
    fn into_kabi_result(self) -> Result<T, KabiError> {
        self.map_err(KabiError::DriverError)
    }
}

/// Infallible vtable methods: bare `T` → `Ok(T)`.
/// This is a **newtype wrapper impl** (not a blanket impl). The IDL
/// compiler generates `KabiOk<T>(T)` for infallible methods to avoid
/// coherence conflicts with the `Result<T, i32>` impl above. A true
/// blanket `impl<T> IntoKabiResult for T` would conflict with the
/// `Result<T, i32>` impl (both cover `Result<T, i32>`).
impl<T> IntoKabiResult for KabiOk<T> {
    type Output = T;
    fn into_kabi_result(self) -> Result<T, KabiError> { Ok(self.0) }
}

/// Newtype wrapper for infallible vtable method returns.
/// IDL-generated code wraps bare return values in this type so that
/// `IntoKabiResult` can distinguish them from `Result<T, i32>`.
pub struct KabiOk<T>(pub T);

The KabiService trait defines the associated vtable type, enabling the kabi_call! macro to cast the type-erased *const () to the concrete vtable struct:

/// Trait implemented by IDL-generated service types.
///
/// Provides the associated vtable type and service identity so that
/// `kabi_call!` can safely cast the type-erased vtable pointer.
pub trait KabiService {
    /// The concrete vtable struct type generated by kabi-gen.
    /// Example: `BlockDeviceVTable`, `NetworkDeviceVTable`.
    type VTable: 'static;

    /// The service identifier used for binding and discovery.
    const SERVICE_ID: ServiceId;
}

/// Trait mapping vtable method names to their ordinal indices.
///
/// IDL-generated: the `kabi-gen` compiler generates one implementation per
/// service type. Each method name maps to a `u32` ordinal that corresponds
/// to the method's position in the vtable's function pointer array (after
/// the `VtableHeader`). The `kabi_call!` Ring path uses this trait to
/// serialize the method index into `T1CommandEntry.method_index`.
///
/// Example (generated by kabi-gen for `BlockDevice`):
/// ```rust
/// impl KabiMethodIndex for BlockDeviceService {
///     fn read_page() -> u32 { 0 }
///     fn write_page() -> u32 { 1 }
///     fn flush() -> u32 { 2 }
///     // ...
/// }
/// ```
///
/// The trait itself has no associated functions — method names are passed
/// as identifiers to the `kabi_call!` macro, which calls
/// `<_ as KabiMethodIndex>::$method()` where `$method` is the method name.
/// The IDL compiler generates a function for each method in the `.kabi`
/// definition. This is a trait-dispatch pattern, not a method-table pattern.
pub trait KabiMethodIndex {
    // Methods are generated per service — no fixed associated functions.
    // The kabi_call! macro calls <_ as KabiMethodIndex>::method_name()
    // where method_name is an identifier from the .kabi file.
}

/// Result buffer for vtable dispatch return values.
///
/// The consumer loop's `vtable_dispatch()` returns a `ResultBuffer` on success,
/// which the consumer serializes into the completion ring. The buffer holds
/// the method's return value in serialized form.
// kernel-internal, not KABI
pub struct ResultBuffer {
    /// Pointer to the serialized result data.
    data: *const u8,
    /// Length of the serialized result in bytes.
    len: usize,
    /// Byte offset into the result region (for completion entry).
    offset: usize,
}

impl ResultBuffer {
    /// Empty result (for void-returning methods or error paths).
    pub fn empty() -> Self {
        Self { data: core::ptr::null(), len: 0, offset: 0 }
    }
    /// Result data length.
    pub fn len(&self) -> usize { self.len }
    /// Byte offset for T1CompletionEntry.result_offset.
    pub fn offset(&self) -> usize { self.offset }
}

12.8.2.2 Asynchronous Variant¶

kabi_call_async! submits to the ring without waiting for a completion. It returns a KabiCookie that the caller can poll or await later. Direct-transport handles execute synchronously and return an immediately-ready cookie (direct calls cannot be deferred — the vtable call blocks the caller).

/// Submit a KABI method call without blocking for the result.
///
/// Returns a `KabiCookie` that can be passed to `kabi_poll!` or
/// `kabi_await!` to retrieve the result later.
///
/// On Direct transport: executes synchronously and wraps the result
/// in an immediately-ready cookie. The async variant adds ~3 cycles
/// of cookie wrapping overhead on the direct path — use `kabi_call!`
/// when the result is needed immediately.
///
/// On Ring transport: enqueues the command and returns without waiting.
/// The caller may submit additional calls before polling — this is the
/// mechanism that enables natural batching. N sequential `kabi_call_async!`
/// calls followed by N `kabi_await!` calls amortize the domain switch
/// over N operations.
macro_rules! kabi_call_async {
    ($handle:expr, $method:ident $(, $arg:expr)*) => {{
        let __handle: &KabiHandle<_> = $handle;
        match __handle.transport {
            KabiTransport::Direct { vtable, ctx } => {
                let gen = unsafe { (*__handle.generation).load(Ordering::Acquire) };
                if gen != __handle.cached_generation {
                    KabiCookie::error(KabiError::StaleHandle)
                } else {
                    // KABI-10 fix: Check if the component is quiescing
                    // (live evolution Phase A'). Same check as the
                    // synchronous kabi_call! Direct path. Without this,
                    // an async Direct caller would bypass the quiescence
                    // barrier and dispatch through the old vtable, racing
                    // with state export.
                    let header = vtable as *const VtableHeader;
                    if unsafe { (*header).quiescing.load(Ordering::Acquire) } != 0 {
                        return KabiCookie::error(KabiError::ComponentQuiescing);
                    }
                    // Cast to typed vtable (same as sync kabi_call! Direct path).
                    let typed_vtable = vtable as *const <S as KabiService>::VTable;
                    let raw_result = unsafe {
                        ((*typed_vtable).$method)(ctx $(, $arg)*)
                    };
                    KabiCookie::ready(IntoKabiResult::into_kabi_result(raw_result))
                }
            }
            KabiTransport::Ring { ref ring } => {
                // Generation check — same as Direct path. Detect crashed/
                // replaced domains BEFORE ring submission.
                let gen = unsafe { (*__handle.generation).load(Ordering::Acquire) };
                if gen != __handle.cached_generation {
                    return KabiCookie::error(KabiError::StaleHandle);
                }
                let method_index = <_ as KabiMethodIndex>::$method();
                let cookie = ring.next_cookie();
                let mut cmd = T1CommandEntry {
                    method_index,
                    flags: T1_CMD_BATCH, // defer doorbell
                    arg_len: 0,
                    arg_overflow_offset: 0,
                    cookie,
                    args: [0u8; DEFAULT_INLINE_ARG_SIZE],
                    _reserved: [0u8; 32],
                };
                cmd.arg_len = serialize_args_inline(
                    &mut cmd.args $(, &$arg)*
                ) as u32;
                match ring.submit(cmd) {
                    Ok(_) => KabiCookie::pending(
                        ring, cookie,
                        __handle.generation,
                        __handle.cached_generation,
                    ),
                    Err(RingError::Disconnected) => {
                        KabiCookie::error(KabiError::DomainCrashed)
                    }
                    Err(_) => KabiCookie::error(KabiError::QueueFull),
                }
            }
        }
    }};
}

/// Cookie representing a pending or completed KABI call.
// kernel-internal, not KABI
pub struct KabiCookie<R> {
    state: KabiCookieState<R>,
}

enum KabiCookieState<R> {
    /// Result is already available (direct transport or error).
    Ready(Result<R, KabiError>),
    /// Waiting for ring completion.
    Pending {
        ring: *const CrossDomainRing,
        /// Cookie value from the submitted T1CommandEntry. The consumer
        /// echoes this in T1CompletionEntry.cookie. `wait_completion()`
        /// searches for a completion entry matching this cookie value.
        /// Note: this is the cookie, NOT the ring head sequence number —
        /// they diverge after any failed submit (cookie is consumed by
        /// `next_cookie()` but head is not advanced on submit failure).
        cookie: u64,
        /// Generation counter pointer from the handle that created this
        /// cookie. Checked before dereferencing `ring` to detect domain
        /// crashes that may have freed the ring memory.
        generation: *const AtomicU64,
        /// Generation value at cookie creation time.
        cached_generation: u64,
    },
}

impl<R> KabiCookie<R> {
    /// Block until the result is available.
    ///
    /// For `Ready` cookies: returns immediately (zero additional cost).
    /// For `Pending` cookies: checks the generation counter before
    /// dereferencing the ring pointer. If the generation has changed
    /// (indicating the domain crashed and the ring may have been freed),
    /// returns `DomainCrashed` without touching the ring pointer.
    pub fn wait(self, timeout_ns: u64) -> Result<R, KabiError> {
        match self.state {
            KabiCookieState::Ready(result) => result,
            KabiCookieState::Pending {
                ring, cookie, generation, cached_generation,
            } => {
                // Check generation BEFORE dereferencing ring. If the
                // domain crashed between cookie creation and wait(), the
                // ring memory may have been freed by teardown_cross_domain_ring.
                // The generation counter is separately allocated (Box)
                // and outlives the ring, so it is safe to dereference.
                //
                // SAFETY: generation pointer is a Box-allocated AtomicU64
                // that outlives all handles and cookies (freed only when
                // the service is permanently removed).
                let current_gen = unsafe {
                    (*generation).load(Ordering::Acquire)
                };
                if current_gen != cached_generation {
                    return Err(KabiError::DomainCrashed);
                }

                // SAFETY: ring pointer is valid because the generation
                // check above confirmed the domain has not crashed.
                // The ring memory is freed only after the generation is
                // incremented (teardown_cross_domain_ring step 1).
                let ring_ref = unsafe { &*ring };
                let completion = ring_ref.wait_completion(cookie, timeout_ns)
                    .map_err(|e| match e {
                        RingError::Disconnected => KabiError::DomainCrashed,
                        RingError::Timeout => KabiError::Timeout,
                        // Full and Overloaded are never returned by
                        // wait_completion (only by submit/enqueue).
                        _ => KabiError::InternalError,
                    })?;
                deserialize_result::<R>(&completion)
            }
        }
    }

    fn ready(r: Result<R, KabiError>) -> Self {
        Self { state: KabiCookieState::Ready(r) }
    }
    fn error(e: KabiError) -> Self {
        Self { state: KabiCookieState::Ready(Err(e)) }
    }
    /// Construct a pending cookie for a cross-domain ring submission.
    /// The `cookie` parameter matches the `KabiCookieState::Pending { cookie }` field
    /// (not `seq` -- the field is named `cookie`, so the shorthand init requires
    /// the parameter to be named `cookie` as well).
    fn pending(
        ring: &CrossDomainRing,
        cookie: u64,
        generation: *const AtomicU64,
        cached_generation: u64,
    ) -> Self {
        Self {
            state: KabiCookieState::Pending {
                ring: ring as *const CrossDomainRing,
                cookie,
                generation,
                cached_generation,
            },
        }
    }
}

12.8.2.3 Handle Types¶

/// Cached transport handle for a KABI service binding.
///
/// Created by `DomainService::resolve()` at module initialization time.
/// Stored in the module's state struct, reused for all calls to the bound
/// service. No per-call lookup — the transport decision is cached once.
///
/// The generic parameter `S` is a marker type generated by the IDL compiler
/// for each service (e.g., `BlockDeviceService`, `CryptoService`). It
/// provides compile-time type safety: a `KabiHandle<BlockDeviceService>`
/// cannot be passed where a `KabiHandle<CryptoService>` is expected.
///
/// **Lifetime**: A handle is valid from `resolve()` until the module is
/// unloaded or the bound service migrates to a different domain. Migration
/// triggers atomic rebinding (see "Rebinding on Promotion/Demotion" below).
///
/// **Stale detection**: `cached_generation` is compared against the live
/// `generation` counter on every call. A mismatch indicates the target
/// module has crashed, been reloaded, or migrated — the caller receives
/// `KabiError::StaleHandle` and must re-resolve from the domain service.
// kernel-internal, not KABI
pub struct KabiHandle<S: KabiService> {
    /// The resolved transport for this binding.
    pub transport: KabiTransport,
    /// Generation counter at the time this handle was created.
    /// Compared against the live generation on every dispatch.
    pub cached_generation: u64,
    /// Pointer to the live generation counter. Allocated separately
    /// from the registry entry (via `Box::into_raw`) to ensure a stable
    /// address independent of XArray node reallocation. Read with
    /// `Acquire` ordering on every `kabi_call!` dispatch.
    ///
    /// # Safety
    ///
    /// The pointed-to `AtomicU64` is allocated at service registration
    /// and freed only when the service is permanently removed (after all
    /// handles have been invalidated by a generation bump). The domain
    /// service's crash handler increments the generation (making all
    /// handles stale) before any structural change that could invalidate
    /// the pointer.
    pub generation: *const AtomicU64,

    // SAFETY: vtable/ctx/generation pointers are valid for the module's
    // loaded lifetime. The generation pointer outlives all handles
    // because it is separately allocated and reference-counted by the
    // global registry. The handle is stored in SpinLock-protected
    // module state and accessed from multiple threads.
    // unsafe impl<S: KabiService> Send for KabiHandle<S> {}
    // unsafe impl<S: KabiService> Sync for KabiHandle<S> {}
    /// Default timeout for ring completions (nanoseconds).
    /// Set from the service's QoS policy at bind time.
    /// Direct calls ignore this field.
    pub timeout_ns: u64,
    /// Type marker for compile-time service discrimination.
    _service: core::marker::PhantomData<S>,
}

/// Type-erased KABI handle for storage in module descriptor arrays.
///
/// `KabiHandle<S>` is generic over the service type `S`, which provides
/// compile-time type safety at call sites. However, the module descriptor
/// must store handles for heterogeneous services in a single `ArrayVec`.
/// `KabiHandleOpaque` wraps the transport + generation without the type
/// parameter, enabling uniform storage.
///
/// Modules retrieve typed handles via `get::<S>(service_id)` which
/// performs a runtime `service_id` check and returns `KabiHandle<S>`.
// kernel-internal, not KABI
pub struct KabiHandleOpaque {
    /// The resolved transport for this binding.
    pub transport: KabiTransport,
    /// Generation counter at the time this handle was created.
    pub cached_generation: u64,
    /// Pointer to the live generation counter (same as KabiHandle).
    pub generation: *const AtomicU64,
    /// Default timeout for ring completions (nanoseconds).
    pub timeout_ns: u64,
    /// Service ID this handle is bound to. Used for runtime type
    /// checking when converting back to `KabiHandle<S>`.
    pub service_id: ServiceId,
}

impl KabiHandleOpaque {
    /// Convert to a typed `KabiHandle<S>` with a runtime service ID check.
    ///
    /// Returns `None` if the stored `service_id` does not match `S`'s
    /// service ID. This prevents type confusion when retrieving handles
    /// from the opaque storage.
    pub fn typed<S: KabiService>(&self) -> Option<KabiHandle<S>> {
        if self.service_id != S::SERVICE_ID {
            return None;
        }
        Some(KabiHandle {
            transport: self.transport.clone(),
            cached_generation: self.cached_generation,
            generation: self.generation,
            timeout_ns: self.timeout_ns,
            _service: PhantomData,
        })
    }
}

/// Transport discriminant for a KABI handle.
///
/// `Direct` and `Ring` are the only two variants. This is the ONLY
/// branch in the entire dispatch path — and it is predicted after the
/// first call (handles almost never change transport during their lifetime).
// kernel-internal, not KABI
pub enum KabiTransport {
    /// Same domain: direct vtable dispatch.
    ///
    /// Both `vtable` and `ctx` are raw pointers into the target module's
    /// memory region. For Tier 0 modules (Core domain), these point into
    /// kernel address space. For Tier 1 same-domain modules, they point
    /// into the shared isolation domain.
    ///
    /// # Safety
    ///
    /// `vtable` must point to a valid vtable struct generated by kabi-gen.
    /// `ctx` must be the opaque context pointer returned by the module's
    /// entry function. Both are valid for the module's loaded lifetime.
    Direct {
        vtable: *const (),
        ctx: *mut (),
    },
    /// Different domain: ring buffer submission.
    ///
    /// The `CrossDomainRing` contains the shared memory ring, doorbell,
    /// and completion wait mechanism. Created by `setup_cross_domain_ring()`
    /// during the binding phase.
    /// `Arc<CrossDomainRing>` rather than `Box` since multiple handles may
    /// share the same ring (e.g., multiple callers bound to the same provider).
    /// Boxing avoids inflating the `Direct` variant's 16 bytes to ~140+ bytes.
    Ring {
        ring: Arc<CrossDomainRing>,
    },
}

12.8.3 IDL-Generated Consumer Loop¶

Every non-Tier-0 domain runs a consumer loop for each inbound ring. The consumer loop is the ring-side counterpart to kabi_call!: it dequeues serialized requests, dispatches them through the local vtable, and posts completions. The IDL compiler generates this loop from the .kabi service definition — driver authors implement only the vtable methods, never the ring protocol.

12.8.3.1 Consumer Loop Implementation¶

/// IDL-generated consumer loop for a KABI service ring.
///
/// Runs in the target module's domain (Tier 1 isolated domain or Tier 2
/// process). Processes requests from the inbound command ring, dispatches
/// to the module's vtable implementation, and posts results to the
/// completion ring.
///
/// # Thread model
///
/// One consumer thread per ring. The thread is created by the domain
/// service when the ring is established during module binding. The thread
/// is affinity-bound to the CPU that the ring serves (matching the
/// producer's CPU affinity) to minimize cross-cache-line traffic.
///
/// # Batch processing
///
/// The consumer dequeues ALL available entries before yielding, not one
/// at a time. This naturally amortizes the domain switch cost: if 12
/// entries accumulated while the consumer was sleeping, all 12 are
/// processed in a single domain-active period. At N=12, the amortized
/// domain switch cost is:
///
///   domain_switch_cycles / N = 280 / 12 = ~23 cycles per operation
///
/// Total per-operation amortized cost at N=12 is ~50-70 cycles:
///   - Domain switch (amortized): ~23 cycles
///   - Serialization/deserialization: ~15-20 cycles
///   - Ring CAS + store: ~5-10 cycles
///   - Completion ring read: ~5-10 cycles
///
/// This is significantly more expensive than a Linux direct function
/// call (~2-5 cycles), but far cheaper than microkernel IPC (~1000+
/// cycles). The per-ring-operation overhead is compensated by OTHER
/// UmkaOS optimizations (CpuLocal registers saving ~2-3 cycles per
/// access at millions/sec, lock-free data structures eliminating
/// contention, compile-time lock ordering eliminating runtime lockdep)
/// that together achieve negative total overhead. See
/// [Section 3.4](03-concurrency.md#cumulative-performance-budget) for the full performance
/// accounting.
///
/// The 280-cycle domain switch figure is the round-trip cost (enter
/// domain + exit domain) for x86 MPK WRPKRU×2 (~46 cycles) plus
/// cache-miss overhead for ring head/tail (~200 cycles worst case)
/// plus doorbell IPI (~34 cycles amortized).
///
/// # Arguments
///
/// - `ring`: The `CrossDomainRing` descriptor containing both the inbound
///   command ring (`ring_header`) and the outbound completion ring
///   (`completion_ring_header`). This single parameter replaces the former
///   separate `ring: &DomainRingBuffer, completion_ring: &DomainRingBuffer`
///   parameters, establishing the structural relationship between command
///   and completion rings.
/// - `vtable`: The module's vtable implementation.
/// - `ctx`: The module's opaque context pointer.
/// - `domain_id`: This domain's identifier (written to CpuLocal on entry).
///
/// # Panics
///
/// If the vtable dispatch panics, the consumer loop catches the panic
/// via the domain's fault handler (the panic unwinds into the domain
/// fault trampoline, which posts an error completion and continues —
/// or triggers full domain recovery for unrecoverable faults).
fn kabi_consumer_loop(
    ring: &CrossDomainRing,
    vtable: *const (),
    ctx: *mut (),
    domain_id: DomainId,
) -> ! {
    // The CrossDomainRing contains both the command ring (ring_header) and
    // the completion ring (completion_ring_header) as a paired structure.
    // The consumer reads commands from cmd_ring and posts completions
    // to completion_ring_header via ring.enqueue_spsc().
    //
    // SAFETY: Both ring_header and completion_ring_header are initialized
    // by setup_cross_domain_ring() and are valid for the ring's lifetime.
    let cmd_ring = unsafe { &*ring.ring_header };

    // Set CpuLocal.active_domain so the NMI crash handler knows which
    // domain this CPU is executing in. See
    // [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--cpulocalblock-field-inventory).
    let cpu_local = arch::current::cpu::cpulocal();
    loop {
        // Phase 1: Sleep until work is available.
        // The doorbell mechanism wakes this thread when the producer
        // submits new entries. On x86, this is a monitor/mwait or
        // a futex-like wait on the ring's head counter.
        cmd_ring.wait_for_entries();

        // Phase 1.5: Check ring state before entering the domain.
        // After crash recovery Step 2' sets header.state = RING_STATE_DISCONNECTED
        // (Release), any consumer waking from wait_for_entries() must observe
        // the Disconnected state and exit cleanly. Without this check, the
        // consumer would enter the revoked domain (Step 2 already revoked
        // hardware permissions), causing a secondary hardware exception.
        // The Acquire ordering pairs with the Release in Step 2' to ensure
        // visibility on all architectures (including ARM, RISC-V).
        if cmd_ring.state.load(Ordering::Acquire) == RING_STATE_DISCONNECTED {
            // Domain is being torn down or has crashed. This kthread
            // exits cleanly. We are in Phase 1 (Core domain): active_domain
            // is still 0 and domain_valid is still 0 — no domain switch
            // cleanup needed.
            kthread_exit(); // -> ! (never returns)
        }

        // Phase 2: Enter domain. First switch the hardware isolation
        // register (WRPKRU on x86, MSR POR_EL0 on AArch64, DACR on ARMv7,
        // etc.), then set active_domain. The hardware switch MUST precede
        // the active_domain store so that the NMI crash handler never sees
        // a stale active_domain that lags behind the actual hardware state
        // (documented invariant in CpuLocalBlock.active_domain).
        //
        // **Benign race window** (1-2 instructions): Between switch_domain()
        // and active_domain.store(), if an NMI arrives the NMI handler sees
        // `active_domain == 0` (Core) while the hardware is already in the
        // driver domain. The handler takes no action (domain != revoked_id).
        // If the domain has been revoked during this window, the CPU faults
        // on the next instruction (deny-all memory), enters the exception
        // handler with `active_domain == 0`, and panics as a false Core
        // panic rather than triggering clean domain recovery. This window is
        // ~1ns and requires simultaneous NMI + domain revocation — an
        // astronomically unlikely event. The alternative ordering (store
        // active_domain FIRST) is worse: the NMI handler would redirect a
        // CPU whose hardware isolation has not yet switched, corrupting Core
        // domain execution.
        arch::current::isolation::switch_domain(domain_id);
        cpu_local.active_domain.store(domain_id, Ordering::Relaxed);
        // Mark this CPU's domain as valid for trampoline TOCTOU checks.
        // Release ordering: the domain_valid store must be visible to any
        // trampoline reader (on this CPU) AFTER the hardware domain switch
        // and active_domain store have committed.
        cpu_local.domain_valid.store(1, Ordering::Release);

        // Phase 3: Drain available entries in a batch, up to a maximum
        // batch size. The batch limit prevents unbounded latency when 256+
        // entries have accumulated (e.g., burst of I/O completions).
        //
        // **Batch size limit**: CONSUMER_MAX_BATCH (default: 32 entries).
        // After processing CONSUMER_MAX_BATCH entries, the consumer exits
        // the domain (Phase 4), calls `cond_resched()` to yield if needed,
        // and re-enters for the next batch. This provides a preemption
        // point within the batch loop.
        //
        // **Consumer kthread priority**: The consumer kthread runs under
        // SCHED_FIFO priority 1 (lowest RT priority). This ensures it runs
        // promptly when entries are available but does not starve CFS tasks.
        // For latency-sensitive services (block I/O, network), the kthread
        // priority can be elevated to SCHED_FIFO 50 via the domain
        // configuration.
        const CONSUMER_MAX_BATCH: u64 = 32;
        let published = cmd_ring.published.load(Ordering::Acquire);
        let mut tail = cmd_ring.tail.load(Ordering::Relaxed);
        let batch_end = core::cmp::min(published, tail + CONSUMER_MAX_BATCH);

        while tail < batch_end {
            let idx = (tail % cmd_ring.size as u64) as usize;
            // SAFETY: idx is within ring bounds (tail < batch_end <= published,
            // published <= head, head < tail + size maintained by
            // producer). Entry contains serialized request data.
            let entry: &T1CommandEntry = unsafe {
                cmd_ring.read_entry_as(idx)
            };

            // Dispatch through vtable. The method_index is bounds-checked
            // against vtable_method_count(). If out of bounds, post an error
            // completion without calling any vtable function.
            // KABI-02 fix: catch_domain_panic<F, R> returns Result<R, KabiError>
            // where R = Result<ResultBuffer, KabiError> (vtable_dispatch's return
            // type), producing Result<Result<ResultBuffer, KabiError>, KabiError>.
            // Flatten with .and_then(|inner| inner) to produce
            // Result<ResultBuffer, KabiError>. Both error channels (DriverPanic
            // from catch_domain_panic, KabiError from vtable_dispatch) collapse
            // into the same Err variant -- both are dispatch failures.
            let result: Result<ResultBuffer, KabiError> =
                if (entry.method_index as usize) < vtable_method_count(vtable) {
                    // SAFETY: method_index is bounds-checked above.
                    // vtable is valid for this module's lifetime.
                    // ctx is the module's context, valid while loaded.
                    // Arguments are in the entry's inline args field.
                    catch_domain_panic(|| unsafe {
                        vtable_dispatch(
                            vtable,
                            entry.method_index,
                            &entry.args[..entry.arg_len as usize],
                            ctx,
                        )
                    })
                    .and_then(|inner| inner) // Flatten Result<Result<..>, ..>
                } else {
                    Err(KabiError::NotSupported)
                };

            // Post completion to the completion ring.
            let completion = T1CompletionEntry {
                cookie: entry.cookie,
                status: match &result {
                    Ok(_) => 0,
                    Err(e) => e.to_errno(),
                },
                result_len: match &result {
                    Ok(buf) => buf.len() as u32,
                    Err(_) => 0,
                },
                result_offset: match &result {
                    Ok(buf) => buf.offset() as u32,
                    Err(_) => 0,
                },
                _reserved: [0u8; 44],
            };
            // Completion ring is SPSC (this consumer is the only producer).
            // enqueue_spsc writes to ring.completion_ring_header.
            // SAFETY: ring is valid for this module's lifetime.
            unsafe { ring.enqueue_spsc(&completion) };

            tail += 1;
        }
        // Advance command ring consumer tail after processing the batch.
        let old_tail = cmd_ring.tail.load(Ordering::Relaxed);
        cmd_ring.tail.store(tail, Ordering::Release);

        // Phase 4: Exit domain. Clear domain_valid FIRST (invalidate
        // trampoline TOCTOU checks), then clear active_domain, then
        // switch hardware isolation back to Core domain. The NMI crash
        // handler must not see this CPU as in-domain after the hardware
        // switch. The shadow elision mechanism
        // ([Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes--pkru-write-elision-mandatory))
        // may skip the hardware write if already in Core domain.
        // **Benign race window** (exit-side mirror of Phase 2 entry):
        // Between active_domain.store(0) and switch_domain(CORE), the
        // hardware isolation register still has the driver domain, but the
        // NMI handler sees Core domain (active_domain == 0). This is less
        // severe than the entry-side window because: (a) domain_valid is
        // already 0 (trampoline TOCTOU rejects), (b) the CPU is about to
        // switch to Core on the very next instruction, (c) even if the
        // domain is revoked, the NMI handler sees Core and takes no action,
        // and the CPU completes switch_domain(CORE) after NMI return.
        cpu_local.domain_valid.store(0, Ordering::Release);
        cpu_local.active_domain.store(0, Ordering::Relaxed);
        arch::current::isolation::switch_domain(CORE_DOMAIN_ID);

        // Phase 5: Completion doorbell signaling is handled inside
        // enqueue_spsc() per entry (it signals completion_doorbell
        // after each completion write). No separate batch-level
        // doorbell needed -- the per-entry signal wakes waiters in
        // wait_completion() as soon as their cookie appears.
    }
}

/// Terminate the current kthread. Diverging function (`-> !`).
///
/// Sets the current task's state to `TASK_DEAD` and calls `schedule()` in a
/// loop. The task will never be scheduled again; `finish_task_switch()` on the
/// CPU where the dead task last ran observes `TASK_DEAD` and frees the kernel
/// stack and Task struct.
///
/// Used by the consumer loop when the ring transitions to `RING_STATE_DISCONNECTED`
/// (domain teardown or crash recovery). The consumer thread exits cleanly
/// without returning from the `-> !` function.
fn kthread_exit() -> ! {
    let task = arch::current::cpu::cpulocal().current_task;
    // SAFETY: current_task is always valid while executing.
    unsafe { &*task }.state.store(TaskState::DEAD.bits(), Ordering::Release);
    loop {
        schedule();
    }
}

/// Return the number of methods in a vtable.
///
/// IDL-generated: each vtable struct has a `const METHOD_COUNT: u32`
/// associated constant. This function reads it from the vtable header's
/// `method_count` field (set by the module at registration time).
/// The consumer loop uses this for bounds-checking before dispatch.
///
/// # Safety
///
/// `vtable` must point to a valid vtable struct whose header is a
/// `VtableHeader`. The pointer is validated at module load time.
unsafe fn vtable_method_count(vtable: *const ()) -> usize {
    // SAFETY: caller guarantees vtable points to a valid VtableHeader.
    let header = &*(vtable as *const VtableHeader);
    // method_count is the number of function pointer slots AFTER the header.
    // Computed as (vtable_size - header_size) / pointer_size.
    let header_size = core::mem::size_of::<VtableHeader>();
    let ptr_size = core::mem::size_of::<*const ()>();
    (header.vtable_size as usize - header_size) / ptr_size
}

/// Dispatch a method call through a vtable by ordinal index.
///
/// IDL-generated: resolves the function pointer at the given index in the
/// vtable's method array (after the VtableHeader) and calls it with the
/// deserialized arguments.
///
/// # Safety
///
/// - `vtable` must point to a valid vtable struct.
/// - `method_index` must be less than `vtable_method_count(vtable)`.
/// - `args` must contain correctly serialized arguments for this method.
/// - `ctx` must be a valid module context pointer.
unsafe fn vtable_dispatch(
    vtable: *const (),
    method_index: u32,
    args: &[u8],
    ctx: *mut (),
) -> Result<ResultBuffer, KabiError> {
    // Compute the function pointer address. The vtable layout is:
    // [VtableHeader][fn_ptr_0][fn_ptr_1]...[fn_ptr_N-1]
    let header_size = core::mem::size_of::<VtableHeader>();
    let ptr_size = core::mem::size_of::<*const ()>();
    let fn_ptr_addr = (vtable as *const u8)
        .add(header_size + method_index as usize * ptr_size);
    let fn_ptr: *const () = core::ptr::read(fn_ptr_addr as *const *const ());
    if fn_ptr.is_null() {
        return Err(KabiError::NotSupported);
    }
    // Transmute to the dispatch function signature. The IDL-generated
    // dispatch function deserializes args and calls the actual method.
    let dispatch_fn: unsafe extern "C" fn(*mut (), *const u8, u32) -> i32 =
        core::mem::transmute(fn_ptr);
    let status = dispatch_fn(ctx, args.as_ptr(), args.len() as u32);
    if status == 0 {
        Ok(ResultBuffer::empty())
    } else {
        Err(KabiError::DriverError(status))
    }
}

/// Catch a recoverable panic from vtable dispatch within the domain.
///
/// UmkaOS does NOT use Rust's `core::panic::catch_unwind` — the kernel
/// is compiled with `panic = "abort"` and has no unwind runtime. Instead,
/// domain fault handling uses a two-path mechanism:
///
/// **Path 1: Recoverable panic (driver calls `panic!()` or triggers a
/// bounds check / arithmetic overflow)**
///
/// The per-domain panic hook (installed by the domain service at domain
/// creation) intercepts the panic before abort:
///   1. Records the panic message and location in the domain's crash log.
///   2. Sets a per-CPU `domain_panic_result` flag to `DriverPanic`.
///   3. Performs a `longjmp`-style return to the `catch_domain_panic`
///      save point (set up via `setjmp` before the vtable dispatch).
///
/// The consumer loop observes the `DriverPanic` result, posts an error
/// completion for this specific request, and continues processing the
/// remaining batch entries.
///
/// **Path 2: Unrecoverable fault (hardware exception, page fault into
/// unmapped memory, double panic)**
///
/// The CPU triggers an exception (e.g., GPF on x86, data abort on ARM).
/// The Tier 0 exception handler inspects `CpuLocal.active_domain`:
///   1. If `active_domain != 0` (not Core domain): the fault is in an
///      isolated domain. The handler redirects control to the domain
///      fault trampoline ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)).
///   2. The trampoline tears down the entire domain. All pending
///      requests on all rings for this domain receive `DomainCrashed`.
///   3. The crash recovery path is initiated (module reload from MBS).
///
/// `catch_domain_panic` implements Path 1. Path 2 is handled by the
/// architecture-specific exception handler and is transparent to this code.
///
/// # Safety
///
/// The `setjmp`/`longjmp` mechanism uses architecture-specific save/restore
/// of callee-saved registers. The save point is valid only for the duration
/// of the closure call — it is stack-allocated and goes out of scope when
/// `catch_domain_panic` returns.
fn catch_domain_panic<F, R>(f: F) -> Result<R, KabiError>
where
    F: FnOnce() -> R,
{
    // Save return point for recoverable panics.
    // SAFETY: jmp_buf is stack-allocated and valid for the duration of
    // the closure call. The per-domain panic hook performs the longjmp.
    let mut jmp_buf = arch::current::cpu::JmpBuf::new();
    if unsafe { arch::current::cpu::setjmp(&mut jmp_buf) } != 0 {
        // We got here via longjmp from the panic hook — recoverable panic.
        return Err(KabiError::DriverPanic);
    }

    // Install this save point as the current domain's recovery target.
    let cpu_local = arch::current::cpu::cpulocal();
    let prev = cpu_local.domain_panic_jmpbuf.swap(
        &mut jmp_buf as *mut _ as *mut (),
        Ordering::Release,
    );

    let result = f();

    // Restore previous save point (for nested dispatch, though uncommon).
    cpu_local.domain_panic_jmpbuf.store(prev, Ordering::Release);

    Ok(result)
}

12.8.3.2 IDL Generation Example¶

Given a .kabi service definition:

@version(1)
service block_device {
    fn read_page(dev_id: u32, offset: u64) -> Result<PageHandle, i32>;
    fn write_page(dev_id: u32, offset: u64, data: DmaBufferHandle) -> Result<(), i32>;
    fn flush(dev_id: u32) -> Result<(), i32>;
}

The IDL compiler generates:

Method index constants — const READ_PAGE: u32 = 0; const WRITE_PAGE: u32 = 1; const FLUSH: u32 = 2;
Serialization functions — per-method serialize_args / deserialize_result implementations
Consumer loop body — a match entry.method_index { 0 => ..., 1 => ..., 2 => ..., _ => NotSupported } dispatch
Producer stubs — kabi_call! wrapper functions with typed signatures
Vtable struct — BlockDeviceVTable { read_page: fn(...), write_page: fn(...), flush: fn(...) }

The consumer loop and producer stubs share the same serialization format, generated from the same IDL source. A mismatch between producer and consumer serialization is a build-time error (both are generated by the same kabi-gen invocation).

12.8.4 Bind-Time Transport Selection¶

When a module declares a dependency via the requires clause in its .kabi file (Section 12.7), the domain service resolves it at module initialization time. The resolution produces a KabiHandle with the transport decision cached — no per-call lookup occurs at runtime.

12.8.4.1 Handle Creation¶

impl DomainService {
    /// Resolve a service dependency and create a cached transport handle.
    ///
    /// Called during the RESOLVING phase of module initialization (see
    /// "Module Hello Protocol" below). The domain service queries the
    /// global service registry to find the provider, determines whether
    /// the provider is in the same domain (direct) or a different domain
    /// (ring), and creates the appropriate handle.
    ///
    /// # Arguments
    ///
    /// - `service_id`: The service to bind to (from the module's `requires` list).
    /// - `min_version`: Minimum acceptable version (from `requires ... >= N.M`).
    ///
    /// # Returns
    ///
    /// - `Ok(KabiHandle<S>)` with transport cached for all future calls.
    /// - `Err(KabiError::ServiceNotFound)` if no provider exists.
    /// - `Err(KabiError::VersionMismatch)` if no compatible version is available.
    /// - `Err(KabiError::RingSetupFailed)` if cross-domain ring creation fails
    ///   (out of memory, domain limit reached).
    pub fn resolve<S: KabiService>(
        &self,
        service_id: &ServiceId,
        min_version: KabiVersion,
    ) -> Result<KabiHandle<S>, KabiError> {
        // Type safety check: the caller's service type S must match the
        // service_id being resolved. Catches misuse where resolve::<A>()
        // is called with B's service_id, which would produce a handle
        // that casts the vtable pointer to the wrong type.
        debug_assert_eq!(
            S::SERVICE_ID, *service_id,
            "resolve<S> called with service_id that does not match S::SERVICE_ID"
        );

        // Step 1: Query global service registry for the provider.
        let provider = GLOBAL_SERVICE_REGISTRY.lookup(service_id, min_version)
            .ok_or(KabiError::ServiceNotFound)?;

        // Step 2: Determine transport based on domain membership.
        // Derive a stable pointer to the generation counter. The generation
        // is a separately-allocated Box<AtomicU64> inside the RegistryEntry,
        // so its address is independent of XArray node reallocation.
        let gen_ptr: *const AtomicU64 = &*provider.generation as *const AtomicU64;

        if provider.domain_id == self.domain_id {
            // Same domain → direct vtable call. No ring needed.
            // This covers: Tier 0↔Tier 0, Tier 1↔Tier 1 within same domain.
            Ok(KabiHandle {
                transport: KabiTransport::Direct {
                    vtable: provider.vtable_ptr,
                    ctx: provider.ctx_ptr,
                },
                cached_generation: provider.generation.load(Ordering::Acquire),
                generation: gen_ptr,
                timeout_ns: DEFAULT_KABI_TIMEOUT_NS,
                _service: PhantomData,
            })
        } else {
            // Different domain → ring buffer transport.
            // Step 2a: Create or reuse a cross-domain ring.
            let ring = self.setup_or_reuse_ring(
                self.domain_id,
                provider.domain_id,
                service_id,
            )?;
            Ok(KabiHandle {
                transport: KabiTransport::Ring { ring },
                cached_generation: provider.generation.load(Ordering::Acquire),
                generation: gen_ptr,
                timeout_ns: DEFAULT_KABI_TIMEOUT_NS,
                _service: PhantomData,
            })
        }
    }
}

/// Default KABI completion timeout: 5 seconds.
///
/// Covers worst-case I/O latency (NVMe: ~10ms typical, ~100ms on queue
/// depth exhaustion; disk: ~100ms typical, ~2s on bad sector retry).
/// Drivers that need longer timeouts (e.g., SCSI tape: 60s) override
/// this via the `timeout_ns` field in the service manifest.
pub const DEFAULT_KABI_TIMEOUT_NS: u64 = 5_000_000_000;

12.8.4.2 Handle Validation and Re-Resolution¶

A KabiHandle becomes stale when the target module crashes, is reloaded, or migrates to a different domain. Detection is immediate: the generation check in kabi_call! returns KabiError::StaleHandle on the first call after the event.

Crash recovery handle lifecycle:

Domain crashes → Step 2 increments DomainDescriptor.generation.
All existing KabiHandle instances for that domain have cached_generation != live generation → kabi_call! returns KabiError::StaleHandle.
Subsystem recovery handlers (registered via DriverRecoveryEvent subscription) receive DriverRecoveryEvent::Reloaded { new_handle } when Step 9 (RESUME) completes. The new_handle is a freshly-resolved DeviceHandle with the correct generation. Handlers replace their cached handles:
Block I/O: DeviceIoQueues replaces its block device handle
VFS: SuperBlock replaces its filesystem driver handle
Networking: NetDevice replaces its NIC driver handle
Generic callers (not subscribed to recovery events) encounter StaleHandle on the next kabi_call! and must call rebind() to re-resolve from the domain service. The domain service returns the new module's handle if the replacement driver is available, or KabiError::ServiceUnavailable if recovery is still in progress.

The caller must re-resolve from the domain service:

/// Re-resolve a stale handle. Called when kabi_call! returns StaleHandle.
///
/// This is NOT a hot path — it runs once per crash/migration event per
/// affected handle. The domain service lookup is an XArray get() — O(1).
pub fn rebind<S: KabiService>(
    domain_service: &DomainService,
    handle: &mut KabiHandle<S>,
    service_id: &ServiceId,
    min_version: KabiVersion,
) -> Result<(), KabiError> {
    let new_handle = domain_service.resolve::<S>(service_id, min_version)?;
    *handle = new_handle;
    Ok(())
}

12.8.5 Module Hello Protocol¶

When a module is loaded into a domain, it follows a defined lifecycle protocol. The protocol is the same regardless of tier — the domain service makes the tier-specific decisions transparently.

12.8.5.1 State Machine¶

LOADING → REGISTERING → RESOLVING → READY → SERVING
                                               ↓
                                           DRAINING → UNLOADING

State	Entry Condition	Exit Condition	Duration
`Loading`	Module binary mapped, relocated	`.init` function returns successfully	~1-10ms (ELF parsing, relocation)
`Registering`	Init complete	`domain_service.register(manifest)` succeeds	~10-100us (manifest validation)
`Resolving`	Registration accepted	All `requires` dependencies resolved	0 (all deps available) to unbounded (waiting for dependency)
`Ready`	All dependencies bound	`domain_service.announce_ready(module_id)` called	~1us (announcement)
`Serving`	Readiness announced, global registry updated	Unload or crash triggered	Steady state (seconds to years)
`Draining`	Unload/crash signal received	All in-flight requests completed	~1-100ms (ring drain)
`Unloading`	Drain complete	Module memory freed, domain resources released	~10-100us (cleanup)

12.8.5.2 ModuleManifest and Registration¶

/// Module manifest: declares services provided, dependencies required,
/// interrupt vectors, and resource needs. Populated from the `.kabi`
/// IDL file and the `KabiDriverManifest` ELF section.
///
/// The manifest is validated by the domain service at registration time.
/// Invalid manifests (unknown service IDs, version conflicts, exceeding
/// resource limits) are rejected with a specific error code.
// kernel-internal, not KABI
pub struct ModuleManifest {
    /// Unique module identifier (SipHash of name + version).
    /// See [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation--module-binary-store-mbs)
    /// for the computation.
    pub module_id: ModuleId,
    /// Human-readable module name (from `KabiDriverManifest.driver_name`).
    pub name: ArrayString<64>,
    /// Module version (from `KabiDriverManifest.driver_version`).
    pub version: u32,

    /// Services this module provides. Each entry includes the vtable
    /// pointer and the service version range.
    /// Bounded: MAX_SERVICES_PER_MODULE = 8. A module providing more than
    /// 8 services should be split into multiple modules.
    pub provides: ArrayVec<ServiceProvision, MAX_SERVICES_PER_MODULE>,
    /// Services this module requires (dependencies).
    /// Bounded: MAX_DEPS_PER_MODULE = 16.
    pub requires: ArrayVec<ServiceRequirement, MAX_DEPS_PER_MODULE>,
    /// Hardware interrupt vectors this module claims.
    /// Bounded: MAX_IRQS_PER_MODULE = 32 (covers multi-queue NVMe with
    /// 32 MSI-X vectors — the practical maximum for a single device).
    pub interrupt_vectors: ArrayVec<u32, MAX_IRQS_PER_MODULE>,
    /// Memory requirements (DMA-able, cacheable, total).
    pub memory_requirements: MemoryRequirements,
}

pub const MAX_SERVICES_PER_MODULE: usize = 8;
pub const MAX_DEPS_PER_MODULE: usize = 16;
pub const MAX_IRQS_PER_MODULE: usize = 32;

/// A service this module provides.
pub struct ServiceProvision {
    /// Service identifier (name + major version).
    pub service_id: ServiceId,
    /// Minimum version of this service the module implements.
    pub min_version: KabiVersion,
    /// Maximum version of this service the module implements.
    pub max_version: KabiVersion,
    /// Pointer to the vtable struct for this service.
    /// The vtable is in the module's read-only data segment.
    pub vtable_ptr: *const (),
    /// Opaque context pointer passed as the first argument to every
    /// vtable method. Typically points to the module's per-device state.
    pub ctx_ptr: *mut (),
}

/// A service this module requires (dependency).
pub struct ServiceRequirement {
    /// Service to bind to.
    pub service_id: ServiceId,
    /// Minimum acceptable version.
    pub min_version: KabiVersion,
    /// Whether this dependency is optional. If `true`, the module can
    /// enter READY without this dependency resolved — the corresponding
    /// KabiHandle will be `None`. If `false` (default), the module stays
    /// in RESOLVING until the dependency is satisfied.
    pub optional: bool,
}

/// Memory requirements declared in the module manifest.
pub struct MemoryRequirements {
    /// Total memory budget in bytes (DMA + cacheable + stack).
    pub total_bytes: u64,
    /// DMA-capable memory needed (for device ring buffers, descriptors).
    pub dma_bytes: u64,
    /// Whether the module needs physically contiguous DMA buffers.
    pub needs_contiguous_dma: bool,
}

12.8.5.3 Registration and Resolution Sequence¶

/// Domain service registration entry point.
///
/// Called by a module during its init function. Validates the manifest,
/// records the module in the domain's module table, and begins dependency
/// resolution.
///
/// # Returns
///
/// - `Ok(ModuleRegistration)` containing the module's assigned ID and
///   a reference to the domain service for subsequent calls.
/// - `Err(KabiError::ManifestInvalid)` if validation fails.
/// - `Err(KabiError::DomainFull)` if the domain's module table is at capacity.
/// - `Err(KabiError::UnauthorizedService)` if a provided service is not
///   declared in the module's signed ELF manifest.
impl DomainService {
    pub fn register(
        &self,
        manifest: &ModuleManifest,
    ) -> Result<ModuleRegistration, KabiError> {
        // Step 1: Validate manifest integrity.
        self.validate_manifest(manifest)?;

        // Step 2: Check domain capacity.
        let mut modules = self.modules.lock();
        if modules.is_full() {
            return Err(KabiError::DomainFull);
        }

        // Step 3: Verify provided services against signed ELF manifest.
        for provision in &manifest.provides {
            if !self.elf_manifest_authorizes(&manifest.module_id, &provision.service_id) {
                return Err(KabiError::UnauthorizedService);
            }
        }

        // Step 4: Record module descriptor.
        let descriptor = ModuleDescriptor {
            module_id: manifest.module_id,
            state: AtomicU8::new(ModuleState::Registering as u8),
            provides: manifest.provides.clone(),
            requires: manifest.requires.clone(),
            handles: SpinLock::new(ArrayVec::new()),
            irq_ring: None,
            service_rings: ArrayVec::new(),
        };
        modules.push(descriptor);

        // Step 5: Begin dependency resolution (non-blocking).
        // The domain service spawns resolution for each requirement.
        // Optional dependencies that cannot be resolved immediately
        // are deferred; mandatory ones block progression to READY.
        let reg = ModuleRegistration {
            module_id: manifest.module_id,
            domain_service: self as *const DomainService,
        };

        Ok(reg)
    }

    /// Resolve all dependencies for a registered module.
    ///
    /// For each `requires` entry in the manifest:
    /// - If the provider is in this domain → create Direct handle.
    /// - If the provider is in another domain → create Ring handle
    ///   (calls `setup_cross_domain_ring` if no ring exists yet).
    /// - If the provider is not yet loaded → for mandatory deps, return
    ///   `Err(Deferred)` and register a wakeup callback with the global
    ///   registry. For optional deps, set the handle to `None`.
    ///
    /// Returns `Ok(handles)` when all mandatory dependencies are resolved.
    pub fn resolve_all(
        &self,
        module_id: ModuleId,
    ) -> Result<ArrayVec<Option<KabiHandleOpaque>, MAX_DEPS_PER_MODULE>, KabiError> {
        // Phase 1: Collect requirements while holding the modules lock.
        // The lock protects the module table but we must NOT hold it
        // during resolve_single() — that function may call into
        // GLOBAL_SERVICE_REGISTRY (different lock) and setup_or_reuse_ring()
        // which may allocate. Holding modules lock during those operations
        // would create lock ordering violations and unbounded hold times.
        let requirements: ArrayVec<ServiceRequirement, MAX_DEPS_PER_MODULE>;
        {
            let modules = self.modules.lock();
            let module = modules.iter()
                .find(|m| m.module_id == module_id)
                .ok_or(KabiError::ModuleNotFound)?;
            requirements = module.requires.clone();
        } // modules lock released here

        // Phase 2: Resolve each dependency without holding the modules lock.
        let mut handles: ArrayVec<Option<KabiHandleOpaque>, MAX_DEPS_PER_MODULE> =
            ArrayVec::new();
        for req in &requirements {
            match self.resolve_single(&req.service_id, req.min_version) {
                Ok(handle) => handles.push(Some(handle)),
                Err(KabiError::ServiceNotFound) if req.optional => {
                    handles.push(None);
                }
                Err(KabiError::ServiceNotFound) => {
                    // Mandatory dependency not available. Register a wakeup
                    // callback so we're notified when the provider loads.
                    GLOBAL_SERVICE_REGISTRY.register_wakeup(
                        &req.service_id,
                        module_id,
                        self.domain_id,
                    );
                    return Err(KabiError::DependencyDeferred);
                }
                Err(e) => return Err(e),
            }
        }

        // Phase 3: Re-acquire modules lock and store resolved handles.
        {
            let modules = self.modules.lock();
            let module = modules.iter()
                .find(|m| m.module_id == module_id)
                .ok_or(KabiError::ModuleNotFound)?;
            // Store the resolved handles into the module descriptor.
            let mut module_handles = module.handles.lock();
            *module_handles = handles.clone();
            // All mandatory dependencies resolved.
            module.state.store(ModuleState::Ready as u8, Ordering::Release);
        }
        Ok(handles)
    }

    /// Announce module readiness to the global registry.
    ///
    /// Called after resolve_all() succeeds. Publishes all provided services
    /// to the global registry, making them available to other modules.
    /// Wakes any modules that were deferred waiting for these services.
    pub fn announce_ready(&self, module_id: ModuleId) -> Result<(), KabiError> {
        let modules = self.modules.lock();
        let module = modules.iter()
            .find(|m| m.module_id == module_id)
            .ok_or(KabiError::ModuleNotFound)?;

        module.state.store(ModuleState::Serving as u8, Ordering::Release);

        // Publish each provided service to the global registry.
        for provision in &module.provides {
            GLOBAL_SERVICE_REGISTRY.publish(
                &provision.service_id,
                RegistryEntry {
                    domain_id: self.domain_id,
                    module_id,
                    vtable_ptr: provision.vtable_ptr,
                    ctx_ptr: provision.ctx_ptr,
                    min_version: provision.min_version,
                    max_version: provision.max_version,
                    generation: Box::new(AtomicU64::new(1)), // odd = active
                },
            );
        }

        // Wake any modules blocked in RESOLVING waiting for our services.
        GLOBAL_SERVICE_REGISTRY.wake_deferred_modules();

        Ok(())
    }
}

/// Module lifecycle states.
#[repr(u8)]
pub enum ModuleState {
    Loading     = 0,
    Registering = 1,
    Resolving   = 2,
    Ready       = 3,
    Serving     = 4,
    Draining    = 5,
    Unloading   = 6,
}

/// Registration receipt returned to the module after successful registration.
// kernel-internal, not KABI
pub struct ModuleRegistration {
    pub module_id: ModuleId,
    /// Pointer to the owning domain service. Valid for the module's lifetime
    /// within the domain.
    pub domain_service: *const DomainService,
}

/// Per-module descriptor maintained by the domain service.
// kernel-internal, not KABI
pub struct ModuleDescriptor {
    pub module_id: ModuleId,
    /// Current lifecycle state (see ModuleState enum).
    pub state: AtomicU8,
    /// Services provided by this module (vtable pointers).
    pub provides: ArrayVec<ServiceProvision, MAX_SERVICES_PER_MODULE>,
    /// Dependency declarations (for re-resolution on migration).
    pub requires: ArrayVec<ServiceRequirement, MAX_DEPS_PER_MODULE>,
    /// Resolved handles (populated during RESOLVING phase).
    /// Uses `KabiHandleOpaque` for type-erased storage — the module
    /// retrieves typed `KabiHandle<S>` via `handle.typed::<S>()` with
    /// a runtime service_id check. Protected by SpinLock because
    /// rebinding updates handles atomically.
    pub handles: SpinLock<ArrayVec<Option<KabiHandleOpaque>, MAX_DEPS_PER_MODULE>>,
    /// Per-driver IRQ ring, if this module claims interrupt vectors.
    pub irq_ring: Option<IrqRing>,
    /// Inbound service rings for cross-domain callers.
    /// Each entry represents a ring from a different domain that sends
    /// requests to services provided by this module. Bounded by
    /// MAX_INBOUND_RINGS_PER_MODULE (one per external domain that binds
    /// to any of this module's services — typically 2-4, max 8).
    pub service_rings: ArrayVec<ServiceRingDescriptor, MAX_INBOUND_RINGS_PER_MODULE>,
}

pub const MAX_INBOUND_RINGS_PER_MODULE: usize = 8;

/// Descriptor for an inbound service ring on a module.
// kernel-internal, not KABI
pub struct ServiceRingDescriptor {
    /// The cross-domain ring backing this service connection.
    pub ring: CrossDomainRing,
    /// Domain that produces commands on this ring.
    pub caller_domain: DomainId,
    /// Service this ring carries.
    pub service_id: ServiceId,
}

impl ModuleDescriptor {
    /// Return an iterator over all inbound rings (both service rings and
    /// IRQ ring). Used by `migrate_module()` for quiescing and teardown.
    pub fn inbound_rings(&self) -> impl Iterator<Item = &CrossDomainRing> {
        self.service_rings.iter().map(|sr| &sr.ring)
    }

    /// Return the module's manifest (reconstructed from stored fields).
    pub fn manifest(&self) -> ModuleManifest {
        ModuleManifest {
            module_id: self.module_id,
            name: ArrayString::new(), // populated from ELF
            version: 0,               // populated from ELF
            provides: self.provides.clone(),
            requires: self.requires.clone(),
            interrupt_vectors: ArrayVec::new(),
            memory_requirements: MemoryRequirements {
                total_bytes: 0,
                dma_bytes: 0,
                needs_contiguous_dma: false,
            },
        }
    }

    /// Return the domain class for grouping (e.g., "block", "network").
    /// Derived from the primary service's class tag in the IDL.
    pub fn domain_class(&self) -> DomainClass {
        self.provides.first()
            .map(|p| p.service_id.domain_class())
            .unwrap_or(DomainClass::Misc)
    }
}

12.8.6 Cross-Domain Ring Setup¶

Shared memory rings are the sole communication channel between different domains. The kernel orchestrates ring creation, mapping, and teardown. After setup, the two parties communicate directly — no kernel involvement per message.

12.8.6.1 Ring Descriptor¶

/// Descriptor for a cross-domain shared memory ring pair (command + completion).
///
/// Created by `setup_cross_domain_ring()`. Contains the physical memory
/// backing, domain access permissions, and lifecycle metadata. Stored
/// in the kernel's ring registry (Tier 0 memory, not in either domain).
///
/// **Memory layout** (single contiguous allocation):
///
/// ```text
///   +--------------------------------------------------------------+
///   | Command DomainRingBuffer header (128 bytes)                   |
///   +--------------------------------------------------------------+
///   | Command ring data: capacity * entry_size bytes                |
///   +--------------------------------------------------------------+
///   | Overflow region: capacity * overflow_chunk_size bytes         |
///   | (omitted if overflow_chunk_size == 0)                         |
///   +--------------------------------------------------------------+
///   | Completion DomainRingBuffer header (128 bytes)                |
///   +--------------------------------------------------------------+
///   | Completion ring data: capacity * sizeof(T1CompletionEntry)    |
///   +--------------------------------------------------------------+
/// ```
///
/// The command ring carries `T1CommandEntry` from producer to consumer.
/// The completion ring carries `T1CompletionEntry` from consumer back
/// to producer. Both rings share the same `capacity` (power of two).
/// The completion ring is SPSC: the consumer loop is the sole writer,
/// and `wait_completion()` is the sole reader (per cookie match).
// kernel-internal, not KABI
pub struct CrossDomainRing {
    /// Physical base address of the entire allocation (command + completion).
    pub pages: PhysAddr,
    /// Total size in bytes (command header + command data + overflow +
    /// completion header + completion data).
    pub total_size: usize,
    /// Number of ring entries (power of two). Shared by both the command
    /// ring and the completion ring.
    pub capacity: u32,
    /// Command entry size in bytes.
    pub entry_size: u32,
    /// Domain that produces commands (writes to command ring head).
    pub producer_domain: DomainId,
    /// Domain that consumes commands (reads from command ring tail) and
    /// produces completions (writes to completion ring head).
    pub consumer_domain: DomainId,
    /// Service this ring carries. Used for diagnostics and rebinding.
    pub service_id: ServiceId,
    /// Generation counter. Incremented on ring teardown/recreation.
    /// Handles that cache a ring pointer compare this against
    /// `KabiHandle.cached_generation` to detect stale rings.
    pub generation: AtomicU64,
    /// Pointer to the command ring's `DomainRingBuffer` header (kernel VA).
    pub ring_header: *mut DomainRingBuffer,
    /// Pointer to the completion ring's `DomainRingBuffer` header (kernel VA).
    /// The completion ring uses the same `DomainRingBuffer` layout with
    /// `entry_size = size_of::<T1CompletionEntry>()` (64 bytes). Allocated
    /// contiguously after the command ring's data + overflow region.
    pub completion_ring_header: *mut DomainRingBuffer,
    /// Pointer to the per-slot overflow region (kernel VA). Each slot owns
    /// `overflow_chunk_size` bytes starting at `overflow_base + slot * chunk_size`.
    /// Used when arguments exceed `DEFAULT_INLINE_ARG_SIZE`. NULL if no overflow
    /// region is configured for this service (all arguments fit inline).
    pub overflow_base: *mut u8,
    /// Per-slot overflow chunk size in bytes. Determined per-service by the
    /// IDL compiler. 0 if no overflow region is configured.
    pub overflow_chunk_size: u32,
    /// Doorbell for waking the command ring consumer when entries are posted.
    pub doorbell: DoorbellRegister,
    /// Doorbell for waking the completion ring consumer (the producer thread
    /// waiting in `wait_completion()`) when completion entries are posted.
    pub completion_doorbell: DoorbellRegister,
    /// Monotonic cookie counter for request correlation. Each `next_cookie()`
    /// call returns a unique value on this ring, used as the `cookie` field
    /// in `T1CommandEntry` to correlate completions with outstanding requests.
    pub cookie_counter: AtomicU64,
}

impl CrossDomainRing {
    /// Submit a command entry to the ring. Returns a sequence number
    /// for completion matching.
    ///
    /// The producer writes the command entry to the slot at `head % capacity`,
    /// then advances `head` via CAS. The sequence number returned is the
    /// head value at submission time — the producer uses it to find the
    /// corresponding completion entry.
    ///
    /// # Errors
    ///
    /// - `RingError::Full`: No free slots. Caller should return `KabiError::QueueFull`.
    /// - `RingError::Disconnected`: Ring partner has crashed. Caller should
    ///   return `KabiError::DomainCrashed`.
    pub fn submit(&self, entry: T1CommandEntry) -> Result<u64, RingError> {
        // SAFETY: ring_header was initialized by setup_cross_domain_ring()
        // and is valid for the ring's lifetime. The generation check in
        // kabi_call! ensures the ring has not been torn down.
        let header = unsafe { &*self.ring_header };

        // Check for disconnected ring (crash recovery in progress).
        if header.state.load(Ordering::Acquire) == RING_STATE_DISCONNECTED {
            return Err(RingError::Disconnected);
        }

        // CAS loop to claim the next slot. MPSC protocol: multiple producers
        // may race on different CPUs (e.g., multiple kabi_call! from different
        // threads bound to the same ring).
        loop {
            let head = header.head.load(Ordering::Relaxed);
            let tail = header.tail.load(Ordering::Acquire);

            // Ring full: head has lapped tail by `capacity` entries.
            if head.wrapping_sub(tail) >= self.capacity as u64 {
                return Err(RingError::Full);
            }

            // Attempt to claim slot at `head`.
            if header.head.compare_exchange_weak(
                head,
                head.wrapping_add(1),
                Ordering::AcqRel,
                Ordering::Relaxed,
            ).is_ok() {
                // Write the entry to the claimed slot.
                let idx = (head % self.capacity as u64) as usize;
                // SAFETY: idx is within ring bounds (head < tail + capacity).
                // The slot is exclusively ours after the CAS succeeded.
                unsafe {
                    header.write_entry_as(idx, &entry);
                }

                // Publish: advance `published` so the consumer can see this entry.
                // For MPSC, we must wait until all entries before ours are published.
                // Spin until published == head (our slot number). This is bounded:
                // at most N concurrent producers, each publishing in slot order.
                //
                // CRITICAL: disable preemption during the CAS-to-publish window.
                // If the producer holding the next `published` slot is preempted,
                // all subsequent producers spin for the duration of a scheduling
                // quantum (~1-10ms). Disabling preemption eliminates this convoy.
                let _preempt = preempt_disable();
                while header.published.load(Ordering::Acquire) != head {
                    core::hint::spin_loop();
                }
                header.published.store(head.wrapping_add(1), Ordering::Release);
                drop(_preempt);

                // Signal the consumer doorbell if the entry requests it.
                if entry.flags & T1_CMD_NOTIFY != 0 {
                    self.doorbell.signal();
                }

                return Ok(head);
            }
            // CAS failed — another producer claimed this slot. Retry.
        }
    }

    /// Wait for a completion matching the given sequence number.
    ///
    /// Polls the completion ring for an entry whose `cookie` matches
    /// the sequence number returned by `submit()`. Uses adaptive polling:
    /// spin for a short burst, then sleep on the completion ring's doorbell.
    ///
    /// # Arguments
    ///
    /// - `seq`: Sequence number from `submit()` (the cookie echoed in the
    ///   completion entry).
    /// - `timeout_ns`: Maximum time to wait in nanoseconds. 0 = non-blocking poll.
    ///
    /// # Errors
    ///
    /// - `RingError::Disconnected`: Ring partner has crashed. The domain's
    ///   crash handler set ring state to `Disconnected`.
    /// - `RingError::Timeout`: No matching completion arrived within `timeout_ns`.
    pub fn wait_completion(
        &self,
        seq: u64,
        timeout_ns: u64,
    ) -> Result<T1CompletionEntry, RingError> {
        // Read from the COMPLETION ring header, not the command ring header.
        // The completion ring is a separate DomainRingBuffer allocated
        // contiguously after the command ring's data + overflow region.
        // SAFETY: completion_ring_header was initialized by
        // setup_cross_domain_ring() and is valid for the ring's lifetime.
        let comp_header = unsafe { &*self.completion_ring_header };
        let cmd_header = unsafe { &*self.ring_header };
        let deadline_ns = arch::current::cpu::ktime_get_ns()
            .saturating_add(timeout_ns);

        loop {
            // Check for disconnect on the command ring before each poll attempt.
            // If the consumer domain crashed, the command ring state is set to
            // DISCONNECTED by crash recovery Step 2'.
            if cmd_header.state.load(Ordering::Acquire) == RING_STATE_DISCONNECTED {
                return Err(RingError::Disconnected);
            }

            // Scan available completions. The completion ring is a standard
            // DomainRingBuffer: `published` tracks how many entries the
            // consumer loop has posted, `tail` tracks how many entries
            // the producer has consumed (read and matched).
            //
            // Since completions may arrive out of order (multiple in-flight
            // requests), we scan all unread completions for our cookie.
            // In practice, with FIFO dispatch in the consumer loop, completions
            // are nearly always in order — the scan length is typically 1.
            let comp_published = comp_header.published.load(Ordering::Acquire);
            let comp_tail = comp_header.tail.load(Ordering::Relaxed);

            let mut scan = comp_tail;
            while scan < comp_published {
                let idx = (scan % self.capacity as u64) as usize;
                // SAFETY: idx is within ring bounds, entry was published.
                let entry: &T1CompletionEntry = unsafe {
                    comp_header.read_entry_as(idx)
                };
                if entry.cookie == seq {
                    // Found our completion. Copy the result.
                    let result = *entry;
                    // Mark this slot as consumed by zeroing its cookie.
                    // This allows other waiters to skip consumed entries
                    // without losing their own completions.
                    let zero_cookie_offset = core::mem::size_of::<DomainRingBuffer>()
                        + idx * comp_header.entry_size as usize;
                    // SAFETY: offset is within bounds. We write only the
                    // cookie field (first 8 bytes of T1CompletionEntry).
                    unsafe {
                        let slot_ptr = (comp_header as *const DomainRingBuffer as *mut u8)
                            .add(zero_cookie_offset) as *mut u64;
                        core::ptr::write(slot_ptr, 0);
                    }
                    // Advance the tail past all consecutive consumed entries
                    // (cookie == 0) starting from the current tail. This
                    // prevents tail stall: if earlier entries are still
                    // unconsumed (their waiters haven't found them yet),
                    // the tail stays at the first unconsumed entry.
                    let mut new_tail = comp_tail;
                    while new_tail < comp_published {
                        let tidx = (new_tail % self.capacity as u64) as usize;
                        let cookie_offset = core::mem::size_of::<DomainRingBuffer>()
                            + tidx * comp_header.entry_size as usize;
                        let cookie = unsafe {
                            let slot_ptr = (comp_header as *const DomainRingBuffer as *const u8)
                                .add(cookie_offset) as *const u64;
                            core::ptr::read(slot_ptr)
                        };
                        if cookie != 0 {
                            break; // Unconsumed entry — stop advancing
                        }
                        new_tail += 1;
                    }
                    if new_tail > comp_tail {
                        comp_header.tail.store(new_tail, Ordering::Release);
                    }
                    return Ok(result);
                }
                scan += 1;
            }

            // No matching completion yet. Check timeout.
            let now = arch::current::cpu::ktime_get_ns();
            if now >= deadline_ns {
                return Err(RingError::Timeout);
            }

            // Adaptive wait: spin briefly, then sleep on the completion doorbell.
            // The completion_doorbell is signaled by the consumer loop after
            // posting completions (Phase 5 of kabi_consumer_loop).
            let remaining_ns = deadline_ns.saturating_sub(now);
            self.completion_doorbell.wait_timeout(remaining_ns);
        }
    }

    /// Allocate the next monotonic cookie value for request correlation.
    ///
    /// Returns a unique u64 value that the producer places in
    /// `T1CommandEntry.cookie`. The consumer echoes it in
    /// `T1CompletionEntry.cookie` so the producer can correlate
    /// completions with outstanding requests.
    ///
    /// Uses `fetch_add(1, Relaxed)` — ordering is not needed because the
    /// cookie is an opaque correlation value, not a synchronization primitive.
    /// The u64 counter cannot wrap within operational lifetime (2^64 at 10^9
    /// ops/sec = 584 years).
    pub fn next_cookie(&self) -> u64 {
        self.cookie_counter.fetch_add(1, Ordering::Relaxed)
    }

    /// Check whether the ring is in Disconnected state (partner crashed).
    pub fn is_disconnected(&self) -> bool {
        // SAFETY: ring_header is valid for the ring's lifetime.
        let header = unsafe { &*self.ring_header };
        header.state.load(Ordering::Acquire) == RING_STATE_DISCONNECTED
    }

    /// Enqueue a completion entry on the SPSC completion ring.
    ///
    /// Called by the consumer loop after processing a command entry.
    /// The completion ring is SPSC (single producer = this consumer thread,
    /// single consumer = the waiting producer thread via `wait_completion()`).
    ///
    /// Writes the completion entry to the completion ring at
    /// `published % capacity`, then advances the completion ring's
    /// `published` with `Release` ordering so the producer can observe
    /// the new entry.
    ///
    /// # Safety
    ///
    /// - `self.completion_ring_header` must point to a valid, live ring.
    /// - The caller must be the sole producer on this completion ring
    ///   (guaranteed by the SPSC invariant: one consumer loop per ring).
    pub unsafe fn enqueue_spsc(&self, entry: &T1CompletionEntry) {
        let comp_header = &*self.completion_ring_header;
        let published = comp_header.published.load(Ordering::Relaxed);
        let idx = (published % self.capacity as u64) as usize;

        // Write the completion entry to the completion ring slot.
        // SAFETY: idx < capacity, we are the sole producer (SPSC),
        // and completion_ring_header is valid for the ring's lifetime.
        comp_header.write_entry_as(idx, entry);

        // Advance the completion ring's published count. Release ensures
        // the entry write is visible before the producer reads the new
        // published value in wait_completion().
        comp_header.published.store(published + 1, Ordering::Release);

        // Signal the completion doorbell to wake any sleeping waiter
        // in wait_completion().
        self.completion_doorbell.signal();
    }
}

/// Shared ring submission helper. Used by both `kabi_call!` Ring path
/// and `dispatch_to_domain()` Tier1 path to avoid duplicating the ring
/// submit + wait + error-mapping logic.
///
/// Submits a pre-constructed `T1CommandEntry` to the ring and waits
/// for the matching completion. Error mapping is unified: Disconnected
/// maps to `DomainCrashed`, Full to `QueueFull`, Timeout to `Timeout`.
///
/// Both `kabi_call!` and `dispatch_to_domain()` construct the command
/// entry differently (kabi_call! uses IDL-generated serialization,
/// dispatch_to_domain uses capability-validated request construction),
/// but the submit-and-wait phase is identical.
fn ring_submit_and_wait(
    ring: &CrossDomainRing,
    cmd: T1CommandEntry,
    timeout_ns: u64,
) -> Result<T1CompletionEntry, KabiError> {
    let cookie = cmd.cookie;
    ring.submit(cmd).map_err(|e| match e {
        RingError::Full => KabiError::QueueFull,
        RingError::Disconnected => KabiError::DomainCrashed,
        _ => KabiError::InternalError,
    })?;
    ring.wait_completion(cookie, timeout_ns).map_err(|e| match e {
        RingError::Disconnected => KabiError::DomainCrashed,
        RingError::Timeout => KabiError::Timeout,
        _ => KabiError::InternalError,
    })
}

/// IDL-generated inline argument serializer.
///
/// Writes method arguments directly into a `T1CommandEntry.args` field.
/// Returns the total serialized length in bytes. This replaces the former
/// shared `ArgBuffer` model which suffered from concurrent-corruption bugs
/// when multiple producers wrote to the same shared region.
///
/// The IDL compiler (`kabi-gen`) generates one `serialize_args_inline()`
/// function per service method. Each function writes the method's arguments
/// in wire format (little-endian, packed, no padding between fields) into
/// the provided `args` slice.
///
/// If the serialized data exceeds `DEFAULT_INLINE_ARG_SIZE`, the caller
/// (ring submit path) copies the overflow into the slot's dedicated overflow
/// chunk. The consumer reassembles inline + overflow before dispatch.
///
/// # Example (generated)
///
/// ```rust
/// fn serialize_args_inline_read_page(
///     args: &mut [u8; DEFAULT_INLINE_ARG_SIZE],
///     dev_id: &DeviceId,
///     offset: &u64,
/// ) -> usize {
///     let mut cursor = 0;
///     cursor += serialize_into(&mut args[cursor..], dev_id);
///     cursor += serialize_into(&mut args[cursor..], offset);
///     cursor // total bytes written
/// }
/// ```
fn serialize_args_inline(args: &mut [u8; DEFAULT_INLINE_ARG_SIZE], /* ... */) -> usize {
    // IDL-generated: method-specific serialization.
    unimplemented!("generated by kabi-gen")
}

/// IDL-generated result deserializer for the Ring path.
///
/// Extracts the method return value from a `T1CompletionEntry`. Called by
/// `kabi_call!` Ring path (line 688) and `KabiCookie::wait()` (line 983)
/// after `wait_completion()` returns successfully.
///
/// The IDL compiler (`kabi-gen`) generates one `deserialize_result::<R>()`
/// specialization per service method. The generic signature is:
///
/// ```rust
/// fn deserialize_result<R>(completion: &T1CompletionEntry) -> Result<R, KabiError>
/// ```
///
/// # Deserialization protocol
///
/// 1. Check `completion.status`. If non-zero, return
///    `Err(KabiError::DriverError(status))` without reading result data.
/// 2. If `completion.result_len == 0` and the method return type is `()`,
///    return `Ok(())`.
/// 3. Read `completion.result_len` bytes from the completion ring's result
///    region at `completion.result_offset`. Deserialize the byte buffer into
///    `R` using the same wire format as `serialize_args_inline` (little-endian,
///    packed, no inter-field padding).
///
/// The result data is written by the consumer loop into the completion ring
/// entry's inline result area (for small results) or the completion ring's
/// overflow region (for large results, e.g., returning a page of data).
///
/// # Type safety
///
/// The `R` type parameter is bound by the `KabiService` trait's associated
/// return types. The IDL compiler generates a concrete deserializer per
/// (service, method) pair, ensuring wire format matches at compile time.
/// A mismatch between producer and consumer deserializers is a build error
/// (both generated from the same `.kabi` source).
///
/// # Example (generated for `block_device::read_page`)
///
/// ```rust
/// fn deserialize_result_read_page(
///     completion: &T1CompletionEntry,
/// ) -> Result<PageHandle, KabiError> {
///     if completion.status != 0 {
///         return Err(KabiError::DriverError(completion.status));
///     }
///     // Result data is in the completion entry's inline result region.
///     // PageHandle is 8 bytes (u64 page frame number).
///     let pfn = u64::from_le_bytes(
///         completion.result_data[..8].try_into().unwrap()
///     );
///     Ok(PageHandle(pfn))
/// }
/// ```
fn deserialize_result<R>(completion: &T1CompletionEntry) -> Result<R, KabiError> {
    // IDL-generated: method-specific deserialization.
    // Each specialization checks completion.status, then deserializes
    // completion.result_len bytes from the result region into R.
    unimplemented!("generated by kabi-gen")
}

12.8.6.2 Setup Per Crossing Type¶

/// Create a shared memory ring between two domains.
///
/// The physical pages are allocated from kernel memory (not from either
/// domain's private allocation). Isolation domain permissions are
/// configured so both domains can access the shared region.
///
/// # Per-crossing setup
///
/// | Crossing | Pages allocated in | Producer access | Consumer access |
/// |----------|-------------------|-----------------|-----------------|
/// | Tier 0 ↔ Tier 1 | Kernel memory | Always (Domain 0) | MPK/POE/DACR key grants read/write |
/// | Tier 1 ↔ Tier 1 | Kernel memory | MPK/POE/DACR key for producer domain | MPK/POE/DACR key for consumer domain |
/// | Tier 0/1 ↔ Tier 2 | Kernel memory, mmap'd into Tier 2 process | Kernel VA | Userspace VA (mmap) |
/// | Tier 2 ↔ Tier 2 | Kernel memory, mmap'd into both processes | Userspace VA (mmap) | Userspace VA (mmap) |
///
/// # Ring sizing
///
/// Default: 256 entries, 64 bytes per entry. Configurable per-service
/// via the `.kabi` service definition or operator policy. The argument
/// buffer defaults to 64 KiB (sufficient for most vtable argument sets;
/// large DMA transfers use separate DMA buffer handles, not inline args).
///
/// # Errors
///
/// - `ENOMEM`: Insufficient memory for ring pages.
/// - `ENOSPC`: Domain isolation key exhausted (too many Tier 1 domains
///   on architectures with limited keys, e.g., x86 MPK has 12 usable).
/// - `EINVAL`: `producer == consumer` (same domain does not need a ring).
pub fn setup_cross_domain_ring(
    producer: DomainId,
    consumer: DomainId,
    service_id: &ServiceId,
    capacity: u32,
    entry_size: u32,
) -> Result<CrossDomainRing, KernelError> {
    if producer == consumer {
        return Err(KernelError::EINVAL);
    }
    // Capacity must be power of two for efficient modular indexing
    // (ring index = tail % capacity uses bitwise AND when capacity is 2^N).
    if !capacity.is_power_of_two() || capacity == 0 {
        return Err(KernelError::EINVAL);
    }

    let cmd_data_size = capacity as usize * entry_size as usize;
    // Overflow region: per-slot chunk for arguments exceeding inline capacity.
    // If overflow_chunk_size is 0 (all args fit inline), no overflow region.
    // The overflow_chunk_size is determined by the IDL compiler per service.
    let overflow_chunk_size = DEFAULT_OVERFLOW_CHUNK_SIZE;
    let overflow_size = if overflow_chunk_size > 0 {
        capacity as usize * overflow_chunk_size
    } else {
        0
    };
    // Completion ring: same capacity, entry_size = sizeof(T1CompletionEntry).
    let comp_entry_size = core::mem::size_of::<T1CompletionEntry>();
    let comp_data_size = capacity as usize * comp_entry_size;
    // Total allocation: command ring (header + data + overflow) +
    //                   completion ring (header + data).
    let total_size = size_of::<DomainRingBuffer>() + cmd_data_size + overflow_size
        + size_of::<DomainRingBuffer>() + comp_data_size;
    let page_count = (total_size + PAGE_SIZE - 1) / PAGE_SIZE;

    // Step 1: Allocate physically contiguous pages from kernel memory.
    // These pages are in kernel address space, not in either domain.
    let pages = phys_alloc::alloc_pages(page_count, GfpFlags::KERNEL | GfpFlags::ZERO)?;
    let ring_va = phys_to_virt(pages);

    // Step 2a: Initialize the command ring DomainRingBuffer header.
    // SAFETY: ring_va points to zeroed, kernel-owned memory of sufficient size.
    let header = unsafe { &mut *(ring_va as *mut DomainRingBuffer) };
    header.size = capacity;
    header.entry_size = entry_size;
    header.consumer_size = capacity;
    header.consumer_entry_size = entry_size;
    header.state.store(0, Ordering::Release); // Active

    // Step 2b: Initialize the completion ring DomainRingBuffer header.
    // The completion ring follows the command ring data + overflow region.
    let comp_offset = size_of::<DomainRingBuffer>() + cmd_data_size + overflow_size;
    let comp_header = unsafe {
        &mut *((ring_va as *mut u8).add(comp_offset) as *mut DomainRingBuffer)
    };
    comp_header.size = capacity;
    comp_header.entry_size = comp_entry_size as u32;
    comp_header.consumer_size = capacity;
    comp_header.consumer_entry_size = comp_entry_size as u32;
    comp_header.state.store(0, Ordering::Release); // Active

    // Step 3: Grant access to both domains.
    let producer_tier = domain_registry::tier_of(producer);
    let consumer_tier = domain_registry::tier_of(consumer);

    match (producer_tier, consumer_tier) {
        // Tier 0 ↔ Tier 1: Grant the Tier 1 domain's isolation key
        // read/write access to the ring pages. Tier 0 (Core domain)
        // always has access to all kernel memory.
        (IsolationTier::Tier0, IsolationTier::Tier1)
        | (IsolationTier::Tier1, IsolationTier::Tier0) => {
            let tier1_domain = if producer_tier == IsolationTier::Tier1 {
                producer
            } else {
                consumer
            };
            arch::current::isolation::grant_domain_access(
                tier1_domain,
                pages,
                page_count,
                AccessPermission::ReadWrite,
            );
        }

        // Tier 1 ↔ Tier 1 (different domains): Grant both domains
        // read/write access. On x86 MPK, this sets both PKEYs in the
        // page table entries. On AArch64 POE, both overlay indices get
        // access. On architectures without per-page multi-domain support,
        // the pages are placed in a shared PKEY/overlay (PKEY 1 on x86,
        // overlay 1 on AArch64) accessible to all Tier 1 domains.
        (IsolationTier::Tier1, IsolationTier::Tier1) => {
            arch::current::isolation::grant_shared_access(
                producer,
                consumer,
                pages,
                page_count,
                AccessPermission::ReadWrite,
            );
        }

        // Tier 2 ↔ Tier 2: Map the ring pages into BOTH Tier 2
        // processes' address spaces. Each gets its own mmap (different
        // userspace VAs, same physical pages). Both IOMMU domains are
        // configured to allow access to these shared pages.
        (IsolationTier::Tier2, IsolationTier::Tier2) => {
            let producer_process = domain_registry::process_of(producer);
            producer_process.mmap_ring_pages(
                pages,
                page_count,
                MmapFlags::SHARED | MmapFlags::LOCKED,
            )?;
            let consumer_process = domain_registry::process_of(consumer);
            if let Err(e) = consumer_process.mmap_ring_pages(
                pages,
                page_count,
                MmapFlags::SHARED | MmapFlags::LOCKED,
            ) {
                // Rollback producer mapping on failure.
                producer_process.munmap_ring_pages(pages, page_count);
                return Err(e);
            }
        }

        // Tier 0/1 ↔ Tier 2: Map the ring pages into the Tier 2
        // process's address space via mmap. The kernel retains its
        // own mapping. IOMMU restricts the Tier 2 process's DMA to
        // only these shared pages (not arbitrary kernel memory).
        // The Tier 0/1 side accesses via kernel VA (always mapped).
        (_, IsolationTier::Tier2) | (IsolationTier::Tier2, _) => {
            let tier2_domain = if producer_tier == IsolationTier::Tier2 {
                producer
            } else {
                consumer
            };
            let tier2_process = domain_registry::process_of(tier2_domain);
            tier2_process.mmap_ring_pages(
                pages,
                page_count,
                MmapFlags::SHARED | MmapFlags::LOCKED,
            )?;
        }

        // Tier 0 ↔ Tier 0: should not reach here (same domain check above).
        (IsolationTier::Tier0, IsolationTier::Tier0) => {
            unreachable!("same-domain ring rejected by producer == consumer check");
        }
    }

    let overflow_offset = size_of::<DomainRingBuffer>() + cmd_data_size;
    let overflow_base = if overflow_size > 0 {
        unsafe { (ring_va as *mut u8).add(overflow_offset) }
    } else {
        core::ptr::null_mut()
    };

    Ok(CrossDomainRing {
        pages,
        total_size,
        capacity,
        entry_size,
        producer_domain: producer,
        consumer_domain: consumer,
        service_id: service_id.clone(),
        generation: AtomicU64::new(1),
        ring_header: header as *mut DomainRingBuffer,
        completion_ring_header: comp_header as *mut DomainRingBuffer,
        overflow_base,
        overflow_chunk_size: overflow_chunk_size as u32,
        doorbell: DoorbellRegister::new(),
        completion_doorbell: DoorbellRegister::new(),
        cookie_counter: AtomicU64::new(1),
    })
}

/// Default per-slot overflow chunk size: 256 bytes.
///
/// Used when arguments exceed `DEFAULT_INLINE_ARG_SIZE` (64 bytes).
/// Each ring slot owns one overflow chunk — no concurrent access across
/// slots. The IDL compiler may override this per service based on the
/// maximum serialized argument size. Set to 0 for services where all
/// arguments fit inline (most control/metadata services).
///
/// Override per-service via `.kabi` `overflow_chunk_size: 1024;` for
/// services with large argument payloads (e.g., crypto operations with
/// key material).
pub const DEFAULT_OVERFLOW_CHUNK_SIZE: usize = 256;

12.8.6.3 Ring Teardown¶

/// Tear down a cross-domain ring.
///
/// Called during module unload or after crash recovery completes.
/// The ring must be fully drained (no in-flight requests) before teardown.
///
/// # Preconditions
///
/// - `ring.state` is `Disconnected` (set by crash handler or unload path).
/// - Consumer thread has exited (joined or aborted by crash handler).
/// - All in-flight completions have been posted or cancelled.
///
/// # Steps
///
/// 1. Increment ring generation (invalidates any cached ring pointers).
/// 2. Revoke domain access permissions for the ring pages.
/// 3. Unmap from Tier 2 process if applicable.
/// 4. Free physical pages.
pub fn teardown_cross_domain_ring(ring: &mut CrossDomainRing) {
    // Step 1: Invalidate generation.
    ring.generation.fetch_add(1, Ordering::SeqCst);

    // Step 2: Revoke access.
    let producer_tier = domain_registry::tier_of(ring.producer_domain);
    let consumer_tier = domain_registry::tier_of(ring.consumer_domain);
    let page_count = (ring.total_size + PAGE_SIZE - 1) / PAGE_SIZE;

    if producer_tier == IsolationTier::Tier1 || consumer_tier == IsolationTier::Tier1 {
        arch::current::isolation::revoke_domain_access(
            ring.producer_domain,
            ring.pages,
            page_count,
        );
        if ring.producer_domain != ring.consumer_domain {
            arch::current::isolation::revoke_domain_access(
                ring.consumer_domain,
                ring.pages,
                page_count,
            );
        }
    }

    if producer_tier == IsolationTier::Tier2 {
        let proc = domain_registry::process_of(ring.producer_domain);
        proc.munmap_ring_pages(ring.pages, page_count);
    }
    if consumer_tier == IsolationTier::Tier2 {
        let proc = domain_registry::process_of(ring.consumer_domain);
        proc.munmap_ring_pages(ring.pages, page_count);
    }

    // Step 3: Free physical pages via RCU-deferred callback.
    // The generation increment in Step 1 causes KabiCookie::wait() to return
    // DomainCrashed, but there is a TOCTOU window: a cookie holder may have
    // passed the generation check and be about to dereference the ring pointer.
    // RCU-deferred freeing eliminates this race: any reader in an RCU read-side
    // critical section (which KabiCookie::wait() callers are, since they hold
    // the handle which implies an RCU read section) sees consistent ring data
    // until the grace period completes.
    //
    // SAFETY: pages were allocated by setup_cross_domain_ring, page_count
    // matches the original allocation, and no domain can access them
    // after revocation above. rcu_call defers the actual free until all
    // pre-existing RCU readers have exited.
    let pages = ring.pages;
    let count = page_count;
    rcu_call(move || {
        unsafe { phys_alloc::free_pages(pages, count) };
    });
}

12.8.7 Per-Driver IRQ Ring¶

Hardware interrupts follow a two-phase model that keeps the Tier 0 IRQ handler minimal (~10-15 cycles) and delegates all interrupt processing to the driver's domain. This ensures the kernel never enters a driver's isolated domain during interrupt handling — if the driver crashes, interrupts stay masked and the kernel remains stable.

12.8.7.1 Phase 1: Tier 0 IRQ Handler¶

The Tier 0 IRQ handler runs immediately when a hardware interrupt fires. Its job is limited to three operations: ACK the interrupt, mask the vector, and post a notification to the driver's IRQ ring.

/// Type tag discriminating the kind of event on the per-driver IRQ ring.
///
/// The IRQ ring carries three event types via a single MPSC ring:
///
/// - `HwIrq`: Hardware interrupt notification (original sole purpose).
/// - `TimerExpiry`: Timer wheel or hrtimer expiry event, crossing the
///   Tier 0 (timer softirq) to Tier 1 (driver domain) boundary. The
///   timer wheel enqueues this when a timer registered with a non-zero
///   `domain_id` expires ([Section 7.8](07-scheduling.md#timekeeping-and-clock-management)).
/// - `Completion`: I/O completion forwarded from another domain. Used
///   when a Tier 1 driver completes a bio and needs to notify the Tier 0
///   block layer via the outbound KABI ring (the Tier 0 consumer then
///   calls `bio_complete()`).
///
/// Discriminant values are fixed ABI (shared between Tier 0 producer
/// and Tier 1 consumer). New variants are append-only.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IrqRingEventType {
    /// Hardware interrupt notification.
    HwIrq = 0,
    /// Timer expiry event from the Tier 0 timer wheel or hrtimer subsystem.
    TimerExpiry = 1,
    /// I/O completion event forwarded across domains.
    Completion = 2,
}

/// IRQ ring notification entry posted by Tier 0 producers.
///
/// Fixed-size, no heap allocation, fits in a single cache line.
/// The `event_type` discriminant determines which fields are meaningful:
///
/// - `HwIrq`: `vector` identifies the interrupt; `payload` unused (zero).
/// - `TimerExpiry`: `vector` unused (zero); `payload` contains the
///   `TimerExpiryPayload` (timer_id, expiry_ns).
/// - `Completion`: `vector` unused (zero); `payload` contains the
///   `CompletionPayload` (cookie, status).
///
/// The union-in-struct approach uses the `payload` bytes (32 bytes within
/// the reserved area) to carry event-specific data. The discriminant is
/// checked by the consumer loop to select the dispatch path.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct IrqNotification {
    /// Event type discriminant.
    pub event_type: IrqRingEventType,
    /// Padding for alignment after u8 discriminant.
    pub _pad0: [u8; 3],
    /// Hardware interrupt vector number (meaningful for `HwIrq`; zero for
    /// timer and completion events).
    pub vector: u32,
    /// CPU that posted the event (for affinity tracking / latency diagnosis).
    pub cpu: u32,
    /// Padding for alignment before u64.
    pub _pad1: [u8; 4],
    /// Timestamp from the architecture's cycle counter at event posting time.
    /// x86: RDTSC. AArch64: CNTVCT_EL0. ARMv7: PMCCNTR (if available,
    /// else 0). RISC-V: rdcycle. PPC: mftb. s390x: STCK.
    /// LoongArch64: RDTIME.
    pub timestamp: u64,
    /// Event-type-specific payload. Interpretation depends on `event_type`:
    /// - `HwIrq`: all zeros (reserved for future coalescing hints).
    /// - `TimerExpiry`: `TimerExpiryPayload` (16 bytes used, 16 reserved).
    /// - `Completion`: `CompletionPayload` (16 bytes used, 16 reserved).
    pub payload: [u8; 32],
    /// Reserved for future use.
    pub _reserved: [u8; 8],
}
// IrqNotification: event_type(1) + _pad0(3) + vector(4) + cpu(4) +
//   _pad1(4) + timestamp(8) + payload(32) + _reserved(8)
//   = 8 + 4 + 4 + 8 + 32 + 8 = 64 bytes.
const_assert!(core::mem::size_of::<IrqNotification>() == 64);

/// Timer expiry payload. Stored in `IrqNotification.payload` when
/// `event_type == TimerExpiry`. Read by the per-domain service thread
/// to dispatch to the correct Tier 1 module's timer handler.
///
/// The timer wheel (Tier 0 softirq) enqueues this when a timer whose
/// `domain_id != CORE_DOMAIN_ID` expires. The `timer_id` is opaque to
/// the timer wheel — it is the value provided by the Tier 1 module at
/// timer registration time (typically an index into a per-module timer
/// table, or a packed `(connection_id, timer_type)` discriminant).
// kernel-internal, not KABI
#[repr(C)]
pub struct TimerExpiryPayload {
    /// Opaque timer identifier provided at registration time.
    /// For TCP: packed `(sock_handle: u48, timer_type: u16)`.
    /// For other modules: module-defined meaning.
    pub timer_id: u64,
    /// Absolute expiry time in nanoseconds (CLOCK_MONOTONIC).
    /// Allows the consumer to detect stale expiries (timer was
    /// rearmed after this event was enqueued).
    pub expiry_ns: u64,
    /// Reserved (zero). Fills the 32-byte payload.
    pub _reserved: [u8; 16],
}
const_assert!(core::mem::size_of::<TimerExpiryPayload>() == 32);

/// I/O completion payload. Stored in `IrqNotification.payload` when
/// `event_type == Completion`. Used by Tier 1 drivers to signal bio
/// completion back to Tier 0 without direct cross-domain function calls.
///
/// The Tier 1 driver enqueues this on its **outbound** KABI completion
/// ring (not the IRQ ring — the outbound ring targets the Tier 0 block
/// layer consumer). The Tier 0 consumer reads the payload and calls
/// `bio_complete()` with the indicated status.
// kernel-internal, not KABI
#[repr(C)]
pub struct CompletionPayload {
    /// Opaque cookie identifying the bio/request. Typically the bio
    /// pointer cast to u64 (the Tier 0 block layer uses this to
    /// recover the Bio reference).
    pub cookie: u64,
    /// Completion status: 0 = success, negative = -errno.
    pub status: i32,
    /// Reserved padding.
    pub _pad: [u8; 4],
    /// Reserved (zero). Fills the 32-byte payload.
    pub _reserved: [u8; 16],
}
const_assert!(core::mem::size_of::<CompletionPayload>() == 32);

/// Per-driver IRQ ring. MPSC producer, SPSC consumer.
///
/// **Producer side (MPSC)**: Multiple CPUs can simultaneously receive
/// interrupts for this driver's vectors and post notifications to the
/// ring. The producer side uses a CAS on `head` for safe concurrent
/// enqueue — the same MPSC protocol as the general `DomainRingBuffer`
/// ([Section 11.8](11-drivers.md#ipc-architecture-and-message-passing--domain-ring-buffer-design)).
/// A failed CAS means another CPU claimed the slot; the retrying CPU
/// claims the next slot. With 64 entries and typical IRQ rates, CAS
/// contention is negligible.
///
/// **Consumer side (SPSC)**: Exactly one consumer thread (the driver's
/// IRQ consumer loop) drains the ring. No consumer-side synchronization
/// is needed — the consumer has exclusive ownership of the tail pointer.
///
/// The ring uses `DomainRingBuffer` internally but is typed to
/// `IrqNotification` entries. The Tier 0 handler writes entries
/// directly (no serialization — `IrqNotification` is fixed-layout).
// kernel-internal, not KABI
pub struct IrqRing {
    /// The underlying ring buffer (header + 64 IrqNotification entries).
    pub ring: DomainRingBuffer,
    /// Monotonically increasing count of dropped notifications (ring full).
    /// Exposed via `/ukfs/kernel/drivers/<name>/irq_lost`.
    pub lost_irqs: AtomicU64,
    /// Flag set by the consumer when it enters sleep (waiting for work).
    /// The Tier 0 handler checks this to decide whether to send an IPI.
    /// `true` = consumer sleeping, send IPI to wake.
    /// `false` = consumer running, skip IPI (it will poll the ring).
    pub consumer_sleeping: AtomicU8, // 0 = awake, 1 = sleeping
}

/// Default IRQ ring capacity: 64 entries per driver.
///
/// At 64 bytes per entry, the ring data occupies 4 KiB (one page) plus
/// 128 bytes for the DomainRingBuffer header. Total: ~4.2 KiB per driver.
///
/// 64 entries covers burst scenarios: a NVMe device completing 64 I/Os
/// simultaneously via MSI-X will post 64 IRQ notifications. If the burst
/// exceeds 64, the overflow policy drops the oldest notification and
/// increments `lost_irqs`. The driver detects this via the sequence gap
/// and polls the device directly.
pub const IRQ_RING_DEFAULT_CAPACITY: u32 = 64;

/// Tier 0 generic IRQ handler. Called directly from the architecture's
/// interrupt dispatch table. Runs with interrupts disabled on the
/// receiving CPU.
///
/// Total cost: ~10-15 cycles.
/// - ACK: ~1-3 cycles (MMIO or register write, architecture-dependent).
/// - Mask: ~1-3 cycles (MMIO or register write).
/// - Ring push: ~3-5 cycles (atomic CAS on head, store notification).
/// - Conditional IPI: ~0-3 cycles (branch + optional IPI send).
///
/// The 10-15 cycle estimate assumes: (a) the IRQ ring's head cache line
/// is hot (L1 hit, ~4 cycles for CAS), (b) the consumer_sleeping flag
/// is in L1 (~1 cycle load), (c) no IPI needed (consumer is awake —
/// the common case under load). Under cold-cache conditions, the CAS
/// and flag load may each incur an L2 miss (~12 cycles each), bringing
/// worst-case to ~30-40 cycles — still within the Tier 0 handler budget.
///
/// # Safety
///
/// Called from interrupt context. Must not acquire any locks, allocate
/// memory, or sleep. The `irq_ring` pointer is valid for the driver's
/// lifetime (it points to kernel-owned memory, not driver-domain memory).
/// The `device` pointer is the driver's device descriptor in kernel
/// memory — also valid after driver crash (the kernel owns it).
pub unsafe fn generic_irq_handler(vector: u32) {
    // Look up the driver owning this vector. O(1) via the IRQ dispatch
    // table (XArray<u32, &IrqOwner>, [Section 3.8](03-concurrency.md#interrupt-handling)).
    let owner = match IRQ_DISPATCH_TABLE.get(vector as u64) {
        Some(o) => o,
        None => {
            // Spurious interrupt (no driver registered). ACK and return.
            arch::current::interrupts::eoi(vector);
            return;
        }
    };

    // Mask this specific vector FIRST, then EOI. For level-triggered
    // interrupts, doing EOI-before-mask creates a window where the
    // still-asserted IRQ line re-fires immediately after EOI, causing
    // a re-entry into this handler before the mask takes effect.
    // Masking first prevents re-entry; the subsequent EOI de-asserts
    // the pending state at the interrupt controller. The driver's
    // consumer unmasks after processing.
    arch::current::interrupts::mask_vector(vector);
    arch::current::interrupts::eoi(vector);

    // Post notification to the driver's IRQ ring.
    let notif = IrqNotification {
        event_type: IrqRingEventType::HwIrq,
        _pad0: [0u8; 3],
        vector,
        cpu: arch::current::cpu::cpu_id() as u32,
        _pad1: [0u8; 4],
        timestamp: arch::current::cpu::read_cycle_counter(),
        payload: [0u8; 32],
        _reserved: [0u8; 8],
    };

    let ring = &owner.irq_ring;
    match ring.ring.try_enqueue_cas(&notif) {
        Ok(()) => {}
        Err(()) => {
            // Ring full — drop notification and count the loss.
            ring.lost_irqs.fetch_add(1, Ordering::Relaxed);
        }
    }

    // Wake the consumer if it's sleeping.
    // Relaxed load: the consumer sets this flag with Release before
    // sleeping. We may observe a stale `awake` value (missing a wake)
    // but the consumer will re-check the ring on its next poll cycle.
    // A stale `sleeping` value causes a spurious IPI — harmless.
    if ring.consumer_sleeping.load(Ordering::Relaxed) != 0 {
        // Swap to 0 (awake) before sending IPI to avoid double-wake.
        if ring.consumer_sleeping.swap(0, Ordering::AcqRel) != 0 {
            arch::current::interrupts::send_ipi(
                owner.consumer_cpu,
                KABI_DOORBELL_VECTOR,
            );
        }
    }
}

/// IPI vector used to wake KABI consumer threads.
///
/// A fixed vector reserved during boot for doorbell IPIs. The IPI
/// handler simply returns — the actual wake-up occurs because the
/// consumer thread's poll/mwait is interrupted.
///
/// The value is architecture-specific because interrupt numbering differs
/// across platforms. Each architecture module defines its own constant:
///
/// | Architecture | Value | Mechanism |
/// |---|---|---|
/// | x86-64 | `0xFB` | Local APIC vector (0xF0-0xFF reserved range) |
/// | AArch64 | `14` | GICv3 SGI (range 0-15) |
/// | ARMv7 | `14` | GIC SGI (range 0-15) |
/// | RISC-V | `0` | Software interrupt via ACLINT (single SW IRQ) |
/// | PPC32/PPC64LE | `0` | Doorbell interrupt (msgsnd, single type) |
/// | s390x | `0` | SIGP external call (single external IRQ type) |
/// | LoongArch64 | `0` | IPI mailbox (single IPI mechanism) |
///
/// Architectures that have a single IPI mechanism (RISC-V, PPC, s390x,
/// LoongArch64) use 0 as a sentinel — the IPI dispatch handler
/// recognizes the KABI doorbell by the sending context, not the vector.
pub const KABI_DOORBELL_VECTOR: u32 = arch::current::interrupts::KABI_DOORBELL_VECTOR;

12.8.7.2 Phase 2: Driver IRQ Consumer¶

The driver's IRQ consumer loop runs in the driver's domain and processes notifications from the IRQ ring. It unmasks the interrupt vector after processing, re-enabling hardware interrupt delivery for the next event.

/// Driver-side IRQ consumer loop. Runs in the driver's isolation domain
/// (Tier 1 or Tier 2). Generated by the IDL compiler from the driver's
/// `interrupt_vectors` declaration.
///
/// The consumer alternates between sleeping (waiting for the doorbell IPI)
/// and draining the IRQ ring. It batches: if multiple IRQs accumulated
/// while the consumer was processing, they are all handled in one
/// domain-active period.
///
/// The ring carries multiple event types distinguished by
/// `IrqNotification.event_type`:
///
/// - `HwIrq`: Hardware interrupt — dispatch to `handler.handle_irq()`,
///   then unmask the vector.
/// - `TimerExpiry`: Timer expiry from the Tier 0 timer wheel — dispatch
///   to `handler.handle_timer_expiry()`. No vector unmasking needed
///   (timer events are not associated with hardware interrupt vectors).
/// - `Completion`: I/O completion forwarded from another domain — dispatch
///   to `handler.handle_completion()`. No vector unmasking needed.
///
/// This unified dispatch eliminates the need for separate per-event-type
/// rings. The MPSC CAS-based enqueue is shared across all Tier 0 producers
/// (hardware IRQ handler, timer softirq, block layer completion consumer).
fn irq_consumer_loop(
    irq_ring: &IrqRing,
    handler: &dyn DriverIrqHandler,
    domain_id: DomainId,
) -> ! {
    let cpu_local = arch::current::cpu::cpulocal();

    loop {
        // Mark ourselves as sleeping so the Tier 0 handler sends an IPI.
        irq_ring.consumer_sleeping.store(1, Ordering::Release);

        // Check ring once more after setting the flag (prevent lost-wake race):
        // if the Tier 0 handler posted between our last drain and the flag set,
        // we would miss it without this check.
        if irq_ring.ring.published.load(Ordering::Acquire)
            == irq_ring.ring.tail.load(Ordering::Relaxed)
        {
            // Ring is empty — sleep until IPI wakes us.
            arch::current::cpu::wait_for_interrupt();
        }

        // Clear sleeping flag.
        irq_ring.consumer_sleeping.store(0, Ordering::Release);

        // Enter domain.
        cpu_local.active_domain.store(domain_id, Ordering::Relaxed);

        // Drain all available notifications.
        let published = irq_ring.ring.published.load(Ordering::Acquire);
        let mut tail = irq_ring.ring.tail.load(Ordering::Relaxed);

        while tail < published {
            let idx = (tail % irq_ring.ring.size as u64) as usize;
            // SAFETY: idx is within ring bounds. Entry was written by
            // the Tier 0 handler and is complete (published).
            let notif: &IrqNotification = unsafe {
                irq_ring.ring.read_entry_as(idx)
            };

            match notif.event_type {
                IrqRingEventType::HwIrq => {
                    // Hardware interrupt — dispatch and unmask.
                    handler.handle_irq(notif.vector, notif.timestamp);

                    // Unmask the vector so the next hardware interrupt can fire.
                    // This is done per-notification, not per-batch, because each
                    // notification may be for a different vector (shared IRQ or
                    // multi-queue device with multiple MSI-X vectors).
                    arch::current::interrupts::unmask_vector(notif.vector);
                }
                IrqRingEventType::TimerExpiry => {
                    // Timer expiry from Tier 0 timer wheel. Extract payload.
                    // SAFETY: payload is 32 bytes, same size as TimerExpiryPayload.
                    // The entry was written by the timer softirq with the correct
                    // layout. Both producer and consumer are in Ring 0.
                    let timer_payload: &TimerExpiryPayload = unsafe {
                        &*(notif.payload.as_ptr() as *const TimerExpiryPayload)
                    };
                    handler.handle_timer_expiry(
                        timer_payload.timer_id,
                        timer_payload.expiry_ns,
                        notif.timestamp,
                    );
                    // No vector unmasking — timer events are not hardware interrupts.
                }
                IrqRingEventType::Completion => {
                    // I/O completion forwarded from another domain.
                    // SAFETY: same layout argument as TimerExpiry above.
                    let comp_payload: &CompletionPayload = unsafe {
                        &*(notif.payload.as_ptr() as *const CompletionPayload)
                    };
                    handler.handle_completion(
                        comp_payload.cookie,
                        comp_payload.status,
                        notif.timestamp,
                    );
                    // No vector unmasking — completion events are not hardware interrupts.
                }
            }

            tail += 1;
        }
        irq_ring.ring.tail.store(tail, Ordering::Release);

        // Exit domain.
        cpu_local.active_domain.store(0, Ordering::Relaxed);
    }
}

/// Trait implemented by drivers that handle IRQ ring events.
///
/// The IDL compiler generates an implementation of this trait from the
/// driver's `.kabi` `interrupt_vectors` declaration. The driver author
/// provides the bodies of the handler methods.
///
/// The three methods correspond to the three `IrqRingEventType` variants.
/// The consumer loop dispatches to the correct method based on the
/// notification's `event_type` discriminant.
pub trait DriverIrqHandler {
    /// Handle a hardware interrupt.
    ///
    /// Called in the driver's domain with the vector's interrupt masked.
    /// The driver should read device status, process completions, and
    /// return. The consumer loop unmasks the vector after this returns.
    ///
    /// # Arguments
    ///
    /// - `vector`: The hardware interrupt vector that fired.
    /// - `timestamp`: Cycle counter value when the interrupt was received
    ///   by the Tier 0 handler. The driver can compute interrupt latency
    ///   as `current_cycles - timestamp`.
    fn handle_irq(&self, vector: u32, timestamp: u64);

    /// Handle a timer expiry event from the Tier 0 timer wheel.
    ///
    /// Called when a timer registered with this driver's `domain_id`
    /// expires. The timer wheel (Tier 0 softirq) cannot call Tier 1
    /// timer handlers directly — it enqueues a `TimerExpiry` event
    /// on the driver's IRQ ring instead.
    ///
    /// The driver uses `timer_id` to identify which timer expired
    /// (e.g., for TCP: packed `(sock_handle, timer_type)` to dispatch
    /// to the correct connection's retransmit/delack/keepalive handler).
    ///
    /// **Stale expiry detection**: The driver should compare `expiry_ns`
    /// against the timer's current armed time. If the timer was rearmed
    /// after this event was enqueued (armed_time > expiry_ns), the event
    /// is stale and should be discarded.
    ///
    /// # Arguments
    ///
    /// - `timer_id`: Opaque timer identifier provided at registration.
    /// - `expiry_ns`: Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
    /// - `timestamp`: Cycle counter at the time the timer softirq posted
    ///   this event (for latency measurement).
    ///
    /// # Default implementation
    ///
    /// Drivers that do not use cross-domain timers (pure Tier 0 drivers,
    /// or drivers whose timers run in their own domain) may leave this
    /// as the default no-op.
    fn handle_timer_expiry(&self, _timer_id: u64, _expiry_ns: u64, _timestamp: u64) {}

    /// Handle an I/O completion event forwarded from another domain.
    ///
    /// Called when a completion event arrives on this driver's IRQ ring.
    /// This is used for cross-domain completion notification (e.g., a
    /// Tier 1 block driver completing a bio and notifying the Tier 0
    /// block layer via the outbound ring).
    ///
    /// # Arguments
    ///
    /// - `cookie`: Opaque completion identifier (e.g., bio pointer as u64).
    /// - `status`: 0 = success, negative = -errno.
    /// - `timestamp`: Cycle counter at posting time.
    ///
    /// # Default implementation
    ///
    /// Drivers that do not receive cross-domain completions may leave
    /// this as the default no-op.
    fn handle_completion(&self, _cookie: u64, _status: i32, _timestamp: u64) {}
}

12.8.7.3 Shared IRQ Handling¶

When multiple drivers register for the same interrupt vector (legacy INTx shared interrupts on PCI), the Tier 0 handler dispatches to ALL registered IRQ rings. Each driver's consumer checks whether the interrupt was for its device (by reading the device's interrupt status register):

/// IRQ dispatch table entry for shared interrupts.
// kernel-internal, not KABI
pub struct IrqOwner {
    /// IRQ ring for this driver — used for non-NAPI device interrupts
    /// (block completion, USB events, etc.). For NAPI vectors, the IRQ
    /// ring is still used: the driver's consumer loop dequeues the IRQ
    /// event and calls its NAPI poll function.
    pub irq_ring: IrqRing,
    /// CPU where the consumer thread is running. Used for IPI targeting
    /// to wake the consumer on interrupt arrival.
    pub consumer_cpu: u32,
    /// Domain ID of the owning driver. Used by crash detection: if the
    /// domain is in `DomainState::Crashed`, the IRQ handler masks the
    /// vector permanently (no ring posting) until crash recovery restarts
    /// the driver.
    pub domain_id: DomainId,
    /// NAPI context ID for NAPI-capable vectors. If `napi_id != 0`, the
    /// vector is a NAPI vector and the IRQ handler writes to the IRQ ring
    /// as usual (the driver's consumer loop dequeues and calls NAPI poll).
    /// If `napi_id == 0`, this is a non-NAPI device vector.
    pub napi_id: u32,
    /// Next driver sharing this vector (linked list for shared IRQs).
    /// `None` for exclusive (MSI/MSI-X) vectors — the common case.
    pub next_shared: Option<*const IrqOwner>,
}

/// Dispatch to all drivers sharing a vector.
///
/// For exclusive vectors (MSI/MSI-X), `next_shared` is `None` and this
/// dispatches to exactly one driver — no overhead from the sharing path.
unsafe fn dispatch_shared_irq(vector: u32) {
    let mut owner_opt = IRQ_DISPATCH_TABLE.get(vector as u64);
    while let Some(owner) = owner_opt {
        // Post to this driver's ring.
        let notif = IrqNotification {
            event_type: IrqRingEventType::HwIrq,
            _pad0: [0u8; 3],
            vector,
            cpu: arch::current::cpu::cpu_id() as u32,
            _pad1: [0u8; 4],
            timestamp: arch::current::cpu::read_cycle_counter(),
            payload: [0u8; 32],
            _reserved: [0u8; 8],
        };
        let _ = owner.irq_ring.ring.try_enqueue_cas(&notif);

        // Wake if sleeping.
        if owner.irq_ring.consumer_sleeping.swap(0, Ordering::AcqRel) != 0 {
            arch::current::interrupts::send_ipi(
                owner.consumer_cpu,
                KABI_DOORBELL_VECTOR,
            );
        }

        owner_opt = owner.next_shared.map(|p| &*p);
    }
}

12.8.7.4 Timer-to-Domain Dispatch Protocol¶

When a Tier 1 module (e.g., umka-net TCP, IPVS) registers a timer with the Tier 0 timer wheel or hrtimer subsystem, it includes its domain_id: DomainId in the registration. On expiry, the timer subsystem uses the domain_id to route the event to the correct driver's IRQ ring, rather than calling the callback directly.

Registration: The Tier 1 module calls timer_register_cross_domain() (defined in Section 7.8) with: - timer_id: u64 -- opaque identifier meaningful to the registering module. - domain_id: DomainId -- the domain to deliver the expiry event to. - expiry_ns: u64 -- absolute expiry time. - timer_type: TimerType -- Wheel (coarse-grained) or HrTimer (high-resolution).

Expiry: When the timer fires in Tier 0 softirq context:

/// Called by the timer wheel or hrtimer expiry path when a cross-domain
/// timer fires. Enqueues a `TimerExpiry` event to the target domain's
/// IRQ ring. Runs in softirq context (no sleeping, no heavy allocation).
///
/// # Safety
///
/// Called from softirq context. Must not acquire sleeping locks.
/// The `domain_id` was validated at registration time -- the domain
/// exists and has an IRQ ring. If the domain has crashed since
/// registration, the ring enqueue fails silently (ring is Disconnected)
/// and the event is dropped.
pub fn timer_fire_to_domain(
    domain_id: DomainId,
    timer_id: u64,
    expiry_ns: u64,
) {
    // Look up the domain's IRQ ring via the domain registry.
    let domain = match DOMAIN_REGISTRY.get(domain_id) {
        Some(d) => d,
        None => return, // Domain gone (shutdown/crashed) -- discard event.
    };

    // Check domain is active -- no point delivering to a crashed domain.
    if domain.domain_crashed() {
        return;
    }

    // Build timer expiry payload.
    let mut payload = [0u8; 32];
    let timer_payload = TimerExpiryPayload {
        timer_id,
        expiry_ns,
        _reserved: [0u8; 16],
    };
    // SAFETY: TimerExpiryPayload is 32 bytes, same as payload.
    unsafe {
        core::ptr::copy_nonoverlapping(
            &timer_payload as *const TimerExpiryPayload as *const u8,
            payload.as_mut_ptr(),
            32,
        );
    }

    let notif = IrqNotification {
        event_type: IrqRingEventType::TimerExpiry,
        _pad0: [0u8; 3],
        vector: 0, // Not a hardware interrupt.
        cpu: arch::current::cpu::cpu_id() as u32,
        _pad1: [0u8; 4],
        timestamp: arch::current::cpu::read_cycle_counter(),
        payload,
        _reserved: [0u8; 8],
    };

    // Find the first active IRQ ring for this domain.
    // Timer events are dispatched to the domain's primary consumer thread.
    // The domain service thread then dispatches to the correct module
    // based on the timer_id.
    let irq_rings = domain.irq_rings.lock();
    if let Some(desc) = irq_rings.first() {
        match desc.ring.ring.try_enqueue_cas(&notif) {
            Ok(()) => {}
            Err(()) => {
                desc.ring.lost_irqs.fetch_add(1, Ordering::Relaxed);
            }
        }
        // Wake consumer if sleeping.
        if desc.ring.consumer_sleeping.swap(0, Ordering::AcqRel) != 0 {
            arch::current::interrupts::send_ipi(
                desc.consumer_cpu,
                KABI_DOORBELL_VECTOR,
            );
        }
    }
}

Reverse direction (Tier 1 to Tier 0 completion): When a Tier 1 driver completes an I/O operation (e.g., NVMe CQE processing), it cannot call bio_complete() directly (that function is in the Tier 0 block layer -- a different domain). Instead, the driver enqueues a Completion event on its outbound KABI ring. The Tier 0 block layer consumer dequeues the completion and calls bio_complete() in Tier 0 context. See Section 15.19 for the NVMe-specific application of this pattern.

12.8.7.5 Crash Behavior¶

If a driver crashes, the Tier 0 IRQ handler continues to ACK and mask interrupts for the driver's vectors. The IRQ ring transitions to Disconnected (set by the crash recovery path, step 2' of Section 11.9). No interrupt storm can occur because vectors remain masked. When the replacement driver re-initializes:

A new IRQ ring is allocated (the old ring's memory is freed after drain).
The IRQ dispatch table is updated to point to the new ring.
The new consumer loop starts and unmasks vectors as it processes the first batch.
Hardware interrupt delivery resumes normally.

The window between crash and driver restart (~50-150ms) is an interrupts-masked period for the affected device. For NVMe devices this means I/O completions are delayed but not lost (the device retains completions in its hardware CQ). For network devices, packets are queued at the NIC (RSS queues are not drained by masking the IRQ).

12.8.8 Rebinding on Promotion/Demotion¶

When an operator writes to /ukfs/kernel/drivers/<name>/tier or the FMA subsystem (Section 20.1) triggers an automatic tier change, the module must move from one domain to another. All handles pointing to the module's services must be atomically rebound — callers switch from direct to ring (or vice versa) without code changes.

12.8.8.1 Rebinding State Machine¶

SERVING → QUIESCING → MIGRATING → REBINDING → SERVING
                                               (new domain)

On failure:
MIGRATING → ROLLBACK → SERVING (original domain, unchanged)

State	Description	Duration
`Quiescing`	New ring submissions return `EAGAIN`. Drain in-flight completions.	~1-10ms (ring drain)
`Migrating`	Module binary relocated to new domain. Hardware isolation configured.	~5-50ms (ELF loading, domain setup)
`Rebinding`	Global registry notifies all domain services. Handles atomically switched.	~100us-1ms (IPI + handle updates)

12.8.8.2 Rebinding Protocol¶

/// Migrate a module from one domain to another (promotion/demotion).
///
/// This is the complete protocol. It is orchestrated by Tier 0 (the
/// global registry) because it crosses domain boundaries.
///
/// # Arguments
///
/// - `module_id`: Module being migrated.
/// - `source_domain`: Current domain.
/// - `target_tier`: Destination tier (0, 1, or 2).
///
/// # Invariants
///
/// - The module's service availability is preserved: callers observe at
///   most a brief latency spike (ring drain time), never a service outage.
/// - No request is lost: all in-flight requests complete before migration.
/// - No request is duplicated: the module processes each request exactly once.
pub fn migrate_module(
    module_id: ModuleId,
    source_domain: DomainId,
    target_tier: IsolationTier,
) -> Result<(), KabiError> {
    // Step 1: Mark module as migrating in the global registry.
    // New service lookups for this module return the current binding
    // (not the pending migration target) until rebinding is complete.
    GLOBAL_SERVICE_REGISTRY.set_migrating(module_id, true);

    // Step 2: Quiesce — stop accepting new requests.
    // Set all inbound rings' state to Draining. Producers receive
    // EAGAIN and must retry after a backoff. Existing in-flight
    // requests continue to completion.
    let source_service = domain_registry::service_of(source_domain);
    let module = source_service.get_module(module_id)?;
    for ring in module.inbound_rings() {
        ring.ring_header().state.store(
            RING_STATE_DRAINING,
            Ordering::Release,
        );
    }

    // Step 3: Wait for in-flight completions.
    // Bounded wait: if completions don't arrive within MIGRATE_DRAIN_TIMEOUT_NS,
    // the remaining in-flight requests are cancelled with -EAGAIN.
    //
    // Uses schedule_timeout() instead of spin-polling to yield the CPU
    // while waiting. The consumer loop posts a wakeup to the drain_waitqueue
    // when the last in-flight request completes.
    let deadline_ns = arch::current::cpu::ktime_get_ns() + MIGRATE_DRAIN_TIMEOUT_NS;
    for ring in module.inbound_rings() {
        while ring.has_inflight() {
            let remaining_ns = deadline_ns.saturating_sub(
                arch::current::cpu::ktime_get_ns(),
            );
            if remaining_ns == 0 {
                ring.cancel_inflight(KabiError::MigrationTimeout);
                break;
            }
            // Sleep until woken by the consumer loop completing the last
            // in-flight request, or until the timeout expires. The ring's
            // drain_waitqueue is signaled by the consumer loop after
            // processing each completion while in DRAINING state.
            ring.drain_waitqueue.wait_timeout(remaining_ns);
        }
    }

    // Step 4: Disconnect old rings.
    for ring in module.inbound_rings() {
        ring.ring_header().state.store(RING_STATE_DISCONNECTED, Ordering::Release);
    }

    // Step 5: Determine target domain. Reuse an existing compatible
    // domain if available, otherwise allocate a new one. For Tier 1,
    // modules with the same domain class (e.g., "block", "network")
    // share a domain — this matches the domain grouping policy in
    // [Section 11.3](11-drivers.md#driver-isolation-tiers--domain-grouping-policy).
    let target_domain = match target_tier {
        IsolationTier::Tier0 => {
            // Moving to Tier 0 (Core domain). No new domain needed.
            CORE_DOMAIN_ID
        }
        IsolationTier::Tier1 => {
            // Check if an existing Tier 1 domain for this module's
            // domain class has capacity for another module.
            let domain_class = module.domain_class();
            match domain_registry::find_compatible_domain(
                IsolationTier::Tier1,
                domain_class,
            ) {
                Some(existing_domain) => existing_domain,
                None => {
                    // No compatible domain with capacity — allocate new.
                    let domain_id = arch::current::isolation::allocate_isolation_domain()?;
                    domain_registry::create_domain(
                        domain_id,
                        IsolationTier::Tier1,
                    );
                    domain_id
                }
            }
        }
        IsolationTier::Tier2 => {
            // Spawn a new process for the Tier 2 driver. Tier 2 domains
            // are always per-process (no sharing).
            let proc = process::spawn_driver_process(module_id)?;
            proc.domain_id()
        }
    };

    // Step 6: Re-register module in the new domain.
    // The module re-executes the Hello protocol (register → resolve → ready),
    // but with preserved state: device handles, DMA mappings, and driver-internal
    // state are carried over via the Module Binary Store snapshot.
    let target_service = domain_registry::service_of(target_domain);
    let manifest = module.manifest().clone();
    let _reg = match target_service.register(&manifest) {
        Ok(reg) => reg,
        Err(e) => {
            // Rollback: restore original domain, re-activate rings.
            rollback_migration(module_id, source_domain, &module)?;
            return Err(e);
        }
    };

    // Step 7: Set up new rings and resolve dependencies in the new domain.
    match target_service.resolve_all(module_id) {
        Ok(_handles) => {}
        Err(e) => {
            rollback_migration(module_id, source_domain, &module)?;
            return Err(e);
        }
    }
    target_service.announce_ready(module_id)?;

    // Step 8: Atomic rebind — notify all domain services with handles
    // pointing to this module's services. Each affected handle is
    // updated to the new transport.
    let affected_domains = GLOBAL_SERVICE_REGISTRY.domains_bound_to(module_id);
    for domain_id in affected_domains {
        let ds = domain_registry::service_of(domain_id);
        // For each handle in this domain that points to the migrated module:
        // - If now same-domain → switch to Direct
        // - If now different-domain → create new ring, switch to Ring
        let _ = ds.rebind_handles_for_module(module_id, target_domain);
    }

    // Step 9: Tear down old domain resources.
    for ring in module.inbound_rings() {
        teardown_cross_domain_ring(ring);
    }
    if source_domain != CORE_DOMAIN_ID
        && domain_registry::module_count(source_domain) == 0
    {
        // Source domain is now empty — release it.
        arch::current::isolation::release_isolation_domain(source_domain);
        domain_registry::destroy_domain(source_domain);
    }

    // Step 10: Clear migration flag.
    GLOBAL_SERVICE_REGISTRY.set_migrating(module_id, false);

    Ok(())
}

/// Drain timeout for module migration: 100ms.
///
/// Covers worst-case I/O completion latency (NVMe: ~10ms, disk: ~50ms).
/// If a request takes longer than 100ms to complete, it is cancelled
/// with -EAGAIN during migration. The application retries after the
/// module is available in its new domain.
pub const MIGRATE_DRAIN_TIMEOUT_NS: u64 = 100_000_000;

/// Ring state constants.
pub const RING_STATE_ACTIVE: u8 = 0;
pub const RING_STATE_DISCONNECTED: u8 = 1;
pub const RING_STATE_DRAINING: u8 = 2;

/// Rollback a failed migration. Restores the module to its source domain.
fn rollback_migration(
    module_id: ModuleId,
    source_domain: DomainId,
    module: &ModuleDescriptor,
) -> Result<(), KabiError> {
    // Re-activate all inbound rings.
    for ring in module.inbound_rings() {
        ring.ring_header().state.store(RING_STATE_ACTIVE, Ordering::Release);
    }
    // Clear migration flag.
    GLOBAL_SERVICE_REGISTRY.set_migrating(module_id, false);
    Ok(())
}

12.8.8.3 Atomic Handle Rebinding¶

The rebind_handles_for_module operation updates all handles in a domain that reference the migrated module. The update is atomic from the caller's perspective: the generation counter ensures no dispatch uses a half-updated handle.

impl DomainService {
    /// Rebind all handles in this domain that point to services provided
    /// by `module_id`, now located in `new_domain`.
    ///
    /// For each affected handle:
    /// 1. If `new_domain == self.domain_id` → switch to Direct transport.
    /// 2. If `new_domain != self.domain_id` → create ring, switch to Ring.
    /// 3. Update `cached_generation` to the module's current generation.
    ///
    /// Handles that were Direct and remain Direct (module moved within
    /// the same domain) only need a vtable pointer update.
    ///
    /// Returns `Ok(rebind_count)` on success. On ring setup failure for
    /// individual handles, invalidates them (setting generation to 0,
    /// which forces re-resolution on next kabi_call!).
    pub fn rebind_handles_for_module(
        &self,
        module_id: ModuleId,
        new_domain: DomainId,
    ) -> Result<usize, KabiError> {
        let modules = self.modules.lock();
        let mut rebind_count: usize = 0;
        for module in modules.iter() {
            let mut handles = module.handles.lock();
            for handle_opt in handles.iter_mut() {
                let handle = match handle_opt {
                    Some(h) => h,
                    None => continue,
                };
                // Check if this handle points to a service from the migrated module.
                let entry = GLOBAL_SERVICE_REGISTRY.entry_for_handle(handle);
                if entry.module_id != module_id {
                    continue;
                }

                // Determine new transport.
                if new_domain == self.domain_id {
                    handle.transport = KabiTransport::Direct {
                        vtable: entry.vtable_ptr,
                        ctx: entry.ctx_ptr,
                    };
                } else {
                    match self.setup_or_reuse_ring(
                        self.domain_id,
                        new_domain,
                        &entry.service_id,
                    ) {
                        Ok(ring) => {
                            handle.transport = KabiTransport::Ring { ring };
                        }
                        Err(_) => {
                            // Ring setup failed (ENOMEM or domain key exhausted).
                            // Invalidate this handle so the caller gets
                            // StaleHandle and re-resolves. This is safe because
                            // the module is still in a valid domain — re-resolution
                            // will succeed once memory pressure subsides.
                            handle.cached_generation = 0;
                            continue;
                        }
                    }
                }
                // Update generation — must be last to ensure the new
                // transport is visible before the generation matches.
                // The `kabi_call!` macro reads generation with Acquire,
                // so this Release store provides the ordering guarantee.
                let new_gen = entry.generation.load(Ordering::Acquire);
                handle.cached_generation = new_gen;
                rebind_count += 1;
            }
        }
        Ok(rebind_count)
    }
}

12.8.8.4 Generation Counter Hierarchy¶

UmkaOS uses multiple generation counters at different levels of the domain model. This section documents their relationship and authoritative sources to prevent confusion and stale-counter bugs.

Counter	Location	Incremented when	Authority
Domain generation	`DomainService.generation: AtomicU64`	Every crash recovery cycle for this domain	Primary authority for domain-level staleness. All handles within the domain become stale when this increments.
Registry entry generation	`RegistryEntry.generation: Box<AtomicU64>`	Module crash (odd→even), module ready (even→odd)	Per-service instance. Mirrors the domain generation pattern from `DriverDomain.generation` in Section 12.3. The `Box` allocation ensures a stable address independent of XArray reallocation.
Handle generation	`KabiHandle.generation: *const AtomicU64`	Never (pointer, not counter)	Points to the `RegistryEntry.generation` Box. The `cached_generation` field holds the snapshot taken at bind time. Compared on every `kabi_call!` dispatch with `Acquire` ordering.
Ring generation	`CrossDomainRing.generation: AtomicU64`	Ring teardown/recreation	Per-ring. Independent of module generation — a ring may be recreated during migration without the module crashing.

Invariant: RegistryEntry.generation and DriverDomain.generation (in Section 12.3) represent the SAME logical counter for the same module. The bilateral-capability-exchange file should reference this section for the canonical definition of the generation protocol (odd = active, even = inactive, two increments per crash cycle). See Section 12.3 for the u64 longevity analysis (584 years at 1 billion reloads/sec).

12.8.9 Per-Domain Service¶

Each isolation domain has a DomainService instance that manages module lifecycle, binding resolution, IRQ routing, and crash notification within that domain.

12.8.9.1 DomainService Structure¶

/// Per-domain service infrastructure.
///
/// One instance exists for each isolation domain. For Tier 0 (Core domain),
/// this is a static global in umka-core. For Tier 1 domains, it is
/// allocated in kernel memory (accessible from the domain via the shared
/// ring page). For Tier 2 domains, a lightweight version runs as a
/// userspace library linked into the driver process.
///
/// All mutable fields use interior mutability (`SpinLock`, `AtomicU8`)
/// because the domain service is shared between the domain's consumer
/// threads and the Tier 0 management path (crash handler, migration).
// kernel-internal, not KABI
pub struct DomainService {
    /// Unique domain identifier ([Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes)).
    pub domain_id: DomainId,

    /// Binding table: maps ServiceId → transport binding.
    /// XArray for O(1) lookup by the ServiceId's packed u64 key
    /// (first 8 bytes of `ServiceId.name` hashed to u64 via SipHash).
    ///
    /// Hot path: `resolve()` during module init (warm, bounded).
    /// Not accessed on per-call dispatch (handles are cached).
    pub binding_table: SpinLock<XArray<BindingEntry>>,

    /// Modules loaded in this domain. Bounded by the maximum number
    /// of modules per domain (hardware-dependent: x86 MPK domains can
    /// host ~4-8 modules before memory pressure; Tier 2 domains host
    /// exactly 1 module per process).
    ///
    /// MAX_MODULES_PER_DOMAIN = 16: sufficient for the largest domain
    /// grouping scenario (e.g., a "block" domain containing NVMe, AHCI,
    /// VirtIO-blk, and SCSI drivers).
    pub modules: SpinLock<ArrayVec<ModuleDescriptor, MAX_MODULES_PER_DOMAIN>>,

    /// IRQ rings for all modules in this domain that claim interrupts.
    /// Bounded: MAX_IRQ_RINGS_PER_DOMAIN = 64 (16 modules × 4 avg
    /// interrupt vectors per module, with headroom for multi-queue devices).
    pub irq_rings: SpinLock<ArrayVec<IrqRingDescriptor, MAX_IRQ_RINGS_PER_DOMAIN>>,

    /// Domain health state. Written atomically by the crash handler.
    ///   0 = Normal (all modules healthy)
    ///   1 = Recovering (crash detected, recovery in progress)
    ///   2 = Faulted (recovery failed, domain permanently down)
    pub crash_state: AtomicU8,

    /// Generation counter for this domain. Incremented on every crash
    /// recovery cycle. Used to invalidate stale handles.
    pub generation: AtomicU64,
}

pub const MAX_MODULES_PER_DOMAIN: usize = 16;
pub const MAX_IRQ_RINGS_PER_DOMAIN: usize = 64;

/// Binding table entry: maps a service to its transport.
// kernel-internal, not KABI
pub struct BindingEntry {
    /// The bound service.
    pub service_id: ServiceId,
    /// Resolved transport (Direct or Ring).
    pub transport: KabiTransport,
    /// Generation at bind time. Compared against the target module's
    /// live generation to detect stale bindings. AtomicU64 so that the
    /// crash handler in `notify_crash()` can iterate binding_table entries
    /// via shared reference (`iter()`, not `iter_mut()`) and invalidate
    /// generations without requiring `&mut` access to each entry. This
    /// avoids the need for `iter_mut()` through SpinLock, which would
    /// require exclusive mutable access to the XArray's internal nodes.
    pub generation: AtomicU64,
    /// Module providing this service.
    pub module_id: ModuleId,
}

/// IRQ ring descriptor in the per-domain IRQ ring table.
// kernel-internal, not KABI
pub struct IrqRingDescriptor {
    /// Module that owns this IRQ ring.
    pub module_id: ModuleId,
    /// The IRQ ring itself.
    pub ring: IrqRing,
    /// CPU where the consumer thread runs.
    pub consumer_cpu: u32,
    /// Hardware interrupt vectors routed to this ring.
    pub vectors: ArrayVec<u32, MAX_IRQS_PER_MODULE>,
}

12.8.9.2 DomainService Operations¶

impl DomainService {
    /// Handle a crash notification for a module in this domain.
    ///
    /// Called by the Tier 0 crash handler after domain isolation
    /// (step 2 of [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)).
    ///
    /// # Steps
    ///
    /// 1. Mark domain as Recovering.
    /// 2. Invalidate all handles to services provided by the crashed module.
    /// 3. Set all rings owned by the crashed module to Disconnected.
    /// 4. Increment domain generation (invalidates cached handles globally).
    /// 5. Trigger recovery: reload the module from the Module Binary Store.
    /// 6. On successful reload: mark domain as Normal.
    /// 7. On failure: mark domain as Faulted.
    pub fn notify_crash(&self, module_id: ModuleId) {
        self.crash_state.store(1, Ordering::Release); // Recovering

        // Invalidate bindings.
        {
            let binding_table = self.binding_table.lock();
            // XArray iteration via shared reference — walk all entries,
            // mark those from the crashed module as stale by atomically
            // incrementing their generation. AtomicU64 allows mutation
            // through shared reference without requiring iter_mut().
            for entry in binding_table.iter() {
                if entry.module_id == module_id {
                    entry.generation.fetch_add(1, Ordering::Release);
                }
            }
        }

        // Set rings to Disconnected.
        {
            let irq_rings = self.irq_rings.lock();
            for desc in irq_rings.iter() {
                if desc.module_id == module_id {
                    desc.ring.ring.state.store(
                        RING_STATE_DISCONNECTED,
                        Ordering::Release,
                    );
                }
            }
        }

        // Increment domain generation.
        self.generation.fetch_add(1, Ordering::SeqCst);
    }

    /// Notify the domain service that a module in ANOTHER domain has
    /// become available. Wakes ALL local modules that were deferred
    /// in RESOLVING state waiting for this service.
    pub fn notify_service_available(
        &self,
        service_id: &ServiceId,
        _provider_domain: DomainId,
    ) {
        // Phase 1: Collect IDs of all affected modules while holding the
        // lock. We cannot call resolve_all() with the lock held because
        // it acquires GLOBAL_SERVICE_REGISTRY (different lock ordering).
        let mut affected: ArrayVec<ModuleId, MAX_MODULES_PER_DOMAIN> = ArrayVec::new();
        {
            let modules = self.modules.lock();
            for module in modules.iter() {
                if module.state.load(Ordering::Acquire)
                    != ModuleState::Resolving as u8
                {
                    continue;
                }
                for req in &module.requires {
                    if req.service_id == *service_id {
                        affected.push(module.module_id);
                        break; // No need to check remaining reqs for this module.
                    }
                }
            }
        } // modules lock released

        // Phase 2: Attempt re-resolution for each affected module.
        // If successful, the module transitions from RESOLVING to READY.
        // Failures are silent — the module stays in RESOLVING until the
        // next notification.
        for mid in &affected {
            let _ = self.resolve_all(*mid);
        }
    }
}

12.8.9.3 Global Service Registry¶

/// Global service registry. Tier 0 singleton that maps ServiceId to
/// (DomainId, ModuleId, generation, vtable_ptr, ctx_ptr).
///
/// Updated on every module register/unregister/migrate. Read by
/// `DomainService::resolve()` to determine transport at bind time.
///
/// Uses XArray keyed by `ServiceId.hash() -> u64` for O(1) lookup.
/// Collision handling: chained entries at the same XArray slot,
/// disambiguated by full `ServiceId` comparison.
///
/// **Concurrency**: The XArray natively supports RCU-protected reads
/// (`xa_load()` under `rcu_read_lock()`). The SpinLock protects WRITE
/// operations only (publish, unpublish, set_migrating). The read path
/// (`lookup()`) is entirely lock-free — it uses `rcu_read_lock()` +
/// `xa_load()` to safely dereference the XArray entry. Write frequency
/// is low (module load/unload/migrate).
///
/// **Lock ordering** (documented here as the canonical reference):
///
/// ```text
/// GLOBAL_SERVICE_REGISTRY.entries_lock   [level 40]
///   └─ DomainService.modules             [level 50]
///       └─ ModuleDescriptor.handles      [level 60]
///           └─ DomainService.binding_table [level 70]
/// ```
///
/// No path may acquire a lower-level lock while holding a higher-level
/// lock. The `Lock<T, LEVEL>` compile-time ordering enforcement
/// ([Section 3.5](03-concurrency.md#locking-strategy)) makes violations compilation errors.
// kernel-internal, not KABI
pub struct GlobalServiceRegistry {
    /// ServiceId hash → RegistryEntry. RCU-protected for read-side,
    /// SpinLock-protected for write-side. XArray's internal structure
    /// uses RCU for safe concurrent reads without the write lock.
    entries: SpinLock<XArray<RegistryEntry>>,
    /// Deferred wakeup list: modules waiting for services that
    /// are not yet loaded. Keyed by ServiceId hash.
    deferred: SpinLock<XArray<ArrayVec<DeferredWakeup, 16>>>,
}

impl GlobalServiceRegistry {
    /// Look up a service provider by ServiceId. Lock-free read path.
    ///
    /// Uses RCU read-side protection to safely dereference XArray entries
    /// without acquiring the write-side SpinLock. This is the hot path
    /// called by `DomainService::resolve()` during module initialization.
    ///
    /// Returns `None` if no provider is registered for this service,
    /// or if the only provider is currently migrating.
    pub fn lookup(
        &self,
        service_id: &ServiceId,
        min_version: KabiVersion,
    ) -> Option<&RegistryEntry> {
        let _guard = rcu_read_lock();
        // xa_load is safe under rcu_read_lock — the returned reference
        // is valid until rcu_read_unlock (guard drop).
        let key = service_id.hash();
        let entry = self.entries.xa_load(key)?;
        // Disambiguate hash collisions.
        if entry.service_id != *service_id {
            return None;
        }
        // Check version compatibility.
        if entry.max_version < min_version {
            return None;
        }
        // Skip migrating entries — the module is between domains.
        if entry.migrating.load(Ordering::Acquire) != 0 {
            return None;
        }
        Some(entry)
    }
}

/// Entry in the global service registry.
pub struct RegistryEntry {
    /// Domain where the provider module is running.
    pub domain_id: DomainId,
    /// Module providing this service.
    pub module_id: ModuleId,
    /// Service identifier (for collision disambiguation).
    pub service_id: ServiceId,
    /// Pointer to the vtable (valid in the provider's domain).
    pub vtable_ptr: *const (),
    /// Opaque context pointer.
    pub ctx_ptr: *mut (),
    /// Version range provided.
    pub min_version: KabiVersion,
    pub max_version: KabiVersion,
    /// Live generation counter for stale-handle detection.
    ///
    /// Allocated separately via `Box` to ensure a stable address independent
    /// of XArray node reallocation. All `KabiHandle.generation` raw pointers
    /// point to this Box-owned AtomicU64. The Box is only freed when the
    /// service is permanently removed (after all handles have been
    /// invalidated by a generation bump). See "Generation Counter Hierarchy"
    /// below for how this relates to domain-level and handle-level generations.
    pub generation: Box<AtomicU64>,
    /// Migration in progress flag.
    pub migrating: AtomicU8, // 0 = stable, 1 = migrating
}

/// Deferred wakeup entry: a module waiting for a service.
struct DeferredWakeup {
    module_id: ModuleId,
    domain_id: DomainId,
}

/// Singleton instance.
pub static GLOBAL_SERVICE_REGISTRY: GlobalServiceRegistry = GlobalServiceRegistry {
    entries: SpinLock::new(XArray::new()),
    deferred: SpinLock::new(XArray::new()),
};

// CORE_DOMAIN_ID is defined in "Canonical Type Definitions" above.

12.8.9.4 Tier-Specific Domain Service Deployment¶

Tier	Domain Service location	Lifecycle	Notes
Tier 0	Static global in umka-core (`CORE_DOMAIN_SERVICE`)	Lives for kernel lifetime	Manages all Tier 0 loadable modules. Single instance.
Tier 1	Allocated in kernel memory at domain creation	Created with domain, destroyed when domain empties	Accessible from the Tier 1 domain via the shared ring page (the domain service state is in kernel memory, not in the domain's private region).
Tier 2	Userspace library (`libumka_domain.so`) linked into driver process	Created at process spawn, destroyed at process exit	Communicates with the Tier 0 global registry via a dedicated control ring (separate from the service rings). Handles local binding resolution and lifecycle management.

The Tier 2 domain service library provides the same register(), resolve(), announce_ready() API as the kernel-side domain service, but translates them into ring messages to the Tier 0 control plane. From the driver author's perspective, the API is identical regardless of tier.

12.8.10 Implementation Phases¶

Component	Phase	Rationale
`KabiHandle`, `KabiTransport`, `kabi_call!` (Direct path only)	Phase 1	Required for any Tier 0 driver to function
`DomainRingBuffer`, T1 command/completion flow	Phase 1	Required for Tier 1 isolation
`DomainService` (Tier 0 only), `GlobalServiceRegistry`	Phase 1	Module lifecycle management for Tier 0 loadable modules
`kabi_call!` (Ring path), consumer loop	Phase 2	Tier 1 driver dispatch
`setup_cross_domain_ring`, ring teardown	Phase 2	Cross-domain ring lifecycle
IRQ ring, `generic_irq_handler`, `irq_consumer_loop`	Phase 2	Two-phase interrupt delivery for Tier 1
`DomainService` (Tier 1), dependency resolution	Phase 2	Full Tier 1 module lifecycle
Module Hello protocol (full state machine)	Phase 2	Module lifecycle with deferred dependencies
`kabi_call_async!`, `KabiCookie`	Phase 2	Batching optimization
`migrate_module`, rebinding protocol	Phase 3	Runtime promotion/demotion
`DomainService` (Tier 2 library)	Phase 3	Tier 2 driver support
T2 ring integration, Tier 2 process lifecycle	Phase 3	Full three-tier deployment

See Section 24.2 for the complete phase definitions and dependencies.

12.8.11 Replaceability Classification¶

Component	Classification	Rationale
`KabiHandle`, `KabiTransport` struct layouts	Nucleus (data)	Wire format between domains; changing layout requires migration.
`DomainRingBuffer` header layout	Nucleus (data)	Shared memory protocol; both sides must agree on field offsets.
`kabi_call!` dispatch logic	Evolvable	The macro expansion (branch + dispatch) can be replaced to optimize for new transport mechanisms.
Consumer loop body	Evolvable	Dispatch logic, batching strategy, error handling — all replaceable without affecting ring format.
`DomainService` binding resolution	Evolvable	Policy decisions (which domain to place a module in, how to resolve multi-provider conflicts) are tuneable.
`GlobalServiceRegistry` lookup	Evolvable	The XArray-based registry can be replaced with a more efficient structure if needed.
IRQ ring protocol	Nucleus (data) + Evolvable (handlers)	`IrqNotification` layout, `IrqRingEventType` discriminants, `TimerExpiryPayload` and `CompletionPayload` layouts are fixed (shared between Tier 0 producers and Tier 1 consumer). Handler and dispatch logic are replaceable.
Timer-to-domain dispatch	Evolvable	`timer_fire_to_domain()` logic (domain lookup, ring selection) is replaceable. The `TimerExpiryPayload` wire format is Nucleus.
Migration protocol	Evolvable	The quiesce → migrate → rebind sequence can be improved without changing data formats.

12.8.12 ML Policy Integration¶

The domain runtime exposes the following observation points to the ML policy framework (Section 23.1):

Observation	`observe_kernel!` call site	Data
Ring occupancy	Consumer loop, per-batch	`(domain_id, ring_id, batch_size, published - tail)`
Dispatch latency	`kabi_call!`, per-call	`(domain_id, method_index, latency_ns)`
IRQ latency	IRQ consumer, per-notification	`(vector, driver_id, current_cycles - timestamp)`
Timer delivery latency	`irq_consumer_loop`, per-TimerExpiry	`(domain_id, timer_id, current_cycles - timestamp)`
Completion delivery latency	`irq_consumer_loop`, per-Completion	`(domain_id, cookie, current_cycles - timestamp)`
Handle rebind	`rebind_handles_for_module`	`(module_id, old_domain, new_domain, rebind_duration_ns)`
Module lifecycle	State transitions	`(module_id, old_state, new_state)`

Tunable parameters registered with the ML framework:

`ParamId`	Description	Bounds	Default
`domain.ring_capacity`	Ring entries per cross-domain ring	`[64, 4096]`	256
`domain.overflow_chunk_size`	Per-slot overflow chunk size (bytes)	`[0, 4096]`	256
`domain.irq_ring_capacity`	IRQ ring entries per driver	`[16, 256]`	64
`domain.drain_timeout_ns`	Migration drain timeout	`[10_000_000, 1_000_000_000]`	100_000_000
`domain.kabi_timeout_ns`	Default KABI completion timeout	`[100_000_000, 60_000_000_000]`	5_000_000_000

All parameters are bounded — ML suggestions outside the range are clamped. The invariant checkers (Section 13.18) verify that parameter changes do not violate ring protocol constraints (e.g., capacity must be power of two).