Skip to content

Chapter 12: Device Class Frameworks

NIC, GPU, WiFi, Bluetooth, Camera, Printers, Live Kernel Evolution, Watchdog, SPI, rfkill, MTD, IPMI, UIO, NVMEM, SoundWire


12.1 Major Driver Subsystem Interfaces

Complex hardware categories — wireless networking, display, and audio — each require a shared kernel subsystem that multiple hardware-specific drivers plug into. This section defines the authoritative interface contracts for these subsystems. Hardware-specific driver documentation (consumer chipsets in Section 12.2/Section 12.3/Section 20.2/Section 20.3, server NICs in Section 14.4, etc.) specifies implementations of these contracts, not independent parallel frameworks.

Any driver implementing a subsystem interface must: - Follow the tier model (Section 10.4) using the tier specified here. - Use UmkaOS ring buffers (Section 10.7) for all bulk data flows. - Implement the crash recovery callbacks defined in Section 10.5.4. - Register through the device registry (Section 10.5) rather than a subsystem-specific registration API.

12.1.1 Wireless Subsystem

Tier: Tier 1. Wireless I/O latency directly affects user-visible responsiveness (video calls, gaming, SSH). The ~200–500 cycle Tier 2 syscall overhead per packet is unacceptable. WiFi and cellular firmware run on-chip (not on the host CPU), so the attack surface is IOMMU-bounded — same threat model as NVMe (Section 10.7.1).

KABI interface name: wireless_device_v1 (in interfaces/wireless_device.kabi).

// umka-core/src/net/wireless.rs — authoritative wireless driver contract

/// A wireless network device. Implemented by all wireless drivers
/// (WiFi 4/5/6/6E/7, cellular modems, 802.15.4).
pub trait WirelessDriver: Send + Sync {
    // --- Identity and capabilities ---

    /// Hardware address (6-byte MAC or 8-byte EUI-64 for 802.15.4).
    fn mac_addr(&self) -> &[u8];

    /// Supported wireless standards (bitmask).
    fn capabilities(&self) -> WirelessCapabilities;

    // --- Lifecycle ---

    /// Bring the radio up (allocate firmware resources, enable PHY).
    fn up(&self) -> Result<(), WirelessError>;

    /// Take the radio down (quiesce TX/RX, release firmware resources).
    fn down(&self) -> Result<(), WirelessError>;

    // --- Scan and association ---

    /// Request an active or passive scan on the given channels.
    /// Results are delivered via the event ring (see `WirelessEvent`).
    fn scan(&self, req: &ScanRequest) -> Result<(), WirelessError>;

    /// Associate with a network.
    fn connect(&self, params: &ConnectParams) -> Result<(), WirelessError>;

    /// Disassociate from the current network.
    fn disconnect(&self) -> Result<(), WirelessError>;

    // --- Data path ---

    /// Return the TX ring shared with umka-net. The ring is allocated
    /// by the driver during `up()` from DMA-capable memory (Section 11.1.5).
    fn tx_ring(&self) -> &RingBuffer<TxDescriptor>;

    /// Return the RX ring shared with umka-net.
    fn rx_ring(&self) -> &RingBuffer<RxDescriptor>;

    // --- Power ---

    /// Set the power save mode (maps to hardware PSM / DTIM skip).
    fn set_power_save(&self, mode: WirelessPowerSave) -> Result<(), WirelessError>;

    /// Configure Wake-on-WLAN patterns before S3 suspend.
    fn set_wowlan(&self, patterns: &[WowlanPattern]) -> Result<(), WirelessError>;

    // --- Statistics ---

    fn stats(&self) -> WirelessStats;
}

bitflags! {
    pub struct WirelessCapabilities: u32 {
        const WIFI_4    = 1 << 0;  // 802.11n
        const WIFI_5    = 1 << 1;  // 802.11ac
        const WIFI_6    = 1 << 2;  // 802.11ax
        const WIFI_6E   = 1 << 3;  // 802.11ax 6 GHz
        const WIFI_7    = 1 << 4;  // 802.11be
        const BT_5      = 1 << 8;  // Bluetooth 5.x (combo chip)
        const WOWLAN    = 1 << 16; // Wake-on-WLAN support
        const SCAN_OFFLOAD = 1 << 17; // Autonomous background scan in S0ix
    }
}

#[repr(u32)]
pub enum WirelessPowerSave {
    /// Radio always awake (CAM). Lowest latency, highest power.
    Disabled  = 0,
    /// 802.11 PSM (sleep between beacons, wake on DTIM).
    Enabled   = 1,
    /// Aggressive PSM (DTIM skipping, beacon filtering).
    Aggressive = 2,
}

Event delivery: Wireless state changes (scan results, connect/disconnect, roaming) are delivered via a per-device event ring (WirelessEvent enum) that umka-net polls. No callbacks into driver code from the network stack.

Hardware-specific detail: Section 12.3 (WiFi — Intel/Realtek/Qualcomm/MediaTek), Section 12.2 (Bluetooth HCI), Section 12.3 (WiFi, including server-class 802.11ax access-point mode).

12.1.2 Display Subsystem

Tier: Tier 1 for integrated GPU display engines (Intel Gen12+, AMD DCN, ARM Mali DP). Tier 2 only for fully-offloaded display (USB DisplayLink, network display server) where the display path already crosses a process boundary.

KABI interface name: display_device_v1 (in interfaces/display_device.kabi).

// umka-core/src/display/mod.rs — authoritative display driver contract

/// A display controller device. Implemented by GPU/display drivers.
pub trait DisplayDriver: Send + Sync {
    // --- Connector enumeration ---

    /// Return all physical connectors managed by this display controller.
    fn connectors(&self) -> &[DisplayConnector];

    // KABI methods use caller-supplied buffers — Vec<T> is Rust-specific and not repr(C) stable.

    /// Read EDID from a connected display. Returns the number of bytes written
    /// to `buf`, or an error if no display or no DDC/CI support (driver falls
    /// back to safe-mode resolution).
    fn read_edid(&self, connector_id: u32, buf: &mut [u8]) -> Result<u32, IoError>;

    // --- Atomic modesetting (required; non-atomic paths are not supported) ---

    /// Validate an atomic commit without applying it.
    /// Returns Ok(()) if the hardware can execute the commit, or an error
    /// describing the constraint that is violated.
    fn atomic_check(&self, commit: &AtomicCommit) -> Result<(), DisplayError>;

    /// Apply an atomic commit. Must be preceded by a successful `atomic_check`.
    /// Blocks until the commit takes effect (next vsync, or immediately for
    /// async page flips).
    fn atomic_commit(&self, commit: &AtomicCommit, flags: CommitFlags)
        -> Result<(), DisplayError>;

    // --- Framebuffer management ---

    /// Import a DMA-BUF as a scanout framebuffer. Returns a `FramebufferId`
    /// used in subsequent atomic commits. The driver pins the buffer for the
    /// lifetime of the framebuffer handle.
    fn import_dmabuf(
        &self,
        fd: DmaBufHandle,
        width: u32,
        height: u32,
        format: PixelFormat,
        modifier: u64,
    ) -> Result<FramebufferId, DisplayError>;

    /// Release a framebuffer handle (unpins the DMA-BUF).
    fn destroy_framebuffer(&self, fb: FramebufferId);

    // --- Display power ---

    /// Set DPMS state for a connector (On / Standby / Suspend / Off).
    fn set_dpms(&self, connector_id: u32, state: DpmsState)
        -> Result<(), DisplayError>;

    // --- Vsync events ---

    /// Return the vsync event ring (one entry per completed page flip or
    /// periodic vsync). Consumers: compositors, frame pacing logic.
    fn vsync_ring(&self) -> &RingBuffer<VsyncEvent>;
}

/// Flags for `atomic_commit`.
bitflags! {
    pub struct CommitFlags: u32 {
        /// Apply on next vsync (tear-free). Default.
        const VSYNC      = 0;
        /// Apply immediately without waiting for vsync (for cursor updates).
        const ASYNC      = 1 << 0;
        /// Test-only: validate without applying.
        const TEST_ONLY  = 1 << 1;
        /// Allow modesetting (resolution / refresh rate change).
        const ALLOW_MODESET = 1 << 2;
    }
}

DMA-BUF integration: The display subsystem consumes DMA-BUFs produced by the GPU compute subsystem (Section 21.1) or by CPU-rendered framebuffers. The kernel capability model (Section 8.1) gates import_dmabuf access: a process must hold CAP_DISPLAY to present a framebuffer on a physical connector.

Hardware-specific detail: Section 20.4.3 (display: Intel i915, AMD DCN, Raspberry Pi display pipeline, USB DisplayLink).

12.1.3 Audio Subsystem

Tier: Tier 1 (default). Audio I/O requires strict real-time scheduling to avoid glitches (buffer underruns). The period interrupt (64–2048 frames at 48 kHz = 1.3–42.7 ms) must fire predictably; Tier 2 syscall overhead per interrupt would consistently violate this budget at low-latency settings (< 10 ms periods). For consumer/desktop configurations where crash resilience is prioritized over latency, audio drivers may be optionally demoted to Tier 2 at ≥ 10 ms buffer periods, where the ~20–50 μs syscall overhead is acceptable (see Section 20.3.2 in 20-user-io.md for demotion policy).

KABI interface name: audio_device_v1 (in interfaces/audio_device.kabi).

// umka-core/src/audio/mod.rs — authoritative audio driver contract

/// An audio device. Implemented by HDA controllers, USB audio class drivers,
/// HDMI audio endpoints, Bluetooth A2DP sinks (via umka-compat HCI layer),
/// and virtual audio devices.
pub trait AudioDriver: Send + Sync {
    // --- PCM streams ---

    /// Negotiate a PCM stream. The driver validates that the hardware
    /// supports the requested format, sample rate, channel count, and
    /// period/buffer sizes, and allocates a DMA ring buffer.
    fn open_pcm(&self, params: &PcmParams) -> Result<PcmStream, AudioError>;

    /// Start DMA on an open PCM stream. The hardware begins reading from
    /// (playback) or writing to (capture) the DMA ring buffer. Must be
    /// called after `open_pcm()` and buffer filling (playback) or
    /// application readiness (capture).
    fn start_stream(&self, handle: PcmStreamHandle) -> Result<(), AudioError>;

    /// Stop DMA on a running PCM stream. The hardware stops reading/writing
    /// the DMA ring buffer. The stream remains open and can be restarted
    /// with `start_stream()`. Pending DMA transfers are drained or aborted
    /// depending on the `drain` parameter (true = wait for ring to empty,
    /// false = immediate stop).
    fn stop_stream(&self, handle: PcmStreamHandle, drain: bool) -> Result<(), AudioError>;

    // --- Mixer (hardware volume/routing controls) ---

    /// Enumerate hardware mixer controls into the caller-supplied buffer.
    /// Returns the number of controls written, up to `max_count`.
    fn mixer_controls(&self, buf: &mut [MixerControl], max_count: u32) -> Result<u32, IoError>;

    /// Read a mixer control value.
    fn mixer_get(&self, id: u32) -> Result<i32, AudioError>;

    /// Write a mixer control value.
    fn mixer_set(&self, id: u32, value: i32) -> Result<(), AudioError>;

    // --- Jack detection ---

    /// Return the jack event ring (headphone/microphone insert/remove events).
    fn jack_ring(&self) -> &RingBuffer<JackEvent>;

    // --- Power ---

    /// Suspend audio device (silence output, power-gate ADC/DAC). Called
    /// before platform S3/S0ix entry.
    fn suspend(&self) -> Result<(), AudioError>;

    /// Resume audio device. Called after platform resume.
    fn resume(&self) -> Result<(), AudioError>;
}

/// Parameters for opening a PCM stream.
#[repr(C)]
pub struct PcmParams {
    pub direction:     PcmDirection,  // Playback or Capture
    pub format:        PcmFormat,     // S16Le / S24Le / S32Le / F32Le
    pub rate:          u32,           // Hz: 44100, 48000, 96000, 192000
    pub channels:      u8,            // 1 (mono) … 8 (7.1)
    pub period_frames: u32,           // Interrupt granularity (power of 2)
    pub buffer_frames: u32,           // Ring buffer size (multiple of period)
}

/// Error type returned by AudioDriver vtable methods.
#[repr(C, u32)]
pub enum AudioError {
    /// The PCM stream parameters are not supported by this device.
    UnsupportedFormat = 1,
    /// The requested sample rate is not supported.
    UnsupportedRate = 2,
    /// No PCM streams available (all in use).
    StreamsExhausted = 3,
    /// DMA buffer allocation failed (ENOMEM).
    NoMemory = 4,
    /// The stream ring buffer underran (playback) — driver consumed data faster
    /// than userspace supplied it. Stream is stopped; caller must restart.
    Underrun = 5,
    /// The stream ring buffer overran (capture) — driver produced data faster
    /// than userspace consumed it. Samples were dropped.
    Overrun = 6,
    /// The hardware is in an unrecoverable error state. Driver must be reloaded.
    HardwareError = 7,
    /// The operation was aborted because the stream is stopping.
    Aborted = 8,
}

ALSA compatibility: umka-compat translates snd_pcm_*, snd_ctl_*, and snd_rawmidi_* ioctls to AudioDriver calls, enabling PipeWire, PulseAudio, and JACK to run unmodified.

Hardware-specific detail: Section 20.3.3 (audio: Intel HDA, USB Audio Class, HDMI/DP audio endpoint).

12.1.4 GPU Compute

Tier: Tier 1. GPU memory management (IOMMU domain assignment, VRAM eviction, TDR recovery) must execute in kernel context. Kernel-bypass command submission is the only path that meets the latency requirements of interactive rendering and GPGPU workloads. A Tier 2 boundary crossing on every submit would add unacceptable per-frame overhead.

KABI interface name: gpu_device_v1 (in interfaces/gpu_device.kabi).

// umka-core/src/gpu/mod.rs — authoritative GPU driver contract

/// A GPU device. Implemented by drivers for discrete and integrated GPUs
/// (Intel Xe, AMD GCN/RDNA, Arm Mali Valhall, NVIDIA GSP, etc.).
pub trait GpuDevice: Send + Sync {
    // --- Context management ---

    /// Allocate a GPU context for the calling process. A context owns a
    /// private GPU virtual address space backed by a dedicated IOMMU domain.
    /// It is the unit of isolation: a fault in one context cannot corrupt
    /// another. The kernel destroys the context (and all buffer objects mapped
    /// into it) when the owning process exits.
    ///
    /// Requires `CAP_GPU_RENDER`.
    fn alloc_ctx(&self) -> Result<GpuContext, GpuError>;

    /// Destroy a GPU context and release all GPU VA space. Any buffer objects
    /// mapped into the context are unmapped but not freed; the caller must
    /// drop the `BufferObject` handles separately.
    fn free_ctx(&self, ctx: GpuContext) -> Result<(), GpuError>;

    // --- Buffer object lifecycle ---

    /// Allocate a buffer object. `size` is in bytes (page-aligned). `placement`
    /// controls where physical backing is sourced (VRAM, GTT, or system
    /// memory). `tiling` sets the hardware tiling modifier; use
    /// `TilingModifier::Linear` unless the caller has negotiated a tiling
    /// format with the display subsystem (Section 12.1.2).
    ///
    /// Requires `CAP_GPU_RENDER`.
    fn alloc_bo(
        &self,
        size: u64,
        placement: BoPlacementFlags,
        tiling: TilingModifier,
    ) -> Result<BufferObject, GpuError>;

    /// Free a buffer object. The BO must have been unmapped from all GPU VA
    /// spaces before calling this. Returns `GpuError::StillMapped` if not.
    fn free_bo(&self, bo: BufferObject) -> Result<(), GpuError>;

    // --- GPU virtual address space ---

    /// Map a buffer object into a GPU context's virtual address space.
    /// `va_hint` is advisory; the driver may choose a different VA if the
    /// hint conflicts with an existing mapping. Returns the actual GPU VA.
    ///
    /// The mapping remains valid until `unmap_bo` is called or the context
    /// is destroyed.
    fn map_bo(
        &self,
        ctx: &GpuContext,
        bo: &BufferObject,
        va_hint: Option<u64>,
        flags: BoMapFlags,
    ) -> Result<u64, GpuError>;

    /// Unmap a buffer object from a GPU context's virtual address space.
    fn unmap_bo(&self, ctx: &GpuContext, va: u64) -> Result<(), GpuError>;

    // --- Command submission ---

    /// Submit a command buffer for execution on the GPU. `exec_queue`
    /// selects the hardware engine (graphics, compute, copy, video).
    /// `wait_fences` is a list of `GpuFence` values that must be signaled
    /// before execution begins. Returns a `GpuFence` that is signaled when
    /// the command buffer completes.
    ///
    /// The command buffer pointer is a GPU VA within `ctx`. The caller is
    /// responsible for ensuring the GPU VA maps to valid, initialized memory.
    ///
    /// Requires `CAP_GPU_RENDER`.
    fn submit(
        &self,
        ctx: &GpuContext,
        exec_queue: ExecQueue,
        cmdbuf_va: u64,
        cmdbuf_size: u64,
        wait_fences: &[GpuFence],
    ) -> Result<GpuFence, GpuError>;

    // --- DMA-BUF export ---

    /// Export a buffer object as a DMA-BUF file descriptor. The returned
    /// handle can be passed to the display subsystem (`DisplayDriver::
    /// import_dmabuf`, Section 12.1.2) or to a video encoder (`MediaDevice::
    /// queue_buf`, Section 12.1.6). The BO reference count is incremented; the BO
    /// remains live until both the `BufferObject` handle and all DMA-BUF
    /// importers are dropped.
    fn export_dmabuf(&self, bo: &BufferObject) -> Result<DmaBufHandle, GpuError>;

    // --- TDR (Timeout Detection and Recovery) ---

    /// Trigger an explicit TDR cycle on a GPU context that the caller has
    /// determined is hung. The kernel also calls this internally if a context
    /// has not produced a progress heartbeat for 2 seconds.
    ///
    /// Behavior:
    /// 1. The driver preempts the hung context.
    /// 2. The hardware engine is reset to a known-good state.
    /// 3. All other active contexts are saved, the engine is reconfigured,
    ///    and those contexts resume from their last checkpoint.
    /// 4. The hung `GpuContext` is marked invalid; any subsequent call on it
    ///    returns `GpuError::ContextLost` (-ENODEV).
    ///
    /// Requires `CAP_GPU_ADMIN`.
    fn tdr_reset(&self, ctx: &GpuContext) -> Result<(), GpuError>;

    // --- Capability queries ---

    /// Return the set of capabilities reported by the GPU hardware (memory
    /// sizes, engine counts, supported tiling modifiers, etc.).
    fn capabilities(&self) -> GpuCapabilities;
}

/// A GPU context: one process's private GPU virtual address space.
/// Isolated from all other contexts by a dedicated IOMMU domain.
pub struct GpuContext {
    /// Opaque kernel handle. Never dereference from outside the GPU subsystem.
    pub handle: u64,
    /// The IOMMU domain ID assigned to this context (for cross-subsystem
    /// DMA-BUF import validation).
    pub iommu_domain_id: u32,
}

/// A GPU memory allocation.
pub struct BufferObject {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Size in bytes (always page-aligned).
    pub size: u64,
    /// Actual placement after allocation (may differ from the requested
    /// placement if VRAM was full and the driver fell back to GTT).
    pub actual_placement: BoPlacementFlags,
    /// Tiling modifier in use (DRM format modifier encoding).
    pub tiling: TilingModifier,
}

/// Timeline semaphore. A fence is signaled when `timeline.seqno >= value`.
/// Shared across Section 12.1.4 (GPU), Section 12.1.6 (Media), Section 12.1.7 (NPU), and Section 12.1.10
/// (Crypto) to allow cross-subsystem dependency chains without conversions.
pub struct GpuFence {
    /// Identifies the hardware timeline (GPU engine, DMA channel, NPU, etc.).
    pub timeline_id: u64,
    /// The sequence number on that timeline that must be reached.
    pub seqno: u64,
}

bitflags! {
    /// Where to source backing memory for a buffer object.
    pub struct BoPlacementFlags: u32 {
        /// GPU-local VRAM. Highest GPU bandwidth, not CPU-accessible without
        /// a GTT mapping.
        const VRAM   = 1 << 0;
        /// Graphics Translation Table (CPU-accessible via BAR2/GGTT aperture).
        const GTT    = 1 << 1;
        /// System (DRAM) memory. Always CPU-accessible; lowest GPU bandwidth.
        const SYSTEM = 1 << 2;
    }
}

bitflags! {
    /// Flags controlling a GPU VA mapping.
    pub struct BoMapFlags: u32 {
        /// GPU may read the buffer.
        const READ       = 1 << 0;
        /// GPU may write the buffer.
        const WRITE      = 1 << 1;
        /// CPU cache is coherent with GPU (requires hardware support; falls
        /// back to uncached if not available).
        const COHERENT   = 1 << 2;
    }
}

#[repr(u32)]
/// Hardware tiling modifier (DRM format modifier encoding, lower 32 bits).
pub enum TilingModifier {
    /// No tiling — linear row-major layout.
    Linear     = 0,
    /// Intel X-tiling (128-byte columns × 8 rows).
    IntelXTile = 1,
    /// Intel Y-tiling (32-byte columns × 32 rows, preferred for render).
    IntelYTile = 2,
    /// AMD DCC (Delta Color Compression — requires matching display engine).
    AmdDcc     = 3,
    /// Arm Afbc (Arm Frame Buffer Compression).
    ArmAfbc    = 4,
}

#[repr(u32)]
/// GPU hardware engine selector for command submission.
pub enum ExecQueue {
    /// 3D rendering and compute shaders (universal queue on most GPUs).
    Graphics  = 0,
    /// Dedicated compute queue (no graphics state, runs in parallel).
    Compute   = 1,
    /// Blitter / copy engine (lower power for buffer-to-buffer transfers).
    Copy      = 2,
    /// Video decode engine.
    VideoDec  = 3,
    /// Video encode engine.
    VideoEnc  = 4,
}

TDR model: The watchdog timer fires every 2 seconds. If a GPU context has not advanced its hardware progress counter since the last tick, the kernel invokes tdr_reset() on that context. The 2-second threshold is adjustable per-device via sysfs (/sys/class/gpu/<dev>/tdr_timeout_ms) by a process holding CAP_GPU_ADMIN. Reducing below 100 ms is not permitted; doing so would produce false positives during legitimate shader compilation stalls.

Cross-driver synchronization: GpuFence is the universal cross-driver timeline primitive. The display subsystem (Section 12.1.2) accepts a GpuFence in atomic_commit to defer scanout until rendering completes. The NPU subsystem (Section 12.1.7) and the crypto engine (Section 12.1.10) both use the same GpuFence struct so that inference pipelines and encrypted content pipelines can express multi-stage dependency chains in a single data structure.

Capability gating: CAP_GPU_RENDER is required for context allocation, BO allocation, and command submission. CAP_GPU_ADMIN is additionally required for clock control, performance counter access, and explicit TDR. Both capabilities are checked in the kernel before any hardware register is touched.

Hardware-specific detail: Per-vendor GPU architecture is documented inline in this section (generic GPU KABI), in Section 21.1–46 (11-accelerators.md) for accelerator scheduling, memory management, and NVIDIA porting (Section 21.5.2), and in Section 20.4 (20-user-io.md) for display/KMS. Vendor-specific register-level programming (Intel Xe/i915, AMD AMDGPU, ARM Mali Valhall, NVIDIA GSP) is covered during per-driver implementation using vendor documentation.

12.1.4.1 DMA Fence Behavior on GPU Crash

When a GPU crashes mid-workload, all pending GpuFence values associated with that GPU will never be signaled by the hardware. Without explicit kernel intervention, every waiter — CPU threads blocked in dma_fence_wait(), display scanout pipelines, video encoders, NPU submission queues — blocks indefinitely. UmkaOS resolves all pending fences during the crash handler to unblock waiters immediately.

Fence error types:

/// Error status signaled to fence waiters when the GPU crashes.
#[repr(u32)]
pub enum DmaFenceError {
    /// The GPU device was lost entirely. Work was not completed and cannot
    /// be retried on this device without a full device reset and driver reload.
    DeviceLost    = 1,
    /// A specific GPU context was killed (by TDR or an unrecoverable fault),
    /// but the GPU device itself remains operational. Other contexts continue.
    /// Work associated with the killed context was not completed.
    ContextKilled = 2,
}

/// Resolution applied to a pending fence during GPU crash handling.
#[repr(u32)]
pub enum FenceCrashResolution {
    /// Signal the fence with an error. Waiters wake up and receive the error.
    /// Used when work is lost and callers must handle the failure.
    SignalError(DmaFenceError),
    /// Signal the fence as completed. Used when the kernel has already rolled
    /// back the associated state and the waiter can safely proceed — for
    /// example, a fence guarding a buffer that has been fully reclaimed.
    SignalComplete,
}

Fence registry: The GPU driver maintains a per-device fence registry — a lock-protected list of (GpuFence, GpuContext, waker) tuples for all outstanding fences. The registry is stored in umka-core memory (not in the Tier 1 driver's isolation domain) so it is accessible during crash recovery after the domain is revoked.

GPU crash handler sequence:

GPU crash detected:
  Source: firmware timeout interrupt, hardware fault interrupt, or TDR watchdog.

1. IOMMU isolation:
   - Revoke the GPU's IOMMU DMA domain (set to fault-on-access).
   - The GPU can no longer read or write system memory via DMA.
   - This is the first action; it happens before any fence signaling.

2. Fence resolution:
   - Acquire the fence registry lock (exclusive).
   - Iterate all pending fences in submission order (FIFO within each context):
     a. Fence associated with a specific GpuContext:
        - Signal with DmaFenceError::ContextKilled.
        - Wake all waiters blocked on this fence.
     b. Fence associated with the device (no context — e.g., inter-GPU dependency):
        - Signal with DmaFenceError::DeviceLost.
        - Wake all waiters blocked on this fence.
   - Release the fence registry lock.
   - Signaling is done under the lock to prevent races with concurrent
     dma_fence_wait() calls that might otherwise block after the lock is
     released but before the fence is signaled.

3. Context invalidation:
   - Iterate all GpuContext objects owned by the crashed GPU.
   - Transition each to GpuContextState::Lost.
   - Any subsequent call on a Lost context returns GpuError::ContextLost.

4. GPU recovery:
   - Attempt TDR reset (single-context kill) if only one context faulted.
   - Escalate to FLR (pcie_flr_with_timeout, Section 10.8.2b) if TDR fails
     or the device itself is unresponsive.
   - After successful reset: transition surviving contexts to Suspended;
     the driver attempts to restore their last checkpoint state.
   - If FLR also fails: proceed with the permanent fault sequence in Section 10.8.2b.

Waiter contract:

Callers using dma_fence_wait() receive Err(DmaFenceError) instead of waiting indefinitely. The return path is identical to a normal timeout — the caller's blocking state is cleared and control returns to the caller with an error code.

UmkaOS does not silently swallow fence errors. Every waiter that was blocked on a GPU fence at crash time receives an explicit error. There is no "signal-and-hope" behavior.

Userspace propagation:

The GPU driver's userspace interface layer maps fence errors to the appropriate API-level error codes: - Vulkan: VK_ERROR_DEVICE_LOST (full device crash) or VK_ERROR_INITIALIZATION_FAILED (context kill, if the device recovers). - OpenGL/EGL: EGL_CONTEXT_LOST (ARB_robustness extension). - CUDA/HIP: cudaErrorLostDevice / hipErrorLostDevice. - OpenCL: CL_DEVICE_NOT_AVAILABLE.

The driver maps DmaFenceError::DeviceLost → the device-lost variant and DmaFenceError::ContextKilled → the context-lost/robustness variant. Userspace applications that handle robustness extensions (Vulkan Robust Buffer Access, OpenGL ARB_robustness) can recover from context-killed errors without terminating.

Fence ordering guarantee on crash:

Fences within a single GpuContext are signaled in submission order (FIFO). This preserves the happens-before relationship for recovery code that inspects fence completion order to determine which operations committed before the crash and which did not. Cross-context fences (inter-context dependencies expressed via wait_fences in submit()) are signaled after all fences in the depended-upon context are signaled, maintaining the dependency ordering even in the error path.

12.1.5 RDMA

Tier: Tier 1. RDMA's defining property is that the hot path (posting work requests, ringing doorbells) never enters the kernel. This requires that the kernel map QP doorbell pages and work-request memory regions directly into userspace. Protection domain management, memory region pinning, and IOMMU programming must therefore reside in the kernel.

KABI interface name: rdma_device_v1 (in interfaces/rdma_device.kabi).

// umka-core/src/rdma/mod.rs — authoritative RDMA driver contract

/// An RDMA-capable network device (InfiniBand HCA, RoCEv2 NIC, iWARP adapter).
/// Implemented by drivers such as Mellanox/NVIDIA mlx5, Broadcom bnxt_re,
/// Intel irdma, and Marvell qedr.
pub trait RdmaDevice: Send + Sync {
    // --- Protection domain ---

    /// Allocate a protection domain. A PD is the unit of authorization:
    /// memory regions, queue pairs, and address handles all belong to exactly
    /// one PD. Objects in different PDs cannot communicate without explicit
    /// cross-registration (which is not currently supported).
    ///
    /// Requires `CAP_RDMA`.
    fn alloc_pd(&self) -> Result<ProtectionDomain, RdmaError>;

    /// Free a protection domain. All child objects (MRs, QPs, AHs) must
    /// have been freed before calling this; returns `RdmaError::PdInUse`
    /// if any remain.
    fn dealloc_pd(&self, pd: ProtectionDomain) -> Result<(), RdmaError>;

    // --- Memory regions ---

    /// Register a memory region. The kernel pins the pages covering
    /// `[addr, addr + length)` in the calling process's address space,
    /// programs the IOMMU to allow DMA from the device, and returns the
    /// local key (`lkey`, for local SGE references) and remote key (`rkey`,
    /// for remote RDMA operations targeting this region). Both keys are
    /// opaque 32-bit values; their encoding is device-specific.
    ///
    /// `access` controls what operations remote peers may perform via
    /// the rkey (read, write, atomic). Local access via lkey always
    /// allows reads and writes.
    ///
    /// Requires `CAP_RDMA`.
    fn alloc_mr(
        &self,
        pd: &ProtectionDomain,
        addr: u64,
        length: u64,
        access: MrAccessFlags,
    ) -> Result<MemoryRegion, RdmaError>;

    /// Deregister a memory region. Pages are unpinned and the IOMMU mapping
    /// is removed. Any in-flight RDMA operation targeting this MR will
    /// complete with a remote access error on the peer side.
    fn dealloc_mr(&self, mr: MemoryRegion) -> Result<(), RdmaError>;

    // --- Completion queues ---

    /// Create a completion queue with capacity for at least `cqe` entries.
    /// The driver may round up to a hardware-convenient size. The actual
    /// capacity is returned in `CompletionQueue::capacity`.
    ///
    /// Requires `CAP_RDMA`.
    fn create_cq(&self, cqe: u32) -> Result<CompletionQueue, RdmaError>;

    /// Destroy a completion queue. All QPs that reference this CQ must be
    /// destroyed first.
    fn destroy_cq(&self, cq: CompletionQueue) -> Result<(), RdmaError>;

    // --- Queue pairs ---

    /// Create a queue pair (send queue + receive queue) associated with the
    /// given protection domain and completion queues. `init_attr` specifies
    /// QP type, initial queue depths, and scatter-gather element counts.
    ///
    /// The QP is created in the RESET state. Call `modify_qp` to transition
    /// it to INIT → RTR → RTS before posting work requests.
    ///
    /// Requires `CAP_RDMA`.
    fn create_qp(
        &self,
        pd: &ProtectionDomain,
        send_cq: &CompletionQueue,
        recv_cq: &CompletionQueue,
        init_attr: &QpInitAttr,
    ) -> Result<QueuePair, RdmaError>;

    /// Transition a queue pair through the state machine (RESET→INIT→RTR→RTS,
    /// or error paths). `attr_mask` indicates which fields in `attr` are valid.
    fn modify_qp(
        &self,
        qp: &mut QueuePair,
        attr: &QpAttr,
        attr_mask: QpAttrMask,
    ) -> Result<(), RdmaError>;

    /// Destroy a queue pair. Any posted work requests are silently discarded.
    fn destroy_qp(&self, qp: QueuePair) -> Result<(), RdmaError>;

    // --- Kernel-bypass doorbell mapping ---

    /// Map the QP doorbell page into the calling process's virtual address
    /// space. Returns the userspace virtual address of the doorbell MMIO page.
    /// The process writes work requests to the QP memory (already mapped via
    /// `mmap` of the QP backing pages) and then writes a 64-bit descriptor to
    /// the doorbell address to ring the hardware. No syscall is needed on the
    /// hot path.
    ///
    /// The mapping is automatically removed when the QP is destroyed or the
    /// process exits.
    ///
    /// Requires `CAP_RDMA`.
    fn map_qp_doorbell(&self, qp: &QueuePair) -> Result<*mut u8, RdmaError>;

    // --- Kernel-side slow path (setup and error recovery only) ---

    /// Post receive work requests to the QP's receive queue. Used only
    /// during initialization and after QP error recovery; the normal path
    /// posts directly from userspace.
    fn post_recv(
        &self,
        qp: &QueuePair,
        wrs: &[RecvWorkRequest],
    ) -> Result<(), RdmaError>;

    /// Post send work requests to the QP's send queue. Used only during
    /// initialization and after QP error recovery.
    fn post_send(
        &self,
        qp: &QueuePair,
        wrs: &[SendWorkRequest],
    ) -> Result<(), RdmaError>;

    // --- Port query ---

    /// Query the state of a physical port. Returns link state, MTU, GID
    /// table entries, port capabilities, and current speed/width.
    fn query_port(&self, port_num: u8) -> Result<PortAttributes, RdmaError>;

    // --- Device query ---

    /// Return static device capabilities (max QPs, max CQEs, max MR size,
    /// supported transport types, atomic operation support, etc.).
    fn query_device(&self) -> DeviceAttributes;
}

/// A protection domain: unit of authorization for RDMA operations.
pub struct ProtectionDomain {
    /// Opaque kernel handle.
    pub handle: u32,
}

/// A pinned, IOMMU-mapped memory region.
pub struct MemoryRegion {
    /// Opaque kernel handle.
    pub handle: u32,
    /// Local key: used in SGE (scatter-gather element) references.
    pub lkey: u32,
    /// Remote key: presented to a remote peer to authorize RDMA operations
    /// targeting this region.
    pub rkey: u32,
    /// Base virtual address of the registered region.
    pub addr: u64,
    /// Length of the registered region in bytes.
    pub length: u64,
}

/// A completion queue.
pub struct CompletionQueue {
    /// Opaque kernel handle.
    pub handle: u32,
    /// Actual CQ capacity (≥ the requested `cqe`).
    pub capacity: u32,
}

/// A queue pair (RC, UC, UD, or SRQ-attached RC).
pub struct QueuePair {
    /// Opaque kernel handle.
    pub handle: u32,
    /// The QP number used by the remote peer for addressing.
    pub qp_num: u32,
    /// Current QP state.
    pub state: QpState,
}

bitflags! {
    /// Access permissions granted on a memory region to remote peers.
    pub struct MrAccessFlags: u32 {
        /// Remote peer may issue RDMA Read targeting this MR.
        const REMOTE_READ   = 1 << 0;
        /// Remote peer may issue RDMA Write targeting this MR.
        const REMOTE_WRITE  = 1 << 1;
        /// Remote peer may issue atomic operations (CAS, FAA) on this MR.
        const REMOTE_ATOMIC = 1 << 2;
        /// Memory window binding is allowed (for dynamic rkey invalidation).
        const MW_BIND       = 1 << 3;
    }
}

#[repr(u32)]
/// QP state machine states (IB Architecture Specification Section 14.4.3).
pub enum QpState {
    /// Hardware-quiesced state. No WRs are processed.
    Reset  = 0,
    /// Initialized. Receive WRs may be posted; sends are not yet enabled.
    Init   = 1,
    /// Ready To Receive. Path information is configured; receives are active.
    Rtr    = 2,
    /// Ready To Send. Both sends and receives are active.
    Rts    = 3,
    /// Send Queue Drain. QP is draining sends due to error; new WRs rejected.
    Sqd    = 4,
    /// Send Queue Error. QP encountered an error; all WRs flushed with error.
    Sqe    = 5,
    /// Error. Both queues have been flushed with error completions.
    Err    = 6,
}

Kernel-bypass model: After map_qp_doorbell() and mmap of the QP work queue memory, userspace RDMA libraries (libibverbs, rdma-core) operate entirely without kernel involvement on the send path. The kernel is re-entered only for: QP state transitions, CQ overflow recovery, error handling, and address handle (AH) creation. This model is compatible with the OpenMPI and UCX transports used by HPC applications.

IOMMU integration: alloc_mr programs the device's IOMMU domain (same model as Section 12.1.4 GpuContext) so that only the registered address range is accessible to the device. A buffer overflow in an RDMA payload cannot reach outside the registered MR.

Multikernel integration: The distributed lock manager (Section 14.6) and the inter-node IPC transport (Section 5.1) both use RDMA as their high-speed fabric. The RDMA protection domain model maps directly to UmkaOS capability domains: each cluster node that participates in the multikernel has one PD per trust domain.

IB verbs compatibility: The RdmaDevice trait is a strict superset of the IB verbs interface exposed by Linux's ib_verbs.h. The umka-compat layer translates ibv_* library calls to the corresponding RdmaDevice methods, allowing unmodified rdma-core, OpenMPI, and OpenFabrics applications to run.

Hardware-specific detail: Per-vendor RDMA driver architecture (Mellanox/NVIDIA mlx5, Intel irdma, Broadcom bnxt_re, Marvell qedr) is documented inline in this section and covered during per-driver implementation using vendor documentation and the IB verbs specification.

12.1.6 Video / Media Pipeline

Tier: Tier 1 for hardware codec engines (Intel Quick Sync, AMD VCN, Qualcomm Venus, Mediatek VENC/VDEC, Apple VideoToolbox-equivalent hardware). Tier 2 for pure software codecs: a CPU-based ffmpeg instance is already a userspace process and requires no special KABI beyond ordinary DMA-BUF file descriptor passing and shared memory.

KABI interface name: media_device_v1 (in interfaces/media_device.kabi).

// umka-core/src/media/mod.rs — authoritative media pipeline driver contract

/// A hardware media processing device (codec engine, ISP, or similar).
/// Implemented by drivers for SoC video IP blocks and discrete capture cards.
pub trait MediaDevice: Send + Sync {
    // --- Capability discovery ---

    /// Enumerate all codec configurations supported by the hardware. Each
    /// entry specifies codec type, profile, level, maximum resolution,
    /// maximum frame rate, and whether encode and/or decode is supported.
    fn query_codecs(&self, buf: &mut [CodecCapability], max_count: u32) -> Result<u32, IoError>;

    // --- Session lifecycle ---

    /// Create a codec session. `config` specifies the codec, direction
    /// (encode or decode), input/output pixel formats, and initial encoding
    /// parameters (bitrate, QP, keyframe interval, rate control mode) for
    /// encode sessions or output pixel format (NV12, P010, etc.) for decode.
    ///
    /// Returns a `MediaSession` handle used for subsequent buffer operations.
    fn create_session(
        &self,
        config: &SessionConfig,
    ) -> Result<MediaSession, MediaError>;

    /// Destroy a codec session. All queued buffers are flushed and returned
    /// with `BufferState::Error` before the session handle is invalidated.
    fn destroy_session(&self, session: MediaSession) -> Result<(), MediaError>;

    // --- Buffer queue ---

    /// Submit an input buffer (as a DMA-BUF handle) to the session for
    /// processing. For encode sessions the buffer contains raw video frames;
    /// for decode sessions it contains compressed bitstream data.
    ///
    /// `sequence` is a monotonically increasing caller-assigned sequence
    /// number returned with the corresponding output buffer so the caller can
    /// match inputs to outputs out-of-order.
    fn queue_buf(
        &self,
        session: &MediaSession,
        buf: DmaBufHandle,
        sequence: u64,
        flags: QueueFlags,
    ) -> Result<(), MediaError>;

    /// Retrieve the next completed output buffer. Blocks until a buffer is
    /// available or the session is destroyed. Returns the DMA-BUF handle of
    /// the output, the input sequence number it corresponds to, and a
    /// `GpuFence` (Section 12.1.4) that is signaled when the hardware has finished
    /// writing to the buffer (the caller must wait on this fence before
    /// reading the buffer contents from CPU or passing it to the display).
    fn dequeue_buf(
        &self,
        session: &MediaSession,
    ) -> Result<DequeuedBuffer, MediaError>;

    // --- Media graph topology ---

    /// Return all pads (typed I/O ports) belonging to this device node.
    fn pads(&self) -> &[MediaPad];

    /// Create a directed link between an output pad of this device and an
    /// input pad of another device. Both pads must be compatible (same
    /// pixel format, resolution, and frame rate). Returns a `MediaLink`
    /// handle. Enabling the link causes DMA-BUFs to flow from the source
    /// pad to the sink pad without copying.
    fn create_link(
        &self,
        src_pad: PadId,
        sink_device: &dyn MediaDevice,
        sink_pad: PadId,
        format: LinkFormat,
    ) -> Result<MediaLink, MediaError>;

    /// Destroy a link, stopping buffer flow between the two pads.
    fn destroy_link(&self, link: MediaLink) -> Result<(), MediaError>;

    // --- Dynamic parameter updates ---

    /// Update encoding parameters on a running encode session without
    /// destroying and recreating it. Only encode-direction parameters
    /// (bitrate, QP range, keyframe force) may be updated this way.
    fn update_encode_params(
        &self,
        session: &MediaSession,
        params: &EncodeParams,
    ) -> Result<(), MediaError>;
}

/// A codec session handle.
pub struct MediaSession {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Session direction (Encode or Decode).
    pub direction: CodecDirection,
}

/// A directed link between two media pads. The link transfers ownership of
/// each DMA-BUF from the source pad to the sink pad atomically.
pub struct MediaLink {
    /// Opaque kernel handle.
    pub handle: u32,
    /// Source device pad identifier.
    pub src_pad: PadId,
    /// Sink device pad identifier.
    pub sink_pad: PadId,
    /// Negotiated format carried on this link.
    pub format: LinkFormat,
}

/// A typed I/O port on a media device.
pub struct MediaPad {
    /// Identifier unique within the owning device.
    pub id: PadId,
    /// Whether this pad produces (Source) or consumes (Sink) DMA-BUFs.
    pub direction: PadDirection,
    /// Set of pixel formats and frame sizes this pad can accept or produce.
    pub supported_formats: Vec<PadFormat>,
}

/// A completed output buffer returned by `dequeue_buf`.
pub struct DequeuedBuffer {
    /// DMA-BUF handle of the output data. For encode: compressed bitstream.
    /// For decode: raw frame in the pixel format requested in `SessionConfig`.
    pub buf: DmaBufHandle,
    /// Caller-assigned sequence number from the corresponding `queue_buf`.
    pub sequence: u64,
    /// Fence signaled when hardware has finished writing to `buf`. The
    /// caller MUST wait on this fence before reading or forwarding the buffer.
    pub ready_fence: GpuFence,
}

/// Configuration for a new codec session.
#[repr(C)]
pub struct SessionConfig {
    /// Codec type (H264, H265, AV1, VP9, JPEG, etc.).
    pub codec: CodecType,
    /// Encode or Decode.
    pub direction: CodecDirection,
    /// Input pixel format (for encode) or bitstream container (for decode).
    pub input_format: MediaFormat,
    /// Output pixel format (for decode: NV12, P010, etc.; for encode: N/A).
    pub output_format: MediaFormat,
    /// Initial encode parameters (ignored for decode sessions).
    pub encode_params: EncodeParams,
}

/// Encoding parameters. All fields are writable after session creation via
/// `update_encode_params`.
#[repr(C)]
pub struct EncodeParams {
    /// Target bitrate in bits per second. 0 means CQP (constant QP) mode.
    pub bitrate_bps: u32,
    /// Minimum quantization parameter (lower = better quality, larger frames).
    pub qp_min: u8,
    /// Maximum quantization parameter.
    pub qp_max: u8,
    /// Force a keyframe every N frames. 0 disables periodic keyframes.
    pub keyframe_interval: u32,
    /// Rate control mode (CBR, VBR, CQP, CRF).
    pub rc_mode: RateControlMode,
}

#[repr(u32)]
pub enum CodecDirection {
    /// Hardware encoder: raw frames in, compressed bitstream out.
    Encode = 0,
    /// Hardware decoder: compressed bitstream in, raw frames out.
    Decode = 1,
}

#[repr(u32)]
pub enum RateControlMode {
    /// Constant bitrate. Buffer fullness is maintained; quality varies.
    Cbr = 0,
    /// Variable bitrate. Average bitrate target; quality peaks on I-frames.
    Vbr = 1,
    /// Constant quantization parameter. Bitrate varies; quality is fixed.
    Cqp = 2,
    /// Constant rate factor (quality-based VBR, similar to x264 CRF).
    Crf = 3,
}

bitflags! {
    /// Flags for `queue_buf`.
    pub struct QueueFlags: u32 {
        /// Mark this buffer as the last in a stream (EOS). The session will
        /// flush and return all pending output buffers after processing this
        /// input.
        const END_OF_STREAM = 1 << 0;
        /// Force a keyframe on this input buffer (encode only).
        const FORCE_KEYFRAME = 1 << 1;
    }
}

Buffer graph model: A complete media pipeline is a directed acyclic graph of MediaDevice nodes connected by MediaLink edges. DMA-BUFs flow from source pads to sink pads without copying. A typical pipeline:

[camera sensor] → [ISP] → [encoder] → [network or file]

The ISP and encoder are separate MediaDevice instances. The link between them carries DMA-BUFs whose lifetime is managed by the producing node; the consuming node signals via a GpuFence when it has finished reading the buffer so the producer can reuse it.

V4L2 M2M compatibility: umka-compat translates V4L2 memory-to-memory device ioctls (VIDIOC_QBUF, VIDIOC_DQBUF, VIDIOC_STREAMON) on M2M nodes to queue_buf / dequeue_buf / session start. The pixel format negotiation (VIDIOC_S_FMT) maps to SessionConfig field selection. Applications using libv4l2 or GStreamer's v4l2h264enc / v4l2h264dec elements run unmodified.

Hardware-specific detail: Per-vendor media codec and camera ISP driver architecture (Intel Quick Sync/GuC/HuC, AMD VCN, Qualcomm Venus, MediaTek VENC/VDEC, camera ISP — ARM Mali C71, Qualcomm Spectra) is documented inline in this section. Camera/video capture architecture is in Section 12.4.

12.1.7 AI / NPU Accelerator

Tier: Tier 1. Large model weight tensors require physically contiguous DMA allocations that the kernel memory allocator must satisfy. Inference latency requirements (< 1 ms first-token for edge models) preclude the overhead of a Tier 2 boundary crossing on each inference submission.

KABI interface name: accel_device_v1 (in interfaces/accel_device.kabi).

// umka-core/src/accel/mod.rs — authoritative NPU/accelerator driver contract

/// A hardware accelerator device: NPU, DSP, or tensor processor.
/// Implemented by drivers for Qualcomm Hexagon, Intel VPU (Meteor Lake NPU),
/// Apple ANE (via open-source reimplementation), MediaTek APU, and custom
/// ASICs.
pub trait AccelDevice: Send + Sync {
    // --- Buffer object management (shared model with Section 12.1.4 GPU) ---

    /// Allocate a buffer object in accelerator-accessible memory. `size` is
    /// in bytes (page-aligned). `placement` selects between accelerator-local
    /// SRAM/DRAM, coherent system memory, or non-coherent DMA-able system
    /// memory depending on what the hardware supports.
    ///
    /// Requires `CAP_ACCEL_INFERENCE`.
    fn alloc_bo(
        &self,
        size: u64,
        placement: AccelPlacementFlags,
    ) -> Result<BufferObject, AccelError>;

    /// Free a buffer object. Must not be in use by a model or in-flight
    /// inference when freed.
    fn free_bo(&self, bo: BufferObject) -> Result<(), AccelError>;

    // --- Model lifecycle ---

    /// Upload a pre-compiled model blob (produced by the vendor NPU compiler
    /// running in userspace) to the accelerator. The blob format is
    /// device-specific and opaque to the kernel; the kernel validates only
    /// its size and alignment constraints. The kernel does NOT JIT-compile or
    /// interpret the blob; it DMA-copies it to accelerator SRAM/DRAM and
    /// registers it with the firmware.
    ///
    /// Returns a `ModelHandle` used in subsequent `submit_inference` calls.
    ///
    /// Requires `CAP_ACCEL_INFERENCE`.
    fn load_model(
        &self,
        blob: DmaBufHandle,
        blob_size: u64,
    ) -> Result<ModelHandle, AccelError>;

    /// Unload a model, freeing accelerator SRAM and deregistering the model
    /// from firmware. Any in-flight inference using this model must complete
    /// before calling this; returns `AccelError::ModelInUse` if not.
    fn unload_model(&self, model: ModelHandle) -> Result<(), AccelError>;

    // --- Inference submission ---

    /// Submit an inference request. `input` is a DMA-BUF containing the
    /// input tensor data in the layout expected by the model (described in
    /// the model blob metadata). `output` is a DMA-BUF that the accelerator
    /// will write inference results to. Both buffers must be at least as
    /// large as the model's declared input/output tensor sizes.
    ///
    /// `wait_fences` lists `GpuFence` (Section 12.1.4) values that must be signaled
    /// before the inference begins (e.g., a camera frame that is still being
    /// written by the ISP). Returns a `GpuFence` signaled when the output
    /// tensor is complete and `output` is safe to read.
    ///
    /// Requires `CAP_ACCEL_INFERENCE`.
    fn submit_inference(
        &self,
        model: &ModelHandle,
        input: DmaBufHandle,
        output: DmaBufHandle,
        wait_fences: &[GpuFence],
    ) -> Result<GpuFence, AccelError>;

    // --- Capability query ---

    /// Return static device capabilities: supported data types (INT8, FP16,
    /// BF16, FP32), maximum model size in bytes, maximum batch size, list of
    /// supported operator sets (ONNX opset version, TFLite version, etc.),
    /// and hardware performance counters layout.
    fn query_capabilities(&self) -> AccelCapabilities;

    // --- TDR ---

    /// Reset the accelerator after a hung or timed-out inference. The kernel
    /// calls this automatically when an inference does not complete within
    /// the configured TDR timeout (default: 30 s for large models, adjustable
    /// via `/sys/class/accel/<dev>/tdr_timeout_ms` with `CAP_ACCEL_ADMIN`).
    ///
    /// All sessions on the device are reset. In-flight inferences return
    /// `AccelError::Timeout` (-ETIMEDOUT) to their callers. If the hardware
    /// supports per-session context isolation, only the hung session is
    /// terminated; other sessions resume.
    ///
    /// Requires `CAP_ACCEL_ADMIN`.
    fn tdr_reset(&self) -> Result<(), AccelError>;
}

/// A loaded model handle.
pub struct ModelHandle {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Size of the model blob in bytes.
    pub blob_size: u64,
    /// Required input tensor size in bytes.
    pub input_size: u64,
    /// Required output tensor size in bytes.
    pub output_size: u64,
}

/// Static capabilities of an accelerator device.
pub struct AccelCapabilities {
    /// Peak INT8 throughput in tera-operations per second.
    pub tops_int8: u32,
    /// Peak FP16 throughput in tera-operations per second.
    pub tops_fp16: u32,
    /// Accelerator-local memory size in bytes (SRAM + on-package DRAM).
    pub local_memory_bytes: u64,
    /// Maximum single model blob size in bytes.
    pub max_model_size_bytes: u64,
    /// Supported numeric data types.
    pub data_types: AccelDataTypeFlags,
    /// Supported operator sets (bitmask: ONNX, TFLite, QNN, OpenVINO IR).
    pub operator_sets: AccelOpSetFlags,
}

bitflags! {
    /// Numeric data types the accelerator can execute natively.
    pub struct AccelDataTypeFlags: u32 {
        const INT8  = 1 << 0;
        const INT16 = 1 << 1;
        const FP16  = 1 << 2;
        const BF16  = 1 << 3;
        const FP32  = 1 << 4;
    }
}

bitflags! {
    /// Supported operator set languages.
    pub struct AccelOpSetFlags: u32 {
        /// ONNX opset (any version accepted by this device's firmware).
        const ONNX      = 1 << 0;
        /// TensorFlow Lite flatbuffer format.
        const TFLITE    = 1 << 1;
        /// Qualcomm QNN binary format.
        const QNN       = 1 << 2;
        /// Intel OpenVINO IR format.
        const OPENVINO  = 1 << 3;
    }
}

bitflags! {
    /// Where to source backing memory for an accelerator buffer object.
    pub struct AccelPlacementFlags: u32 {
        /// Accelerator-local SRAM or on-package DRAM (highest bandwidth).
        const ACCEL_LOCAL = 1 << 0;
        /// System DRAM, coherent with CPU caches.
        const SYSTEM_COHERENT = 1 << 1;
        /// System DRAM, non-coherent (explicit cache flush/invalidate needed).
        const SYSTEM_NONCOHERENT = 1 << 2;
    }
}

Compiler model: The kernel never compiles or JIT-translates model graphs. Vendor SDKs (Qualcomm QNN SDK, Intel OpenVINO, Google XNNPACK, Arm Ethos toolchain) run entirely in userspace and produce a hardware-specific binary blob. The kernel's role is limited to loading that blob into accelerator memory, managing its lifetime, and scheduling inference jobs. This boundary keeps the attack surface small and avoids incorporating license-encumbered compiler code into the kernel.

Shared synchronization with GPU: AccelDevice uses GpuFence (Section 12.1.4) for all completion signaling. A camera-to-inference pipeline can therefore express its dependencies as:

camera_fence = ISP_submit(frame)
infer_fence  = accel.submit_inference(model, input, output, &[camera_fence])
display_fence = compositor.atomic_commit(plane, &[infer_fence])

No additional synchronization primitive is needed.

Capability gating: CAP_ACCEL_INFERENCE gates buffer allocation, model loading, and inference submission. CAP_ACCEL_ADMIN additionally gates TDR, thermal policy override, and access to hardware performance counters.

Hardware-specific detail: Per-vendor NPU/DSP driver architecture (Qualcomm Hexagon, Intel Meteor Lake NPU/OpenVINO, MediaTek APU, ONNX Runtime FPGA backend) is documented inline in this section and in Section 21.1–46 (11-accelerators.md) for the unified accelerator scheduling framework.

12.1.8 DMA Engine

Tier: Tier 1. DMA engines are platform infrastructure directly used by other Tier 1 subsystems (audio DMA in Section 12.1.3, display framebuffer DMA in Section 12.1.2, storage DMA in Section 10.7). They must operate in kernel context to program IOMMU tables and to route completion interrupts to the correct waiters.

KABI interface name: dma_engine_v1 (in interfaces/dma_engine.kabi).

// umka-core/src/dma_engine/mod.rs — authoritative DMA engine driver contract

/// A platform DMA engine controller. Implemented by drivers for Intel CBDMA /
/// DSA, ARM PL330, Synopsys eDMA, TI UDMA, and Xilinx AXI DMA.
pub trait DmaEngine: Send + Sync {
    /// Request a DMA channel from this engine. `capabilities` specifies the
    /// minimum set of capabilities the channel must provide (e.g.,
    /// `MEM_TO_MEM | SCATTER_GATHER`). The engine selects the best-matching
    /// channel from its pool; returns `DmaError::NoChannel` if none is
    /// available.
    ///
    /// On ACPI platforms, the channel is cross-referenced to the CSRT entry
    /// describing it. On DT platforms, the channel is cross-referenced to the
    /// `dmas` phandle in the requesting device's DT node. See the ACPI/DT
    /// enumeration note below.
    fn request_channel(
        &self,
        capabilities: DmaChannelCapabilities,
    ) -> Result<DmaChannel, DmaError>;

    /// Release a DMA channel back to the engine's pool. The channel must not
    /// have any in-flight transactions (`DmaFence` values that have not yet
    /// been signaled) when released; returns `DmaError::ChannelBusy` if so.
    fn release_channel(&self, channel: DmaChannel) -> Result<(), DmaError>;
}

/// A DMA channel: a single logical DMA stream backed by one hardware channel.
pub trait DmaChannel: Send + Sync {
    /// Submit a flat memory-to-memory copy of `len` bytes from physical
    /// address `src_pa` to physical address `dst_pa`. Returns a `DmaFence`
    /// signaled when the copy is complete.
    ///
    /// Both addresses must be within IOMMU-mapped regions. The caller is
    /// responsible for cache coherency (flush source, invalidate destination)
    /// on non-coherent platforms before and after the transfer.
    fn memcpy(
        &self,
        dst_pa: u64,
        src_pa: u64,
        len: u64,
    ) -> Result<DmaFence, DmaError>;

    /// Submit a scatter-gather copy. `entries` is a list of
    /// `(src_pa, dst_pa, len)` tuples. The engine processes entries in order.
    /// Returns a single `DmaFence` signaled after all entries complete.
    ///
    /// The maximum number of entries per call is bounded by
    /// `DmaChannelInfo::max_sg_entries`; split across multiple calls if
    /// needed.
    fn sg_copy(
        &self,
        entries: &[(u64, u64, u64)],
    ) -> Result<DmaFence, DmaError>;

    /// Fill `len` bytes starting at physical address `dst_pa` with the
    /// repeating byte pattern `value`. Returns a `DmaFence` signaled on
    /// completion. Used for zeroing newly allocated pages and clearing
    /// framebuffers.
    fn fill(
        &self,
        dst_pa: u64,
        len: u64,
        value: u8,
    ) -> Result<DmaFence, DmaError>;

    /// Return static information about this channel (capabilities,
    /// maximum transfer size, maximum scatter-gather entry count).
    fn channel_info(&self) -> DmaChannelInfo;
}

/// A DMA completion handle. Cheap to copy; backed by a hardware status word.
#[derive(Clone, Copy)]
pub struct DmaFence {
    /// Identifies the DMA engine and channel this fence belongs to.
    pub channel_id: u32,
    /// Sequence number on the channel's completion timeline.
    pub seqno: u64,
}

/// DmaFence operations are provided by the DMA engine driver via the KABI vtable.
/// The methods below document the contract; actual implementation is in the driver.
pub trait DmaFenceOps: Send + Sync {
    /// Poll whether this DMA transfer has completed. Returns immediately
    /// without blocking. Safe to call from interrupt context.
    fn is_done(&self, fence: &DmaFence) -> bool;

    /// Block the current thread until this DMA transfer completes or until
    /// `timeout_ns` nanoseconds elapse. Returns `Ok(())` on completion,
    /// `Err(DmaError::Timeout)` on timeout.
    fn wait(&self, fence: &DmaFence, timeout_ns: u64) -> Result<(), DmaError>;
}

/// Static information about a DMA channel.
pub struct DmaChannelInfo {
    /// Capabilities of this specific channel.
    pub capabilities: DmaChannelCapabilities,
    /// Maximum number of bytes per single `memcpy` or `fill` call.
    pub max_transfer_bytes: u64,
    /// Maximum number of scatter-gather entries per `sg_copy` call.
    pub max_sg_entries: u32,
    /// Whether this channel's transfers are observable by the CPU without
    /// an explicit cache flush (i.e., the DMA path is cache-coherent).
    pub coherent: bool,
}

bitflags! {
    /// Capabilities that a DMA channel may provide.
    pub struct DmaChannelCapabilities: u32 {
        /// Memory-to-memory flat copy.
        const MEM_TO_MEM     = 1 << 0;
        /// Memory-to-device transfers (device is the sink).
        const MEM_TO_DEV     = 1 << 1;
        /// Device-to-memory transfers (device is the source).
        const DEV_TO_MEM     = 1 << 2;
        /// Scatter-gather transfer support.
        const SCATTER_GATHER = 1 << 3;
        /// Memory fill (pattern write, used for zeroing).
        const FILL           = 1 << 4;
        /// Cache-coherent DMA path (no manual flush/invalidate required).
        const COHERENT       = 1 << 5;
    }
}

Shared infrastructure model: DmaChannel is the common abstraction for all bulk-data DMA in UmkaOS. Subsystems that need DMA use it as follows:

  • Audio (Section 12.1.3): the PcmStream DMA ring uses a MEM_TO_DEV or DEV_TO_MEM channel obtained from the audio controller's built-in DMA or from a platform DMA engine channel bound in ACPI/DT.
  • Display (Section 12.1.2): cursor and framebuffer uploads on platforms without a GPU use a MEM_TO_DEV channel.
  • Storage (Section 10.7): on platforms where the storage controller does not have its own scatter-gather engine, DmaChannel::sg_copy is used for PRD tables.

On platforms where the device has its own built-in DMA (NVMe PRPs, AHCI PRDT, PCIe DMA engines on GPUs), the device driver does not use DmaEngine at all; the built-in DMA is programmed directly and the completion is reported via the device's own interrupt.

ACPI/DT enumeration: On ACPI platforms, DMA engine channels are described in the ACPI CSRT (Core System Resources Table, documented in the ACPI specification Section 5.2.24). The kernel's ACPI layer parses CSRT at boot and registers each channel group as a DmaEngine instance. Consumers reference channels by ACPI _CRS DMA descriptor. On Device Tree platforms, channels are described using the dmas and dma-names properties in the consuming device node, following the DMA Engine binding in the Linux kernel DT bindings (used as the authoritative reference for this property format).

Hardware-specific detail: Per-platform DMA engine driver architecture (Intel CBDMA/DSA, ARM PL330, TI UDMA-P on AM65x/J7, Synopsys eDMA on PCIe controllers) is documented inline in this section. Platform-specific channel discovery uses ACPI CSRT or Device Tree dmas/dma-names properties.

12.1.9 GPIO and Pin Control

Tier: Tier 1. GPIO controllers are low-level platform hardware directly used by many other Tier 1 drivers for chip-select lines, reset/enable signals, and interrupt routing. GPIO interrupts must be demultiplexed in the kernel IRQ subsystem (Section 10.2) before they can be delivered to drivers or (via eventfd) to userspace.

KABI interface names: gpio_controller_v1, pinctrl_v1 (in interfaces/gpio.kabi).

// umka-core/src/gpio/mod.rs — authoritative GPIO and pin control contract

/// A GPIO controller. One instance per hardware GPIO IP block (which may
/// expose dozens to hundreds of individual lines). Implemented by drivers for
/// Intel Broxton/Cannon Lake PCH GPIO, ARM PL061, NXP RGPIO, Qualcomm TLMM,
/// and Broadcom BCM2835 GPIO.
pub trait GpioController: Send + Sync {
    // --- Pin configuration ---

    /// Configure a GPIO line's direction, pull resistor, and drive mode.
    /// Must be called before `read` or `write` on the line.
    fn configure(
        &self,
        line: GpioLine,
        direction: GpioDirection,
        pull: GpioPull,
        drive: GpioDrive,
    ) -> Result<(), GpioError>;

    // --- Digital I/O ---

    /// Read the current logic level of an input (or output in read-back mode)
    /// GPIO line. Returns `true` for high, `false` for low. Returns
    /// `GpioError::NotInput` if the line is configured as output-only and the
    /// hardware does not support output read-back.
    fn read(&self, line: GpioLine) -> Result<bool, GpioError>;

    /// Set the output level of an output-configured GPIO line. `high` = true
    /// drives the line high; `high` = false drives it low. Returns
    /// `GpioError::NotOutput` if the line is configured as input.
    fn write(&self, line: GpioLine, high: bool) -> Result<(), GpioError>;

    // --- Interrupt registration ---

    /// Register an interrupt handler for a GPIO line. `mode` selects the
    /// edge or level trigger condition. `handler` is called in a Tier 1
    /// threaded interrupt context (Section 10.2 threaded IRQ model). Returns a
    /// `GpioIrqHandle`; dropping the handle atomically deregisters the
    /// handler and ensures no further invocations occur.
    ///
    /// Only one handler may be registered per line at a time; returns
    /// `GpioError::AlreadyRegistered` if a handler is already registered.
    fn request_irq(
        &self,
        line: GpioLine,
        mode: IrqMode,
        handler: GpioIrqHandler,
    ) -> Result<GpioIrqHandle, GpioError>;

    /// Deregister the interrupt handler associated with `handle`. Equivalent
    /// to dropping the `GpioIrqHandle` but provides an explicit error return.
    fn free_irq(&self, handle: GpioIrqHandle) -> Result<(), GpioError>;

    // --- Controller metadata ---

    /// Return the number of GPIO lines managed by this controller.
    fn line_count(&self) -> u32;

    /// Return the controller's unique identifier (used to construct
    /// `GpioLine` handles for cross-subsystem use).
    fn controller_id(&self) -> u32;
}

/// A pin control block. Manages the per-pin function multiplexer on SoCs
/// where physical pads can be assigned to multiple peripheral signals
/// (GPIO, I2C, SPI, UART, PCIe reference clock, etc.).
///
/// On platforms where pin multiplexing is co-located inside the GPIO
/// controller, both traits are implemented by the same driver struct.
pub trait PinCtrl: Send + Sync {
    /// Query the list of functions available for a given pin index. Writes
    /// `PinFunction` values into the caller-supplied buffer and returns the
    /// number written. Each entry has a name (e.g., "gpio", "i2c_sda",
    /// "spi_clk", "uart_tx") and the peripheral it routes to.
    /// KABI note: uses caller-supplied buffer, not Vec, for C driver compat.
    fn query_functions(&self, pin: u32, buf: &mut [PinFunction], max_count: u32) -> Result<u32, PinCtrlError>;

    /// Select a function for a pin, connecting the physical pad to the
    /// named peripheral signal. Any previously selected function is
    /// deactivated. Returns `PinCtrlError::Conflict` if another driver has
    /// claimed this pin in an incompatible function.
    fn select_function(
        &self,
        pin: u32,
        function: &PinFunction,
    ) -> Result<(), PinCtrlError>;

    /// Release ownership of a pin, returning it to a default high-impedance
    /// state. Safe to call even if no function is currently selected.
    fn release_pin(&self, pin: u32) -> Result<(), PinCtrlError>;
}

/// Handle to a single GPIO line: the combination of a controller and a
/// zero-based pin index within that controller.
///
/// This type is referenced by Section 10.10.3 (I2C-HID interrupt line) and Section 12.1.3
/// (audio jack detection) and is formally defined here. All other subsystems
/// that reference a GPIO line MUST use this type.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct GpioLine {
    /// Identifier of the `GpioController` that owns this line.
    pub controller_id: u32,
    /// Zero-based index of the line within the controller (0 …
    /// `controller.line_count() - 1`).
    pub pin_index: u32,
}

/// RAII handle for a registered GPIO interrupt. Dropping this value
/// deregisters the handler. Implemented as a token that the kernel associates
/// with the registration record; no raw pointers are exposed.
pub struct GpioIrqHandle {
    /// Opaque kernel handle. The kernel uses this to locate and remove the
    /// registration entry on drop.
    pub(crate) handle: u64,
}

impl Drop for GpioIrqHandle {
    /// Deregister the GPIO interrupt handler. Guaranteed to be called even
    /// if the owning driver panics, preventing stale handlers from firing
    /// after the driver struct is freed.
    fn drop(&mut self) { /* kernel deregistration via syscall or direct call */ }
}

/// Type alias for a GPIO interrupt handler function pointer.
/// The handler is called in a threaded interrupt context (Section 10.2). It must not
/// block indefinitely; it may acquire short-duration spinlocks and queue
/// work to a kernel work queue.
pub type GpioIrqHandler = fn(line: GpioLine, mode: IrqMode);

/// Available trigger modes for GPIO interrupts.
#[repr(u32)]
pub enum IrqMode {
    /// Trigger on a low-to-high transition.
    RisingEdge  = 0,
    /// Trigger on a high-to-low transition.
    FallingEdge = 1,
    /// Trigger on both transitions.
    BothEdges   = 2,
    /// Trigger while the line is held high (level-triggered).
    HighLevel   = 3,
    /// Trigger while the line is held low (level-triggered).
    LowLevel    = 4,
}

#[repr(u32)]
/// GPIO line direction.
pub enum GpioDirection {
    /// Line is an input; the driver reads the external logic level.
    Input  = 0,
    /// Line is an output; the driver drives the logic level.
    Output = 1,
}

#[repr(u32)]
/// Internal pull resistor configuration.
pub enum GpioPull {
    /// No pull resistor (high impedance when not driven).
    None     = 0,
    /// Weak pull-up to VCC.
    PullUp   = 1,
    /// Weak pull-down to GND.
    PullDown = 2,
}

#[repr(u32)]
/// Output drive mode.
pub enum GpioDrive {
    /// Totem-pole (push-pull): the driver actively drives both high and low.
    PushPull   = 0,
    /// Open-drain: the driver only pulls low; high is achieved by an external
    /// pull-up. Required for I2C bus lines and wired-AND configurations.
    OpenDrain  = 1,
}

/// A multiplexable function available on a SoC pin.
pub struct PinFunction {
    /// Human-readable function name (e.g., "gpio", "i2c0_sda", "uart2_tx").
    pub name: &'static str,
    /// The peripheral subsystem this function connects to (e.g., I2C
    /// controller index 0, UART controller index 2).
    pub peripheral_id: u32,
}

IRQ model: request_irq() registers the handler with the kernel IRQ subsystem (Section 10.2). The GPIO controller's top-level interrupt line is demuxed by the GPIO driver: on each top-level interrupt, the driver reads the controller's pending interrupt register, identifies which lines are active, and dispatches the registered handlers for those lines in threaded IRQ context. Handlers run at a normal kernel thread priority with preemption enabled unless the handler explicitly raises its priority via the Section 10.2 scheduling API.

ACPI/DT enumeration: On ACPI platforms, GPIO lines are described using GpioInt (interrupt) and GpioIo (I/O) resource descriptors in device _CRS methods, following the ACPI specification Section 23.6.56–19.6.57. The GPIO subsystem resolves these descriptors to GpioLine handles and registers IRQs automatically during device enumeration. On Device Tree platforms, the gpios phandle-with-args property and the standard GPIO binding (two-cell format: <&gpio_controller pin_index flags>) are parsed to produce GpioLine handles.

Fix to Section 10.10.3: The GpioLine type and request_irq() method used by the I2C-HID driver (Section 10.10.3) to register the ATTN interrupt are formally defined by this Section 12.1.9 contract. The Section 10.10.3 description is authoritative on how I2C-HID uses GPIO; this section is authoritative on what GPIO provides.

Hardware-specific detail: Per-platform GPIO/pinctrl driver architecture (Intel PCH GPIO — Broxton/Cannon Lake/Tiger Lake, ARM PL061, Qualcomm TLMM, NXP i.MX IOMUXC, Broadcom BCM2835/2711) is documented inline in this section. Each platform's pin mux register layout is covered during per-driver implementation using vendor datasheets.

12.1.10 Crypto Accelerator

Tier: Tier 1. Hardware crypto engines need DMA access to key material and plaintext/ciphertext buffers. Tier 2 boundary crossing would add one to two microseconds per operation — unacceptable for TLS session establishment (RSA or ECDH operations on the critical path of every connection) and bulk record encryption (AES-GCM on every TCP segment with TLS offload).

KABI interface name: crypto_engine_v1 (in interfaces/crypto_engine.kabi).

// umka-core/src/crypto_engine/mod.rs — authoritative crypto accelerator contract

/// A hardware cryptographic accelerator. Implemented by drivers for:
/// - On-SoC crypto engines (Intel QAT, ARM TrustZone CryptoCell, NXP CAAM)
/// - NIC-integrated TLS offload engines (Mellanox ConnectX-6 TLS)
/// - HSM-adjacent secure enclaves
/// - Software fallback (when no hardware engine is present)
pub trait CryptoEngine: Send + Sync {
    // --- Capability discovery ---

    /// Return the algorithm configurations supported by this engine. Writes
    /// `AlgorithmDescriptor` entries into the caller-supplied buffer and
    /// returns the number written. Each entry specifies the algorithm family,
    /// key sizes, and performance tier (hardware-accelerated or software
    /// fallback). The caller selects an algorithm from this list when
    /// creating a session.
    /// KABI note: uses caller-supplied buffer, not Vec, for C driver compat.
    fn query_algorithms(&self, buf: &mut [AlgorithmDescriptor], max_count: u32) -> Result<u32, CryptoError>;

    // --- Key management ---

    /// Import raw key material into the engine under a wrapping key (or in
    /// plaintext if `wrapping_key` is None and the engine permits it). The
    /// engine stores the key internally; the caller's buffer is zeroed after
    /// import. Returns an opaque `KeyHandle`. Raw key bytes are never
    /// accessible after this call; all subsequent operations use the handle.
    ///
    /// `flags` controls whether the key may be exported (wrapped) later or
    /// is permanently non-extractable. Non-extractable keys cannot leave the
    /// hardware even if the kernel is fully compromised — the hardware
    /// enforces this at the engine level.
    ///
    /// Requires `CAP_CRYPTO_ADMIN`.
    fn import_key(
        &self,
        algorithm: AlgorithmId,
        key_bytes: &[u8],
        wrapping_key: Option<&KeyHandle>,
        flags: KeyFlags,
    ) -> Result<KeyHandle, CryptoError>;

    /// Export a key that was imported with `EXPORTABLE`. The key material is
    /// encrypted under `wrapping_key` and returned as an opaque blob. Returns
    /// `CryptoError::NonExtractable` if the key was imported with
    /// `NON_EXTRACTABLE`.
    ///
    /// Requires `CAP_CRYPTO_ADMIN`.
    fn export_key(
        &self,
        key: &KeyHandle,
        wrapping_key: &KeyHandle,
    ) -> Result<Vec<u8>, CryptoError>;

    /// Destroy a key handle. The engine erases all key material. After this
    /// call, the `KeyHandle` is invalid and any session using it will return
    /// `CryptoError::InvalidKey` on the next operation.
    ///
    /// Requires `CAP_CRYPTO_ADMIN`.
    fn destroy_key(&self, key: KeyHandle) -> Result<(), CryptoError>;

    // --- Session lifecycle ---

    /// Allocate a session for a specific algorithm and key. A session holds
    /// per-operation state (IV/nonce counters, HMAC state, RSA blinding
    /// factors) and is bound to one `KeyHandle`. Sessions are not thread-safe;
    /// concurrent callers must allocate separate sessions.
    ///
    /// Requires `CAP_CRYPTO_ACCEL`.
    fn alloc_session(
        &self,
        algorithm: AlgorithmId,
        key: &KeyHandle,
    ) -> Result<CryptoSession, CryptoError>;

    /// Free a session. Any in-flight operation on this session must complete
    /// first; returns `CryptoError::SessionBusy` if not.
    fn free_session(&self, session: CryptoSession) -> Result<(), CryptoError>;

    // --- Operation submission ---

    /// Submit a cryptographic operation. The request is placed on the
    /// engine's ring buffer (Section 10.7 ring model). Returns immediately with a
    /// `GpuFence` (Section 12.1.4 timeline semaphore) that is signaled when the
    /// output DMA-BUF has been fully written and is safe to read.
    ///
    /// `request` fully describes the operation: op type (encrypt, decrypt,
    /// sign, verify, hash, key-exchange), input DMA-BUF, output DMA-BUF,
    /// associated data (for AEAD ciphers), nonce/IV, and tag buffer.
    ///
    /// For AEAD operations (AES-GCM, ChaCha20-Poly1305): authentication tag
    /// is appended to ciphertext on encrypt, and verified + stripped on
    /// decrypt. A tag mismatch on decrypt returns `CryptoError::AuthFailed`
    /// via the fence's error status.
    ///
    /// Requires `CAP_CRYPTO_ACCEL`.
    fn submit(
        &self,
        session: &CryptoSession,
        request: &CryptoRequest,
    ) -> Result<GpuFence, CryptoError>;
}

/// An opaque key handle. The raw key material is inaccessible after
/// `import_key`; this handle is the only means to reference the key in
/// subsequent operations.
pub struct KeyHandle {
    /// Opaque engine-assigned key identifier.
    pub(crate) id: u64,
    /// The algorithm this key is bound to.
    pub algorithm: AlgorithmId,
    /// Whether this key is allowed to be exported.
    pub exportable: bool,
}

/// A crypto session: per-operation state bound to one key and algorithm.
pub struct CryptoSession {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Algorithm this session is configured for.
    pub algorithm: AlgorithmId,
}

/// A single cryptographic operation request, placed on the engine's ring.
#[repr(C)]
pub struct CryptoRequest {
    /// The type of operation to perform.
    pub op: CryptoOp,
    /// DMA-BUF containing input data (plaintext for encrypt, ciphertext for
    /// decrypt, message for hash/sign, public value for key agreement).
    pub input: DmaBufHandle,
    /// DMA-BUF that the engine will write output into (ciphertext for
    /// encrypt, plaintext for decrypt, digest for hash, signature for sign,
    /// shared secret for key agreement).
    pub output: DmaBufHandle,
    /// Associated data for AEAD operations (authenticated but not encrypted).
    /// Length zero means no associated data.
    pub aad: DmaBufHandle,
    /// Nonce or IV for symmetric ciphers. Length and format are
    /// algorithm-specific: 12 bytes for AES-GCM, 12 bytes for
    /// ChaCha20-Poly1305. Ignored for hash and asymmetric operations.
    pub nonce: [u8; 16],
    /// Actual nonce/IV length in bytes (0 if not applicable).
    pub nonce_len: u8,
    /// Input data length in bytes.
    pub input_len: u64,
    /// Associated data length in bytes.
    pub aad_len: u64,
}

/// The specific cryptographic operation requested.
#[repr(u32)]
pub enum CryptoOp {
    /// Symmetric encryption (AES-GCM, ChaCha20-Poly1305, AES-CBC, AES-CTR).
    Encrypt          = 0,
    /// Symmetric decryption with authentication tag verification (AEAD) or
    /// plain decryption (non-AEAD).
    Decrypt          = 1,
    /// Compute a message digest (SHA-256, SHA-384, SHA-512, SHA-3-256).
    Hash             = 2,
    /// Compute an HMAC (HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512).
    Hmac             = 3,
    /// Asymmetric signing (RSA-PSS, ECDSA P-256, ECDSA P-384, Ed25519).
    Sign             = 4,
    /// Asymmetric signature verification.
    Verify           = 5,
    /// Key agreement / scalar multiplication (ECDH P-256, ECDH P-384,
    /// X25519, X448). Output is the shared secret.
    KeyAgreement     = 6,
    /// TLS record encryption (NIC TLS offload engines only). Input is a
    /// plaintext TLS record; output is the encrypted wire-format record.
    TlsRecordEncrypt = 7,
    /// TLS record decryption (NIC TLS offload engines only).
    TlsRecordDecrypt = 8,
}

/// Identifier for a specific algorithm configuration.
#[repr(u32)]
pub enum AlgorithmId {
    AesGcm128       = 0,
    AesGcm256       = 1,
    ChaCha20Poly1305 = 2,
    AesCbc128       = 3,
    AesCbc256       = 4,
    AesCtr128       = 5,
    AesCtr256       = 6,
    Sha256          = 16,
    Sha384          = 17,
    Sha512          = 18,
    Sha3_256        = 19,
    HmacSha256      = 32,
    HmacSha384      = 33,
    HmacSha512      = 34,
    RsaPss2048Sha256 = 48,
    RsaPss4096Sha384 = 49,
    EcdsaP256Sha256 = 64,
    EcdsaP384Sha384 = 65,
    Ed25519         = 66,
    EcdhP256        = 80,
    EcdhP384        = 81,
    X25519          = 82,
    X448            = 83,
}

bitflags! {
    /// Flags controlling key lifecycle and extractability.
    pub struct KeyFlags: u32 {
        /// Key may be exported (wrapped) by a process holding
        /// `CAP_CRYPTO_ADMIN`. Mutually exclusive with `NON_EXTRACTABLE`.
        const EXPORTABLE       = 1 << 0;
        /// Key material never leaves the hardware security boundary. Once
        /// imported, it cannot be read out even with physical access to DRAM.
        /// Mutually exclusive with `EXPORTABLE`.
        const NON_EXTRACTABLE  = 1 << 1;
        /// Key is persistent across power cycles (stored in hardware key
        /// store, e.g., TPM NV index or TrustZone secure storage). Engines
        /// that do not support persistence return `CryptoError::Unsupported`
        /// if this flag is set.
        const PERSISTENT       = 1 << 2,
    }
}

/// Descriptor of one algorithm configuration supported by an engine.
pub struct AlgorithmDescriptor {
    /// The algorithm this descriptor covers.
    pub id: AlgorithmId,
    /// Whether this algorithm is executed in hardware (true) or software
    /// fallback (false).
    pub hardware_accelerated: bool,
    /// Approximate throughput in MiB/s for bulk operations (encrypt/decrypt/
    /// hash). 0 for asymmetric operations where throughput is not meaningful.
    pub throughput_mibps: u32,
    /// Approximate latency in microseconds per single operation (for
    /// asymmetric operations such as sign/verify/key-agreement).
    pub latency_us: u32,
}

Software fallback: If query_algorithms() returns a descriptor with hardware_accelerated = false for a requested algorithm, the submit() path executes the algorithm in a kernel software implementation (Rust aes-gcm, chacha20poly1305, sha2, p256, x25519-dalek). The API is identical regardless of acceleration. Callers that require hardware acceleration for security reasons (e.g., to achieve constant-time execution or non-extractable keys) must check the hardware_accelerated flag in the descriptor before creating a session.

TLS offload integration: A NIC with TLS record-layer offload (e.g., Mellanox ConnectX-6, Marvell OcteonTX2) registers itself as a CryptoEngine with TlsRecordEncrypt and TlsRecordDecrypt in its algorithm list. The kernel TLS layer (ktls, Section 15.1 net-tls) queries the CryptoEngine registry and, if a matching engine is found for the session's cipher suite, offloads record encryption there. The TCP send path then bypasses the software TLS layer and passes plaintext records directly to the NIC. The kernel retains ownership of the session key via a KeyHandle; the NIC's shadow copy is invalidated when destroy_key is called.

Capability gating: CAP_CRYPTO_ACCEL is required for session allocation and operation submission. CAP_CRYPTO_ADMIN is additionally required for key import, export, and destruction, and for reading hardware performance counters. Processes without CAP_CRYPTO_ACCEL receive the software fallback path transparently; they do not receive an error.

Hardware-specific detail: Per-vendor crypto accelerator driver architecture (Intel QAT, ARM TrustZone CryptoCell cc712/cc713, NXP CAAM, Mellanox ConnectX TLS offload) is documented inline in this section. TPM 2.0 key storage integration is specified in Section 8.2 (08-security.md).


12.2 Bluetooth HCI Driver

Interface contract: Section 12.1.1 (WirelessDriver trait covers 802.11; BT HCI uses a separate HCI socket interface exposed via umka-compat). Tier decision: Tier 2 for the BT stack (control path not latency-sensitive), Tier 1 for the kernel HCI transport driver.

Stack Decision: BlueZ-compatible via umka-compat HCI socket interface — UmkaOS provides a kernel HCI (Host Controller Interface) driver that exposes /dev/hci0 as a character device implementing the standard Linux HCI socket protocol. The BlueZ userspace daemon (bluetoothd) runs in Tier 2, implementing L2CAP, SDP, RFCOMM, A2DP, HID, and pairing logic. This approach: - Reuses the mature BlueZ stack (~200K lines, 15+ years of protocol compatibility testing). - Avoids the multi-year effort of a clean-room Bluetooth stack. - Maintains compatibility with existing Bluetooth management tools (bluez-utils, bluetoothctl).

12.2.1 Kernel HCI Driver (Tier 1)

The HCI driver is Tier 1 (MPK-isolated) and handles raw HCI packet transport. Common transports: - USB HCI: Bulk endpoints (ACL data), interrupt endpoint (events), control endpoint (commands). Most common on laptops (Intel, Realtek, Qualcomm combo modules). - UART HCI: Serial port (ttyS, ttyUSB) with H4/H5/BCSP framing. Common on ARM SoCs (RPi, embedded).

// umka-core/src/bluetooth/hci.rs

/// HCI packet type.
#[repr(u8)]
pub enum HciPacketType {
    /// HCI command (host → controller).
    Command = 0x01,
    /// ACL data (bidirectional, L2CAP payload).
    AclData = 0x02,
    /// SCO data (bidirectional, voice payload).
    ScoData = 0x03,
    /// HCI event (controller → host).
    Event = 0x04,
}

/// HCI device handle (opaque to userspace).
#[repr(C)]
pub struct HciDeviceId(u32);

/// HCI command packet (max 259 bytes: 1 byte type + 2 bytes opcode + 1 byte len + 255 bytes data).
#[repr(C)]
pub struct HciCommand {
    /// Packet type (always 0x01 for commands).
    pub packet_type: u8,
    /// Opcode (OCF + OGF encoded as u16).
    pub opcode: u16,
    /// Parameter length (0-255).
    pub param_len: u8,
    /// Parameters (variable length).
    pub params: [u8; 255],
}

/// HCI event packet (max 258 bytes: 1 byte type + 1 byte event code + 1 byte len + 255 bytes data).
#[repr(C)]
pub struct HciEvent {
    /// Packet type (always 0x04 for events).
    pub packet_type: u8,
    /// Event code.
    pub event_code: u8,
    /// Parameter length (0-255).
    pub param_len: u8,
    /// Parameters (variable length).
    pub params: [u8; 255],
}

The HCI driver exposes a ring buffer interface (Section 10.7.2) to the BlueZ daemon: - Command ring: BlueZ writes HciCommand structs, driver sends them to the controller via USB bulk OUT or UART TX. - Event ring: Driver receives events from the controller (USB interrupt IN or UART RX), writes HciEvent structs to the ring. - ACL TX ring: BlueZ writes ACL data packets (L2CAP frames), driver sends them to the controller. - ACL RX ring: Driver receives ACL data from the controller, writes to the ring.

12.2.2 BlueZ Daemon (Tier 2)

bluetoothd runs as a Tier 2 process. It opens /dev/hci0 (which is backed by the HCI ring buffer interface via umka-compat), reads/writes HCI packets, and implements all higher-layer protocols: - L2CAP: Logical Link Control and Adaptation Protocol (multiplexing, segmentation). - SDP: Service Discovery Protocol (enumerate remote device capabilities). - RFCOMM: Serial port emulation over Bluetooth (for legacy apps). - A2DP: Advanced Audio Distribution Profile (high-quality stereo audio streaming). - AVRCP: Audio/Video Remote Control Profile (play/pause, volume control). - HID: Human Interface Device (keyboards, mice, game controllers). - HSP/HFP: Headset/Hands-Free Profiles (phone call audio).

Pairing: BlueZ userspace daemon (bluetoothd) handles pairing logic and stores the pairing database. Kernel provides HCI transport only.

12.2.3 A2DP Audio Routing to PipeWire

When a Bluetooth headset is paired and A2DP is active: 1. bluetoothd decodes the SBC/AAC/LDAC A2DP stream from ACL packets (received via the HCI ACL RX ring). 2. Writes decoded PCM samples to a PipeWire ring buffer (Section 20.3, the same ring buffers used for wired audio). 3. PipeWire mixes/routes the audio to the audio subsystem (Section 20.3.3, 20-user-io.md). 4. For playback, the reverse path: PipeWire writes PCM to a ring, bluetoothd encodes to SBC/AAC, sends via ACL TX.

Latency: A2DP adds ~100-200ms latency (codec encoding/decoding, BT scheduling). This is unavoidable (Bluetooth spec limitation). Gaming audio and video calls use SCO (Synchronous Connection-Oriented) links for lower latency at the cost of lower quality (64kbps, 8kHz sample rate).

12.2.4 HID Input Routing

When a Bluetooth keyboard/mouse is paired: 1. bluetoothd receives HID reports via L2CAP over ACL. 2. Translates them to standard InputEvent structs (Section 20.1, same format as USB HID). 3. Writes to the input subsystem ring buffer (Section 20.1). 4. umka-input (the input multiplexer) routes events to the active Wayland compositor or VT.

Wake-on-Bluetooth: Before S3 suspend, bluetoothd tells the HCI driver to enable "wake on HID activity" (any HID report from a paired device wakes the system). The driver programs the USB controller's PME mask or UART's RTS line to wake on RX. Pressing a key on the Bluetooth keyboard wakes the laptop.

12.2.5 Architectural Decision

Bluetooth: BlueZ-compatible via umka-compat HCI

Decision: Kernel HCI driver (Tier 1) exposes /dev/hci0. BlueZ daemon (Tier 2) implements L2CAP, A2DP, HID, pairing. Reuses mature BlueZ stack (~200K lines, 15+ years of testing) instead of multi-year clean-room effort. UmkaOS maintains Linux HCI ABI compatibility.


12.3 WiFi Driver

Interface contract: Section 12.1.1 (WirelessDriver trait, wireless_device_v1 KABI). This section specifies the Intel/Realtek/Qualcomm/MediaTek/Broadcom implementations of that contract. Tier and ring-buffer design decisions are authoritative in Section 12.1.1.

Tier: Tier 1 (per Section 12.1.1 — latency-sensitive; IOMMU-bounded firmware threat model).

Chipset coverage (minimum for launch): - Intel: AX210, AX211, AX411 (WiFi 6E) - Realtek: RTL8852AE, RTL8852BE, RTL8852CE (common in consumer laptops) - Qualcomm: QCA6390, QCA6391, WCN6855 (Snapdragon-based laptops) - Mediatek: MT7921, MT7922 (budget laptops) - Broadcom: BCM4350, BCM4352 (older MacBooks, some Thinkpads)

12.3.1 WiFi Driver Architecture

WiFi drivers implement the WirelessDriver trait defined in Section 12.1.1 (wireless_device_v1 KABI). Each chipset driver is Tier 1, MPK-isolated on x86-64 (Section 10.4), and communicates with umka-net via the TX/RX ring buffers specified in Section 12.1.1.

// umka-core/src/net/wireless.rs

/// WiFi device handle. Opaque to userspace, used for ioctl operations.
#[repr(C)]
pub struct WirelessDeviceId(u64);

/// WiFi scan result.
#[repr(C)]
pub struct WifiScanResult {
    /// BSSID (MAC address of the AP).
    pub bssid: [u8; 6],
    /// SSID length (0-32 bytes).
    pub ssid_len: u8,
    /// SSID (variable length, up to 32 bytes; remaining bytes are zero).
    pub ssid: [u8; 32],
    /// RSSI (signal strength in dBm, typically -100 to 0).
    pub rssi: i8,
    /// Channel number (1-14 for 2.4 GHz, 36-165 for 5 GHz, 1-233 for 6 GHz).
    pub channel: u16,
    /// Security type (bitmask: WPA2=0x1, WPA3=0x2, Enterprise=0x4).
    pub security: u32,
    /// BSS load (0-255, indicates AP congestion; 255 = unknown).
    pub bss_load: u8,
    _pad: [u8; 3],
}

/// WiFi connection parameters.
#[repr(C)]
pub struct WifiConnectParams {
    /// SSID length (1-32 bytes).
    pub ssid_len: u8,
    /// SSID.
    pub ssid: [u8; 32],
    /// BSSID (all zeros = any BSSID; specific BSSID = forced roam to that AP).
    pub bssid: [u8; 6],
    /// Security type (WPA2=0x1, WPA3=0x2, Enterprise=0x4).
    pub security: u32,
    /// PSK (pre-shared key) length in bytes (0 for open networks).
    pub psk_len: u8,
    /// PSK (for WPA2-PSK / WPA3-SAE personal).
    pub psk: [u8; 64],
    /// 802.1X parameters (for enterprise; zero if not used).
    pub eap: Eap8021xParams,
}

/// 802.1X / EAP parameters for enterprise WiFi.
#[repr(C)]
pub struct Eap8021xParams {
    /// EAP method (0=none, 1=PEAP, 2=TTLS, 3=TLS).
    pub method: u8,
    /// Identity length.
    pub identity_len: u8,
    /// Identity (username).
    pub identity: [u8; 128],
    /// Password length (0 for certificate-based).
    pub password_len: u8,
    /// Password (for PEAP/TTLS).
    pub password: [u8; 128],
    /// CA certificate handle (for TLS verification; 0 = no pinning).
    pub ca_cert: u64,
    _pad: [u8; 6],
}

/// WiFi power save mode.
#[repr(u32)]
pub enum WifiPowerSaveMode {
    /// No power save (CAM - Constantly Awake Mode). Lowest latency, highest power.
    Disabled = 0,
    /// 802.11 Power Save Mode (PSM). Sleep between beacons, wake for DTIM.
    Enabled = 1,
    /// Aggressive power save (skip DTIMs, rely on TIM). Highest battery savings.
    Aggressive = 2,
}

/// WiFi connection state.
#[repr(u32)]
pub enum WifiState {
    /// Not connected, not scanning.
    Idle = 0,
    /// Scanning for networks.
    Scanning = 1,
    /// Authenticating with AP (4-way handshake in progress).
    Authenticating = 2,
    /// Connected, link up.
    Connected = 3,
    /// Disconnecting (deauth sent, waiting for confirmation).
    Disconnecting = 4,
}

/// WiFi statistics.
#[repr(C)]
pub struct WifiStats {
    /// Current state.
    pub state: WifiState,
    /// Connected SSID length (0 if not connected).
    pub ssid_len: u8,
    /// Connected SSID.
    pub ssid: [u8; 32],
    /// Connected BSSID (all zeros if not connected).
    pub bssid: [u8; 6],
    /// Current channel.
    pub channel: u16,
    /// RSSI (dBm).
    pub rssi: i8,
    /// Link speed (Mbps).
    pub link_speed_mbps: u16,
    /// TX packets.
    pub tx_packets: u64,
    /// RX packets.
    pub rx_packets: u64,
    /// TX bytes.
    pub tx_bytes: u64,
    /// RX bytes.
    pub rx_bytes: u64,
    /// TX errors (failed transmissions).
    pub tx_errors: u32,
    /// RX errors (FCS errors, drops).
    pub rx_errors: u32,
}

12.3.2 Firmware Isolation Model

WiFi firmware runs on the chip (Intel AX210's embedded ARM core, Qualcomm's dedicated DSP), not in host CPU Ring 0. The Tier 1 driver manages:

  1. Firmware upload: Firmware blobs loaded from /System/Firmware/WiFi/<vendor>/<chip>.bin at driver probe time via umka_driver_firmware_load() KABI call (maps the blob DMA-accessible, issues chip-specific firmware load command).
  2. Control path: Commands (scan, connect, disconnect) sent via MMIO registers or command rings (chip-specific).
  3. Data path: TX/RX ring buffers (see Section 12.3.3) populated by driver, consumed/produced by firmware DMA engine.

IOMMU enforcement: The WiFi chip's DMA is bounded to: - TX ring buffer pages (read-only from chip perspective) - RX ring buffer pages (write-only from chip perspective) - Firmware upload buffer (read-only, unmapped after upload completes)

The driver cannot access arbitrary physical memory, and the firmware cannot DMA outside its assigned buffers. This matches the NVMe threat model (Section 10.5): firmware is untrusted, IOMMU is the hard boundary.

Firmware blob loading: Firmware is NOT shipped in the kernel binary (bloat, licensing). /System/Firmware/ is a separate partition or directory populated during install. The kernel provides umka_driver_firmware_load(device_id, "iwlwifi-ax210-v71.ucode") which: 1. Reads the file from the firmware partition (uses VFS, Tier 1 filesystem driver). 2. Allocates an IOMMU-fenced DMA buffer. 3. Copies the firmware blob to the buffer. 4. Returns a DmaBufferHandle to the driver. 5. Driver passes the handle to the chip's firmware loader.

12.3.3 TX/RX Ring Buffer Design

WiFi uses the same ring buffer protocol as NVMe (Section 10.7.2). The driver allocates two rings:

  1. TX ring: Host writes packet descriptors (packet buffer address, length, metadata). Firmware DMA engine reads descriptors, fetches packets, transmits over the air.
  2. RX ring: Firmware DMA engine writes received packet descriptors (packet buffer address, length, RSSI, channel, timestamp). Host reads descriptors, processes packets.
// umka-driver-sdk/src/wireless.rs

/// WiFi TX descriptor (64 bytes, cache-line aligned).
#[repr(C, align(64))]
pub struct WifiTxDescriptor {
    /// Physical address of packet buffer (DMA-mapped).
    pub buffer_addr: u64,
    /// Packet length in bytes (14-2304 for 802.11).
    pub length: u16,
    /// TX flags (ACK required, QoS TID, encryption).
    pub flags: u16,
    /// Sequence number (for retransmissions).
    pub seq: u16,
    /// Retry count (0 for first attempt).
    pub retries: u8,
    /// TX power (dBm, or 0xFF for default).
    pub tx_power: i8,
    /// Rate index (driver-specific rate table).
    pub rate_index: u8,
    _pad: [u8; 47],
}

/// WiFi RX descriptor (64 bytes, cache-line aligned).
///
/// Layout (with `#[repr(C, align(64))]`):
///   Offset 0:  buffer_addr (u64)     = 8 bytes
///   Offset 8:  length (u16)          = 2 bytes
///   Offset 10: flags (u16)           = 2 bytes
///   Offset 12: rssi (i8)             = 1 byte
///   Offset 13: noise (i8)            = 1 byte
///   Offset 14: channel (u16)         = 2 bytes
///   Offset 16: timestamp_us (u64)    = 8 bytes  (naturally aligned at 16)
///   Offset 24: _pad                  = 40 bytes
///   Total: 64 bytes (matches align(64), no implicit padding needed).
#[repr(C, align(64))]
pub struct WifiRxDescriptor {
    /// Physical address of packet buffer (firmware wrote packet here).
    pub buffer_addr: u64,
    /// Packet length in bytes.
    pub length: u16,
    /// RX flags (FCS OK, decryption OK, AMPDU).
    pub flags: u16,
    /// RSSI (dBm).
    pub rssi: i8,
    /// Noise floor (dBm).
    pub noise: i8,
    /// Channel number.
    pub channel: u16,
    /// Timestamp (hardware TSF, microseconds).
    pub timestamp_us: u64,
    _pad: [u8; 40],
}

Zero-copy path: When umka-net (the Tier 1 network stack) needs to send a packet over WiFi: 1. umka-net allocates a packet buffer from the DMA-capable memory pool (Section 11.1.5 umka_driver_dma_alloc). 2. Writes the 802.11 frame (header + payload) to the buffer. 3. Writes a WifiTxDescriptor to the TX ring. 4. Kicks the firmware (MMIO doorbell write). 5. Firmware DMA-reads the descriptor, DMA-reads the packet, transmits. 6. Firmware writes a completion entry to the TX completion ring (separate ring, omitted here for brevity; same pattern as NVMe).

12.3.4 Power Management

WiFi power management integrates with Section 6.2 power budgeting and Section 6.2.11 suspend/resume.

Power save modes: - WifiPowerSaveMode::Disabled: Driver keeps the radio in CAM (Constantly Awake Mode). Lowest latency, ~1.5W idle power. - WifiPowerSaveMode::Enabled: Driver enables 802.11 PSM. Radio sleeps between beacons, wakes for DTIM. ~300mW idle power, ~10-20ms wake latency. - WifiPowerSaveMode::Aggressive: Driver enables DTIM skipping (only wake every 3rd DTIM), beacon filtering (hardware drops beacons not containing traffic indication). ~150mW idle power, ~50-100ms wake latency.

Mode selection: Controlled by the power profile (Section 6.2.10): - Performance profile: Disabled - Balanced profile: Enabled - BatterySaver profile: Aggressive

Fast wake: When the radio is in PSM and an outbound packet arrives, the driver: 1. Immediately sends a null data frame with PM=0 (告诉 AP "I'm awake now"). 2. Queues the outbound packet in the TX ring. 3. The firmware buffers it until the AP acknowledges the PM=0 frame (~5-10ms). 4. Then transmits the queued packet.

12.3.5 WoWLAN (Wake-on-WLAN)

Before entering S3 suspend (Section 6.2.11), the driver registers wake patterns with the firmware: - Magic Packet: Wake on receiving a packet with destination MAC matching the WiFi interface. - Disconnect: Wake on AP deauth/disassoc (lost connection). - GTK Rekey: Wake on WPA2 group key rekey (maintains encryption sync).

The firmware remains powered (in D3hot, not D3cold) during S3. When a wake pattern matches, the firmware asserts the PCIe PME (Power Management Event) signal, waking the system. The driver's resume() callback (Section 6.2.11) re-establishes the connection.

Security consideration: WoWLAN patterns are capability-gated. Only processes with CAP_NET_ADMIN can configure wake patterns, preventing DoS (malicious process sets "wake on any packet" → battery drain).

12.3.6 Scan Offload

The driver supports background scanning while suspended (S0ix Modern Standby, Section 6.2.11): 1. Before S0ix entry, the driver programs the firmware with a scan schedule (every 30 seconds, channels 1/6/11 only, passive scan). 2. Firmware performs scans autonomously while the host CPU is in C10 (powered down). 3. If scan results differ significantly (RSSI drop >20dB, AP disappeared), firmware wakes the host via PME. 4. Driver's resume handler evaluates roaming decision.

This enables "instant reconnect" on lid open: the firmware already scanned for APs and selected the best candidate while the laptop was asleep.

12.3.7 Roaming

When the driver detects poor link quality (RSSI < -75dBm, packet loss >5%), it triggers a roam: 1. Background scan for APs on the same SSID. 2. Select best candidate (highest RSSI, lowest BSS load). 3. Send reassociation request to the new AP. 4. If successful, TX/RX rings continue using the same buffers (no data plane disruption). 5. If failed, stay connected to the current AP, retry roam in 5 seconds.

Seamless roaming: The driver batches the last ~10 outbound packets in a shadow buffer during reassociation. If roaming succeeds, retransmits them to the new AP. If roaming fails, discards them (they're already lost). This avoids TCP connection resets during roaming.

12.3.8 Architectural Decision: WiFi Tier Classification

Decision: WiFi drivers are Tier 1 (in-kernel, isolation-domain-sandboxed).

Rationale: Tier 2 (separate process) would add ~200–500 cycles of IPC overhead per packet on the hot RX path. WiFi is latency-sensitive: video calls, SSH sessions, and cloud gaming are all affected by millisecond-scale jitter. WiFi firmware runs on-chip (AX210's embedded ARM core, Qualcomm's DSP) — not on the host CPU — so Tier 1 does not mean "trust the firmware"; IOMMU enforcement is the hard boundary, matching the NVMe threat model (Section 10.5). Tier 2 would add latency without improving isolation.


12.3.9 nl80211 — Linux Wireless Configuration Interface

nl80211 is the Linux Generic Netlink-based wireless configuration interface. Userspace tools — wpa_supplicant, iw, hostapd, iwd, NetworkManager — use nl80211 to scan for networks, configure connections, and manage access point (AP) mode. Without nl80211, WiFi is invisible to standard Linux userspace.

UmkaOS implements nl80211 in the umka-wireless module (Tier 1, inside umka-net). The module registers a Generic Netlink family named "nl80211" and translates nl80211 commands into WirelessDriver KABI calls (Section 12.1.1). No cfg80211 or mac80211 kernel modules are needed — UmkaOS's implementation is a direct translation layer.

Architecture

wpa_supplicant / iw / hostapd / NetworkManager
    │  AF_NETLINK socket, NETLINK_GENERIC, family "nl80211"
    │  NL80211_CMD_* commands, NL80211_ATTR_* attributes
    ▼
umka-wireless: nl80211 Generic Netlink handler
    │  Translates nl80211 → WirelessDriver KABI calls
    │  Delivers WirelessEvent → nl80211 multicast notifications
    ▼
WirelessDriver (KABI, §12.1.1)
    │  WirelessDriver::scan(), connect(), disconnect(), ...
    ▼
WiFi chip driver (Tier 1: Intel AX210, Realtek RTL8852AE, ...)
/// nl80211 Generic Netlink family registration.
pub struct Nl80211Family {
    /// Family ID (auto-assigned at registration; user queries via
    /// `CTRL_CMD_GETFAMILY` on the "nlctrl" family).
    pub family_id: u16,
    /// Multicast groups for unsolicited event delivery.
    pub mcast_groups: [Nl80211McastGroup; 5],
}

pub enum Nl80211McastGroup {
    /// Scan notifications (scan started, scan results ready).
    Scan,
    /// Regulatory domain notifications.
    Regulatory,
    /// MLME events (auth, assoc, disassoc, connect, disconnect).
    Mlme,
    /// Vendor-specific events.
    Vendor,
    /// NAN (Neighbor Awareness Networking) events.
    Nan,
}

Key NL80211 Commands

The following NL80211 commands are implemented. All commands use NETLINK_GENERIC with family "nl80211". Requests carry NL80211_ATTR_IFINDEX to identify the wireless interface.

NL80211 Command wpa_supplicant use UmkaOS implementation
NL80211_CMD_GET_WIPHY Query hardware capabilities (bands, rates, features) WirelessDriver::capabilities() + hardware query
NL80211_CMD_GET_INTERFACE Get interface mode (station/AP/monitor) per-interface state
NL80211_CMD_SET_INTERFACE Change interface mode WirelessDriver::set_interface_type()
NL80211_CMD_TRIGGER_SCAN Start a scan (SSIDs, channels, IEs) WirelessDriver::scan()
NL80211_CMD_GET_SCAN Dump scan results (BSS list) WirelessDriver::get_scan_results()
NL80211_CMD_AUTHENTICATE Send 802.11 authentication frame WirelessDriver::authenticate()
NL80211_CMD_ASSOCIATE Send 802.11 association request WirelessDriver::associate()
NL80211_CMD_DEAUTHENTICATE Send deauthentication frame WirelessDriver::disconnect()
NL80211_CMD_DISASSOCIATE Send disassociation frame WirelessDriver::disconnect()
NL80211_CMD_CONNECT SME-controlled connect (full auth+assoc) WirelessDriver::connect()
NL80211_CMD_DISCONNECT SME-controlled disconnect WirelessDriver::disconnect()
NL80211_CMD_GET_STATION Per-station info (RSSI, TX rate, etc.) WirelessDriver::stats()
NL80211_CMD_SET_STATION Set per-station flags WirelessDriver::set_station()
NL80211_CMD_NEW_KEY / DEL_KEY Install/remove pairwise/group keys (WPA2/WPA3) WirelessDriver::set_key()
NL80211_CMD_SET_BSS Configure AP parameters (beacon interval, DTIM, HT) WirelessDriver::set_bss_params()
NL80211_CMD_START_AP Start access point mode WirelessDriver::start_ap()
NL80211_CMD_STOP_AP Stop access point mode WirelessDriver::stop_ap()
NL80211_CMD_REGISTER_FRAME Register for specific management frames WirelessDriver::register_mgmt_frame()
NL80211_CMD_FRAME Send a management frame (probe req, auth, etc.) WirelessDriver::send_mgmt_frame()
NL80211_CMD_SET_POWER_SAVE Enable/disable power save mode WirelessDriver::set_power_save()
NL80211_CMD_SET_CHANNEL Set monitor channel WirelessDriver::set_channel()
NL80211_CMD_ADD_VIRTUAL_INTERFACE Create secondary virtual interface (p2p, monitor) WirelessDriver::add_interface()
NL80211_CMD_DEL_VIRTUAL_INTERFACE Delete secondary interface WirelessDriver::del_interface()

Asynchronous Events (Multicast Notifications)

UmkaOS delivers wireless events as nl80211 multicast notifications to registered listeners (wpa_supplicant subscribes via NL80211_MCGRP_MLME):

/// Events delivered asynchronously to nl80211 multicast subscribers.
/// Each event is a Netlink message with an NL80211_CMD_* code.
pub enum Nl80211Event {
    /// NL80211_CMD_NEW_SCAN_RESULTS: scan completed, results available.
    /// Attributes: NL80211_ATTR_GENERATION (scan cookie).
    ScanDone { aborted: bool },

    /// NL80211_CMD_AUTHENTICATE: authentication result.
    /// Attributes: NL80211_ATTR_FRAME (the auth frame), NL80211_ATTR_STATUS_CODE.
    Authenticate { bssid: [u8; 6], status_code: u16 },

    /// NL80211_CMD_ASSOCIATE: association result.
    /// Attributes: NL80211_ATTR_FRAME, NL80211_ATTR_STATUS_CODE, NL80211_ATTR_RESP_IE.
    /// `resp_ies` holds Information Elements from the ASSOC_RESP frame body.
    /// Fixed-size inline buffer (350 bytes) covers all IEs in production APs;
    /// `resp_ies_len` is the valid byte count. `Vec<u8>` is prohibited here —
    /// `WirelessEvent` entries must be fixed-size for the event ring buffer.
    Associate { bssid: [u8; 6], status_code: u16, resp_ies: [u8; 350], resp_ies_len: u16 },

    /// NL80211_CMD_CONNECT: connection result (when using SME mode).
    /// Attributes: NL80211_ATTR_STATUS_CODE, NL80211_ATTR_RESP_IE.
    Connect { bssid: [u8; 6], status_code: u16 },

    /// NL80211_CMD_DISCONNECT: disconnection notification.
    /// Attributes: NL80211_ATTR_REASON_CODE, NL80211_ATTR_DISCONNECTED_BY_AP.
    Disconnect { reason_code: u16, by_ap: bool },

    /// NL80211_CMD_NOTIFY_CQM: Connection Quality Monitor event.
    /// Attributes: NL80211_ATTR_CQM (rssi threshold, beacon loss count).
    CqmRssiAlert { rssi: i32, threshold_event: CqmThresholdEvent },

    /// NL80211_CMD_ROAM: driver initiated roam to new AP.
    /// `req_ies`: IEs from the ASSOC_REQ sent to the new AP (RSN, HT/VHT/HE caps).
    /// `resp_ies`: IEs from the ASSOC_RESP received from the new AP.
    /// Fixed-size inline buffers; `_len` fields carry valid byte counts.
    Roam {
        bssid: [u8; 6],
        req_ies: [u8; 256],
        req_ies_len: u16,
        resp_ies: [u8; 350],
        resp_ies_len: u16,
    },

    /// NL80211_CMD_DEAUTHENTICATE: received deauthentication frame from AP.
    Deauthenticate { bssid: [u8; 6], reason_code: u16 },

    /// NL80211_CMD_DISASSOCIATE: received disassociation frame.
    Disassociate { bssid: [u8; 6], reason_code: u16 },

    /// NL80211_CMD_MICHAEL_MIC_FAILURE: TKIP MIC failure (TKIP attack detection).
    MicFailure { bssid: [u8; 6], key_type: u32, key_id: u8 },

    /// NL80211_CMD_FRAME: management frame received (for registered frame types).
    Frame { freq: u32, data: Vec<u8> },

    /// AP mode: client associated.
    NewStation { mac: [u8; 6] },

    /// AP mode: client disassociated.
    DelStation { mac: [u8; 6] },
}

WiPhy and Band Information

NL80211_CMD_GET_WIPHY returns a comprehensive capabilities structure that wpa_supplicant and iw use to configure connections. Key nested attributes:

/// Wireless phy capabilities (nested in NL80211_ATTR_WIPHY_BANDS).
pub struct Nl80211Band {
    /// Frequency range.
    pub band: Nl80211BandId,
    /// Supported channels (frequency in MHz + channel flags).
    pub channels: Vec<Nl80211Channel>,
    /// Supported TX bit rates (HT/VHT/HE MCS tables).
    pub rates: Vec<Nl80211Rate>,
    /// HT capabilities (NL80211_BAND_ATTR_HT_CAPA): MIMO streams, channel width, etc.
    pub ht_cap: Option<HtCapabilities>,
    /// VHT capabilities (NL80211_BAND_ATTR_VHT_CAPA): 80/160 MHz, MU-MIMO.
    pub vht_cap: Option<VhtCapabilities>,
    /// HE capabilities (NL80211_BAND_ATTR_IFTYPE_DATA): WiFi 6/6E rates.
    pub he_cap: Option<HeCapabilities>,
}

pub enum Nl80211BandId {
    Ghz2_4 = 0,
    Ghz5   = 1,
    Ghz60  = 2,
    Ghz6   = 3,
}

pub struct Nl80211Channel {
    /// Channel center frequency in MHz.
    pub freq_mhz:  u32,
    /// Channel flags (NL80211_FREQUENCY_ATTR_*).
    pub flags:     ChannelFlags,
    /// Maximum TX power in dBm (tenth of dBm units: 200 = 20.0 dBm).
    pub max_power: u32,
}

bitflags! {
    pub struct ChannelFlags: u32 {
        /// Passive scan only (no probe requests transmitted).
        const PASSIVE_SCAN    = 1 << 0;
        /// Beaconing not allowed.
        const NO_IBSS         = 1 << 1;
        /// Radar detection required (DFS channel).
        const RADAR           = 1 << 2;
        /// No HT40- operation.
        const NO_HT40_MINUS   = 1 << 3;
        /// No HT40+ operation.
        const NO_HT40_PLUS    = 1 << 4;
        /// No 80 MHz operation.
        const NO_80MHZ        = 1 << 5;
        /// No 160 MHz operation.
        const NO_160MHZ       = 1 << 6;
        /// Indoor only.
        const INDOOR_ONLY     = 1 << 7;
        /// Go concurrent (can be used simultaneously with another channel).
        const GO_CONCURRENT   = 1 << 8;
        /// No 20 MHz operation (only available in wider modes).
        const NO_20MHZ        = 1 << 9;
        /// No HE operation.
        const NO_HE           = 1 << 10;
        /// Disabled (regulatory constraint).
        const DISABLED        = 1 << 11;
    }
}

Regulatory Domain

UmkaOS enforces regulatory channel restrictions via the CRDA (Central Regulatory Domain Agent) or a compiled-in regulatory database (wireless-regdb):

/// Regulatory domain (ISO 3166-1 alpha-2 country code).
pub struct RegDomain {
    /// Country code (e.g., "US", "DE", "JP", "00" = world regulatory domain).
    pub alpha2: [u8; 2],
    /// DFS region (for radar detection requirements).
    pub dfs_region: DfsRegion,
    /// Frequency rules.
    pub rules: Vec<RegRule>,
}

pub struct RegRule {
    /// Frequency range (MHz).
    pub freq_range:  core::ops::RangeInclusive<u32>,
    /// Maximum bandwidth allowed (MHz).
    pub max_bw_mhz:  u32,
    /// Maximum EIRP (maximum equivalent isotropically radiated power, dBm).
    pub max_eirp_dbm: u32,
    /// Rule flags.
    pub flags:       RegRuleFlags,
}

bitflags! {
    pub struct RegRuleFlags: u32 {
        const NO_OFDM     = 1 << 0; // OFDM not allowed
        const NO_CCK      = 1 << 1; // CCK not allowed
        const NO_INDOOR   = 1 << 2; // Indoor operation prohibited
        const NO_OUTDOOR  = 1 << 3; // Outdoor operation prohibited
        const DFS         = 1 << 4; // DFS required
        const PTP_ONLY    = 1 << 5; // Point-to-point links only
        const PTMP_ONLY   = 1 << 6; // Point-to-multipoint only
        const NO_IR       = 1 << 7; // No initiating radiation (passive listen only)
        const AUTO_BW     = 1 << 11; // Auto-select bandwidth based on local conditions
        const IR_CONCURRENT = 1 << 12; // IR even if not associated
        const NO_HT40MINUS  = 1 << 13;
        const NO_HT40PLUS   = 1 << 14;
        const NO_80MHZ      = 1 << 15;
        const NO_160MHZ     = 1 << 16;
    }
}

Regulatory domain changes are broadcast via NL80211_CMD_REG_CHANGE multicast to the NL80211_MCGRP_REGULATORY group.

P2P (Wi-Fi Direct)

Wi-Fi P2P (peer-to-peer, used by Miracast/screen mirroring and Android Beam) is implemented as a virtual interface mode:

pub enum Nl80211IfType {
    Unspecified = 0,
    Adhoc       = 1,  // IBSS (independent BSS)
    Station     = 2,  // Client station (default)
    Ap          = 3,  // Access Point
    ApVlan      = 4,  // AP VLAN (virtual interface per client)
    Wds         = 5,  // WDS (4-address mode)
    Monitor     = 6,  // Monitor (receive all frames, no TX)
    MeshPoint   = 7,  // 802.11s mesh
    P2pClient   = 8,  // P2P client
    P2pGo       = 9,  // P2P Group Owner (acts as AP for P2P group)
    P2pDevice   = 10, // P2P device (for discovery; not an AP or station)
    Ocb         = 11, // Outside Context of BSS (802.11p, V2X)
    Nan         = 12, // NAN (Neighbor Awareness Networking)
}

NL80211_CMD_ADD_VIRTUAL_INTERFACE with NL80211_IFTYPE_P2P_DEVICE creates the P2P discovery interface. wpa_supplicant handles P2P negotiation in userspace; the kernel provides the management frame exchange mechanism via NL80211_CMD_FRAME / NL80211_CMD_REGISTER_FRAME.

Linux Compatibility

  • Same "nl80211" Generic Netlink family name
  • Same NL80211_CMD_* command codes (compatible with kernel 5.15+ nl80211.h)
  • Same NL80211_ATTR_* attribute IDs
  • Same multicast group names (scan, regulatory, mlme, vendor, nan)
  • Same NL80211_BAND_* band descriptors
  • iw(8): station and AP management works without modification
  • wpa_supplicant 2.10+: full WPA2-Personal, WPA2-Enterprise (EAP-PEAP/TTLS/TLS), WPA3-SAE
  • hostapd 2.10+: AP mode, 802.11r fast roaming, 802.11w management frame protection
  • iwd 2.x: full station mode, systemd-iwd integration
  • NetworkManager 1.44+: uses wpa_supplicant or iwd, both work
  • rfkill integration: blocking the WiFi rfkill device sets NL80211_ATTR_WIPHY_FREQ to disabled and delivers NL80211_CMD_RFKILL_EVENT

12.4 Camera and Video Capture

Current state: Not covered.

Requirements:

  1. Webcam drivers:
  2. UVC (USB Video Class) — most common
  3. MIPI CSI-2 (for ARM SoCs, integrated cameras)
  4. Vendor-specific protocols (some laptops have custom cameras)

  5. Video capture API:

  6. V4L2 (Video4Linux2) compatibility OR clean UmkaOS API?
  7. Pixel formats (YUYV, NV12, MJPEG, H.264)
  8. Resolution enumeration
  9. Frame rate control

  10. Privacy:

  11. Camera privacy shutter (physical or electronic)
  12. Indicator LED control (show when camera is active)
  13. Per-app camera access control (capability-based)

Tier classification: Tier 1 with strict isolation (webcam compromise must not escalate)


12.5 Printers and Scanners

Current state: Not covered.

Requirements:

  1. Printing:
  2. CUPS (Common Unix Printing System) compatibility
  3. IPP (Internet Printing Protocol)
  4. Driverless printing (IPP Everywhere, AirPrint)
  5. Legacy printer drivers (HPLIP, Gutenprint)

  6. Scanning:

  7. SANE (Scanner Access Now Easy) compatibility
  8. Network scanners (eSCL, WSD)

Priority: Low (many users rely on network printing, driverless IPP)

12.6 Live Kernel Evolution

12.6.1 The Theseus Model

Theseus OS (Rice University, 2020) demonstrated that kernel components can be individually replaced at runtime without rebooting, by making state ownership explicit and granular.

UmkaOS already does this for drivers (Section 10.8 crash recovery). This section extends it to core kernel components.

12.6.2 Design: Explicit State Ownership Graph

// umka-core/src/evolution/mod.rs

/// Every kernel component declares its state explicitly.
/// This enables:
///   1. Live replacement: old component's state is migrated to new component.
///   2. Crash recovery: component's state can be reconstructed from invariants.
///   3. State inspection: debugging and observability.

/// Trait that every replaceable kernel component implements.
pub trait EvolvableComponent {
    /// Component's serializable state.
    /// Must capture ALL mutable state that persists across calls.
    type State: Serialize + Deserialize;

    /// Export current state for migration to a new version.
    fn export_state(&self) -> Self::State;

    /// Initialize from migrated state (for live replacement).
    fn import_state(state: Self::State) -> Result<Self, MigrationError>
    where Self: Sized;

    /// Initialize fresh (for first boot or after state loss).
    fn initialize_fresh(config: &KernelConfig) -> Self
    where Self: Sized;

    /// Version of this component's state format.
    /// Migration rule: v(N) can import v(N-1) state ONLY.
    /// For larger jumps (v1 → v5): chained migration through intermediates
    /// (v1 → v2 → v3 → v4 → v5). Each version carries ONE migration
    /// function from the immediately prior version. The chain runs
    /// during import_state() before the atomic swap.
    fn state_version(&self) -> u32;
}

Chain length bound: To prevent unbounded migration chains, the maximum chain length is 8 intermediate versions. A component at version v(K) can be live-evolved to at most version v(K+8) in a single operation. Larger version jumps require: (a) A direct v(K)→v(K+N) migration function registered by the new component (the component author provides a migration path that skips intermediates), or (b) Multiple sequential live evolutions (v(K)→v(K+8)→v(K+16)→...), each of which is a separate atomic operation with its own rollback capability. The 8-version limit bounds the worst-case migration time to ~8× the single-step migration cost. If a chained migration exceeds 500 ms total elapsed time, the evolution is aborted and the old component continues running. This timeout is configurable via evolution.max_chain_time_ms.

State Serialization Format:

/// Serialized component state for live replacement.
pub struct ComponentState {
    /// Component identifier (e.g., "scheduler", "page_replacement").
    /// Fixed-size string to ensure validity across live-replacement boundaries
    /// (heap/static pointers from the replaced component are invalid after replacement).
    pub component_id: ArrayString<64>,
    /// State format version (matches EvolvableComponent::state_version).
    pub version: u32,
    /// Serialized state data (component-owned schema).
    /// Allocated from the kernel heap via `alloc::vec::Vec` — this is acceptable
    /// because state export/import runs only during live replacement (rare, cold
    /// path, well after the heap allocator is initialized). State sizes are
    /// bounded per component (see Section 12.6.5 table).
    pub data: Vec<u8>,
    /// CRC32C of all preceding fields, using hardware acceleration
    /// (SSE4.2 `crc32` on x86, ARMv8 CRC instructions).
    ///
    /// **Checksum**: CRC32C provides adequate 32-bit error detection for this
    /// small, cold-path structure. A cryptographic hash is unnecessary here —
    /// state integrity against malicious tampering is enforced by the evolution
    /// framework's capability checks and signature verification (Section 12.6.5),
    /// not by this checksum.
    pub checksum: u32,
}

Each component owns its serialization schema. The kernel provides StateSerializer helpers for common patterns (serialize BTreeMap, serialize per-CPU arrays, serialize LRU lists) but does not impose a format. Components choose what to serialize and how — the contract is that import_state(export_state()) produces an equivalent component.

12.6.3 Component Replacement Flow

Live kernel component replacement (e.g., new scheduler algorithm):

Phase A — Preparation (runs concurrently with normal operation, NOT stop-the-world):
  1. New component binary loaded (same mechanism as policy module, Section 18.7).
  2. Old component: export_state() → serialized state.
     This may walk large data structures (all run queues, LRU lists, etc.).
     Time: potentially milliseconds for complex components.
     Normal operation continues during this phase — the old component
     is still active and handling requests.
  3. New component: import_state(serialized_state) → initialized.

Phase A' — Quiescence (bounded, runs before the atomic swap):
  Before the atomic swap, the old component enters a **quiescence phase**: all
  in-flight operations are allowed to complete (with a bounded deadline), and new
  operations are queued. The quiescence deadline is configurable per component type
  (default: 10ms for scheduler, 50ms for page replacement). If the deadline expires
  before all in-flight operations drain, the replacement is aborted and the old
  component resumes normal operation without disruption.

  **Scheduler-specific quiescence note**: For the scheduler, `pick_next_task` is
  called from the timer interrupt on every tick on every CPU. During quiescence,
  these calls are intercepted by the trampoline and queued. This means **no new
  scheduling decisions are made** during the quiescence window — CPUs continue
  executing their current task. The 10ms quiescence window is chosen to be ≤ one
  scheduler tick (typically 4ms on HZ=250), so at most 2-3 timer ticks are queued
  per CPU. The queued `pick_next_task` calls are replayed by the new scheduler
  immediately after the Phase B atomic swap. Worst-case scheduling latency impact:
  ~10ms for a single task that should have been preempted during quiescence. This
  is comparable to a long spinlock hold and acceptable for a live-evolution operation
  that occurs at most a few times per kernel lifetime.

```rust
/// Maximum serialized argument size for a deferred vtable call.
/// Total struct size = 8 (header) + 248 (payload) = 256 bytes = 4 cache lines.
pub const PENDING_OP_MAX_ARG_SIZE: usize = 248;

/// A vtable call deferred during component quiescence (live driver evolution).
/// Fixed-size layout enables a statically-allocated ring buffer — no heap allocation
/// during the quiescence window when memory operations may be restricted.
///
/// `method_id = 0` is a sentinel for an empty/invalid slot.
#[repr(C, align(64))]
pub struct PendingOp {
    /// Vtable method index (matches the `KernelServicesVTable` or `DriverVTable` ordinal).
    pub method_id: u32,
    /// Number of valid bytes in `args` (0 if method takes no arguments).
    pub arg_len: u32,
    /// Serialized method arguments. Encoding is method-specific (documented per method).
    pub args: [u8; PENDING_OP_MAX_ARG_SIZE],
}

// Compile-time assertion: struct must be exactly 256 bytes (4 × 64-byte cache lines).
const _PENDING_OP_SIZE_CHECK: () = assert!(
    core::mem::size_of::<PendingOp>() == 256,
    "PendingOp must be exactly 256 bytes"
);

/// Maximum number of ops that can be queued during a single quiescence window.
/// At 1000 calls/sec, PENDING_OPS_QUEUE_CAPACITY provides ~64ms of buffering.
pub const PENDING_OPS_QUEUE_CAPACITY: usize = 64;

/// The pending-op ring buffer for a quiescing driver instance.
/// Statically allocated — no heap allocation during quiescence.
pub struct PendingOpsRing {
    buf: [PendingOp; PENDING_OPS_QUEUE_CAPACITY],
    head: AtomicU32,
    tail: AtomicU32,
}

Operation interception mechanism: At Phase A' entry, a per-component quiescing: AtomicBool flag is set to true. The vtable entry trampoline checks this flag before dispatching each call. When quiescing is true, the trampoline appends the operation descriptor (a serialized PendingOp containing the method ID and argument blob) to a bounded pending_ops queue instead of invoking the old component. This interception is lock-free (the queue is a pre-allocated MPSC ring buffer, see PendingOpsRing above). The vtable pointer itself is not yet swapped — interception happens at the trampoline level, not the pointer level.

Queued operation handling: Operations that arrive during Phase A' are appended to pending_ops via the interception mechanism above. If pending_ops reaches capacity (default: 1024 entries), the quiescence deadline is extended by up to 100ms. If the deadline expires and in-flight operations have still not drained, the evolution is aborted: quiescing is set to false, the trampoline resumes normal dispatch, and the old component resumes without disruption.

State re-export: After in-flight operations drain, the old component's state is re-exported (export_state() on the now-quiesced component). This re-export does NOT capture pending_ops — the queue is transferred separately in Phase B.

Phase B — Atomic swap (stop-the-world, ~1-10 μs): 4. All CPUs briefly hold (IPI to stop-the-world). 5. The pending_ops queue pointer is transferred to the new component by copying the ring buffer head/tail pointers. This is O(1) — no data copying, just pointer assignment. Operations that arrived between the Phase A' re-export and the IPI are captured because the interception trampoline continues appending to pending_ops until the IPI fires. 6. Old component's vtable pointer replaced with new component's vtable. 7. Interrupt handlers redirected. quiescing flag cleared. 8. CPUs released. New component is now active. Only the pointer swap + queue transfer is stop-the-world. No data structure walking. The queue transfer (step 5) adds ~100ns to the stop-the-world window.

Phase C — Activation and cleanup: Phase C1 — New component activation: 9. New component drains pending_ops queue before accepting new operations. Each pending op is replayed through the new component's vtable. Phase C2 — Deferred cleanup (after watchdog window): 10. The old component is NOT immediately freed. It is frozen (no new calls) but its memory is retained for the Post-Swap Watchdog window (5 seconds, see below). If the watchdog triggers a revert, the old component is reactivated from this frozen state. 11. After the watchdog window expires without revert, the old component is unloaded and its memory freed. Total disruption: ~1-10 μs (the Phase B stop-the-world window only).

If import_state fails (incompatible version): → Abort replacement. Old component continues. No disruption.

If new component crashes after replacement: → Crash recovery (Section 10.8). Reload old component with initialize_fresh(). → Component state lost, but system continues.



**Post-Swap Watchdog:**

After the atomic swap (Phase B), a 5-second watchdog timer starts. If the new
component crashes or triggers a fault within this window, the kernel reverts to the
old component using the RETAINED serialized state (from `export_state()` in Phase A),
not `initialize_fresh()`. This preserves accumulated state (run queue weights, LRU
ordering, learned parameters) across a failed swap attempt. Only if the retained
state itself is corrupted does the kernel fall back to `initialize_fresh()`.

**Memory During Swap:**

The dual-load approach (old + new component coexist during Phase A) requires
sufficient memory for both. Typical component state sizes: scheduler ~64KB, page
replacement ~128KB, I/O scheduler ~8KB per device. If insufficient memory is
available for the new component's state, the swap returns `ENOMEM` and the old
component continues unchanged. Maximum expected dual-load overhead: ~128KB for the
scheduler (the largest replaceable component).


### 12.6.4 Export Symbol Contract

When a component is live-replaced, other components may depend on its exported
symbols (vtable entries, public functions, constants). The following rules govern
export compatibility during live evolution:

1. **Compatible exports required.** The new version MUST export the same KABI vtable
   entries at compatible types (same layout, same semantics). If the new version changes
   an export's signature (different parameter types, different return type, different
   struct layout), the live evolution is **rejected at load time** during Phase A. The
   loader compares vtable sizes and entry signatures before proceeding to state export.

2. **Indirection-based resolution.** Export addresses are resolved through the KABI
   vtable indirection table, not direct pointers. When the new version loads, the
   vtable pointer is atomically updated during Phase B (step 5). Dependent components
   never hold raw function pointers to the old version's code -- they dispatch through
   the vtable pointer, which is updated in the stop-the-world window. This is the same
   mechanism used for policy module vtable dispatch ([Section 18.7](18-compat.md#187-safe-kernel-extensibility)).

3. **Removed exports rejected.** If the new version removes a vtable entry (reduces
   `vtable_size`), the evolution is rejected unless no loaded component references the
   removed entry. The loader scans the dependency graph during Phase A to verify this.
   Adding new entries (increasing `vtable_size`) is always safe -- existing callers
   never reference entries beyond the size they were compiled against.

### 12.6.5 What Can Be Live-Replaced

| Component | Replaceable? | State Size | Notes |
|-----------|-------------|-----------|-------|
| CPU scheduler | Yes | Per-CPU run queues, CBS servers (~64KB total) | Policy module swap ([Section 18.7](18-compat.md#187-safe-kernel-extensibility)) covers most cases |
| Page replacement | Yes | LRU lists, access counters (~128KB) | Hot-swap eviction algorithm |
| I/O scheduler | Yes | Per-device queues (~8KB per device) | Hot-swap I/O algorithm |
| Network classifier | Yes | Classification rules, flow tables (~256KB) | Hot-swap QoS policy |
| Memory allocator | **No** | Buddy allocator state is the physical memory map | Too fundamental to swap. Bugs caught by verification ([Section 23.10](23-roadmap.md#2310-formal-verification-readiness)). |
| Page table manager | **No** | Active page tables for all processes | Same — too fundamental. |
| Capability system | **No** | Global capability table | Security-critical — verified ([Section 23.10](23-roadmap.md#2310-formal-verification-readiness)), never replaced. |
| KABI dispatch | **No** | Vtable registry | Infrastructure — stable by design. |
| Tier 1 drivers | Yes (existing) | Driver-internal state | Crash recovery already handles this ([Section 10.8](10-drivers.md#108-crash-recovery-and-state-preservation)). |

The non-replaceable components (listed above) are verified via the techniques in
[Section 23.10](23-roadmap.md#2310-formal-verification-readiness). The replaceable components can be evolved independently via [Section 12.6](#126-live-kernel-evolution).
Together: verified core + evolvable policy.

### 12.6.6 Performance Impact

**Steady-state: zero.** Between replacements, code paths are identical to a
monolithic kernel. The `EvolvableComponent` trait adds no runtime code — it's
a development contract.

**During replacement**: ~1-10 μs stop-the-world. Happens at most once per kernel
update. Amortized over months of uptime: unmeasurable.

---

## 12.7 Hardware Watchdog Framework

The hardware watchdog framework exposes `/dev/watchdog` and `/dev/watchdogN` character
devices to userspace. Its purpose is production system health monitoring: if the
privileged userspace daemon (typically systemd) stops petting the watchdog before its
timeout expires, the hardware unconditionally resets the machine. This guarantees
recovery from hung kernels, deadlocked daemons, and runaway processes — scenarios
where orderly shutdown is impossible.

This section describes the **system-level hardware watchdog** (WDOG). It is distinct
from:
- The **clocksource watchdog** ([Section 6.5.5](06-scheduling.md#655-clocksource-watchdog)):
  detects unstable TSC and switches clocksources. Internal kernel mechanism, no userspace interface.
- The **driver crash watchdog** ([Section 10.5.5.3](10-drivers.md#10553-timeouts)):
  per-driver health timer used by the crash recovery subsystem. Also internal.

The WDOG is the final line of defense: it runs in hardware and is guaranteed to fire
even if the kernel itself is completely hung.

### 12.7.1 WatchdogOps KABI Vtable

Tier 1 watchdog drivers implement the `WatchdogOps` vtable. The vtable follows the
standard KABI conventions ([Section 11.1](11-kabi.md#111-driver-model-and-stable-abi-kabi)): `vtable_size` for forward
compatibility, `KabiResult` for error propagation, `unsafe extern "C"` for ABI stability.

```rust
// umka-core/src/watchdog/ops.rs

/// KABI vtable for hardware watchdog drivers (Tier 1).
/// All function pointers are called from the watchdog core with IRQs enabled
/// and no spinlocks held, unless noted otherwise.
#[repr(C)]
pub struct WatchdogOps {
    /// Must be set to `size_of::<WatchdogOps>()` by the driver.
    /// The watchdog core uses this to detect older drivers that do not
    /// implement fields added in later versions.
    pub vtable_size: u64,
    /// Driver API version. Currently 1.
    pub version: u32,
    _padding: u32,

    /// Start the watchdog countdown. After this call returns `KabiResult::Ok`,
    /// the kernel MUST call `keepalive` before `timeout_s` seconds elapse.
    /// Called once when `/dev/watchdog` is first opened.
    /// Returns: KabiResult::Ok, KabiResult::Err(EBUSY) if already running.
    pub start: unsafe extern "C" fn(wdd: *mut WatchdogDev) -> KabiResult,

    /// Stop the watchdog. Not all hardware supports stopping once started.
    /// If `None`, the watchdog cannot be stopped after `start()`.
    /// Production systems should set `nowayout` to prevent calling `stop`.
    pub stop: Option<unsafe extern "C" fn(wdd: *mut WatchdogDev) -> KabiResult>,

    /// Pet the watchdog: reset the hardware countdown to `timeout_s`.
    /// This is the hot path — called on every write to `/dev/watchdog`
    /// and on every `WDIOC_KEEPALIVE` ioctl. Must be fast and non-sleeping.
    pub keepalive: unsafe extern "C" fn(wdd: *mut WatchdogDev) -> KabiResult,

    /// Set the watchdog timeout. `timeout_s` is the requested timeout in
    /// seconds. Returns the actual timeout the hardware was set to (hardware
    /// may round to the nearest supported granularity). If `None`, the timeout
    /// is fixed at the hardware default. Called before `start()` if the user
    /// sets a timeout via `WDIOC_SETTIMEOUT`.
    pub set_timeout: Option<unsafe extern "C" fn(
        wdd: *mut WatchdogDev,
        timeout_s: u32,
    ) -> u32>,

    /// Return remaining time before hardware reset, in seconds.
    /// Optional: returns 0 if not implemented (field is `None`).
    pub get_timeleft: Option<unsafe extern "C" fn(wdd: *mut WatchdogDev) -> u32>,

    /// Return the current hardware status word (bitfield of `WatchdogStatus`).
    /// Optional: returns 0 if not implemented.
    pub status: Option<unsafe extern "C" fn(wdd: *mut WatchdogDev) -> WatchdogStatus>,
}

12.7.2 WatchdogDev — The Watchdog Device Descriptor

Each registered watchdog has one WatchdogDev. The watchdog core allocates and owns this struct; drivers receive a raw pointer to it in every vtable call.

// umka-core/src/watchdog/dev.rs

use core::ffi::c_void;
use crate::notify::NotifierBlock;

/// A hardware (or software) watchdog device.
/// Owned by the watchdog core after `watchdog_register_device()` succeeds.
pub struct WatchdogDev {
    /// Vtable of driver-implemented operations.
    pub ops: &'static WatchdogOps,
    /// Static device identity (filled in by the driver before registration).
    pub info: WatchdogInfo,
    /// Current timeout in seconds. Updated by `set_timeout`.
    pub timeout_s: u32,
    /// Minimum timeout supported by the hardware. Validated on `WDIOC_SETTIMEOUT`.
    pub min_timeout_s: u32,
    /// Maximum timeout supported by the hardware.
    pub max_timeout_s: u32,
    /// Pretimeout in seconds: fire the pretimeout notifier this many seconds
    /// before the hardware reset. `0` means pretimeout is disabled.
    pub pretimeout_s: u32,
    /// Operational flags.
    pub flags: WatchdogFlags,
    /// Status at last boot: why the previous session ended.
    /// Checked at driver registration time and cached here.
    pub bootstatus: WatchdogStatus,
    /// Notifier block used to cleanly stop the watchdog on orderly reboot.
    /// Registered with the kernel reboot notifier chain at device registration.
    pub reboot_nb: NotifierBlock,
    /// Device index (0 = /dev/watchdog0). Assigned by the core.
    pub index: u32,
    /// Driver-private data pointer. Opaque to the watchdog core.
    pub priv_data: *mut c_void,
    /// Exclusive-open mutex: only one userspace process may open at a time.
    pub open_mutex: Mutex<()>,
    /// `MagicClose` state: whether a `'V'` character has been written.
    /// Used to distinguish orderly close from userspace crash.
    pub magic_close_armed: bool,
}

/// Static identity information filled in by the driver.
#[repr(C)]
pub struct WatchdogInfo {
    /// WDIOF_* option flags — bitmask of `WatchdogStatus` capability bits.
    pub options: u32,
    /// Driver/firmware version (driver-defined, often the hardware revision).
    pub firmware_version: u32,
    /// Human-readable identity string (e.g., "Intel TCO Watchdog").
    pub identity: [u8; 32],
}

Flag types:

// umka-core/src/watchdog/flags.rs

bitflags! {
    /// Operational state flags for WatchdogDev.
    pub struct WatchdogFlags: u32 {
        /// The watchdog hardware is currently running (counting down).
        const ACTIVE           = 1 << 0;
        /// The watchdog was kept alive at least once (first ping received).
        const ALIVE            = 1 << 1;
        /// Userspace wrote 'V' — safe to stop on close.
        const MAGIC_CLOSE      = 1 << 2;
        /// The watchdog was started when /dev/watchdog was opened.
        const RUNNING          = 1 << 3;
        /// nowayout: watchdog cannot be stopped once started.
        /// Set by `umka.watchdog.nowayout=1` boot parameter.
        const NOWAYOUT         = 1 << 4;
        /// Pretimeout interrupt was delivered (informational, cleared on keepalive).
        const PRETIMEOUT_FIRED = 1 << 5;
        /// Handshake timeout mode: driver requires 2-phase keepalive (rare).
        const HANDSHAKE        = 1 << 6;
    }
}

bitflags! {
    /// Hardware status bits: device capabilities (in WatchdogInfo::options)
    /// and runtime status (returned by WatchdogOps::status and boot status).
    pub struct WatchdogStatus: u32 {
        /// Temperature exceeded threshold.
        const OVERHEAT        = 0x0001;
        /// Fan fault detected.
        const FANFAULT        = 0x0002;
        /// External fault input 1 asserted.
        const EXTERN1         = 0x0004;
        /// External fault input 2 asserted.
        const EXTERN2         = 0x0008;
        /// Power supply undervoltage detected.
        const POWERUNDER      = 0x0010;
        /// Last reset was caused by the watchdog (this boot).
        const CARDRESET       = 0x0020;
        /// Power supply overvoltage detected.
        const POWEROVER       = 0x0040;
        /// A keepalive ping was successfully received.
        const KEEPALIVEPING   = 0x0080;
        /// Timeout is settable at runtime.
        const SETTIMEOUT      = 0x0100;
        /// Magic-close ('V') is required to stop the watchdog on close.
        const MAGICCLOSE      = 0x0200;
        /// Pretimeout notification before reset is supported.
        const PRETIMEOUT      = 0x0400;
        /// Alarm notification (driver-specific) supported.
        const ALARM           = 0x0800;
    }
}

12.7.3 Character Device Interface — /dev/watchdog

The watchdog core registers a cdev at /dev/watchdogN for each registered watchdog (N = device index). /dev/watchdog is a symlink to /dev/watchdog0. The cdev provides the standard Linux WDOG interface; systemd, daemon supervisors, and any POSIX-compliant watchdog client work without modification.

open()

// umka-core/src/watchdog/cdev.rs

fn watchdog_open(dev: &Arc<WatchdogDev>) -> Result<WatchdogFile, KernelError> {
    // Exclusive open: only one process may have the device open at a time.
    let _guard = dev.open_mutex.try_lock()
        .map_err(|_| KernelError::EBUSY)?;

    // Start the watchdog if it is not already running.
    if !dev.flags.contains(WatchdogFlags::RUNNING) {
        // SAFETY: ops pointer is valid for the lifetime of WatchdogDev.
        // start() is called with no spinlocks held, IRQs enabled.
        let result = unsafe { (dev.ops.start)(dev.as_ptr()) };
        result.into_result()?;
        dev.flags.insert(WatchdogFlags::ACTIVE | WatchdogFlags::RUNNING);
    }

    // Clear magic-close flag from any previous open.
    dev.magic_close_armed = false;

    Ok(WatchdogFile { dev: dev.clone() })
}

write(buf, len)

A write of any bytes pets the watchdog. If the buffer contains the ASCII character 'V' (0x56), the MagicClose flag is set, signalling that the next close() is an orderly shutdown (safe to stop the watchdog). This matches Linux's magic-close convention: the watchdog client writes "V" immediately before closing to signal intentional teardown, as opposed to crashing.

// umka-core/src/watchdog/cdev.rs

fn watchdog_write(
    file: &mut WatchdogFile,
    buf: &[u8],
) -> Result<usize, KernelError> {
    let dev = &file.dev;

    // Check for magic-close character.
    if buf.iter().any(|&b| b == b'V') {
        dev.flags.insert(WatchdogFlags::MAGIC_CLOSE);
    }

    // Pet the watchdog.
    // SAFETY: keepalive() is designed for hot-path use; it is non-sleeping
    // and safe to call from any context where IRQs are enabled.
    let result = unsafe { (dev.ops.keepalive)(dev.as_ptr()) };
    result.into_result()?;

    dev.flags.insert(WatchdogFlags::ALIVE | WatchdogFlags::KEEPALIVEPING);
    dev.flags.remove(WatchdogFlags::PRETIMEOUT_FIRED);

    Ok(buf.len())
}

ioctl handlers:

// umka-core/src/watchdog/cdev.rs

fn watchdog_ioctl(
    file: &mut WatchdogFile,
    cmd: u32,
    arg: usize,
) -> Result<i32, KernelError> {
    let dev = &file.dev;

    match cmd {
        WDIOC_GETSUPPORT => {
            // Copy WatchdogInfo to userspace.
            copy_to_user(arg as *mut WatchdogInfo, &dev.info)?;
            Ok(0)
        }
        WDIOC_GETSTATUS => {
            let status = if let Some(status_fn) = dev.ops.status {
                // SAFETY: status() is read-only, non-sleeping.
                unsafe { status_fn(dev.as_ptr()) }
            } else {
                WatchdogStatus::empty()
            };
            copy_to_user(arg as *mut u32, &status.bits())?;
            Ok(0)
        }
        WDIOC_GETBOOTSTATUS => {
            copy_to_user(arg as *mut u32, &dev.bootstatus.bits())?;
            Ok(0)
        }
        WDIOC_SETTIMEOUT => {
            let mut timeout_s: u32 = 0;
            copy_from_user(&mut timeout_s, arg as *const u32)?;

            if timeout_s < dev.min_timeout_s || timeout_s > dev.max_timeout_s {
                return Err(KernelError::EINVAL);
            }

            let actual = match dev.ops.set_timeout {
                Some(set_fn) => {
                    // SAFETY: set_timeout() updates hardware registers; non-sleeping.
                    unsafe { set_fn(dev.as_ptr(), timeout_s) }
                }
                None => return Err(KernelError::EOPNOTSUPP),
            };
            dev.timeout_s = actual;
            copy_to_user(arg as *mut u32, &actual)?;
            Ok(0)
        }
        WDIOC_GETTIMEOUT => {
            copy_to_user(arg as *mut u32, &dev.timeout_s)?;
            Ok(0)
        }
        WDIOC_GETTIMELEFT => {
            let left = match dev.ops.get_timeleft {
                Some(f) => {
                    // SAFETY: get_timeleft() reads a hardware counter; non-sleeping.
                    unsafe { f(dev.as_ptr()) }
                }
                None => 0,
            };
            copy_to_user(arg as *mut u32, &left)?;
            Ok(0)
        }
        WDIOC_KEEPALIVE => {
            // Explicit keepalive ioctl — same effect as writing to the device.
            // SAFETY: keepalive() is non-sleeping.
            let result = unsafe { (dev.ops.keepalive)(dev.as_ptr()) };
            result.into_result()?;
            dev.flags.insert(WatchdogFlags::ALIVE);
            Ok(0)
        }
        _ => Err(KernelError::ENOTTY),
    }
}

close()

// umka-core/src/watchdog/cdev.rs

fn watchdog_release(file: WatchdogFile) {
    let dev = &file.dev;

    if dev.flags.contains(WatchdogFlags::NOWAYOUT) {
        // nowayout: never stop. Pet once to extend lifetime (avoid
        // accidental expiry during close processing).
        // SAFETY: keepalive() is non-sleeping.
        let _ = unsafe { (dev.ops.keepalive)(dev.as_ptr()) };
        log::warn!(
            "watchdog{}: nowayout set — watchdog stays active",
            dev.index
        );
        return;
    }

    if dev.flags.contains(WatchdogFlags::MAGIC_CLOSE) {
        // Orderly close: stop the watchdog if the driver supports it.
        if let Some(stop_fn) = dev.ops.stop {
            // SAFETY: stop() is non-sleeping; called with no spinlocks held.
            let result = unsafe { stop_fn(dev.as_ptr()) };
            if result.into_result().is_ok() {
                dev.flags.remove(WatchdogFlags::ACTIVE | WatchdogFlags::RUNNING);
                log::info!("watchdog{}: stopped (magic close)", dev.index);
                return;
            }
        }
    }

    // No magic close, or stop() failed or unavailable.
    // Log a warning. The watchdog continues counting and will reset
    // the machine unless another process opens and pets it in time.
    log::warn!(
        "watchdog{}: closed without magic character ('V') — watchdog NOT stopped",
        dev.index
    );
}

12.7.4 Nowayout Boot Option

umka.watchdog.nowayout=1 is a kernel boot parameter that permanently enables WatchdogFlags::NOWAYOUT for all watchdog devices at registration time. Once set, no stop() call is ever issued, regardless of magic-close. On hardware that supports it (e.g., Intel TCO watchdog with NO_REBOOT bit cleared), the nowayout state is also committed to hardware so that even a compromised kernel cannot disable it.

Nowayout is the correct default for production: a malicious or buggy userspace process that manages to open /dev/watchdog and write 'V' should not be able to disable the last-resort reset mechanism. nowayout=0 is provided for development environments where a watchdog-triggered reboot during testing is disruptive.

12.7.5 Pretimeout Notifier

If pretimeout_s > 0, the hardware (or the watchdog core, if the hardware does not support pretimeout interrupts natively) fires the pretimeout notifier chain WATCHDOG_PRETIMEOUT_GOVERNOR at pretimeout_s seconds before the expiry deadline. This gives the system a final window to collect a crash dump or trigger a controlled panic before the hard reset occurs.

// umka-core/src/watchdog/pretimeout.rs

/// Pretimeout event delivered to the governor.
pub struct WatchdogPretimeoutEvent {
    pub dev: Arc<WatchdogDev>,
    /// Seconds remaining until hard reset at the moment of this event.
    pub timeleft_s: u32,
}

/// A pretimeout governor: decides what to do when pretimeout fires.
pub trait PretimeoutGovernor: Send + Sync {
    fn name(&self) -> &'static str;
    fn pretimeout(&self, event: &WatchdogPretimeoutEvent);
}

/// Built-in governor: log and do nothing. Default.
pub struct NoopGovernor;
impl PretimeoutGovernor for NoopGovernor {
    fn name(&self) -> &'static str { "noop" }
    fn pretimeout(&self, event: &WatchdogPretimeoutEvent) {
        log::warn!(
            "watchdog{}: pretimeout — hard reset in {}s",
            event.dev.index, event.timeleft_s
        );
    }
}

/// Built-in governor: trigger kernel panic to generate a crash dump.
/// The panic handler writes a minidump via the crash dump subsystem
/// before the hard reset occurs.
pub struct PanicGovernor;
impl PretimeoutGovernor for PanicGovernor {
    fn name(&self) -> &'static str { "panic" }
    fn pretimeout(&self, event: &WatchdogPretimeoutEvent) {
        panic!(
            "watchdog{}: pretimeout governor triggered panic for crash dump \
             ({} seconds before hard reset)",
            event.dev.index, event.timeleft_s
        );
    }
}

The active governor is selectable per-device via sysfs:

/sys/bus/watchdog/devices/watchdog0/pretimeout_governor

Writing "noop" or "panic" to this file changes the active governor atomically. The list of available governors is read from /sys/bus/watchdog/devices/watchdog0/pretimeout_available_governors.

12.7.6 Software Watchdog (softdog)

When the system has no hardware watchdog (embedded platforms, VMs without virtio-wdt), softdog provides a kernel-timer-based fallback. Its WatchdogOps implementation:

// umka-core/src/watchdog/softdog.rs

static SOFTDOG_TIMER: OnceLock<KernelTimer> = OnceLock::new();
static SOFTDOG_DEV: OnceLock<WatchdogDev> = OnceLock::new();

const SOFTDOG_DEFAULT_TIMEOUT_S: u32 = 60;
const SOFTDOG_MIN_TIMEOUT_S: u32     = 1;
const SOFTDOG_MAX_TIMEOUT_S: u32     = 65535;

unsafe extern "C" fn softdog_start(wdd: *mut WatchdogDev) -> KabiResult {
    let wdd = unsafe { &*wdd };
    let timeout_jiffies = secs_to_jiffies(wdd.timeout_s);
    // SAFETY: timer is initialized before registration; mod_timer is safe
    // with a valid timer pointer and non-zero expiry.
    unsafe {
        mod_timer(
            SOFTDOG_TIMER.get().unwrap(),
            jiffies_add(jiffies(), timeout_jiffies),
        );
    }
    KabiResult::Ok
}

unsafe extern "C" fn softdog_stop(wdd: *mut WatchdogDev) -> KabiResult {
    // SAFETY: del_timer_sync blocks until any running timer callback completes.
    unsafe { del_timer_sync(SOFTDOG_TIMER.get().unwrap()) };
    KabiResult::Ok
}

unsafe extern "C" fn softdog_keepalive(wdd: *mut WatchdogDev) -> KabiResult {
    let wdd = unsafe { &*wdd };
    let timeout_jiffies = secs_to_jiffies(wdd.timeout_s);
    // SAFETY: same as softdog_start; mod_timer is idempotent if already pending.
    unsafe {
        mod_timer(
            SOFTDOG_TIMER.get().unwrap(),
            jiffies_add(jiffies(), timeout_jiffies),
        );
    }
    KabiResult::Ok
}

unsafe extern "C" fn softdog_set_timeout(wdd: *mut WatchdogDev, timeout_s: u32) -> u32 {
    // Software timer has 1-second granularity — return as requested.
    timeout_s
}

/// Called when the kernel timer fires (keepalive not received in time).
fn softdog_fire(_timer: &KernelTimer) {
    let wdd = SOFTDOG_DEV.get().expect("softdog timer fired before device init");

    if wdd.flags.contains(WatchdogFlags::NOWAYOUT)
        || wdd.flags.contains(WatchdogFlags::ACTIVE)
    {
        log::crit!("softdog: watchdog timer expired — initiating emergency reboot");
        // kernel_restart() triggers the reboot path (orderly if possible).
        // SAFETY: we are in a safe call context; no locks are held.
        unsafe { kernel_restart(core::ptr::null()) };
    }
}

static SOFTDOG_OPS: WatchdogOps = WatchdogOps {
    vtable_size: core::mem::size_of::<WatchdogOps>() as u64,
    version: 1,
    _padding: 0,
    start:       softdog_start,
    stop:        Some(softdog_stop),
    keepalive:   softdog_keepalive,
    set_timeout: Some(softdog_set_timeout),
    get_timeleft: None,
    status:      None,
};

softdog is registered during kernel init via watchdog_register_device() if and only if no hardware watchdog driver has claimed device index 0. It is not compiled out — having a software fallback is always safer than no watchdog at all.

12.7.7 systemd Integration

systemd opens /dev/watchdog at startup and uses it as its primary liveness signal:

  • WATCHDOG_USEC=N environment variable (set by the service manager) informs systemd of the current watchdog timeout in microseconds. systemd derives its keepalive interval as WATCHDOG_USEC / 2 — it pets the watchdog twice per timeout period.
  • sd_notify(0, "WATCHDOG=1") is the keepalive call; it maps to a write() of "1" to /dev/watchdog. The '1' character does not trigger magic-close (only 'V' does).
  • On clean shutdown, systemd writes 'V' to /dev/watchdog before closing, enabling the watchdog to be stopped (unless nowayout is set).
  • WDIOC_GETTIMEOUT is called at startup so systemd can populate WATCHDOG_USEC in the environment for spawned services.

No UmkaOS-specific changes to systemd are needed — the standard Linux WDOG interface is fully compatible.

12.7.8 Device Registration

Drivers call watchdog_register_device() to install a watchdog. The function validates the vtable, assigns a device index, creates the cdev and sysfs entries, and queries the hardware for the boot status (cause of last reset).

// umka-core/src/watchdog/register.rs

/// Maximum number of concurrently registered watchdog devices.
const WATCHDOG_MAX_DEVICES: u32 = 32;

/// Register a watchdog device. `wdd` must be fully initialized before calling.
/// On success, wdd.index is set and /dev/watchdogN + sysfs entries are created.
pub fn watchdog_register_device(wdd: &mut WatchdogDev) -> Result<(), KernelError> {
    // Validate vtable.
    if wdd.ops.vtable_size < core::mem::size_of::<WatchdogOps>() as u64 {
        return Err(KernelError::EINVAL);
    }

    // Validate timeout bounds.
    if wdd.min_timeout_s == 0
        || wdd.max_timeout_s < wdd.min_timeout_s
        || wdd.timeout_s < wdd.min_timeout_s
        || wdd.timeout_s > wdd.max_timeout_s
    {
        return Err(KernelError::EINVAL);
    }

    // Assign device index.
    let index = WATCHDOG_REGISTRY.lock().allocate_index()
        .ok_or(KernelError::ENOSPC)?;
    if index >= WATCHDOG_MAX_DEVICES {
        return Err(KernelError::ENOSPC);
    }
    wdd.index = index;

    // Query boot status (why did the system last reset?).
    wdd.bootstatus = if let Some(status_fn) = wdd.ops.status {
        // SAFETY: device is not yet started; status() is read-only.
        unsafe { status_fn(wdd as *mut WatchdogDev) }
    } else {
        WatchdogStatus::empty()
    };

    // Apply nowayout boot parameter.
    if WATCHDOG_NOWAYOUT.load(Ordering::Relaxed) {
        wdd.flags.insert(WatchdogFlags::NOWAYOUT);
    }

    // Create character device at /dev/watchdogN.
    cdev_register(wdd)?;

    // Register sysfs entries under /sys/bus/watchdog/devices/watchdogN/.
    sysfs_watchdog_register(wdd)?;

    // Register reboot notifier: stop watchdog on orderly reboot
    // (unless nowayout is set).
    register_reboot_notifier(&mut wdd.reboot_nb, watchdog_reboot_handler)?;

    // If this is watchdog0, create /dev/watchdog symlink.
    if index == 0 {
        devfs_symlink("watchdog", "watchdog0")?;
    }

    log::info!(
        "watchdog{}: registered '{}' (timeout: {}s, min: {}s, max: {}s, nowayout: {})",
        wdd.index,
        core::str::from_utf8(&wdd.info.identity).unwrap_or("?").trim_end_matches('\0'),
        wdd.timeout_s,
        wdd.min_timeout_s,
        wdd.max_timeout_s,
        wdd.flags.contains(WatchdogFlags::NOWAYOUT),
    );

    Ok(())
}

/// Reboot notifier callback: stop the watchdog on orderly system shutdown.
/// Not called if nowayout is set.
fn watchdog_reboot_handler(wdd: &mut WatchdogDev) {
    if wdd.flags.contains(WatchdogFlags::NOWAYOUT)
        || !wdd.flags.contains(WatchdogFlags::ACTIVE)
    {
        return;
    }
    if let Some(stop_fn) = wdd.ops.stop {
        // SAFETY: reboot notifier runs with IRQs enabled, no spinlocks held.
        let _ = unsafe { stop_fn(wdd as *mut WatchdogDev) };
        log::info!("watchdog{}: stopped for orderly reboot", wdd.index);
    }
}

sysfs entries under /sys/bus/watchdog/devices/watchdogN/:

File Access Description
identity ro WatchdogInfo::identity string
timeout rw Current timeout in seconds
min_timeout ro Hardware minimum
max_timeout ro Hardware maximum
pretimeout rw Pretimeout in seconds (0 = disabled)
pretimeout_governor rw Active governor name
pretimeout_available_governors ro Space-separated list of available governors
timeleft ro Remaining time (calls get_timeleft, or 0)
bootstatus ro WatchdogStatus bits from last boot
nowayout ro 1 if nowayout is in effect
status ro Current WatchdogStatus bits

12.8 SPI Bus Framework

SPI (Serial Peripheral Interface) is a synchronous full-duplex serial bus connecting a master controller to one or more peripheral devices (ADCs, DACs, flash memory, display controllers, RF transceivers, SD cards in SPI mode, and sensor modules). Unlike I2C, SPI transfers are full-duplex: MOSI (Master Out Slave In) and MISO (Master In Slave Out) operate simultaneously on every clock edge. The UmkaOS SPI framework (umka-core/src/bus/spi.rs) provides a KABI trait for platform SPI controller drivers, a higher-level SpiDevice handle for peripheral drivers, and a spidev character device for userspace access.

12.8.1 SpiController KABI Trait

Platform SPI controller drivers implement SpiController. The trait is in umka-core/src/bus/spi.rs.

/// SPI bus mode (polarity + phase combination).
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u8)]
pub enum SpiMode {
    /// CPOL=0, CPHA=0: clock idle low, data captured on rising edge.
    Mode0 = 0,
    /// CPOL=0, CPHA=1: clock idle low, data captured on falling edge.
    Mode1 = 1,
    /// CPOL=1, CPHA=0: clock idle high, data captured on falling edge.
    Mode2 = 2,
    /// CPOL=1, CPHA=1: clock idle high, data captured on rising edge.
    Mode3 = 3,
}

/// A single SPI transfer (one segment of a complete SPI message).
pub struct SpiTransfer<'a> {
    /// Data to transmit (MOSI). None → send zeros.
    pub tx_buf:           Option<&'a [u8]>,
    /// Buffer to receive data into (MISO). None → discard received data.
    pub rx_buf:           Option<&'a mut [u8]>,
    /// Transfer length in bytes (max of tx_buf.len() and rx_buf.len()).
    pub len:              usize,
    /// Override clock speed for this transfer (Hz). 0 = use device default.
    pub speed_hz:         u32,
    /// Bits per word for this transfer (4–32). 0 = use device default (typically 8).
    pub bits_per_word:    u8,
    /// Delay in microseconds after this transfer before CS is deasserted or
    /// the next transfer starts.
    pub delay_us:         u16,
    /// If true, deassert CS between this transfer and the next.
    /// If false, hold CS asserted for the following transfer (typical for
    /// multi-segment reads).
    pub cs_change:        bool,
    /// Word delay in nanoseconds between each word within this transfer
    /// (for slow devices).
    pub word_delay_ns:    u8,
}

/// A complete SPI message: one or more transfers sharing a single CS assertion.
pub struct SpiMessage<'a> {
    /// Ordered list of transfers.
    pub transfers: &'a mut [SpiTransfer<'a>],
    /// Completion status (Ok or error code). Set by the controller after transfer.
    pub status:    Option<Result<(), KernelError>>,
}

/// SPI controller KABI trait. Implemented by platform SPI controller drivers.
pub trait SpiController: Send + Sync {
    /// Execute a complete SPI message synchronously.
    ///
    /// The controller asserts CS for the peripheral at `cs_index` for the entire
    /// duration of the message, deasserts between transfers only if
    /// `SpiTransfer::cs_change` is set, and deasserts after the last transfer.
    fn transfer_one_message(
        &self,
        cs_index: u8,
        mode:     SpiMode,
        speed_hz: u32,
        msg:      &mut SpiMessage<'_>,
    ) -> Result<(), KernelError>;

    /// Maximum supported clock speed in Hz (hardware limit).
    fn max_speed_hz(&self) -> u32;

    /// Number of native chip selects (additional CS via GPIO is handled above
    /// this layer).
    fn num_chipselect(&self) -> u8;

    /// Bitmask of supported bits-per-word values. Bit N is set if
    /// `bits_per_word = N+1` is supported. Bit 7 set means 8-bit words are
    /// supported.
    fn bits_per_word_mask(&self) -> u32;
}

/// Handle to an SPI peripheral at a specific CS on a specific controller.
pub struct SpiDevice {
    /// Underlying controller.
    pub controller:    Arc<dyn SpiController>,
    /// Chip select index on the controller.
    pub cs_index:      u8,
    /// SPI mode (polarity + phase).
    pub mode:          SpiMode,
    /// Maximum clock speed for this device in Hz.
    pub max_speed_hz:  u32,
    /// Bits per word (usually 8).
    pub bits_per_word: u8,
}

impl SpiDevice {
    /// Full-duplex transfer: send `tx`, receive into `rx` simultaneously.
    pub fn transfer(&self, tx: &[u8], rx: &mut [u8]) -> Result<(), KernelError> {
        assert_eq!(tx.len(), rx.len());
        let mut transfer = SpiTransfer {
            tx_buf:        Some(tx),
            rx_buf:        Some(rx),
            len:           tx.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut msg = SpiMessage {
            transfers: core::slice::from_mut(&mut transfer),
            status:    None,
        };
        self.controller.transfer_one_message(
            self.cs_index, self.mode, self.max_speed_hz, &mut msg,
        )
    }

    /// Write only (discard MISO).
    pub fn write(&self, data: &[u8]) -> Result<(), KernelError> {
        let mut transfer = SpiTransfer {
            tx_buf:        Some(data),
            rx_buf:        None,
            len:           data.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut msg = SpiMessage {
            transfers: core::slice::from_mut(&mut transfer),
            status:    None,
        };
        self.controller.transfer_one_message(
            self.cs_index, self.mode, self.max_speed_hz, &mut msg,
        )
    }

    /// Write a register address, then read response (2-segment, CS held asserted
    /// between segments).
    pub fn write_then_read(&self, cmd: &[u8], rx: &mut [u8]) -> Result<(), KernelError> {
        let mut t0 = SpiTransfer {
            tx_buf:        Some(cmd),
            rx_buf:        None,
            len:           cmd.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut t1 = SpiTransfer {
            tx_buf:        None,
            rx_buf:        Some(rx),
            len:           rx.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut msg = SpiMessage { transfers: &mut [t0, t1], status: None };
        self.controller.transfer_one_message(
            self.cs_index, self.mode, self.max_speed_hz, &mut msg,
        )
    }
}

Tier classification: SPI controller drivers are Tier 1. SPI peripheral drivers (sensors, transceivers, display controllers) follow their own tier classification based on function. spidev (Section 12.8.2) is Tier 2.

Device enumeration: SPI devices are enumerated from ACPI (SPISerialBus resource) or device-tree (spi-bus compatible node with reg property for CS index). The bus manager matches each ACPI/DT node to a registered SPI device driver by compatible string or ACPI HID.

CS GPIO: Many boards use GPIO pins as additional chip selects beyond what the hardware SPI controller provides natively. GPIO CS abstraction is handled in the bus manager layer: SpiController::transfer_one_message receives the already-resolved hardware CS index; the bus manager handles GPIO assertion and deassertion for GPIO-based CS lines before and after each controller call.

12.8.2 spidev — Userspace SPI Access

spidev exposes SPI devices to userspace via /dev/spidev<bus>.<cs> (e.g., /dev/spidev0.0). This allows userspace drivers and test tools to communicate with SPI peripherals without a kernel driver, using the same ioctl interface as Linux.

/// SPI transfer descriptor for the SPI_IOC_MESSAGE ioctl.
/// Layout matches Linux `struct spi_ioc_transfer` for ABI compatibility.
#[repr(C)]
pub struct SpiIocTransfer {
    /// Userspace pointer to TX data buffer (0 for RX-only transfers).
    pub tx_buf:           u64,
    /// Userspace pointer to RX data buffer (0 for TX-only transfers).
    pub rx_buf:           u64,
    /// Transfer length in bytes.
    pub len:              u32,
    /// Transfer clock speed override in Hz (0 = device default).
    pub speed_hz:         u32,
    /// Inter-transfer delay in microseconds.
    pub delay_usecs:      u16,
    /// Bits per word override (0 = device default).
    pub bits_per_word:    u8,
    /// If non-zero, deassert CS after this transfer before the next.
    pub cs_change:        u8,
    /// Dual/quad SPI TX mode (0 = standard).
    pub tx_nbits:         u8,
    /// Dual/quad SPI RX mode (0 = standard).
    pub rx_nbits:         u8,
    /// Inter-word delay in microseconds.
    pub word_delay_usecs: u8,
    /// Reserved; must be zero.
    pub _pad:             u8,
}

ioctls on /dev/spidevN.M:

ioctl Direction Description
SPI_IOC_RD_MODE Read Get SPI mode byte (SPI_MODE_0…3 + flags)
SPI_IOC_WR_MODE Write Set SPI mode byte
SPI_IOC_RD_MODE32 Read Get mode with extended flags (32-bit)
SPI_IOC_WR_MODE32 Write Set mode with extended flags
SPI_IOC_RD_LSB_FIRST Read Get bit order (0 = MSB first)
SPI_IOC_WR_LSB_FIRST Write Set bit order
SPI_IOC_RD_BITS_PER_WORD Read Get bits per word
SPI_IOC_WR_BITS_PER_WORD Write Set bits per word
SPI_IOC_RD_MAX_SPEED_HZ Read Get maximum speed in Hz
SPI_IOC_WR_MAX_SPEED_HZ Write Set maximum speed in Hz
SPI_IOC_MESSAGE(n) Write Transfer N SpiIocTransfer structs in one CS assertion

Linux compatibility: identical ioctl codes and SpiIocTransfer struct layout to Linux spidev. Userspace programs using <linux/spi/spidev.h> compile and run without modification.


12.9 rfkill — RF Kill Switch Framework

rfkill manages radio transmitters across all wireless technologies (WiFi, Bluetooth, UWB, WWAN, NFC, GPS). A "kill" can be hardware-initiated (a physical slide switch or button) or software-initiated (NetworkManager, airplane mode toggle, userspace rfkill tool). The framework tracks per-device block state, enforces the invariant that hard-blocked radios cannot be unblocked by software, and exposes current state to userspace via /dev/rfkill and sysfs.

12.9.1 Data Structures

/// Radio technology type.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u32)]
pub enum RfkillType {
    /// Meta-type: affects all radios when used in RFKILL_OP_CHANGE_ALL.
    All       = 0,
    /// IEEE 802.11 WiFi.
    Wlan      = 1,
    /// Bluetooth (BR/EDR and LE).
    Bluetooth = 2,
    /// Ultra-Wideband (deprecated in new hardware).
    Uwb       = 3,
    /// WiMAX.
    Wimax     = 4,
    /// WWAN / cellular modem (LTE, 5G NR).
    Wwan      = 5,
    /// GPS receiver.
    Gps       = 6,
    /// FM radio.
    Fm        = 7,
    /// Near-Field Communication.
    Nfc       = 8,
}

/// Aggregated block state for a single rfkill device.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum RfkillState {
    /// Radio may transmit.
    Unblocked,
    /// Blocked by software (rfkill_block_soft). Overridable.
    SoftBlocked,
    /// Blocked by hardware kill switch. Cannot be overridden by software.
    HardBlocked,
}

/// An rfkill device registered by a wireless driver.
pub struct RfkillDevice {
    /// Unique index (auto-assigned at registration, 0-based, never reused).
    pub idx:          u32,
    /// Radio technology.
    pub type_:        RfkillType,
    /// Human-readable name (e.g., "phy0", "hci0", "wwan0"). NUL-terminated.
    pub name:         [u8; 32],
    /// Current software block state. True = soft-blocked.
    pub soft_blocked: AtomicBool,
    /// Driver operations table.
    pub ops:          Arc<dyn RfkillOps>,
    /// Handle for sysfs uevent and netlink event notification.
    pub event_handle: RfkillEventHandle,
}

/// Operations implemented by the wireless driver.
pub trait RfkillOps: Send + Sync {
    /// Apply the soft block state to the hardware.
    ///
    /// `blocked = true` → shut down transmitter; `blocked = false` → enable
    /// transmitter. The driver must not transmit while blocked and may power-gate
    /// the radio hardware.
    fn set_block(&self, blocked: bool);

    /// Query the hardware kill switch state.
    ///
    /// Returns true if the hardware kill switch is asserted (hard-blocked).
    /// Called periodically and on GPIO interrupt to refresh hard-block state.
    /// Default: no hardware kill switch present.
    fn query_hardware(&self) -> bool {
        false
    }
}

12.9.2 /dev/rfkill — Userspace Interface

The /dev/rfkill character device (major 10, minor 242) provides a unified interface for monitoring and controlling all registered rfkill devices. It is the interface used by NetworkManager, ConnMan, iwd, and the rfkill(8) userspace tool.

/// rfkill event structure exchanged with userspace via /dev/rfkill.
/// Layout matches Linux `struct rfkill_event` (8 bytes) for ABI compatibility.
#[repr(C)]
pub struct RfkillEvent {
    /// Device index (matches RfkillDevice::idx).
    pub idx:   u32,
    /// Radio type (RfkillType as u8).
    pub type_: u8,
    /// Operation code: one of the RFKILL_OP_* constants below.
    pub op:    u8,
    /// Software block state: 1 = soft-blocked, 0 = not soft-blocked.
    pub soft:  u8,
    /// Hardware block state: 1 = hard-blocked, 0 = not hard-blocked.
    pub hard:  u8,
}

/// New device registered; sent once per device on first open.
pub const RFKILL_OP_ADD:        u8 = 0;
/// Device unregistered (driver unloaded or hardware removed).
pub const RFKILL_OP_DEL:        u8 = 1;
/// Block state changed for a specific device.
pub const RFKILL_OP_CHANGE:     u8 = 2;
/// Block state changed for all devices of the given type.
pub const RFKILL_OP_CHANGE_ALL: u8 = 3;

Read: Returns one RfkillEvent (8 bytes) per call. The first read after open() returns one RFKILL_OP_ADD event per currently registered rfkill device (device enumeration), then subsequent reads block until a state change occurs. Supports O_NONBLOCK + poll()/epoll().

Write: Write an RfkillEvent with op = RFKILL_OP_CHANGE to block (soft=1) or unblock (soft=0) a specific device identified by idx. Write with op = RFKILL_OP_CHANGE_ALL to block or unblock all devices of the given type_.

sysfs: Each rfkill device is exposed under /sys/class/rfkill/rfkill<N>/:

File Access Description
name ro Device name string
type ro Technology name ("wlan", "bluetooth", "wwan", etc.)
state ro Aggregate state: "0" = blocked, "1" = unblocked
hard ro Hardware block: "0" or "1"
soft rw Software block: "0" or "1" (write to change)
uevent Generates uevent on any state change

12.9.3 rfkill-input: Hardware Kill Switch

When a hardware kill switch (GPIO or ACPI button device) changes state, it notifies rfkill-input, which calls rfkill_set_hw_state() on all rfkill devices of the associated type (typically RfkillType::All for a physical airplane-mode switch). The WiFi driver and Bluetooth driver each register their own rfkill devices; the single switch event propagates to all of them simultaneously through the framework.

Linux compatibility: same /dev/rfkill ABI; same sysfs layout; same ioctl codes. rfkill(8) from util-linux, NetworkManager, ConnMan, and iwd all work without modification.


12.10 MTD — Memory Technology Device Framework

MTD (Memory Technology Device) provides a uniform kernel interface to flash memory: NOR flash (bit-erasable, supports random read and byte-granular 1→0 bit writes, erases entire sectors to all-ones), NAND flash (page-write with ECC, block erase, sequential access pattern required), and eMMC in raw partition mode. MTD is used for bootloader storage, firmware update partitions, and embedded root filesystems (UBIFS on NAND, JFFS2 on NOR). The MTD layer sits below filesystems and above hardware flash controller drivers.

12.10.1 MtdInfo and MtdDevice

/// MTD device type.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u32)]
pub enum MtdType {
    /// No device / slot unused.
    Absent    = 0,
    /// RAM (volatile, no erase needed).
    Ram       = 1,
    /// Read-only NOR (firmware ROM).
    Rom       = 2,
    /// NOR flash: bit-alterable (can flip 1→0 without erase), sector-erase.
    NorFlash  = 3,
    /// NAND flash: page write, block erase, ECC required.
    NandFlash = 4,
    /// Atmel DataFlash (SPI, power-of-two pages).
    DataFlash = 6,
    /// UBI logical volume over NAND.
    UbiVolume = 7,
    /// Multi-level cell NAND.
    MlcNand   = 8,
}

bitflags! {
    /// MTD device capability flags.
    pub struct MtdFlags: u32 {
        /// Device supports write operations.
        const WRITEABLE     = 0x400;
        /// Individual bits may be set (NOR: can flip 1→0 without erase).
        const BIT_WRITEABLE = 0x800;
        /// No erase needed (RAM, ROM).
        const NO_ERASE      = 0x1000;
        /// OOB (Out-Of-Band / spare) area accessible via read_oob/write_oob.
        const OOB           = 0x2000;
        /// Hardware ECC engine present and active.
        const ECC           = 0x4000;
        /// Continuous (linearly-addressed) memory space (NOR).
        const MAPPED        = 0x8000;
    }
}

/// Static MTD device descriptor returned by MtdDevice::info().
pub struct MtdInfo {
    /// Device type.
    pub type_:          MtdType,
    /// Capability flags.
    pub flags:          MtdFlags,
    /// Total device size in bytes.
    pub size:           u64,
    /// Minimum erase unit in bytes (erase block size).
    /// NOR: typically 64 KiB or 128 KiB.
    /// NAND: typically 128 KiB or 256 KiB.
    pub erasesize:      u32,
    /// Minimum write unit in bytes.
    /// NAND: page size (512 B, 2 KiB, 4 KiB). NOR: 1 (bit-alterable).
    pub writesize:      u32,
    /// OOB (spare) bytes per page (NAND only; typically 64 or 128).
    pub oobsize:        u32,
    /// OOB bytes per page available for filesystem use (after ECC overhead).
    pub oobavail:       u32,
    /// Device model name (e.g., "mx25l25635f"). NUL-terminated.
    pub name:           [u8; 64],
    /// MTD device index (matches /dev/mtdN minor number N/2).
    pub index:          u32,
    /// ECC strength: correctable bits per ecc_step_size bytes.
    pub ecc_strength:   u32,
    /// ECC step size in bytes (typically 512 or 1024).
    pub ecc_step_size:  u32,
}

/// MTD device KABI trait. Implemented by flash controller drivers.
pub trait MtdDevice: Send + Sync {
    /// Return static MTD info for this device.
    fn info(&self) -> &MtdInfo;

    /// Read `buf.len()` bytes from device offset `from` into `buf`.
    ///
    /// Returns `(bytes_read, max_bit_flips)` where `max_bit_flips` is the
    /// maximum number of bit errors corrected in any single ECC step during
    /// the read (0 if no ECC or no errors).
    fn read(&self, from: u64, buf: &mut [u8]) -> Result<(usize, u32), MtdError>;

    /// Write `data` to device offset `to`. Must be aligned to writesize.
    ///
    /// NOR: can only write 0-bits (1→0 only); cannot flip 0→1 without erase.
    /// NAND: must write entire pages; partial-page writes are not supported.
    fn write(&self, to: u64, data: &[u8]) -> Result<usize, MtdError>;

    /// Erase one erase block starting at `addr`. Must be aligned to erasesize.
    fn erase(&self, addr: u64) -> Result<(), MtdError>;

    /// Read page data and OOB area simultaneously (NAND only).
    fn read_oob(
        &self,
        from:     u64,
        data_buf: &mut [u8],
        oob_buf:  &mut [u8],
    ) -> Result<(), MtdError>;

    /// Write page data and OOB area simultaneously (NAND only).
    fn write_oob(
        &self,
        to:   u64,
        data: &[u8],
        oob:  &[u8],
    ) -> Result<(), MtdError>;

    /// Check whether a NAND block is bad (factory-marked or runtime-marked).
    fn block_isbad(&self, ofs: u64) -> Result<bool, MtdError>;

    /// Mark a NAND block as bad after an unrecoverable ECC error.
    fn block_markbad(&self, ofs: u64) -> Result<(), MtdError>;
}

/// MTD-specific error codes.
#[derive(Debug)]
pub enum MtdError {
    /// Generic I/O error.
    Io(KernelError),
    /// Uncorrectable ECC error: bit flip count exceeded ECC strength.
    EccError,
    /// Byte offset or length exceeds device bounds.
    OutOfBounds,
    /// Write or erase targeting a bad block (NAND).
    BadBlock,
    /// Device is write-protected (WP# pin asserted or software lock active).
    WriteProtected,
}

12.10.2 MTD Partitions

Raw flash devices are divided into named partitions analogous to disk partitions. Each partition is a contiguous subrange of the parent MTD device and appears as its own /dev/mtdN node.

/// An MTD partition: a named subrange of a parent MTD device.
pub struct MtdPartition {
    /// Partition name (e.g., "bootloader", "kernel", "rootfs"). NUL-terminated.
    pub name:   [u8; 64],
    /// Byte offset within the parent MTD device. Must be erasesize-aligned.
    pub offset: u64,
    /// Partition size in bytes. Must be a multiple of erasesize.
    pub size:   u64,
    /// True if this partition is read-only (erase and write are rejected).
    pub ro:     bool,
}

Partition source priority (highest first):

  1. Kernel command line (mtdparts= parameter via cmdlinepart driver): mtdparts=spi0.0:512k(bootloader),1m(kernel),-(rootfs)
  2. Device tree (partitions subnode with compatible = "fixed-partitions", child nodes with reg and label properties)
  3. RedBoot FIS table (self-describing NOR flash partition table at a known offset)

Each partition appears as its own MTD device: the parent is /dev/mtd0; partitions are /dev/mtd1, /dev/mtd2, etc., in registration order.

12.10.3 Character Devices: /dev/mtdN and /dev/mtdblockN

/dev/mtdN (major 90, minor 2N): raw MTD character device. Supports sequential read()/write() with lseek(). ioctls:

ioctl Description
MEMGETINFO Returns MtdInfo for this device
MEMERASE Erase blocks (struct erase_info_user { start, length })
MEMREAD Read with OOB data
MEMWRITE Write with OOB data
MEMGETBADBLOCK Query bad block at given offset
MEMSETBADBLOCK Mark block at given offset as bad
MEMGETOOBSEL Get OOB layout (ECC byte positions, free byte positions)
MEMLOCK Write-lock sectors (NOR flash with hardware lock bits)
MEMUNLOCK Write-unlock sectors

/dev/mtdblockN (major 31, minor N): block device interface over MTD. Translates block layer read/write requests into MTD read/erase/write sequences. Suitable for FAT filesystems on NOR flash. Not suitable for NAND (use UBI + UBIFS instead — the block interface performs destructive random writes that destroy NAND without wear leveling).

12.10.4 UBI (Unsorted Block Images)

UBI is a wear-leveling and bad-block management layer that sits between raw NAND and UBIFS. It maintains a volume table, distributes erases evenly across all physical erase blocks, and transparently remaps bad blocks.

/// UBI volume type.
pub enum UbiVolumeType {
    /// Writable and erasable (standard data partition).
    Dynamic,
    /// Read-only after finalization; integrity verified by ECC on every read.
    Static,
}

/// A UBI logical volume.
pub struct UbiVolume {
    /// Volume ID (0 to UBI_MAX_VOLUMES-1; typically up to 128 volumes).
    pub vol_id:    u32,
    /// Volume type.
    pub type_:     UbiVolumeType,
    /// Volume name. NUL-terminated.
    pub name:      [u8; 128],
    /// Volume size in bytes (multiple of leb_size).
    pub size:      u64,
    /// Logical erase block size = MTD erasesize − UBI per-block overhead (~64 B).
    pub leb_size:  u32,
    /// Logical erase block data alignment in bytes (usually 1).
    pub alignment: u32,
}

UBI exposes volumes as /dev/ubiN_M (UBI device N, volume M). UBIFS mounts directly on a UBI volume (mount -t ubifs ubi0:rootfs /).

Linux compatibility: same mtd-utils commands (flash_erase, flashcp, nandwrite, nanddump, ubiformat, ubimkvol, ubinfo, ubirename) work without modification. Identical ioctl codes, same /dev/mtdN and /dev/mtdblockN node layout.


12.11 IPMI — Intelligent Platform Management Interface

IPMI (Intelligent Platform Management Interface, version 2.0) enables out-of-band system monitoring and management via the Baseboard Management Controller (BMC). The kernel communicates with the BMC through one of four system interfaces: KCS (Keyboard Controller Style), SMIC, BT (Block Transfer), or SSIF (SMBus System Interface over I2C). Capabilities provided include: temperature, voltage, and fan telemetry via Sensor Data Records (SDR); remote power control; system event log (SEL) access; serial-over-LAN (SOL); and hardware watchdog. IPMI is universally present on server-class hardware and is required for IPMI-aware management frameworks (ipmitool, freeipmi, Redfish BMC integration).

12.11.1 IPMI Message

/// IPMI message: a request sent to or a response received from the BMC.
pub struct IpmiMsg {
    /// Network Function (NetFn). Even = request, odd = response.
    /// Common values: 0x04 Sensor/Event, 0x06 Application, 0x0A Storage,
    /// 0x2C Group Extension, 0x30–0x3F OEM/Site-specific.
    pub netfn:    u8,
    /// Command code within the NetFn.
    pub cmd:      u8,
    /// Completion code: 0x00 = success; non-zero = error (in responses).
    pub ccode:    u8,
    /// Valid bytes in `data`.
    pub data_len: u8,
    /// Message payload (max 64 bytes for KCS; 32 bytes for SSIF due to SMBus
    /// block transfer limit).
    pub data:     [u8; 64],
}

impl Default for IpmiMsg {
    fn default() -> Self {
        Self { netfn: 0, cmd: 0, ccode: 0, data_len: 0, data: [0u8; 64] }
    }
}

/// IPMI Logical Unit Number (sub-channel within a network function).
#[repr(u8)]
pub enum IpmiLun {
    /// BMC hardware.
    Bmc      = 0,
    /// OEM channel 1.
    Oem1     = 1,
    /// IPMB channel 0.
    IpmbChan = 2,
    /// OEM channel 2.
    Oem2     = 3,
}

12.11.2 System Interface Drivers

/// IPMI system interface trait. Implemented by KCS, SMIC, BT, and SSIF drivers.
pub trait IpmiSi: Send + Sync {
    /// Send an IPMI request and receive the BMC's response synchronously.
    ///
    /// Blocks until the BMC response is ready or `timeout_ms` elapses.
    fn send_recv(
        &self,
        request:    &IpmiMsg,
        response:   &mut IpmiMsg,
        timeout_ms: u32,
    ) -> Result<(), IpmiError>;

    /// Short name identifying this interface type (e.g., "kcs", "ssif", "bt").
    fn interface_type(&self) -> &'static str;
}

/// IPMI system interface error codes.
pub enum IpmiError {
    /// BMC did not respond within the timeout.
    Timeout,
    /// BMC returned a NACK (SMBus) or error completion code.
    Nack,
    /// Malformed response data.
    InvalidData,
    /// BMC busy; retry.
    DeviceBusy,
    /// Underlying I/O error.
    Io(KernelError),
}

KCS (Keyboard Controller Style): The most common system interface. Uses two I/O-port register pairs: DATA_IN/DATA_OUT and STATUS/CMD. The driver implements the KCS state machine (KCS_IDLE → KCS_WRITE_START → KCS_WRITE_DATA → KCS_READ → KCS_IDLE), polling at 100 µs intervals. Switches to interrupt-driven operation if the BMC asserts a system IRQ.

SSIF (SMBus System Interface): IPMI over SMBus (I2C). Maximum payload 32 bytes per SMBus block transfer. Messages exceeding 32 bytes use multi-part transactions. Implemented by IpmiSsif on top of the I2cBus trait (Section 10.10.1).

BT (Block Transfer): Three I/O-port registers; supports BMC-initiated interrupts to the host. Deprecated in new platform designs.

12.11.3 /dev/ipmiN Character Device

Each IPMI interface creates /dev/ipmi0 (major 239, dynamic minor per kernel assignment):

/// Userspace request structure for IPMICTL_SEND_COMMAND.
/// Layout matches Linux `struct ipmi_req` for ABI compatibility.
#[repr(C)]
pub struct IpmiReq {
    /// Pointer to IPMI address structure (kernel reads via copy_from_user).
    pub addr:     *const IpmiAddrT,
    /// Size of the address structure.
    pub addr_len: u32,
    /// Caller-assigned message ID; returned unchanged with the response.
    pub msgid:    i64,
    /// Message header: netfn, cmd, data length, and pointer to data buffer.
    pub msg:      IpmiMsgHdr,
}

// ioctl command codes (matches Linux openipmi ABI)
pub const IPMICTL_SEND_COMMAND:           u32 = 0x8028690D;
pub const IPMICTL_RECEIVE_MSG_TRUNC:      u32 = 0xC030690B;
pub const IPMICTL_RECEIVE_MSG:            u32 = 0xC030690C;
pub const IPMICTL_REGISTER_FOR_CMD:       u32 = 0x8008690A;
pub const IPMICTL_UNREGISTER_FOR_CMD:     u32 = 0x80086909;
pub const IPMICTL_SET_MY_CHANNEL_ADDRESS: u32 = 0x80046983;
pub const IPMICTL_GET_MY_CHANNEL_ADDRESS: u32 = 0x40046984;
pub const IPMICTL_SET_TIMING_PARMS:       u32 = 0x80106985;
pub const IPMICTL_GET_TIMING_PARMS:       u32 = 0x40106986;

select()/poll()/epoll() on /dev/ipmiN: becomes readable when a response or an asynchronous event message from the BMC is available. Multiple processes may open /dev/ipmiN simultaneously; responses are demultiplexed by the msgid field.

12.11.4 Platform Event / Panic Notifier

On kernel panic, UmkaOS sends a Platform Event Message to the BMC so it can log the event, alert the management network, or trigger an automatic power cycle after a configurable delay.

/// Sends an IPMI OS Critical Stop event to the BMC on kernel panic.
pub struct IpmiPanicNotifier {
    /// IPMI system interface to use.
    pub si: Arc<dyn IpmiSi>,
}

impl PanicNotifier for IpmiPanicNotifier {
    fn notify_panic(&self, _msg: &str) {
        // Platform Event Message: NetFn=0x04 (Sensor/Event), Cmd=0x02
        let mut req = IpmiMsg::default();
        req.netfn    = 0x04;
        req.cmd      = 0x02;
        req.data_len = 8;
        req.data[0]  = 0x20; // EvMRev: IPMI 2.0
        req.data[1]  = 0x20; // Sensor Type: OS Critical Stop
        req.data[2]  = 0xFF; // Sensor Number: OS critical stop (0xFF = unspecified)
        req.data[3]  = 0x6F; // Event Dir: assertion; Event Type: sensor-specific
        req.data[4]  = 0x00; // Event Data 1: run-time critical stop
        req.data[5]  = 0xFF; // Event Data 2: unspecified
        req.data[6]  = 0xFF; // Event Data 3: unspecified
        // Ignore errors: BMC may not respond during a panic.
        let _ = self.si.send_recv(&req, &mut IpmiMsg::default(), 500);
    }
}

Linux compatibility: identical /dev/ipmiN ioctl interface; ipmitool, freeipmi, and OpenIPMI userspace libraries work without modification. ACPI _HID = "IPI0001" device detection and ipmi_si PnP IDs are supported.


12.12 UIO — Userspace I/O

UIO (Userspace I/O) allows complete device drivers to be implemented in userspace. A minimal kernel stub registers the device, maps device memory regions (MMIO BARs, reserved RAM) into the process address space via mmap() on /dev/uioN, and delivers hardware interrupts to userspace via a blocking read(). This is appropriate for FPGAs, industrial I/O cards, custom hardware with no existing kernel driver, and legacy proprietary hardware where a vendor supplies a userspace driver binary. Kernel code for UIO devices is minimal: it must only set up the UioDevice trait implementation; the rest lives in userspace.

12.12.1 UioDevice Trait

/// Kernel stub trait for a UIO device. Implemented once per device type.
pub trait UioDevice: Send + Sync {
    /// Device name shown in /sys/class/uio/uioN/name.
    fn name(&self) -> &str;

    /// Driver version string shown in /sys/class/uio/uioN/version.
    fn version(&self) -> &str;

    /// Memory regions to expose via mmap. Maximum UIO_MAX_MAPS (5) regions.
    fn mem_regions(&self) -> &[UioMem];

    /// Called when userspace writes 1 to /dev/uioN to re-enable the interrupt
    /// after it has been delivered. Prevents interrupt storms before userspace
    /// has finished processing.
    fn irq_control(&self, enable: bool);

    /// Called in interrupt context when the hardware asserts the IRQ.
    ///
    /// The implementation must disable the interrupt at the hardware level
    /// (to prevent re-entry) and return true to wake all blocked readers on
    /// /dev/uioN.
    fn irq_handler(&self) -> bool;
}

/// A physical or virtual memory region exposed via mmap on /dev/uioN.
pub struct UioMem {
    /// Physical base address of the region (for MMIO BARs or reserved RAM).
    pub addr:  u64,
    /// Size of the region in bytes. Must be a multiple of PAGE_SIZE.
    pub size:  usize,
    /// Memory type: determines how the mmap mapping is established.
    pub type_: UioMemType,
    /// Region name shown in sysfs maps/mapN/name. NUL-terminated.
    pub name:  [u8; 32],
}

/// How a UIO memory region is physically mapped into userspace.
pub enum UioMemType {
    /// Slot is unused (padding to preserve index of later slots).
    None,
    /// Physically contiguous memory (device MMIO or reserved RAM).
    /// mmap returns an uncached (write-combining or device) mapping.
    PhysContiguous,
    /// Kernel virtual memory (vmalloc area).
    /// mmap remaps the kernel virtual pages into the user VMA.
    Virtual,
    /// Kernel logical memory (struct page array, contiguous in kernel VA).
    /// mmap uses remap_pfn_range over the page frames.
    Logical,
}

12.12.2 /dev/uioN Character Device

mmap(): Each UioMem region is mapped at a fixed file offset: region 0 at offset 0, region 1 at offset 1 * PAGE_SIZE, region N at offset N * PAGE_SIZE (where PAGE_SIZE equals getpagesize() and UIO_MAX_MAPS = 5). Example: mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, region_index * getpagesize()). Userspace accesses hardware registers directly via the returned virtual address.

read(): Blocks until the UIO interrupt fires. Returns a u32 (4 bytes) containing the cumulative interrupt count since the device was opened. Supports O_NONBLOCK + select()/poll()/epoll().

write(): Write the value 1u32 (4 bytes) to re-enable the hardware interrupt after processing. This is required before the next read() will block again on a new interrupt edge.

sysfs under /sys/class/uio/uioN/:

Path Description
name Device name (from UioDevice::name())
version Driver version (from UioDevice::version())
event Current interrupt count (mirrors read())
maps/map0/addr Physical address of region 0 (hex)
maps/map0/size Size of region 0 (hex)
maps/map0/name Name of region 0
maps/map0/offset mmap offset for region 0 (always 0)
maps/mapN/… Same fields for regions 1–4

12.12.3 uio_pdrv_genirq

uio_pdrv_genirq is a generic kernel component that turns any platform device with an IRQ into a UIO device without any device-specific kernel code. The interrupt handler disables the IRQ line and wakes userspace readers; the userspace driver re-enables the IRQ via write(1). This is the primary mechanism for FPGA and custom I/O card support in UmkaOS.

Linux compatibility: same /dev/uioN ABI; same sysfs layout; Linux UIO userspace libraries (libuio) and drivers written for Linux UIO work without modification.


12.13 NVMEM — Non-Volatile Memory Framework

The NVMEM (Non-Volatile Memory) framework provides a unified kernel interface to small non-volatile storage cells: I2C EEPROMs (AT24C series), SPI EEPROMs (AT25 series), OTP fuses (silicon eFuse banks), NVRAM cells inside RTC chips, battery-backed SRAM, and U-Boot environment variable storage. The framework decouples providers (EEPROM/fuse drivers that know how to read and write bytes) from consumers (e.g., Ethernet drivers needing a programmed MAC address, audio drivers needing factory calibration constants, clock drivers needing trim values). Consumers reference cells by symbolic name via device-tree or ACPI declarations rather than by raw byte offset.

12.13.1 Data Structures

/// An NVMEM provider device (one EEPROM chip, one OTP bank, etc.).
pub struct NvmemDevice {
    /// Device model name (e.g., "at24c256", "imx-ocotp"). NUL-terminated.
    pub name:      [u8; 64],
    /// Total addressable size in bytes.
    pub size:      usize,
    /// True if the device or the current software policy prohibits writes.
    pub read_only: bool,
    /// True if this is a One-Time-Programmable device (bits can only be
    /// written once and cannot be erased).
    pub otp:       bool,
    /// Named cells registered for this device.
    pub cells:     Vec<NvmemCell>,
    /// Read/write operations implemented by the provider driver.
    pub ops:       Arc<dyn NvmemOps>,
}

/// A named data cell within an NVMEM device.
pub struct NvmemCell {
    /// Cell name as declared in device-tree or ACPI (e.g., "mac-address",
    /// "calibration-data", "serial-number"). NUL-terminated.
    pub name:       [u8; 64],
    /// Byte offset of the cell's first byte within the NVMEM device.
    pub offset:     u32,
    /// Cell size in bits. Cells smaller than 8 bits use `bit_offset`.
    pub nbits:      u32,
    /// Bit offset within the byte at `offset` for sub-byte cells
    /// (e.g., a 4-bit trim value packed into the upper nibble of a byte).
    pub bit_offset: u8,
    /// True if the cell may be written. False for OTP cells already programmed
    /// or cells in read-only regions.
    pub writable:   bool,
}

/// NVMEM provider operations trait.
pub trait NvmemOps: Send + Sync {
    /// Read `buf.len()` bytes starting at byte `offset` within the NVMEM device.
    fn read(&self, offset: u32, buf: &mut [u8]) -> Result<(), KernelError>;

    /// Write `data.len()` bytes starting at byte `offset`.
    ///
    /// Returns `Err(EROFS)` if `NvmemDevice::read_only` is true.
    /// Returns `Err(EPERM)` if OTP cell is already programmed.
    fn write(&self, offset: u32, data: &[u8]) -> Result<(), KernelError>;
}

12.13.2 Consumer API

Consumer drivers call these functions from umka-core/src/nvmem/consumer.rs:

/// Look up an NVMEM cell handle by consumer device node and cell name.
///
/// `consumer` is the device node of the driver consuming the cell (used to
/// resolve the `nvmem-cells` + `nvmem-cell-names` DT properties).
/// `cell_name` is the symbolic cell name (e.g., "mac-address").
pub fn nvmem_cell_get(
    consumer:  &DeviceNode,
    cell_name: &str,
) -> Result<NvmemCellHandle, KernelError>;

/// Read the entire contents of `handle`'s cell into `buf`.
///
/// Returns the number of bytes written to `buf`.
pub fn nvmem_cell_read(
    handle: &NvmemCellHandle,
    buf:    &mut [u8],
) -> Result<usize, KernelError>;

/// Convenience wrapper: read a 6-byte MAC address from the cell named
/// "mac-address" on `consumer`. Handles big-endian byte order if the cell
/// is stored MSB-first (as is conventional in EEPROM MAC storage).
pub fn nvmem_cell_read_mac_address(
    consumer: &DeviceNode,
) -> Result<[u8; 6], KernelError>;

/// Write `data` to `handle`'s cell.
///
/// Returns `Err(EROFS)` if the device or cell is read-only.
/// Returns `Err(EPERM)` if the OTP bit is already set.
pub fn nvmem_cell_write(
    handle: &NvmemCellHandle,
    data:   &[u8],
) -> Result<(), KernelError>;

12.13.3 sysfs Interface

/sys/bus/nvmem/devices/<name>/
├── nvmem              rw if not read_only; r if read_only
│                      Raw byte access: supports lseek() + read()/write()
│                      with byte offset mapping directly to NVMEM address space
└── cells/
    └── <cell_name>    r--  Raw bytes of the named cell

Linux compatibility: same device-tree bindings (nvmem-cells, nvmem-cell-names, #nvmem-cell-cells); same sysfs layout; same consumer API function names. nvmem-tools userspace utilities work without modification.


12.14 SoundWire Bus Framework

SoundWire (MIPI Alliance SoundWire Specification version 1.2) is a two-wire (clock + data) serial audio bus used on Intel Tiger Lake, Alder Lake, Meteor Lake, and Raptor Lake SoCs to connect digital audio peripherals (codecs, amplifiers, DMIC arrays) via the PCH High-Definition Audio Multi-Link (HDAML) controller. SoundWire replaces the parallel HDA pin connections used by previous generations of external codecs. The UmkaOS SoundWire framework lives in umka-kernel/src/drivers/soundwire/ and integrates with the ASoC framework (Section 20.4) for stream management.

12.14.1 Bus Architecture

SoC PCH
├── Intel HDAML controller  (soundwire-intel driver)
│   ├── SoundWire link 0    (master, 48 MHz ref clock, 12.288 Mbit/s)
│   │   ├── Peripheral 0: RT712 codec      (Realtek, dev_num 1)
│   │   └── Peripheral 1: RT715 DMIC array (Realtek, dev_num 2)
│   └── SoundWire link 1    (master, second codec pair)
│       └── Peripheral 0: CS35L45 amplifier (Cirrus Logic, dev_num 1)
└── Legacy HDA controller   (for internal speakers / HDA codecs)

Each SoundWire link is a separate logical bus. Peripherals are automatically enumerated by the master during link startup: each peripheral responds with its MIPI manufacturer ID, part ID, class code, and firmware version.

12.14.2 Data Structures

/// A discovered SoundWire peripheral (codec, amplifier, or microphone array).
pub struct SdwPeripheral {
    /// SoundWire unique address assigned during enumeration (1–14; 0 and 15
    /// are reserved).
    pub dev_num:    u8,
    /// MIPI-registered manufacturer ID (e.g., 0x025D = Realtek).
    pub mfr_id:     u16,
    /// Manufacturer-assigned part identifier.
    pub part_id:    u16,
    /// MIPI device class code (0x01 = audio codec, 0x02 = amplifier,
    /// 0x03 = microphone).
    pub class_code: u8,
    /// Peripheral firmware revision number.
    pub version:    u8,
}

/// PCM audio stream configuration for a SoundWire link.
pub struct SdwStream {
    /// Human-readable name for debug output (e.g., "playback", "capture").
    /// NUL-terminated.
    pub name:            [u8; 32],
    /// Number of audio channels (e.g., 2 for stereo, 8 for surround).
    pub num_channels:    u8,
    /// PCM sample rate in Hz (e.g., 44100, 48000, 96000, 192000).
    pub sample_rate:     u32,
    /// Bit depth per sample (16, 20, 24, or 32).
    pub bits_per_sample: u8,
    /// SoundWire frame shape: number of rows per audio frame.
    /// Valid values: 48, 50, 60, 64, 72, 75, 80, 125, 147, 192, 250.
    pub frame_rows:      u8,
    /// SoundWire frame shape: number of columns per audio frame (2–16).
    pub frame_cols:      u8,
    /// Data port assignments: which SoundWire data ports carry this stream.
    pub ports:           Vec<SdwPortConfig>,
}

/// Mapping of a stream channel group to a SoundWire data port.
pub struct SdwPortConfig {
    /// Data port number on the peripheral (1–14).
    pub port_num:  u8,
    /// Bitmask of channels assigned to this port within the stream.
    pub ch_mask:   u32,
    /// Data port mode: 0 = isochronous (default), 1 = tx controlled,
    /// 2 = rx controlled, 3 = simplified.
    pub port_mode: u8,
}

/// KABI vtable for a SoundWire peripheral driver.
///
/// Peripheral drivers (codec, amplifier) implement this vtable. The
/// SoundWire bus manager calls into it for register access, stream lifecycle,
/// and interrupt handling.
#[repr(C)]
pub struct SdwPeripheralOps {
    /// Size of this vtable in bytes (for versioned ABI compatibility).
    /// Must be `u64` (not `usize`) per Section 11.1.3 Rule 3: vtable_size is part
    /// of the stable KABI and must have the same width on 32-bit and 64-bit targets.
    pub vtable_size: u64,
    /// Read a SoundWire register at `addr` (32-bit address, 8-bit value
    /// per SoundWire spec section 10). Returns value in low 8 bits; high
    /// bits are zero on success, 0xFFFF_FFFF on bus error.
    pub read_reg: unsafe extern "C" fn(
        ctx:  *mut c_void,
        addr: u32,
    ) -> u32,
    /// Write `value` (8 bits) to SoundWire register at `addr`.
    pub write_reg: unsafe extern "C" fn(
        ctx:   *mut c_void,
        addr:  u32,
        value: u32,
    ),
    /// Prepare and enable a PCM stream on this peripheral.
    ///
    /// `dir`: 0 = capture (peripheral → host), 1 = playback (host → peripheral).
    /// Returns 0 on success, negative errno on error.
    pub stream_enable: unsafe extern "C" fn(
        ctx:    *mut c_void,
        stream: *const SdwStream,
        dir:    u8,
    ) -> i32,
    /// Disable and release the stream identified by `stream_id`.
    pub stream_disable: unsafe extern "C" fn(
        ctx:       *mut c_void,
        stream_id: u32,
    ),
    /// Handle a SoundWire interrupt delivered to this peripheral.
    ///
    /// `status` is the INTSTAT register value. The driver clears the
    /// interrupt source and returns.
    pub interrupt: unsafe extern "C" fn(
        ctx:    *mut c_void,
        status: u32,
    ),
}

12.14.3 Power States

SoundWire defines three standardized power management states:

State Description Clock
D0 Active; streams running normally. Full speed
ClockStop Inactive; all peripherals maintain register state across the stop. Master gates the clock pin after ClockStop Prepare handshake. Stopped (gated)
ClockStop2 Deepest sleep; peripherals may discard volatile register state. Non-volatile configuration (e.g., OTP-based defaults) is preserved. Stopped (gated)

Transition into ClockStop: master broadcasts ClockStop Prepare command → all peripherals ACK → master asserts ClockStop status → master gates clock. Wake: master restarts clock → peripherals detect clock activity → bus enumeration runs again → streams re-established from saved state.

12.14.4 Integration with ASoC (ALSA SoC)

SoundWire peripherals register as ASoC codec components. The sdw_master_device (the Intel HDAML controller) binds each SoundWire link to the ASoC machine driver. Stream bring-up sequence:

  1. ASoC DAPM (Dynamic Audio Power Management) resolves the active audio route.
  2. The machine driver identifies the SoundWire data ports involved.
  3. SdwPeripheralOps::stream_enable is called on each peripheral on the path.
  4. The Intel HDAML hardware programs the SoundWire frame shape (frame_rows × frame_cols) and asserts the SoundWire clock to start isochronous data transfer.

Linux compatibility: UmkaOS's SoundWire implementation follows the MIPI SoundWire 1.2 specification. Peripheral devices supported by the Linux soundwire-intel driver (Realtek RT711, RT712, RT715; Cirrus Logic CS35L45; Maxim MAX98373) work on UmkaOS using the same ACPI firmware tables. The sdw_stream_config register layout and MIPI frame-shape encoding are spec-compliant and identical to Linux. ASoC machine driver DT/ACPI bindings are compatible.