Chapter 13: Device Class Frameworks¶

NIC, GPU, USB, I2C/SMBus, WiFi, Bluetooth, Camera, Printers, Live Kernel Evolution, Watchdog, SPI, rfkill, MTD, IPMI, UIO, NVMEM, SoundWire

Per-device-class frameworks define the vtable contracts, state machines, and data structures for each hardware category. Each framework sits above the driver isolation layer (Chapter 11) and below userspace APIs. This chapter covers 26 device subsystems from NIC and GPU to I2C, USB, watchdog, and RTC.

Live kernel evolution (hot-swapping driver components at runtime) is specified here as a cross-cutting device-class concern. The Evolvable Module SDK provides three layers of state-spill prevention (link-time ELF section check, compile-time proc macro enforcement, PersistentState derive macro with versioned migration). Drivers written in C or Rust use the same KABI IDL for state definitions — a C driver can be hot-swapped for a Rust implementation with binary-compatible persistent state.

13.1 Device Class Overview¶

Complex hardware categories — wireless networking, display, and audio — each require a shared kernel subsystem that multiple hardware-specific drivers plug into. This section defines the authoritative interface contracts for these subsystems. Hardware-specific driver documentation (consumer chipsets in Section 13.14/Section 13.15/Section 21.3/Section 21.4, server NICs in Section 15.13, etc.) specifies implementations of these contracts, not independent parallel frameworks.

Any driver implementing a subsystem interface must: - Follow the tier model (Section 11.3) using the tier specified here. - Use UmkaOS ring buffers (Section 11.8) for all bulk data flows. - Implement the crash recovery callbacks defined in Section 11.4. - Register through the device registry (Section 11.4) rather than a subsystem-specific registration API.

Note on trait objects: Arc<dyn T> references (e.g., Arc<dyn SpiController>, Arc<dyn NvmemOps>) in device framework structs are kernel-internal references within the same address space. They do NOT cross the KABI boundary. Cross-domain communication uses C-ABI vtable pointers (Section 12.5), not Rust trait objects.

13.2 Wireless Subsystem¶

Tier: Tier 1. Wireless I/O latency directly affects user-visible responsiveness (video calls, gaming, SSH). The ~200–500 cycle Tier 2 syscall overhead per packet is unacceptable. WiFi and cellular firmware run on-chip (not on the host CPU), so the attack surface is IOMMU-bounded — same threat model as NVMe (Section 11.8).

KABI interface name: wireless_device_v1 (in interfaces/wireless_device.kabi).

// umka-core/src/net/wireless.rs — authoritative wireless driver contract

/// A wireless network device. Implemented by all wireless drivers
/// (WiFi 4/5/6/6E/7, cellular modems, 802.15.4).
///
/// **All methods take `&self`**, consistent with `AudioDriver` and `DisplayDriver`.
/// The kernel holds `Arc<dyn WirelessDriver>`, so `&mut self` would require an
/// external Mutex. Instead, drivers use internal locking (e.g.,
/// `Mutex<DriverState>` or per-subsystem SpinLocks) to protect mutable state.
/// This matches the established KABI pattern: the trait boundary uses `&self`
/// for ergonomics; the driver implementation manages its own synchronization.
pub trait WirelessDriver: Send + Sync {
    // --- Identity and capabilities ---

    /// Hardware address (6-byte MAC or 8-byte EUI-64 for 802.15.4).
    fn mac_addr(&self) -> &[u8];

    /// Supported wireless standards (bitmask).
    fn capabilities(&self) -> WirelessCapabilities;

    // --- Lifecycle ---

    /// Bring the radio up (allocate firmware resources, enable PHY).
    fn up(&self) -> Result<(), WirelessError>;

    /// Take the radio down (quiesce TX/RX, release firmware resources).
    fn down(&self) -> Result<(), WirelessError>;

    // --- Scan and association ---

    /// Request an active or passive scan on the given channels.
    /// Results are delivered via the event ring (see `WirelessEvent`).
    fn scan(&self, req: &ScanRequest) -> Result<(), WirelessError>;

    /// Dump cached scan results (BSS list). Called by `NL80211_CMD_GET_SCAN`.
    /// Invokes `cb` for each cached BSS entry from the most recent scan.
    fn get_scan_results(&self, cb: &mut dyn FnMut(&BssEntry)) -> Result<(), WirelessError>;

    /// Send an 802.11 authentication frame (open, SAE, FT).
    /// Used by `NL80211_CMD_AUTHENTICATE` (SME in userspace mode).
    fn authenticate(&self, params: &AuthParams) -> Result<(), WirelessError>;

    /// Send an 802.11 association request frame.
    /// Used by `NL80211_CMD_ASSOCIATE` (SME in userspace mode).
    fn associate(&self, params: &AssocParams) -> Result<(), WirelessError>;

    /// Associate with a network (SME in driver/firmware mode).
    fn connect(&self, params: &ConnectParams) -> Result<(), WirelessError>;

    /// Disassociate from the current network.
    fn disconnect(&self) -> Result<(), WirelessError>;

    // --- Interface management ---

    /// Change the interface type (station, AP, monitor, P2P-client, etc.).
    /// Used by `NL80211_CMD_SET_INTERFACE`.
    fn set_interface_type(&self, iftype: Nl80211IfType) -> Result<(), WirelessError>;

    /// Create a secondary virtual interface (P2P, monitor, AP).
    /// Used by `NL80211_CMD_ADD_VIRTUAL_INTERFACE`. Returns the new interface index.
    fn add_interface(&self, name: &[u8], iftype: Nl80211IfType) -> Result<u32, WirelessError>;

    /// Delete a secondary virtual interface.
    /// Used by `NL80211_CMD_DEL_VIRTUAL_INTERFACE`.
    fn del_interface(&self, ifindex: u32) -> Result<(), WirelessError>;

    // --- AP mode ---

    /// Start access point mode with the given configuration (SSID, channel,
    /// security, beacon interval). Used by `NL80211_CMD_START_AP`.
    fn start_ap(&self, params: &ApParams) -> Result<(), WirelessError>;

    /// Stop access point mode. Used by `NL80211_CMD_STOP_AP`.
    fn stop_ap(&self) -> Result<(), WirelessError>;

    /// Configure AP BSS parameters (beacon interval, DTIM period, HT/VHT
    /// operation IEs). Used by `NL80211_CMD_SET_BSS`.
    fn set_bss_params(&self, params: &BssParams) -> Result<(), WirelessError>;

    // --- Channel and frame management ---

    /// Set the operating channel (monitor mode or CSA target).
    /// Used by `NL80211_CMD_SET_CHANNEL`.
    fn set_channel(&self, channel: &ChannelSpec) -> Result<(), WirelessError>;

    /// Register to receive specific management frame subtypes.
    /// Used by `NL80211_CMD_REGISTER_FRAME`. `frame_type` is the IEEE 802.11
    /// frame type/subtype bitmask; `match_data` is an optional prefix filter.
    fn register_mgmt_frame(&self, frame_type: u16, match_data: &[u8]) -> Result<(), WirelessError>;

    // --- Data path ---

    /// Return the TX ring shared with umka-net. The ring is allocated
    /// by the driver during `up()` from DMA-capable memory (Section 12.1.5).
    fn tx_ring(&self) -> &RingBuffer<TxDescriptor>;

    /// Return the RX ring shared with umka-net.
    fn rx_ring(&self) -> &RingBuffer<RxDescriptor>;

    // --- Power ---

    /// Set the power save mode (maps to hardware PSM / DTIM skip).
    fn set_power_save(&self, mode: WirelessPowerSave) -> Result<(), WirelessError>;

    /// Configure Wake-on-WLAN patterns before S3 suspend.
    /// Hardware typically supports 8-32 patterns. If `patterns.len()` exceeds
    /// the device's capacity (reported via `WirelessDeviceInfo.max_wowlan_patterns`),
    /// returns `WirelessError::InvalidParam`.
    fn set_wowlan(&self, patterns: &[WowlanPattern]) -> Result<(), WirelessError>;

    // --- Key management (WPA/WPA2/WPA3) ---

    /// Install a pairwise or group key. `key_index` identifies the key slot
    /// (0-3 for group keys, 0 for pairwise). `key_type` distinguishes
    /// pairwise, group, and IGTK (integrity group temporal key).
    fn add_key(&self, params: &KeyParams) -> Result<(), WirelessError>;

    /// Remove an installed key by index and type.
    fn del_key(&self, key_index: u8, key_type: KeyType) -> Result<(), WirelessError>;

    /// Read key sequence counter (for replay detection). Returns the current
    /// TX/RX sequence counter for the specified key.
    fn get_key(&self, key_index: u8, key_type: KeyType) -> Result<KeyInfo, WirelessError>;

    /// Set the default TX key index for unicast frames.
    fn set_default_key(&self, key_index: u8) -> Result<(), WirelessError>;

    /// Set the default management frame protection key index (IGTK).
    fn set_default_mgmt_key(&self, key_index: u8) -> Result<(), WirelessError>;

    // --- Station management (AP mode and mesh) ---

    /// Add a station entry (AP mode: new client association).
    fn add_station(&self, params: &StationParams) -> Result<(), WirelessError>;

    /// Remove a station entry (AP mode: client disassociation/deauth).
    fn del_station(&self, mac: &MacAddr) -> Result<(), WirelessError>;

    /// Modify station parameters (flags, supported rates, HT/VHT/HE caps).
    fn change_station(&self, mac: &MacAddr, params: &StationParams) -> Result<(), WirelessError>;

    /// Query per-station information (RSSI, TX/RX rates, bytes, etc.).
    fn get_station(&self, mac: &MacAddr) -> Result<StationInfo, WirelessError>;

    /// Iterate all known stations. Calls `cb` for each station entry.
    /// Used by `NL80211_CMD_DUMP_STATION`.
    fn dump_station(&self, cb: &mut dyn FnMut(&StationInfo)) -> Result<(), WirelessError>;

    // --- PMKSA cache (802.11r/SAE fast roaming) ---

    /// Add a PMKSA (Pairwise Master Key Security Association) cache entry.
    /// Used for fast BSS transition (802.11r) and SAE fast reconnect.
    fn set_pmksa(&self, pmksa: &PmksaEntry) -> Result<(), WirelessError>;

    /// Remove a PMKSA cache entry by BSSID.
    fn del_pmksa(&self, bssid: &MacAddr) -> Result<(), WirelessError>;

    /// Flush all PMKSA cache entries (e.g., on network profile removal).
    fn flush_pmksa(&self) -> Result<(), WirelessError>;

    // --- Off-channel operations ---

    /// Remain on a specific channel for `duration_ms` milliseconds.
    /// Used for off-channel TX (P2P discovery, DFS). Returns a cookie
    /// identifying this remain-on-channel session.
    fn remain_on_channel(&self, channel: &ChannelSpec, duration_ms: u32)
        -> Result<u64, WirelessError>;

    /// Cancel an active remain-on-channel session by cookie.
    fn cancel_remain_on_channel(&self, cookie: u64) -> Result<(), WirelessError>;

    // --- Management frame TX ---

    /// Transmit a raw 802.11 management frame. `buf` contains the complete
    /// frame starting from the 802.11 header. Returns a cookie for TX status.
    /// Used for authentication, deauth, action frames, and P2P negotiation.
    fn mgmt_tx(&self, channel: &ChannelSpec, buf: &[u8], no_ack: bool)
        -> Result<u64, WirelessError>;

    /// Cancel a pending management frame TX wait (by cookie from `mgmt_tx`).
    fn mgmt_tx_cancel_wait(&self, cookie: u64) -> Result<(), WirelessError>;

    // --- Connection quality monitoring ---

    /// Configure RSSI-based connection quality monitoring thresholds.
    /// Firmware generates `WirelessEvent::CqmRssi` when RSSI crosses
    /// `threshold_dbm ± hysteresis_db`.
    fn set_cqm_rssi_config(&self, threshold_dbm: i32, hysteresis_db: u32)
        -> Result<(), WirelessError>;

    // --- Channel switch (CSA) ---

    /// Initiate a Channel Switch Announcement (AP mode). The firmware
    /// transmits CSA elements in beacons/probe responses for `count` beacons,
    /// then switches to `target_channel`.
    fn channel_switch(&self, target_channel: &ChannelSpec, count: u8, block_tx: bool)
        -> Result<(), WirelessError>;

    // --- GTK rekey offload ---

    /// Offload GTK rekeying to firmware (for S0ix/suspend). The firmware
    /// performs 4-way handshake group key updates without waking the host CPU.
    ///
    /// Key sizes vary by AKM suite:
    /// - kek: 16 bytes (WPA2 CCMP-128) or 32 bytes (WPA3 GCMP-256).
    /// - kck: 16 bytes (WPA2), 24 bytes (WPA3-SAE), or 32 bytes (WPA3-Enterprise 192-bit).
    /// - akm: AKM suite selector determining expected key lengths.
    /// Returns `WirelessError::InvalidKeyLength` if sizes do not match the AKM suite.
    fn set_rekey_offload(&self, kek: &[u8], kck: &[u8], replay_ctr: &[u8; 8], akm: u32)
        -> Result<(), WirelessError>;

    // --- DFS radar detection ---

    /// Start Channel Availability Check (CAC) on a DFS channel.
    /// The driver must monitor for radar pulses for `duration_ms` milliseconds
    /// (typically 60000 ms, or 600000 ms for weather radar channels).
    /// On completion: if no radar is detected, the driver emits
    /// `WirelessEvent::CacFinished { success: true }` and the channel
    /// becomes available for use. If radar is detected during CAC, the
    /// driver emits `WirelessEvent::RadarDetected` and the channel is
    /// marked as unavailable for the Non-Occupancy Period (NOP = 30 min).
    ///
    /// Equivalent to Linux `NL80211_CMD_RADAR_DETECT` / cfg80211_ops.start_radar_detection.
    fn start_cac(&self, channel: &ChannelSpec, duration_ms: u32) -> Result<(), WirelessError>;

    /// Abort an in-progress CAC. Used when the interface is being torn down
    /// or the regulatory domain changes during CAC.
    fn abort_cac(&self) -> Result<(), WirelessError>;

    // --- Statistics ---

    fn stats(&self) -> WirelessStats;
}
// Total: 43 methods covering all nl80211 operations listed in Section 13.3.3.

bitflags! {
    pub struct WirelessCapabilities: u32 {
        const WIFI_4    = 1 << 0;  // 802.11n
        const WIFI_5    = 1 << 1;  // 802.11ac
        const WIFI_6    = 1 << 2;  // 802.11ax
        const WIFI_6E   = 1 << 3;  // 802.11ax 6 GHz
        const WIFI_7    = 1 << 4;  // 802.11be
        const BT_5      = 1 << 8;  // Bluetooth 5.x (combo chip)
        const WOWLAN    = 1 << 16; // Wake-on-WLAN support
        const SCAN_OFFLOAD = 1 << 17; // Autonomous background scan in S0ix
    }
}

#[repr(u32)]
pub enum WirelessPowerSave {
    /// Radio always awake (CAM). Lowest latency, highest power.
    Disabled  = 0,
    /// 802.11 PSM (sleep between beacons, wake on DTIM).
    Enabled   = 1,
    /// Aggressive PSM (DTIM skipping, beacon filtering).
    Aggressive = 2,
}

/// 6-byte IEEE 802 MAC address.
/// `#[repr(transparent)]` ensures the same layout as `[u8; 6]` for
/// embedding in KABI `#[repr(C)]` structs (BssEntry, AuthParams, etc.).
#[repr(transparent)]
pub struct MacAddr(pub [u8; 6]);

/// SSID (0-32 bytes, not null-terminated per IEEE 802.11-2020 §10.4.2.2).
/// `#[repr(C)]` for stable layout in KABI structs. Uses fixed `[u8; 32]` + `len`
/// instead of `ArrayVec` (which has no stable repr and cannot cross KABI boundaries).
#[repr(C)]
pub struct Ssid {
    /// SSID bytes. Valid content is `data[..len]`. Remainder is zero-padded.
    pub data: [u8; 32],
    /// Valid byte count (0-32).
    pub len: u8,
}
const_assert!(core::mem::size_of::<Ssid>() == 33);

/// Channel specification: frequency + width.
/// `#[repr(C)]` for stable layout in KABI structs (BssEntry, ScanRequest, etc.).
#[repr(C)]
pub struct ChannelSpec {
    /// Center frequency in MHz (e.g., 2412 for channel 1, 5180 for channel 36).
    pub center_freq_mhz: u32,
    /// Channel width.
    pub width: ChannelWidth,
    /// Explicit padding: repr(C) aligns struct to max field alignment (4 bytes
    /// from center_freq_mhz). ChannelWidth is 1 byte, so 3 bytes padding.
    pub _pad: [u8; 3],
}
// Size: 4 (center_freq_mhz) + 1 (width) + 3 (pad) = 8 bytes.
const_assert!(core::mem::size_of::<ChannelSpec>() == 8);

#[repr(u8)]
pub enum ChannelWidth {
    Mhz20    = 0,
    Mhz40    = 1,
    Mhz80    = 2,
    Mhz160   = 3,
    Mhz80P80 = 4,  // 80+80 MHz (VHT)
    Mhz320   = 5,  // Wi-Fi 7 (EHT)
}

/// Scan request parameters. Matches NL80211_CMD_TRIGGER_SCAN attributes.
///
/// `#[repr(C)]` for KABI stability — this struct is passed to
/// `WirelessDriver::scan()` across the KABI boundary.
#[repr(C)]
pub struct ScanRequest {
    /// SSIDs to probe for (active scan). Empty = broadcast probe.
    /// nl80211 limit: NL80211_MAX_SCAN_SSIDS = 20.
    pub ssids: KabiArray<Ssid, 20>,
    /// Explicit padding after ssids (KabiArray align 2, ends at offset 662).
    /// 662 % 4 = 2, need 2 bytes to align channels (KabiArray<ChannelSpec> align 4).
    /// CLAUDE.md rule 11.
    pub _pad_ssids: [u8; 2],
    /// Specific channels to scan. Empty = all supported channels.
    pub channels: KabiArray<ChannelSpec, 64>,
    /// Active (send probes) or passive (listen only).
    pub scan_type: ScanType,
    /// Explicit padding after scan_type (u8) to align ie (next KabiArray).
    pub _pad0: [u8; 1],
    /// Extra information elements to include in probe request frames.
    /// Used by wpa_supplicant for Interworking/Hotspot 2.0.
    pub ie: KabiArray<u8, 512>,
    /// NL80211_SCAN_FLAG_* bitmask.
    pub flags: ScanFlags,
    /// Per-channel dwell time in milliseconds. 0 = driver default.
    pub duration_ms: u32,
}
// ScanRequest layout (all padding explicit):
//   ssids(KabiArray<Ssid,20>=662) + _pad_ssids([u8;2]=2) +
//   channels(KabiArray<ChannelSpec,64>=516) + scan_type(u8=1) + _pad0([u8;1]=1) +
//   ie(KabiArray<u8,512>=514) + flags(ScanFlags u32=4) + duration_ms(u32=4)
//   = 662+2+516+1+1+514+4+4 = 1704 bytes. Struct align 4.
const_assert!(core::mem::size_of::<ScanRequest>() == 1704);

#[repr(u8)]
pub enum ScanType {
    /// Send probe requests on each channel.
    Active  = 0,
    /// Listen only (required on DFS channels before CAC).
    Passive = 1,
}

bitflags! {
    pub struct ScanFlags: u32 {
        /// NL80211_SCAN_FLAG_LOW_PRIORITY — yield to data traffic.
        const LOW_PRIORITY   = 1 << 0;
        /// NL80211_SCAN_FLAG_FLUSH — flush old BSS entries before scan.
        const FLUSH          = 1 << 1;
        /// NL80211_SCAN_FLAG_AP — scan while in AP mode.
        const AP             = 1 << 2;
        /// NL80211_SCAN_FLAG_RANDOM_ADDR — randomize source MAC per probe.
        const RANDOM_ADDR    = 1 << 3;
    }
}

/// Connection parameters for NL80211_CMD_CONNECT.
///
/// `#[repr(C)]` for KABI stability — passed to `WirelessDriver::connect()`.
#[repr(C)]
pub struct ConnectParams {
    /// Target network SSID.
    pub ssid: Ssid,
    /// Target BSSID (None = driver/firmware selects best AP).
    pub bssid: KabiOption<MacAddr>,
    /// Target channel (None = scan all).
    pub channel: KabiOption<ChannelSpec>,
    /// Authentication algorithm.
    pub auth_type: AuthType,
    /// Explicit padding after auth_type (u8, offset 53) to align crypto (align 4).
    /// 53 % 4 = 1, need 3 bytes. CLAUDE.md rule 11.
    pub _pad0: [u8; 3],
    /// Cipher and AKM suite selection.
    pub crypto: CryptoSettings,
    /// Extra IEs for association request frame.
    pub ie: KabiArray<u8, 512>,
    /// Previous BSSID for reassociation (roaming). None for initial connect.
    pub prev_bssid: KabiOption<MacAddr>,
    /// Explicit padding after prev_bssid (offset 621) to align flags (align 4).
    /// 621 % 4 = 1, need 3 bytes. CLAUDE.md rule 11.
    pub _pad1: [u8; 3],
    /// Connection flags.
    pub flags: ConnectFlags,
}
// ConnectParams layout (all padding explicit):
//   ssid(Ssid=33) + bssid(KabiOption<MacAddr>=7) → offset 40 (40%4=0, no pad) +
//   channel(KabiOption<ChannelSpec>=12) → 52 + auth_type(u8=1) → 53 +
//   _pad0([u8;3]=3) → 56 + crypto(CryptoSettings=44) → 100 +
//   ie(KabiArray<u8,512>=514) → 614 + prev_bssid(KabiOption<MacAddr>=7) → 621 +
//   _pad1([u8;3]=3) → 624 + flags(ConnectFlags u32=4) → 628.
//   Total: 628 bytes. Struct align 4.
const_assert!(core::mem::size_of::<ConnectParams>() == 628);

#[repr(u8)]
pub enum AuthType {
    Open      = 0,
    SharedKey = 1,
    Ft        = 2,  // 802.11r Fast BSS Transition
    Sae       = 3,  // WPA3 Simultaneous Authentication of Equals
    Fils      = 4,  // Fast Initial Link Setup (802.11ai)
}

/// Cipher suite and AKM configuration for connect/AP.
/// `#[repr(C)]` because this is embedded in `AssocParams` (KABI struct).
#[repr(C)]
pub struct CryptoSettings {
    /// Pairwise cipher suites (e.g., CCMP-128, GCMP-256).
    pub ciphers_pairwise: KabiArray<CipherSuite, 4>,
    /// Group cipher suite.
    pub cipher_group: CipherSuite,
    /// Authentication and key management suites (e.g., PSK, SAE, 802.1X).
    pub akm_suites: KabiArray<AkmSuite, 4>,
}
// CryptoSettings: repr(C).
//   KabiArray<CipherSuite,4>: [CipherSuite(u32);4](16) + len(u16)(2) = 18, padded to 20 (align 4).
//   CipherSuite(u32): 4 bytes at offset 20.
//   KabiArray<AkmSuite,4>: same as ciphers_pairwise = 20 bytes at offset 24.
//   Total: 44 bytes, align 4.
const_assert!(core::mem::size_of::<CryptoSettings>() == 44);

/// IEEE 802.11 cipher suite selector (OUI + suite type, 4 bytes).
/// Values match those in nl80211.h (e.g., WLAN_CIPHER_SUITE_CCMP = 0x000FAC04).
pub struct CipherSuite(pub u32);

/// Authentication and Key Management suite selector.
/// Values match nl80211 (e.g., WLAN_AKM_SUITE_PSK = 0x000FAC02).
pub struct AkmSuite(pub u32);

bitflags! {
    pub struct ConnectFlags: u32 {
        /// Prefer using the SME in firmware rather than host.
        const OFFLOAD_AUTH = 1 << 0;
    }
}

/// Wake-on-WLAN pattern for S3/S0ix suspend.
///
/// `#[repr(C)]` for KABI stability — passed to `WirelessDriver::set_wowlan()`.
#[repr(C)]
pub struct WowlanPattern {
    /// Pattern bytes to match against incoming frames.
    /// Maximum length is hardware-dependent; 128 bytes covers all
    /// common patterns (magic packet, EAP identity, ARP).
    pub pattern: KabiArray<u8, 128>,
    /// Bitmask: 1 bit per pattern byte. Bit N=1 means pattern[N] must match.
    /// Length = ceil(pattern.len() / 8).
    pub mask: KabiArray<u8, 16>,
    /// Byte offset in the received packet where matching starts.
    pub pkt_offset: u32,
}
// WowlanPattern: pattern(KabiArray<u8,128>=130) + mask(KabiArray<u8,16>=18) +
//   pkt_offset(u32=4) = 152 bytes.
const_assert!(core::mem::size_of::<WowlanPattern>() == 152);

/// Wireless link statistics, returned by `WirelessDriver::stats()`.
///
/// **Distinction from `WifiStats`** ([Section 13.15](#wifi-driver)): `WirelessStats` is
/// the cfg80211-level per-station statistics (signal, bitrate, packet
/// counters). `WifiStats` in the WiFi driver section is the mac80211-level
/// driver aggregation (per-interface totals, firmware state). The two
/// structs serve different consumers: `WirelessStats` for nl80211 station
/// dump responses, `WifiStats` for ethtool-style driver diagnostics.
/// `#[repr(C)]` for KABI stability — returned by `WirelessDriver::stats()`.
#[repr(C)]
pub struct WirelessStats {
    /// Current signal strength in dBm (e.g., -65).
    pub signal_dbm: i8,
    /// Explicit padding after signal_dbm (align 1) to align tx_bitrate (align 4).
    /// Offset 1, 1 % 4 = 1, need 3 bytes. CLAUDE.md rule 11.
    pub _pad0: [u8; 3],
    /// Current TX bitrate in units of 100 kbps (e.g., 8660 = 866.0 Mbps).
    pub tx_bitrate: u32,
    /// Current RX bitrate in units of 100 kbps.
    pub rx_bitrate: u32,
    /// Explicit padding after rx_bitrate to align tx_packets (align 8).
    /// Offset 12, 12 % 8 = 4, need 4 bytes. CLAUDE.md rule 11.
    pub _pad1: [u8; 4],
    /// Total TX packets since association.
    pub tx_packets: u64,
    /// Total RX packets since association.
    pub rx_packets: u64,
    /// Total TX bytes since association.
    pub tx_bytes: u64,
    /// Total RX bytes since association.
    pub rx_bytes: u64,
    /// TX failures (no ACK received).
    /// u64: kernel-internal counter. At WiFi 7 rates (~46 Gbps), a u32 could
    /// wrap in ~49 days with high failure rates. u64 avoids wrap for 50-year uptime.
    pub tx_failed: u64,
    /// TX retry count (frames retransmitted).
    pub tx_retries: u64,
    /// Beacon loss events since association.
    /// u64 for consistency; practical values are small (resets on re-association).
    pub beacon_loss_count: u64,
}
// WirelessStats layout (all padding explicit):
//   signal_dbm(i8=1) + _pad0([u8;3]=3) + tx_bitrate(u32=4) + rx_bitrate(u32=4) +
//   _pad1([u8;4]=4) + tx_packets(u64=8) + rx_packets(u64=8) + tx_bytes(u64=8) +
//   rx_bytes(u64=8) + tx_failed(u64=8) + tx_retries(u64=8) + beacon_loss_count(u64=8)
//   = 1+3+4+4+4+8+8+8+8+8+8+8 = 72 bytes. Struct align 8 (from u64). 72 % 8 = 0.
const_assert!(core::mem::size_of::<WirelessStats>() == 72);

/// Key installation parameters for `add_key`.
///
/// `#[repr(C)]` for KABI stability — passed to `WirelessDriver::add_key()`.
#[repr(C)]
pub struct KeyParams {
    /// Key material (16 bytes for CCMP-128, 32 bytes for GCMP-256, etc.).
    pub key: KabiArray<u8, 32>,
    /// Key index (0-3 for group keys, 0 for pairwise).
    pub key_index: u8,
    /// Key type (pairwise, group, or IGTK).
    pub key_type: KeyType,
    /// Cipher suite this key is used with.
    pub cipher: CipherSuite,
    /// Peer MAC address (for pairwise keys). None for group keys.
    pub mac_addr: KabiOption<MacAddr>,
    /// Explicit padding after mac_addr (offset 47) to align seq (KabiArray align 2).
    /// 47 % 2 = 1, need 1 byte. CLAUDE.md rule 11.
    pub _pad0: u8,
    /// Initial TX/RX sequence counter (for CCMP/GCMP replay protection).
    pub seq: KabiArray<u8, 16>,
    /// Explicit tail padding for struct alignment.
    /// Content ends at offset 66. Struct align 4 (from CipherSuite u32).
    /// 66 % 4 = 2, need 2 bytes. CLAUDE.md rule 11.
    pub _pad_tail: [u8; 2],
}
// KeyParams layout (all padding explicit):
//   key(KabiArray<u8,32>=34) + key_index(u8=1) + key_type(u8=1) +
//   cipher(CipherSuite u32=4) → offset 40 + mac_addr(KabiOption<MacAddr>=7) → 47 +
//   _pad0(u8=1) → 48 + seq(KabiArray<u8,16>=18) → 66 + _pad_tail([u8;2]=2) → 68.
//   Total: 68 bytes. Struct align 4.
const_assert!(core::mem::size_of::<KeyParams>() == 68);

#[repr(u8)]
pub enum KeyType {
    /// Pairwise (unicast) key.
    Pairwise = 0,
    /// Group (multicast/broadcast) key.
    Group    = 1,
    /// Integrity Group Temporal Key (management frame protection, 802.11w).
    Igtk     = 2,
    /// Beacon Integrity Group Temporal Key (Wi-Fi 7).
    Bigtk    = 3,
}

/// Key information returned by `get_key`.
///
/// `#[repr(C)]` for KABI stability — returned from `WirelessDriver::get_key()`.
/// Uses `KabiArray<u8, 16>` instead of `ArrayVec<u8, 16>` per CLAUDE.md rule 9.
/// Matches the `KeyParams` input counterpart pattern.
#[repr(C)]
pub struct KeyInfo {
    /// Current TX sequence counter (replay counter).
    pub tx_seq: KabiArray<u8, 16>,          // 18 bytes, offset 0
    /// Current RX sequence counter.
    pub rx_seq: KabiArray<u8, 16>,          // 18 bytes, offset 18
    /// Cipher suite of this key.
    pub cipher: CipherSuite,                // 4 bytes, offset 36
    /// Explicit tail padding: content ends at offset 40. Struct align 4
    /// (from CipherSuite u32). 40 % 4 = 0. No trailing padding needed.
}
// KeyInfo layout (all padding explicit):
//   tx_seq(KabiArray<u8,16>=18) + rx_seq(KabiArray<u8,16>=18) +
//   cipher(CipherSuite u32=4) = 40 bytes. Struct align 4. 40 % 4 = 0.
const_assert!(core::mem::size_of::<KeyInfo>() == 40);

/// Station parameters for AP mode add/change operations.
///
/// `#[repr(C)]` for KABI stability — passed to `WirelessDriver::add_station()`
/// and `WirelessDriver::change_station()`. Uses `KabiArray` and `KabiOption`
/// instead of `ArrayVec` and `Option` per CLAUDE.md rules 9/11.
#[repr(C)]
pub struct StationParams {
    /// Station MAC address.
    pub mac: MacAddr,                                   // 6 bytes, offset 0
    /// Explicit padding: mac ends at offset 6, sta_flags (u32) needs align 4.
    /// 6 % 4 = 2, need 2 bytes. CLAUDE.md rule 11.
    pub _pad0: [u8; 2],                                 // 2 bytes, offset 6
    /// Station capability flags.
    pub sta_flags: StationFlags,                        // 4 bytes, offset 8
    /// Association ID (1-2007).
    pub aid: u16,                                       // 2 bytes, offset 12
    /// Supported rates (each byte = rate in 500 kbps units).
    pub supported_rates: KabiArray<u8, 32>,             // 34 bytes, offset 14
    /// HT capabilities (if the station supports 802.11n).
    pub ht_cap: KabiOption<[u8; 26]>,                   // 27 bytes, offset 48
    /// VHT capabilities (if the station supports 802.11ac).
    /// KabiOption<VhtCapabilities> where VhtCapabilities is [u8; 12] align 1.
    pub vht_cap: KabiOption<[u8; 12]>,                  // 13 bytes, offset 75
    /// HE capabilities (if the station supports 802.11ax).
    /// KabiOption<KabiArray<u8, 54>>: valid(u8=1) + pad(1) + KabiArray<u8,54>(56) = 58.
    pub he_cap: KabiOption<KabiArray<u8, 54>>,          // 58 bytes, offset 88
    /// Explicit tail padding: content ends at offset 146. Struct align = 4
    /// (from sta_flags: u32). 146 % 4 = 2, need 2 bytes. CLAUDE.md rule 11.
    pub _pad_tail: [u8; 2],                             // 2 bytes, offset 146
}
// StationParams layout (all padding explicit):
//   mac([u8;6]=6) + _pad0([u8;2]=2) + sta_flags(u32=4) + aid(u16=2) +
//   supported_rates(KabiArray<u8,32>=34) + ht_cap(KabiOption<[u8;26]>=27) +
//   vht_cap(KabiOption<[u8;12]>=13) + he_cap(KabiOption<KabiArray<u8,54>>=58) +
//   _pad_tail([u8;2]=2) = 148 bytes. Struct align 4. 148 % 4 = 0. No implicit padding.
const_assert!(core::mem::size_of::<StationParams>() == 148);

bitflags! {
    pub struct StationFlags: u32 {
        const AUTHORIZED     = 1 << 0;
        const SHORT_PREAMBLE = 1 << 1;
        const WME            = 1 << 2;
        const MFP            = 1 << 3;  // Management Frame Protection
        const AUTHENTICATED  = 1 << 4;
        const ASSOCIATED     = 1 << 5;
    }
}

/// Per-station information returned by `get_station` / `dump_station`.
///
/// `#[repr(C)]` for KABI stability — returned from `WirelessDriver::get_station()`
/// and passed to `WirelessDriver::dump_station()` callback.
/// Uses `KabiOption<i8>` instead of `Option<i8>` per CLAUDE.md rule 9.
#[repr(C)]
pub struct StationInfo {
    /// Station MAC address.
    pub mac: MacAddr,                       // 6 bytes, offset 0
    /// Signal strength (dBm). valid=0 if not associated.
    pub signal_dbm: KabiOption<i8>,         // 2 bytes, offset 6
    /// TX/RX packet and byte counters.
    pub tx_packets: u64,                    // 8 bytes, offset 8
    pub rx_packets: u64,                    // 8 bytes, offset 16
    pub tx_bytes: u64,                      // 8 bytes, offset 24
    pub rx_bytes: u64,                      // 8 bytes, offset 32
    /// Current TX bitrate (100 kbps units).
    pub tx_bitrate: u32,                    // 4 bytes, offset 40
    /// Current RX bitrate (100 kbps units).
    pub rx_bitrate: u32,                    // 4 bytes, offset 44
    /// Time since last activity (milliseconds).
    pub inactive_time_ms: u32,              // 4 bytes, offset 48
    /// Connected time (seconds). u32 wraps at ~136 years — exceeds the
    /// 50-year uptime target. u32 matches the nl80211 ABI
    /// (NL80211_STA_INFO_CONNECTED_TIME uses __u32).
    pub connected_time_secs: u32,           // 4 bytes, offset 52
}
// StationInfo layout (all padding explicit):
//   mac([u8;6]=6) + signal_dbm(KabiOption<i8>=2) + tx_packets(u64=8) +
//   rx_packets(u64=8) + tx_bytes(u64=8) + rx_bytes(u64=8) +
//   tx_bitrate(u32=4) + rx_bitrate(u32=4) + inactive_time_ms(u32=4) +
//   connected_time_secs(u32=4) = 56 bytes. Struct align 8 (from u64 fields).
//   56 % 8 = 0. No implicit padding.
const_assert!(core::mem::size_of::<StationInfo>() == 56);

/// PMKSA cache entry for 802.11r/SAE fast roaming.
///
/// `#[repr(C)]` for KABI stability — passed to `WirelessDriver::set_pmksa()`.
/// Uses `KabiArray<u8, 48>` instead of `ArrayVec<u8, 48>` per CLAUDE.md rule 9.
#[repr(C)]
pub struct PmksaEntry {
    /// BSSID of the AP this PMKSA is associated with.
    pub bssid: MacAddr,                     // 6 bytes, offset 0
    /// PMKID (16 bytes, derived from PMK per IEEE 802.11-2020 §13.7.1.3).
    pub pmkid: [u8; 16],                    // 16 bytes, offset 6
    /// PMK (Pairwise Master Key, 32 bytes for WPA2, 48 bytes for WPA3-SAE).
    pub pmk: KabiArray<u8, 48>,             // 50 bytes, offset 22
}
// PmksaEntry layout (all padding explicit):
//   bssid([u8;6]=6) + pmkid([u8;16]=16) + pmk(KabiArray<u8,48>=50) = 72 bytes.
//   All fields are byte-aligned (align 1 or 2 from KabiArray.len u16).
//   Struct align 2 (from KabiArray<u8,48>.len: u16). 72 % 2 = 0. No implicit padding.
const_assert!(core::mem::size_of::<PmksaEntry>() == 72);

13.2.1.1 BSS Entry — Scan Result Descriptor¶

/// A BSS (Basic Service Set) entry representing a single AP discovered
/// during scanning. Returned via `get_scan_results()`.
///
/// Fixed-size (no heap allocation) for use in ring buffers and scan result
/// caches. The `ies` field carries the raw Information Elements from the
/// beacon/probe response frame — wpa_supplicant parses these for RSN, WPA,
/// WPS, Interworking, and vendor-specific capabilities.
#[repr(C)]
pub struct BssEntry {
    /// BSSID (MAC address of the AP).
    pub bssid: MacAddr,               // 6 bytes, offset 0
    /// Network SSID.
    pub ssid: Ssid,                    // 33 bytes, offset 6
    /// Padding: align ChannelSpec (alignment 4) after offset 39.
    pub _pad0: [u8; 1],               // offset 39, 1 byte
    /// Channel specification (center frequency + width).
    pub channel: ChannelSpec,          // 8 bytes, offset 40
    /// Beacon interval in TU (Time Units, 1 TU = 1024 microseconds).
    /// Typical: 100 TU (~102.4 ms). Range: 1-65535 TU.
    pub beacon_interval_tu: u16,       // offset 48
    /// Capability information field from the beacon/probe response
    /// (IEEE 802.11-2020 §10.3.1.4). Bitmask: bit 0 = ESS, bit 1 = IBSS,
    /// bit 4 = privacy (encryption required), bit 5 = short preamble, etc.
    pub capability_info: u16,          // offset 50
    /// Signal strength in dBm (typically -100 to 0). More negative = weaker.
    pub signal_dbm: i8,                // offset 52
    /// Padding: align last_seen_ns (u64, alignment 8) after offset 53.
    pub _pad1: [u8; 3],               // offset 53, 3 bytes
    /// Timestamp of the last beacon or probe response from this AP.
    /// Nanoseconds since boot (monotonic clock).
    pub last_seen_ns: u64,             // offset 56
    /// BSS load element (IEEE 802.11-2020 §10.3.14): indicates AP congestion.
    /// `station_count`: number of associated stations. `channel_utilization`:
    /// 0-255 (255 = fully utilized). Use `bss_load_valid` to check presence.
    pub bss_load: BssLoad,             // 6 bytes, offset 64
    /// 0 = BSS Load absent (AP does not advertise), 1 = present.
    pub bss_load_valid: u8,            // offset 70
    /// Raw Information Elements from the beacon/probe response frame body.
    /// Contains RSN IE, WPA IE, HT/VHT/HE capabilities IEs, vendor IEs, etc.
    /// wpa_supplicant parses these to determine security type, roaming support
    /// (802.11r), and Hotspot 2.0 capabilities.
    /// 768 bytes covers all IEs observed in production APs. Longer IE sets
    /// are truncated; `ies_len` gives the valid byte count.
    pub ies: [u8; 768],               // offset 71
    /// Padding: align ies_len (u16, alignment 2) after offset 839.
    pub _pad2: [u8; 1],               // offset 839, 1 byte
    /// Valid byte count within `ies`. May be less than 768.
    pub ies_len: u16,                  // offset 840
    /// Padding: struct alignment is 8 (from last_seen_ns u64).
    /// 842 % 8 = 2, so 6 bytes final padding.
    pub _pad3: [u8; 6],               // offset 842, 6 bytes
}
// Size: 6+33+1+8+2+2+1+3+8+6+1+768+1+2+6 = 848 bytes.
const_assert!(core::mem::size_of::<BssEntry>() == 848);

/// BSS Load element from the beacon frame (IEEE 802.11-2020 §10.3.14).
#[repr(C)]
pub struct BssLoad {
    /// Number of associated stations.
    pub station_count: u16,                // offset 0
    /// Channel utilization (0-255, 255 = fully utilized).
    pub channel_utilization: u8,           // offset 2
    /// Padding: align available_admission_capacity (u16) to 2-byte boundary.
    pub _pad: u8,                          // offset 3
    /// Available admission capacity in units of 32 microseconds per second.
    pub available_admission_capacity: u16, // offset 4
}
// Size: 2+1+1+2 = 6 bytes, alignment 2.
const_assert!(core::mem::size_of::<BssLoad>() == 6);

13.2.1.2 KABI-Safe Wrapper Types¶

The following wrapper types provide #[repr(C)]-safe alternatives to Option<T> and ArrayVec<T, N>, which have no stable C representation and must never appear in KABI structs that cross isolation domain boundaries. These are defined here for first use in the wireless subsystem; the canonical definition belongs in umka-driver-sdk (shared KABI types module) and all KABI structs across the entire spec should use these wrappers instead of Option<T> or ArrayVec.

/// KABI-safe optional value. Replaces `Option<T>` in `#[repr(C)]` structs
/// that cross KABI/wire boundaries.
///
/// `Option<T>` has no stable C representation (Rust's niche optimization
/// produces layout that depends on `T`'s internal structure). This wrapper
/// uses an explicit `valid` discriminant.
///
/// `T` must be `Copy` to allow safe bitwise read of the value field even
/// when `valid == 0` (the value is meaningless but reading it is not UB).
// kernel-internal, not KABI
#[repr(C)]
pub struct KabiOption<T: Copy> {
    /// 0 = absent (value field is meaningless), 1 = present.
    pub valid: u8,
    /// The wrapped value. Only meaningful when `valid == 1`.
    /// When `valid == 0`, this field is zero-initialized.
    pub value: T,
}

impl<T: Copy> KabiOption<T> {
    pub const fn none() -> Self where T: Copy {
        // SAFETY: all-zero is valid for Copy types in this context.
        // Caller must ensure T is valid when zero-initialized, or
        // never read `value` when `valid == 0`.
        unsafe { core::mem::zeroed() }
    }
    pub const fn some(val: T) -> Self {
        // SAFETY: zeroed() initializes all bytes (including padding between
        // `valid` and `value` due to T's alignment) to 0. Then we overwrite
        // the meaningful fields. This prevents leaking uninitialized kernel
        // memory across KABI boundaries when T has alignment > 1.
        // For example, KabiOption<VhtCapabilities> has 3 bytes of repr(C)
        // padding between valid (u8, offset 0) and cap_info (u32, offset 4).
        // Without zeroing, those 3 bytes would contain stack garbage.
        let mut s: Self = unsafe { core::mem::zeroed() };
        s.valid = 1;
        s.value = val;
        s
    }
    pub fn as_option(&self) -> Option<&T> {
        if self.valid != 0 { Some(&self.value) } else { None }
    }
    pub fn is_some(&self) -> bool { self.valid != 0 }
    pub fn is_none(&self) -> bool { self.valid == 0 }
}

/// KABI-safe fixed-capacity array. Replaces `ArrayVec<T, N>` in `#[repr(C)]`
/// structs that cross KABI/wire boundaries.
///
/// `ArrayVec` has no stable C representation. This wrapper uses a fixed
/// `[T; N]` array with an explicit `len` field. `N` must fit in `u16`
/// (max 65535 elements); a `const_assert!` in the impl block enforces this.
///
/// `data` is not public: elements beyond `len` are zero-initialized but may
/// not satisfy `T`'s validity invariants for all types. Access only through
/// `as_slice()` which returns `&data[..len]`.
// kernel-internal, not KABI
#[repr(C)]
pub struct KabiArray<T: Copy, const N: usize> {
    /// Fixed-size storage. Valid elements are `data[..len]`.
    /// Elements beyond `len` are zero-initialized.
    /// Not public: prevents reading zero-initialized elements that may
    /// violate T's validity invariants (e.g., NonZeroU32).
    data: [T; N],
    /// Number of valid elements (0..=N). Validated on construction.
    pub len: u16,
}

impl<T: Copy, const N: usize> KabiArray<T, N> {
    // Compile-time guard: N must fit in u16 (the len field's type).
    const _N_FITS_U16: () = assert!(N <= u16::MAX as usize,
        "KabiArray<T, N>: N exceeds u16::MAX (65535)");

    /// Create an empty array.
    pub fn new() -> Self {
        // Trigger the const assertion.
        let _ = Self::_N_FITS_U16;
        // SAFETY: T: Copy guarantees no drop glue. All bytes are zeroed,
        // producing valid zero-initialized T values. The `len` field is 0,
        // so no element is considered "valid" — as_slice() returns an empty
        // slice. The zeroed data is never exposed through the public API
        // unless explicitly added via push/set operations (not shown here;
        // the driver SDK provides them).
        Self {
            data: unsafe { core::mem::zeroed() },
            len: 0,
        }
    }
    /// Number of valid elements.
    pub fn len(&self) -> usize { self.len as usize }
    /// Whether the array is empty.
    pub fn is_empty(&self) -> bool { self.len == 0 }
    /// Slice of valid elements.
    pub fn as_slice(&self) -> &[T] { &self.data[..self.len as usize] }
}

// const_asserts for KabiOption/KabiArray instantiations used in this file.
// Generic wrappers: size depends on T's size and alignment.
//
// KabiOption<T>: valid(u8) + padding-to-align(T) + T.
// KabiArray<T, N>: [T; N] + len(u16) + padding-to-struct-align.

// KabiOption instantiations:
// KabiOption<MacAddr>: u8(1) + [u8;6](6) = 7, align 1. No padding.
const_assert!(core::mem::size_of::<KabiOption<MacAddr>>() == 7);
// KabiOption<HtCapabilities>: u8(1) + HtCap(26, packed, align 1) = 27, align 1.
const_assert!(core::mem::size_of::<KabiOption<HtCapabilities>>() == 27);
// KabiOption<VhtCapabilities>: u8(1) + 3(pad) + VhtCap(12, align 4) = 16, align 4.
const_assert!(core::mem::size_of::<KabiOption<VhtCapabilities>>() == 16);
// KabiOption<HeCapabilities>: u8(1) + HeCap(56, align 1) = 57, align 1.
const_assert!(core::mem::size_of::<KabiOption<HeCapabilities>>() == 57);
// KabiOption<i8>: u8(1) + i8(1) = 2, align 1. No padding.
const_assert!(core::mem::size_of::<KabiOption<i8>>() == 2);
// KabiOption<[u8; 26]>: u8(1) + [u8;26](26) = 27, align 1. No padding.
const_assert!(core::mem::size_of::<KabiOption<[u8; 26]>>() == 27);
// KabiOption<[u8; 12]>: u8(1) + [u8;12](12) = 13, align 1. No padding.
const_assert!(core::mem::size_of::<KabiOption<[u8; 12]>>() == 13);
// KabiOption<KabiArray<u8, 54>>: valid(u8=1) + pad(1, for KabiArray align 2) +
//   KabiArray<u8,54>(56) = 58. Align 2. 58 % 2 = 0.
const_assert!(core::mem::size_of::<KabiOption<KabiArray<u8, 54>>>() == 58);

// KabiArray instantiations:
// KabiArray<u8, 512>: [u8;512](512) + len(u16)(2) = 514, align 2.
const_assert!(core::mem::size_of::<KabiArray<u8, 512>>() == 514);
// KabiArray<u8, 256>: [u8;256](256) + len(u16)(2) = 258, align 2.
const_assert!(core::mem::size_of::<KabiArray<u8, 256>>() == 258);
// KabiArray<CipherSuite, 4>: [CipherSuite(u32);4](16) + len(u16)(2) + 2(pad) = 20, align 4.
const_assert!(core::mem::size_of::<KabiArray<CipherSuite, 4>>() == 20);
// KabiArray<AkmSuite, 4>: same layout as CipherSuite variant = 20, align 4.
const_assert!(core::mem::size_of::<KabiArray<AkmSuite, 4>>() == 20);
// KabiArray<u8, 32>: [u8;32](32) + len(u16)(2) = 34, align 2.
const_assert!(core::mem::size_of::<KabiArray<u8, 32>>() == 34);
// KabiArray<u8, 16>: [u8;16](16) + len(u16)(2) = 18, align 2.
const_assert!(core::mem::size_of::<KabiArray<u8, 16>>() == 18);
// KabiArray<u8, 48>: [u8;48](48) + len(u16)(2) = 50, align 2.
const_assert!(core::mem::size_of::<KabiArray<u8, 48>>() == 50);
// KabiArray<u8, 54>: [u8;54](54) + len(u16)(2) = 56, align 2.
const_assert!(core::mem::size_of::<KabiArray<u8, 54>>() == 56);

13.2.1.3 Authentication and Association Parameters¶

/// 802.11 authentication frame parameters. Used by `authenticate()` for
/// SME-in-userspace mode (`NL80211_CMD_AUTHENTICATE`). wpa_supplicant drives
/// the authentication state machine and sends individual auth/assoc frames.
#[repr(C)]
pub struct AuthParams {
    /// BSSID of the target AP.
    pub bssid: MacAddr,                    // offset 0, 6 bytes
    /// Authentication algorithm.
    pub auth_type: AuthType,               // offset 6, 1 byte
    /// Explicit padding: KabiArray<u8,512> has align 2 (from len: u16).
    /// offset 7 → next 2-aligned = 8.
    pub _pad: u8,                          // offset 7, 1 byte
    /// SAE (WPA3) commit/confirm element. Empty for Open/SharedKey/FT.
    /// Carries the SAE authentication frame body (scalar, element, confirm).
    /// Maximum SAE frame body is 512 bytes (group 19/20/21).
    pub sae_data: KabiArray<u8, 512>,      // offset 8, 514 bytes
    /// Extra IEs to include in the authentication frame.
    pub ie: KabiArray<u8, 256>,            // offset 522, 258 bytes
}
// AuthParams: 6+1+1+514+258 = 780 bytes, align 2.
const_assert!(core::mem::size_of::<AuthParams>() == 780);

/// 802.11 association request parameters. Used by `associate()` for
/// SME-in-userspace mode (`NL80211_CMD_ASSOCIATE`). Sent after successful
/// authentication.
#[repr(C)]
pub struct AssocParams {
    /// BSSID of the target AP (must match the AP from `authenticate()`).
    pub bssid: MacAddr,                            // offset 0, 6 bytes, align 1
    /// Network SSID (must match the AP's SSID).
    pub ssid: Ssid,                                // offset 6, 33 bytes, align 1
    /// Previous BSSID for reassociation (roaming). Absent for initial association.
    /// KabiOption<MacAddr>: valid(1) + MacAddr(6) = 7 bytes, align 1.
    pub prev_bssid: KabiOption<MacAddr>,           // offset 39, 7 bytes
    /// HT (802.11n) capabilities to advertise. Absent if not supported.
    /// KabiOption<HtCapabilities>: valid(1) + HtCap(26,packed) = 27 bytes, align 1.
    pub ht_cap: KabiOption<HtCapabilities>,        // offset 46, 27 bytes
    /// Explicit padding: VhtCapabilities has align 4 (from cap_info: u32), so
    /// KabiOption<VhtCapabilities> has align 4. Offset 73 → next 4-aligned = 76.
    pub _pad1: [u8; 3],                            // offset 73, 3 bytes
    /// VHT (802.11ac) capabilities to advertise. Absent if not supported.
    /// KabiOption<VhtCapabilities>: valid(1) + 3(pad) + VhtCap(12) = 16 bytes, align 4.
    pub vht_cap: KabiOption<VhtCapabilities>,      // offset 76, 16 bytes
    /// HE (802.11ax) capabilities to advertise. Absent if not supported.
    /// KabiOption<HeCapabilities>: valid(1) + HeCap(56) = 57 bytes, align 1.
    pub he_cap: KabiOption<HeCapabilities>,        // offset 92, 57 bytes
    /// Explicit padding: CryptoSettings has align 4 (from CipherSuite(u32)).
    /// Offset 149 → next 4-aligned = 152.
    pub _pad2: [u8; 3],                            // offset 149, 3 bytes
    /// Cipher and AKM suite selection.
    pub crypto: CryptoSettings,                    // offset 152, 44 bytes
    /// Extra IEs to include in the association request frame (RSN IE,
    /// vendor-specific IEs, Interworking IEs).
    pub ie: KabiArray<u8, 512>,                    // offset 196, 514 bytes
    /// Explicit padding: struct align = 4 (from VhtCapabilities). 710 → 712.
    pub _pad3: [u8; 2],                            // offset 710, 2 bytes
}
// AssocParams: 6+33+7+27+3+16+57+3+44+514+2 = 712 bytes, align 4.
const_assert!(core::mem::size_of::<AssocParams>() == 712);

13.2.1.4 HT/VHT/HE Capabilities¶

/// HT (High Throughput, 802.11n) capabilities. Wire format matches
/// IEEE 802.11-2020 §10.6.5.1 (26 bytes).
///
/// wpa_supplicant sends these as NL80211_ATTR_HT_CAPABILITY (26-byte blob).
/// UmkaOS stores them structured for validation and driver use.
/// Packed to match the 26-byte IEEE 802.11 wire format exactly.
#[repr(C, packed)]
pub struct HtCapabilities {
    /// HT Capability Information field (2 bytes).
    /// Bit 0: LDPC coding capability. Bit 1: supported channel width set
    /// (0 = 20 MHz only, 1 = 20/40 MHz). Bits 2-3: SM Power Save mode.
    /// Bit 5: HT-Greenfield. Bit 6: short GI for 20 MHz.
    /// Bit 7: short GI for 40 MHz. Bits 8-9: TX STBC. Bits 10-11: RX STBC.
    pub cap_info: u16,
    /// A-MPDU Parameters (1 byte). Bits 0-1: maximum A-MPDU length exponent
    /// (0-3 → 8191/16383/32767/65535 bytes). Bits 2-4: minimum MPDU start spacing.
    pub ampdu_params: u8,
    /// Supported MCS Set (16 bytes). Bitmask of supported MCS indices (0-76).
    /// Bytes 0-9: RX MCS bitmask. Bytes 10-11: RX highest supported data rate.
    /// Bytes 12-15: TX MCS set defined, TX/RX MCS not equal, etc.
    pub mcs_set: [u8; 16],
    /// HT Extended Capabilities (2 bytes).
    pub ext_cap: u16,
    /// Transmit Beamforming Capabilities (4 bytes).
    pub txbf_cap: u32,
    /// ASEL (Antenna Selection) Capability (1 byte).
    pub asel_cap: u8,
}
// HtCapabilities: packed, no padding. 2+1+16+2+4+1 = 26 bytes.
const_assert!(core::mem::size_of::<HtCapabilities>() == 26);

/// VHT (Very High Throughput, 802.11ac) capabilities. Wire format matches
/// IEEE 802.11-2020 §10.39.2 (12 bytes).
///
/// wpa_supplicant sends these as NL80211_ATTR_VHT_CAPABILITY (12-byte blob).
#[repr(C)]
pub struct VhtCapabilities {
    /// VHT Capability Information field (4 bytes).
    /// Bits 0-1: max MPDU length (0=3895, 1=7991, 2=11454).
    /// Bits 2-3: supported channel width set (0=80MHz, 1=160MHz, 2=80+80MHz).
    /// Bit 4: RX LDPC. Bit 5: short GI for 80 MHz. Bit 6: short GI for 160/80+80.
    /// Bits 7-9: RX STBC. Bit 11: SU beamformer capable.
    /// Bit 12: SU beamformee capable. Bits 13-15: beamformee STS capability.
    /// Bits 16-18: sounding dimensions. Bit 19: MU beamformer capable.
    pub cap_info: u32,
    /// Supported VHT-MCS and NSS Set (8 bytes).
    /// Bytes 0-1: RX VHT-MCS Map (2 bits per spatial stream, 1-8 SS).
    /// Bytes 2-3: RX Highest Supported Long GI Data Rate.
    /// Bytes 4-5: TX VHT-MCS Map.
    /// Bytes 6-7: TX Highest Supported Long GI Data Rate.
    pub mcs_nss: [u8; 8],
}
// VhtCapabilities: repr(C). u32(4) + [u8;8](8) = 12 bytes, align 4.
const_assert!(core::mem::size_of::<VhtCapabilities>() == 12);

/// HE (High Efficiency, 802.11ax) capabilities. Wire format matches
/// IEEE 802.11ax-2021 §10.82.3. Variable length: 22-54 bytes depending
/// on the number of supported spatial streams and optional fields.
///
/// wpa_supplicant sends these as NL80211_ATTR_HE_CAPABILITY (variable blob).
/// Maximum 54 bytes covers all production AP configurations.
#[repr(C)]
pub struct HeCapabilities {
    /// HE MAC Capabilities Information (6 bytes).
    /// Encodes HTC-HE support, TWT requester/responder, fragmentation level,
    /// max number of fragmented MSDUs, trigger frame MAC padding duration,
    /// multi-TID aggregation support, and QoS data + A-MSDU.
    pub mac_cap: [u8; 6],
    /// HE PHY Capabilities Information (11 bytes).
    /// Encodes channel width set (20/40/80/160/80+80), preamble puncturing,
    /// device class, LDPC coding, HE-SU PPDU with 1x HE-LTF + 0.8us GI,
    /// midamble Rx Max NSTS, NDP with 4x HE-LTF + 3.2us GI,
    /// STBC Tx/Rx, Doppler Tx/Rx, UL MU-MIMO, DCM max constellation,
    /// and beamforming capabilities.
    pub phy_cap: [u8; 11],
    /// Supported HE-MCS and NSS Set (variable: 4, 8, or 12 bytes).
    /// Contains Rx/Tx HE-MCS Map for each supported channel width.
    /// `mcs_nss_len` gives the valid byte count.
    pub mcs_nss: [u8; 12],
    /// Valid byte count within `mcs_nss` (4, 8, or 12).
    pub mcs_nss_len: u8,
    /// PPE Thresholds (variable, 0-25 bytes). Optional field present when
    /// bit 27 of phy_cap is set. Contains per-NSS per-RU PPE thresholds.
    pub ppe_thresholds: [u8; 25],
    /// Valid byte count within `ppe_thresholds`.
    pub ppe_thresholds_len: u8,
}
// HeCapabilities: repr(C), all fields are u8/[u8;N] → align 1, no padding.
// 6+11+12+1+25+1 = 56 bytes.
const_assert!(core::mem::size_of::<HeCapabilities>() == 56);

13.2.1.5 AP Mode and BSS Parameters¶

/// Access point mode configuration. Used by `start_ap()` (NL80211_CMD_START_AP).
/// Defines the AP's network identity, channel, security, and beacon parameters.
#[repr(C)]
pub struct ApParams {
    /// Network SSID.
    pub ssid: Ssid,                            // offset 0, 33 bytes, align 1
    /// Whether the SSID is hidden (not broadcast in beacons). Hidden SSIDs
    /// are only revealed in probe responses to targeted probe requests.
    pub hidden_ssid: u8,                       // offset 33, 1 byte. 0 = visible, 1 = hidden.
    /// Explicit padding: ChannelSpec has align 4 (from center_freq_mhz: u32).
    /// Offset 34 → next 4-aligned = 36.
    pub _pad1: [u8; 2],                        // offset 34, 2 bytes
    /// Operating channel.
    pub channel: ChannelSpec,                  // offset 36, 8 bytes, align 4
    /// Beacon interval in TU (1024 microseconds). Default: 100 TU.
    pub beacon_interval_tu: u16,               // offset 44, 2 bytes
    /// DTIM period: number of beacons between DTIM (Delivery Traffic Indication
    /// Map) frames. Default: 2. Higher values improve client power savings
    /// at the cost of multicast latency.
    pub dtim_period: u8,                       // offset 46, 1 byte
    /// Authentication type the AP will accept from clients.
    pub auth_type: AuthType,                   // offset 47, 1 byte
    /// Cipher and AKM suite configuration for the AP.
    pub crypto: CryptoSettings,                // offset 48, 44 bytes, align 4
    /// Beacon head (before TIM IE): includes fixed fields (timestamp, beacon
    /// interval, capability info) and IEs that precede TIM. The kernel
    /// constructs the beacon template from this head + TIM + beacon tail.
    pub beacon_head: KabiArray<u8, 256>,       // offset 92, 258 bytes, align 2
    /// Beacon tail (after TIM IE): IEs that follow TIM (RSN IE, HT/VHT/HE
    /// Operation IEs, vendor-specific IEs). Updated on `set_beacon()`.
    pub beacon_tail: KabiArray<u8, 512>,       // offset 350, 514 bytes, align 2
    /// Probe response template IEs. Sent in response to probe requests.
    pub probe_resp_ies: KabiArray<u8, 512>,    // offset 864, 514 bytes, align 2
    /// Explicit padding: struct align = 4 (from ChannelSpec/CryptoSettings).
    /// Offset 1378 → next 4-aligned = 1380.
    pub _pad2: [u8; 2],                        // offset 1378, 2 bytes
}
// ApParams: 33+1+2+8+2+1+1+44+258+514+514+2 = 1380 bytes, align 4.
const_assert!(core::mem::size_of::<ApParams>() == 1380);

bitflags! {
    /// Bitmask indicating which fields in `BssParams` contain valid values.
    /// Fields whose corresponding bit is NOT set are ignored (no change).
    /// This replaces sentinel-value checking (0xFF / 0xFFFF) with type-safe
    /// presence tracking, preventing subtle bugs from accidentally treating
    /// a valid value as "no change" or vice versa.
    pub struct BssParamsFlags: u16 {
        /// `cts_protection` field is valid.
        const CTS_PROTECTION = 1 << 0;
        /// `short_preamble` field is valid.
        const SHORT_PREAMBLE = 1 << 1;
        /// `short_slot_time` field is valid.
        const SHORT_SLOT_TIME = 1 << 2;
        /// `ht_opmode` field is valid.
        const HT_OPMODE = 1 << 3;
        /// `ap_isolate` field is valid.
        const AP_ISOLATE = 1 << 4;
    }
}

/// BSS parameter update for an existing AP. Used by `set_bss_params()`
/// (NL80211_CMD_SET_BSS). Allows tuning AP parameters without a full restart.
///
/// Only fields whose corresponding bit is set in `valid_fields` are applied.
/// All other fields are ignored regardless of their value. This replaces the
/// previous sentinel-value convention (0xFF = no change) with explicit presence
/// tracking via `BssParamsFlags`.
#[repr(C)]
pub struct BssParams {
    /// Bitmask of which fields below are valid. Fields not marked in this
    /// bitmask are ignored by the driver.
    pub valid_fields: BssParamsFlags,
    /// Use CTS-to-self protection for non-HT stations. Only applied if
    /// `valid_fields` contains `CTS_PROTECTION`.
    pub cts_protection: u8,
    /// Use short preamble. Only applied if `valid_fields` contains
    /// `SHORT_PREAMBLE`.
    pub short_preamble: u8,
    /// Use short slot time. Only applied if `valid_fields` contains
    /// `SHORT_SLOT_TIME`.
    pub short_slot_time: u8,
    /// Explicit padding (offset 5 after short_slot_time, ht_opmode needs u16 alignment).
    pub _pad0: u8,
    /// HT operation mode (NL80211_BSS_HT_OPMODE). Controls whether the AP
    /// operates in mixed mode (HT + non-HT stations), greenfield mode, etc.
    /// Only applied if `valid_fields` contains `HT_OPMODE`.
    pub ht_opmode: u16,
    /// Enable/disable AP isolation (prevent clients from communicating with
    /// each other directly through the AP). Only applied if `valid_fields`
    /// contains `AP_ISOLATE`.
    pub ap_isolate: u8,
    /// Explicit trailing padding (struct alignment = 2 from u16 fields).
    pub _pad1: u8,
}
// BssParams: valid_fields(2) + cts_protection(1) + short_preamble(1) +
//   short_slot_time(1) + _pad0(1) + ht_opmode(2) + ap_isolate(1) + _pad1(1) = 10 bytes.
const_assert!(core::mem::size_of::<BssParams>() == 10);

13.2.1.6 Wireless Error and Event Types¶

/// Error type returned by WirelessDriver methods.
#[repr(C, u32)]
pub enum WirelessError {
    /// The radio is not powered on (call `up()` first).
    RadioOff               = 1,
    /// A scan is already in progress.
    ScanBusy               = 2,
    /// Not connected to any network.
    NotConnected           = 3,
    /// The requested channel is not available (regulatory restriction or
    /// DFS requiring CAC).
    ChannelNotAvailable    = 4,
    /// Authentication failed (AP rejected, timeout, or SAE failure).
    AuthFailed             = 5,
    /// Association failed (AP rejected or timeout).
    AssocFailed            = 6,
    /// The firmware did not respond within the expected timeout.
    FirmwareTimeout        = 7,
    /// A hardware-level error occurred (MMIO failure, DMA stall).
    HardwareError          = 8,
    /// The requested operation is not supported by the hardware/firmware.
    Unsupported            = 9,
    /// The device was disconnected or the Tier 1 driver crashed.
    DeviceLost             = 10,
    /// An invalid parameter was passed (bad BSSID, unsupported cipher, etc.).
    InvalidParam           = 11,
    /// All hardware interface slots are in use.
    InterfacesExhausted    = 12,
    /// The requested key index is out of range (0-3 for group, 0 for pairwise).
    InvalidKeyIndex        = 13,
    /// Regulatory domain forbids the requested operation.
    RegulatoryRestriction  = 14,
    /// Key material length does not match what the AKM suite requires
    /// (e.g., KEK/KCK sizes for WPA2 vs WPA3-SAE vs WPA3-Enterprise 192-bit).
    /// Returned by `set_rekey_offload` when the provided key buffers are the
    /// wrong size for the specified AKM suite selector.
    InvalidKeyLength       = 15,
}

13.2.1.7 WirelessDeviceShadow¶

The kernel (Tier 0) maintains a shadow copy of essential wireless device configuration so that a Tier 1 driver crash can be recovered without userspace re-supplying configuration.

/// Shadow copy of wireless device state maintained by umka-core (Tier 0).
/// Populated during normal operation by intercepting cfg80211 configuration
/// commands. Used to restore the device after a Tier 1 driver crash.
pub struct WirelessDeviceShadow {
    /// Current regulatory domain (alpha2 country code).
    pub regdomain: [u8; 2],
    /// Configured interfaces (station, AP, monitor, etc.).
    /// 16 matches maximum concurrent interfaces on current enterprise WiFi hardware.
    pub interfaces: ArrayVec<WirelessIfaceShadow, 16>,
    /// Current channel configuration (center frequency, width).
    pub channel: Option<ChannelDef>,
    /// Power save mode in effect before crash.
    pub power_save: WirelessPowerSave,
}

/// Shadow of a single wireless interface's configuration.
pub struct WirelessIfaceShadow {
    /// Interface type (station, AP, etc.).
    pub iftype: Nl80211IfType,
    /// Interface name (e.g., "wlan0").
    pub name: ArrayString<16>,
    /// MAC address.
    pub mac_addr: [u8; 6],
}

13.2.1.8 DeviceLost Recovery Protocol¶

When WirelessError::DeviceLost is raised, the recovery path depends on whether the event was caused by a Tier 1 driver crash (recoverable) or a physical device disconnection (permanent):

Tier 1 driver crash recovery (detected by umka-core's crash monitor via Section 11.9):

umka-core marks the wireless device as DeviceState::Recovering. All in-flight TX descriptors in the KABI ring are completed with WirelessError::DeviceLost.
The Tier 1 driver is reloaded (~50-150 ms, Section 11.9).
umka-core calls WirelessDriver::init() on the reloaded driver to reinitialize hardware state (firmware upload, ring buffer re-allocation, interrupt re-registration).
The cfg80211 state machine in umka-net issues an NL80211_CMD_DISCONNECT netlink event to wpa_supplicant, with reason code WLAN_REASON_DEAUTH_LEAVING (3). This tells wpa_supplicant the connection was lost. wpa_supplicant will automatically attempt reassociation via its standard reconnection logic (scan → authenticate → associate → 4-way handshake).
Device state transitions: Recovering → Ready (after init() completes). The regulatory domain, configured interfaces, and channel settings are restored from the WirelessDeviceShadow state cache maintained by umka-core (Tier 0).
FMA event WifiDriverCrash is logged with driver name, crash count, and recovery duration (Section 20.1).

Physical device disconnection (USB unplug, PCIe hot-remove):

umka-core marks the wireless device as DeviceState::Removed (permanent, no recovery).
All in-flight TX descriptors are completed with WirelessError::DeviceLost.
NL80211_CMD_DEL_WIPHY netlink event is sent — wpa_supplicant removes the interface and stops reconnection attempts.
All open sockets bound to the device receive ENODEV on subsequent operations.
The NetDevice is unregistered (RTM_DELLINK netlink event).

Application-visible difference: Applications (including wpa_supplicant) distinguish the two cases by the netlink event type: NL80211_CMD_DISCONNECT (crash, will recover) vs NL80211_CMD_DEL_WIPHY (permanent removal). Applications using the WirelessError::DeviceLost error code directly can query DeviceState via the NL80211_CMD_GET_WIPHY command to determine recoverability.

/// TX descriptor for the **firmware/hardware ring buffer** (Tier 0 core → NIC hardware).
/// This is the low-level descriptor that the kernel's wireless infrastructure (running
/// in Tier 0) places on the actual hardware DMA ring. It is distinct from the KABI-level
/// `WifiTxDescriptor` in [Section 13.15](#wifi-driver) which the Tier 1 driver writes to the
/// KABI ring buffer — the kernel copies from `WifiTxDescriptor` to `TxDescriptor`
/// during the Tier 0 TX path. Layout matches the NIC TX descriptor pattern
/// from [Section 11.8](11-drivers.md#ipc-architecture-and-message-passing--domain-ring-buffer-design).
#[repr(C, align(64))]
pub struct TxDescriptor {
    /// Physical address of the packet buffer (DMA-mapped by the caller via
    /// `umka_driver_dma_alloc`). The firmware/hardware reads the packet from
    /// this address.
    pub buffer_addr: u64,
    /// Packet length in bytes (total 802.11 frame including header).
    pub length: u16,
    /// TX flags: bit 0 = ACK requested, bits 1-3 = QoS TID (0-7),
    /// bit 4 = do not encrypt, bit 5 = no sequence number assignment.
    pub flags: u16,
    /// Explicit alignment padding (cookie requires 8-byte alignment).
    pub _pad0: [u8; 4],
    /// Completion cookie: opaque value returned in the TX completion ring
    /// so the driver can match completions to submitted packets.
    pub cookie: u64,
    _pad: [u8; 40],
    // buffer_addr(8) + length(2) + flags(2) + _pad0(4) + cookie(8) = 24 data bytes.
    // _pad(40) trailing pad = 64 total. _pad0(4) is intra-field alignment between
    // flags(u16) and cookie(u64), not counted as trailing pad.
}
const_assert!(core::mem::size_of::<TxDescriptor>() == 64);

/// RX descriptor for the **firmware/hardware ring buffer** (NIC hardware → Tier 0 core).
/// Describes a received packet written by the firmware/hardware into a pre-allocated
/// DMA buffer. This is distinct from the KABI-level `WifiRxDescriptor` in [Section 13.15](#wifi-driver)
/// which the kernel writes to the KABI ring for Tier 1 driver consumption.
#[repr(C, align(64))]
pub struct RxDescriptor {
    /// Physical address of the packet buffer (pre-allocated by the driver).
    pub buffer_addr: u64,
    /// Received packet length in bytes.
    pub length: u16,
    /// RX status flags: bit 0 = FCS OK, bit 1 = decrypted OK,
    /// bit 2 = part of A-MPDU, bit 3 = AMSDU subframe.
    pub flags: u16,
    /// RSSI (dBm).
    pub rssi: i8,
    /// Noise floor (dBm).
    pub noise: i8,
    /// Channel on which the packet was received (center frequency MHz).
    pub channel_freq_mhz: u16,
    /// Hardware timestamp (TSF microseconds).
    pub timestamp_us: u64,
    _pad: [u8; 40],
    // buffer_addr(8) + length(2) + flags(2) + rssi(1) + noise(1) +
    // channel_freq_mhz(2) + timestamp_us(8) = 24 bytes. Pad: 64 - 24 = 40.
}
const_assert!(core::mem::size_of::<RxDescriptor>() == 64);

/// Asynchronous wireless events delivered via the per-device event ring.
/// `umka-net` polls this ring and translates events to nl80211 multicast
/// notifications ([Section 13.15](#wifi-driver--nl80211--linux-wireless-configuration-interface)).
///
/// All variants are fixed-size for the ring buffer. No heap allocation.
/// `repr(C, u32)` ensures stable discriminant layout across driver reloads
/// and compiler versions. Per-slot size is ~2.3 KB (dominated by MgmtFrame
/// variant's 2304-byte data field); a 256-entry ring is ~580 KB per device.
#[repr(C, u32)]
pub enum WirelessEvent {
    /// Scan completed (all requested channels scanned).
    /// Triggers `NL80211_CMD_NEW_SCAN_RESULTS` multicast.
    ScanDone {
        /// True if the scan was aborted before completion.
        aborted: u8, // 0 = completed, 1 = aborted
    },
    /// Authentication result (SME-in-userspace mode).
    /// Triggers `NL80211_CMD_AUTHENTICATE` multicast.
    AuthResult {
        /// AP BSSID.
        bssid: MacAddr,
        /// IEEE 802.11 status code (0 = success).
        status_code: u16,
        /// Authentication frame body (for SAE confirm exchange).
        frame_body: [u8; 512],
        /// Valid byte count within `frame_body`.
        frame_body_len: u16,
    },
    /// Association result (SME-in-userspace mode).
    /// Triggers `NL80211_CMD_ASSOCIATE` multicast.
    AssocResult {
        /// AP BSSID.
        bssid: MacAddr,
        /// IEEE 802.11 status code (0 = success).
        status_code: u16,
        /// Response IEs from the association response frame.
        resp_ies: [u8; 350],
        /// Valid byte count within `resp_ies`.
        resp_ies_len: u16,
    },
    /// Connection completed (SME-in-firmware mode).
    /// Triggers `NL80211_CMD_CONNECT` multicast.
    Connected {
        /// AP BSSID.
        bssid: MacAddr,
        /// IEEE 802.11 status code (0 = success).
        status_code: u16,
    },
    /// Disconnected from the network.
    /// Triggers `NL80211_CMD_DISCONNECT` multicast.
    Disconnected {
        /// IEEE 802.11 reason code.
        reason_code: u16,
        /// True if the AP initiated the disconnection (deauth/disassoc from AP).
        by_ap: u8, // 0 = locally initiated, 1 = by AP
    },
    /// Connection quality monitor RSSI threshold crossed.
    /// Triggers `NL80211_CMD_NOTIFY_CQM` multicast.
    CqmRssi {
        /// Current RSSI (dBm).
        rssi: i32,
        /// Which threshold was crossed.
        event: CqmThresholdEvent,
    },
    /// Driver-initiated roam to a new AP (same SSID, different BSSID).
    /// Triggers `NL80211_CMD_ROAM` multicast.
    Roamed {
        /// New AP BSSID.
        bssid: MacAddr,
        /// Request IEs sent to the new AP.
        req_ies: [u8; 256],
        req_ies_len: u16,
        /// Response IEs from the new AP.
        resp_ies: [u8; 350],
        resp_ies_len: u16,
    },
    /// Received a management frame registered via `register_mgmt_frame()`.
    /// Triggers `NL80211_CMD_FRAME` multicast.
    MgmtFrame {
        /// Channel frequency (MHz) on which the frame was received.
        freq: u32,
        /// Frame body (from the 802.11 header onward).
        data: [u8; 2304],
        /// Valid byte count within `data`.
        data_len: u16,
    },
    /// Michael MIC failure detected (TKIP countermeasures).
    /// Triggers `NL80211_CMD_MICHAEL_MIC_FAILURE` multicast.
    MicFailure {
        /// AP BSSID.
        bssid: MacAddr,
        /// Key type: 0 = pairwise, 1 = group.
        key_type: u32,
        /// Key index.
        key_id: u8,
    },
    /// AP mode: a new station has associated.
    NewStation { mac: MacAddr },
    /// AP mode: a station has disassociated or was deauthenticated.
    DelStation { mac: MacAddr },
    /// Remain-on-channel session expired.
    RemainOnChannelExpired { cookie: u64 },
    /// TX status for a management frame sent via `mgmt_tx()`.
    MgmtTxStatus {
        /// Cookie from the `mgmt_tx()` call.
        cookie: u64,
        /// True if the frame was ACKed by the remote station.
        acked: u8, // 0 = not acked, 1 = acked
    },
    /// Radar detected on the operating channel (DFS).
    /// Triggers `NL80211_CMD_RADAR_DETECT` multicast.
    /// The kernel must immediately vacate the channel (switch to a
    /// non-DFS channel or a different DFS channel that has passed CAC).
    /// The channel is marked as unavailable for the Non-Occupancy Period
    /// (NOP, typically 30 minutes per regulatory domain).
    RadarDetected {
        /// Channel on which radar was detected (center frequency MHz).
        freq: u32,
    },
    /// Channel Availability Check (CAC) completed on a DFS channel.
    /// `success = 1`: no radar detected, channel is now available for use.
    /// `success = 0`: CAC was aborted (e.g., interface teardown, regulatory change).
    CacFinished {
        /// Channel on which CAC was performed (center frequency MHz).
        freq: u32,
        /// 0 = aborted, 1 = completed successfully (no radar detected).
        success: u8,
    },
}

// WirelessEvent layout proof (#[repr(C, u32)] — tag u32 + largest variant):
//   Tag: u32 = 4 bytes at offset 0.
//   Largest variant: MgmtFrame { freq: u32, data: [u8; 2304], data_len: u16 }
//     freq(u32=4) + data([u8;2304]=2304) + data_len(u16=2) = 2310 bytes payload.
//   With #[repr(C, u32)]: tag(4) + variant data start at offset 4.
//   MgmtFrame: tag(4) + freq(u32=4) + data([u8;2304]=2304) + data_len(u16=2) = 2314.
//   Struct align = max(align_of tag, align_of any field) = 4 (from u32).
//   2314 % 4 = 2, tail pad 2 bytes → 2316. But sizeof includes discriminant + largest
//   variant with internal padding. Actual: 4 + 2316 = 2320.
// All ring slots use this size regardless of variant.
const_assert!(core::mem::size_of::<WirelessEvent>() == 2320);

/// Connection quality monitor threshold event direction.
#[repr(u8)]
pub enum CqmThresholdEvent {
    /// RSSI dropped below the configured threshold.
    Low  = 0,
    /// RSSI rose above the configured threshold.
    High = 1,
}

Event delivery: Wireless state changes (scan results, connect/disconnect, roaming) are delivered via a per-device event ring (WirelessEvent enum) that umka-net polls. No callbacks into driver code from the network stack.

Hardware-specific detail: Section 13.15 (WiFi — Intel/Realtek/Qualcomm/MediaTek), Section 13.14 (Bluetooth HCI), Section 13.15 (WiFi, including server-class 802.11ax access-point mode).

13.3 Display Subsystem¶

Tier: Tier 1 for integrated GPU display engines (Intel Gen12+, AMD DCN, ARM Mali DP). Tier 2 only for fully-offloaded display (USB DisplayLink, network display server) where the display path already crosses a process boundary.

KABI interface name: display_device_v1 (in interfaces/display_device.kabi).

// umka-core/src/display/mod.rs — authoritative display driver contract

/// A display controller device. Implemented by GPU/display drivers.
// Rust-internal trait. KABI vtable generated separately by kabi-gen
// from display_device.kabi.
pub trait DisplayDriver: Send + Sync {
    // --- Connector enumeration ---

    /// Return all physical connectors managed by this display controller.
    /// Caller-supplied buffer pattern for KABI compatibility.
    /// Returns the number of connectors written to `buf`.
    /// `DisplayConnector` is defined in [Section 21.5](21-user-io.md#display-and-graphics).
    fn connectors(&self, buf: &mut [DisplayConnector], max_count: u32) -> Result<u32, IoError>;

    // KABI methods use caller-supplied buffers — Vec<T> is Rust-specific and not repr(C) stable.

    /// Read EDID from a connected display. Returns the number of bytes written
    /// to `buf`, or an error if no display or no DDC/CI support (driver falls
    /// back to safe-mode resolution).
    fn read_edid(&self, connector_id: u32, buf: &mut [u8]) -> Result<u32, IoError>;

    // --- Atomic modesetting (required; non-atomic paths are not supported) ---

    /// Validate an atomic commit without applying it.
    /// Returns Ok(()) if the hardware can execute the commit, or an error
    /// describing the constraint that is violated.
    fn atomic_check(&self, commit: &AtomicCommit) -> Result<(), DisplayError>;

    /// Apply an atomic commit. Must be preceded by a successful `atomic_check`.
    /// Blocks until the commit takes effect (next vsync, or immediately for
    /// async page flips).
    fn atomic_commit(&self, commit: &AtomicCommit, flags: CommitFlags)
        -> Result<(), DisplayError>;

    // --- Framebuffer management ---

    /// Import a DMA-BUF as a scanout framebuffer. Returns a `FramebufferId`
    /// used in subsequent atomic commits. The driver pins the buffer for the
    /// lifetime of the framebuffer handle.
    /// `format` uses `FramebufferFormat` (defined in [Section 21.5](21-user-io.md#display-and-graphics)).
    fn import_dmabuf(
        &self,
        fd: DmaBufHandle,
        width: u32,
        height: u32,
        format: FramebufferFormat,
        modifier: u64,
    ) -> Result<FramebufferId, DisplayError>;

    /// Release a framebuffer handle (unpins the DMA-BUF).
    fn destroy_framebuffer(&self, fb: FramebufferId);

    // --- Display power ---

    /// Set DPMS state for a connector (On / Standby / Suspend / Off).
    fn set_dpms(&self, connector_id: u32, state: DpmsState)
        -> Result<(), DisplayError>;

    // --- Vsync events ---

    /// Return the vsync event ring (one entry per completed page flip or
    /// periodic vsync). Consumers: compositors, frame pacing logic.
    fn vsync_ring(&self) -> &RingBuffer<VsyncEvent>;
}

/// Vsync event delivered via the per-CRTC ring buffer.
/// One entry per completed page flip or periodic vsync interrupt.
/// See [Section 21.5](21-user-io.md#display-and-graphics) for the companion userspace types.
// kernel-internal, not KABI
#[repr(C)]
pub struct VsyncEvent {
    /// Monotonic timestamp (nanoseconds) when the vsync/page flip completed.
    pub timestamp_ns: u64,
    /// Sequence number (monotonically increasing per CRTC).
    pub sequence: u64,
    /// CRTC index that generated this event.
    pub crtc_index: u32,
    /// Flags: bit 0 = page_flip_complete, bit 1 = vblank_event.
    pub flags: u32,
}
const_assert!(core::mem::size_of::<VsyncEvent>() == 24);

/// Framebuffer ID. Identifies a framebuffer object within the DRM subsystem.
/// Allocated by `DRM_IOCTL_MODE_ADDFB2` and referenced in atomic commit
/// requests. See [Section 21.5](21-user-io.md#display-and-graphics) for the userspace-facing types.
pub type FramebufferId = u32;

/// Kernel-internal atomic commit request. Built from the userspace
/// `AtomicModeset` (defined in [Section 21.5](21-user-io.md#display-and-graphics)) by the DRM/KMS
/// core: validates connector/plane IDs, resolves CRTC assignments, and
/// converts `FramebufferFormat` fourcc codes to driver-internal pixel layout.
/// This is the type consumed by `DisplayDriver::atomic_check()` and
/// `atomic_commit()`.
/// Types and constants referenced below are defined in [Section 21.5](21-user-io.md#display-and-graphics):
/// `ConnectorState`, `PlaneState`, `CrtcState`, `DpmsState`,
/// `MAX_CONNECTORS`, `MAX_PLANES`, `MAX_CRTCS`.
pub struct AtomicCommit {
    /// Per-connector state changes (DPMS, CRTC assignment, mode).
    pub connectors: ArrayVec<ConnectorState, MAX_CONNECTORS>,
    /// Per-plane state changes (framebuffer, position, scaling, rotation).
    pub planes: ArrayVec<PlaneState, MAX_PLANES>,
    /// Per-CRTC state (active, mode_blob). Derived from connector assignments.
    pub crtcs: ArrayVec<CrtcState, MAX_CRTCS>,
}

/// Flags for `atomic_commit`.
/// A value of `CommitFlags(0)` (no flags set) means the default behavior:
/// apply on the next vsync (tear-free). There is no explicit VSYNC flag
/// because vsync-synchronous commits are the implicit default.
bitflags! {
    pub struct CommitFlags: u32 {
        /// Apply immediately without waiting for vsync (for cursor updates).
        const ASYNC      = 1 << 0;
        /// Test-only: validate without applying.
        const TEST_ONLY  = 1 << 1;
        /// Allow modesetting (resolution / refresh rate change).
        const ALLOW_MODESET = 1 << 2;
    }
}

DMA-BUF integration: The display subsystem consumes DMA-BUFs produced by the GPU compute subsystem (Section 22.1) or by CPU-rendered framebuffers. The kernel capability model (Section 9.1) gates import_dmabuf access: a process must hold CAP_DISPLAY to present a framebuffer on a physical connector.

Hardware-specific detail: Section 21.5 (display: Intel i915, AMD DCN, Raspberry Pi display pipeline, USB DisplayLink).

13.4 Audio Subsystem¶

Tier: Tier 1 (default). Audio I/O requires strict real-time scheduling to avoid glitches (buffer underruns). The period interrupt (64–2048 frames at 48 kHz = 1.3–42.7 ms) must fire predictably; Tier 2 syscall overhead per interrupt would consistently violate this budget at low-latency settings (< 10 ms periods). For consumer/desktop configurations where crash resilience is prioritized over latency, audio drivers may be optionally demoted to Tier 2 at ≥ 10 ms buffer periods, where the ~20–50 μs syscall overhead is acceptable (see Section 21.4 in 20-user-io.md for demotion policy).

KABI interface name: audio_device_v1 (in interfaces/audio_device.kabi).

// umka-core/src/audio/mod.rs — authoritative audio driver contract

/// An audio device. Implemented by HDA controllers, USB audio class drivers,
/// HDMI audio endpoints, Bluetooth A2DP sinks (via umka-sysapi HCI layer),
/// and virtual audio devices.
pub trait AudioDriver: Send + Sync {
    // --- PCM streams ---

    /// Negotiate a PCM stream. The driver validates that the hardware
    /// supports the requested format, sample rate, channel count, and
    /// period/buffer sizes, and allocates a DMA ring buffer.
    fn open_pcm(&self, params: &PcmParams) -> Result<PcmStream, AudioError>;

    /// Start DMA on an open PCM stream. The hardware begins reading from
    /// (playback) or writing to (capture) the DMA ring buffer. Must be
    /// called after `open_pcm()` and buffer filling (playback) or
    /// application readiness (capture).
    fn start_stream(&self, handle: PcmStreamHandle) -> Result<(), AudioError>;

    /// Stop DMA on a running PCM stream. The hardware stops reading/writing
    /// the DMA ring buffer. The stream remains open and can be restarted
    /// with `start_stream()`. Pending DMA transfers are drained or aborted
    /// depending on the `drain` parameter (true = wait for ring to empty,
    /// false = immediate stop).
    fn stop_stream(&self, handle: PcmStreamHandle, drain: bool) -> Result<(), AudioError>;

    // --- Mixer (hardware volume/routing controls) ---

    /// Enumerate hardware mixer controls into the caller-supplied buffer.
    /// Returns the number of controls written, up to `max_count`.
    fn mixer_controls(&self, buf: &mut [MixerControl], max_count: u32) -> Result<u32, IoError>;

    /// Read a mixer control value.
    fn mixer_get(&self, id: u32) -> Result<i32, AudioError>;

    /// Write a mixer control value.
    fn mixer_set(&self, id: u32, value: i32) -> Result<(), AudioError>;

    // --- Jack detection ---

    /// Return the jack event ring (headphone/microphone insert/remove events).
    fn jack_ring(&self) -> &RingBuffer<JackEvent>;

    // --- Power ---

    /// Suspend audio device (silence output, power-gate ADC/DAC). Called
    /// before platform S3/S0ix entry.
    fn suspend(&self) -> Result<(), AudioError>;

    /// Resume audio device. Called after platform resume.
    fn resume(&self) -> Result<(), AudioError>;
}

/// Parameters for opening a PCM stream.
/// Layout (24 bytes): direction(4) + format(4) + rate(4) + channels(1) +
/// _pad(3) + period_frames(4) + buffer_frames(4) = 24 bytes.
#[repr(C)]
pub struct PcmParams {
    pub direction:     PcmDirection,  // Playback or Capture
    pub format:        PcmFormat,     // S16Le / S24Le / S32Le / F32Le
    pub rate:          u32,           // Hz: 44100, 48000, 96000, 192000
    pub channels:      u8,            // 1 (mono) … 8 (7.1)
    /// Explicit padding for u32 alignment of `period_frames`. Must be zeroed.
    pub _pad: [u8; 3],
    pub period_frames: u32,           // Interrupt granularity (power of 2)
    pub buffer_frames: u32,           // Ring buffer size (multiple of period)
}
const_assert!(core::mem::size_of::<PcmParams>() == 24);

/// Error type returned by AudioDriver vtable methods.
#[repr(C, u32)]
pub enum AudioError {
    /// The PCM stream parameters are not supported by this device.
    UnsupportedFormat = 1,
    /// The requested sample rate is not supported.
    UnsupportedRate = 2,
    /// No PCM streams available (all in use).
    StreamsExhausted = 3,
    /// DMA buffer allocation failed (ENOMEM).
    NoMemory = 4,
    /// The stream ring buffer underran (playback) — driver consumed data faster
    /// than userspace supplied it. Stream is stopped; caller must restart.
    Underrun = 5,
    /// The stream ring buffer overran (capture) — driver produced data faster
    /// than userspace consumed it. Samples were dropped.
    Overrun = 6,
    /// The hardware is in an unrecoverable error state. Driver must be reloaded.
    HardwareError = 7,
    /// The operation was aborted because the stream is stopping.
    Aborted = 8,
}

ALSA compatibility: umka-sysapi translates snd_pcm_*, snd_ctl_*, and snd_rawmidi_* ioctls to AudioDriver calls, enabling PipeWire, PulseAudio, and JACK to run unmodified.

Hardware-specific detail: Section 21.4 (audio: Intel HDA, USB Audio Class, HDMI/DP audio endpoint).

13.5 GPU Compute¶

Tier: Tier 1. GPU memory management (IOMMU domain assignment, VRAM eviction, TDR recovery) must execute in kernel context. Kernel-bypass command submission is the only path that meets the latency requirements of interactive rendering and GPGPU workloads. A Tier 2 boundary crossing on every submit would add unacceptable per-frame overhead.

KABI interface name: gpu_device_v1 (in interfaces/gpu_device.kabi).

// umka-core/src/gpu/mod.rs — authoritative GPU driver contract

/// A GPU device. Implemented by drivers for discrete and integrated GPUs
/// (Intel Xe, AMD GCN/RDNA, Arm Mali Valhall, NVIDIA GSP, etc.).
pub trait GpuDevice: Send + Sync {
    // --- Context management ---

    /// Allocate a GPU context for the calling process. A context owns a
    /// private GPU virtual address space backed by a dedicated IOMMU domain.
    /// It is the unit of isolation: a fault in one context cannot corrupt
    /// another. The kernel destroys the context (and all buffer objects mapped
    /// into it) when the owning process exits.
    ///
    /// Requires `CAP_GPU_RENDER`.
    fn alloc_ctx(&self) -> Result<GpuContext, GpuError>;

    /// Destroy a GPU context and release all GPU VA space. Any buffer objects
    /// mapped into the context are unmapped but not freed; the caller must
    /// drop the `BufferObject` handles separately.
    fn free_ctx(&self, ctx: GpuContext) -> Result<(), GpuError>;

    // --- Buffer object lifecycle ---

    /// Allocate a buffer object. `size` is in bytes (page-aligned). `placement`
    /// controls where physical backing is sourced (VRAM, GTT, or system
    /// memory). `tiling` sets the hardware tiling modifier; use
    /// `DrmFormatModifier::LINEAR` unless the caller has negotiated a tiling
    /// format with the display subsystem ([Section 13.3](#display-subsystem)).
    ///
    /// Requires `CAP_GPU_RENDER`.
    fn alloc_bo(
        &self,
        size: u64,
        placement: BoPlacementFlags,
        tiling: TilingModifier,
    ) -> Result<BufferObject, GpuError>;

    /// Free a buffer object. The BO must have been unmapped from all GPU VA
    /// spaces before calling this. Returns `GpuError::StillMapped` if not.
    fn free_bo(&self, bo: BufferObject) -> Result<(), GpuError>;

    // --- GPU virtual address space ---

    /// Map a buffer object into a GPU context's virtual address space.
    /// `va_hint` is advisory; the driver may choose a different VA if the
    /// hint conflicts with an existing mapping. Returns the actual GPU VA.
    ///
    /// The mapping remains valid until `unmap_bo` is called or the context
    /// is destroyed.
    fn map_bo(
        &self,
        ctx: &GpuContext,
        bo: &BufferObject,
        va_hint: Option<u64>,
        flags: BoMapFlags,
    ) -> Result<u64, GpuError>;

    /// Unmap a buffer object from a GPU context's virtual address space.
    fn unmap_bo(&self, ctx: &GpuContext, va: u64) -> Result<(), GpuError>;

    // --- Command submission ---

    /// Submit a command buffer for execution on the GPU. `exec_queue`
    /// selects the hardware engine (graphics, compute, copy, video).
    /// `wait_fences` is a list of `DmaFence` values that must be signaled
    /// before execution begins. Returns a `DmaFence` that is signaled when
    /// the command buffer completes.
    ///
    /// The command buffer pointer is a GPU VA within `ctx`. The caller is
    /// responsible for ensuring the GPU VA maps to valid, initialized memory.
    ///
    /// Requires `CAP_GPU_RENDER`.
    fn submit(
        &self,
        ctx: &GpuContext,
        exec_queue: ExecQueue,
        cmdbuf_va: u64,
        cmdbuf_size: u64,
        wait_fences: &[DmaFence],
    ) -> Result<DmaFence, GpuError>;

    // --- DMA-BUF export ---

    /// Export a buffer object as a DMA-BUF file descriptor. The returned
    /// handle can be passed to the display subsystem (`DisplayDriver::
    /// import_dmabuf`, [Section 13.3](#display-subsystem)) or to a video encoder (`MediaDevice::
    /// queue_buf`, [Section 13.7](#video-media-pipeline)). The BO reference count is incremented; the BO
    /// remains live until both the `BufferObject` handle and all DMA-BUF
    /// importers are dropped.
    fn export_dmabuf(&self, bo: &BufferObject) -> Result<DmaBufHandle, GpuError>;

    // --- TDR (Timeout Detection and Recovery) ---

    /// Trigger an explicit TDR cycle on a GPU context that the caller has
    /// determined is hung. The kernel also calls this internally if a context
    /// has not produced a progress heartbeat for 2 seconds.
    ///
    /// Behavior:
    /// 1. The driver preempts the hung context.
    /// 2. The hardware engine is reset to a known-good state.
    /// 3. All other active contexts are saved, the engine is reconfigured,
    ///    and those contexts resume from their last checkpoint.
    /// 4. The hung `GpuContext` is marked invalid; any subsequent call on it
    ///    returns `GpuError::ContextLost` (-ENODEV).
    ///
    /// Requires `CAP_GPU_ADMIN`.
    fn tdr_reset(&self, ctx: &GpuContext) -> Result<(), GpuError>;

    // --- Memory pressure integration ---

    /// Evict a buffer object from its current placement to `target` domain.
    /// Called by the OOM killer or memory reclaim subsystem when system memory
    /// is under pressure and the BO resides in system-memory-backed placement
    /// (GTT, system RAM). For discrete GPUs with dedicated VRAM, `target` may
    /// be `BoPlacementFlags::VRAM` to move from system to device memory.
    ///
    /// Returns `Ok(())` on successful eviction, `GpuError::Busy` if the BO is
    /// currently referenced by in-flight GPU commands (caller should retry after
    /// fence completion), or `GpuError::NotSupported` if the driver does not
    /// support eviction for this BO type.
    fn evict_bo(&self, bo: &BufferObject, target: BoPlacementFlags) -> Result<(), GpuError>;

    /// Shrink GPU-held reclaimable memory under memory pressure.
    /// Called by the kernel's shrinker infrastructure. The driver should release
    /// idle cached resources (purgeable BOs, shader cache entries, precompiled
    /// pipeline state) up to `target_bytes`. Returns the number of bytes
    /// actually freed. The driver MUST NOT block waiting for GPU fences — only
    /// resources that are immediately reclaimable should be freed.
    fn shrink(&self, target_bytes: u64) -> u64;

    // --- Capability queries ---

    /// Return the set of capabilities reported by the GPU hardware (memory
    /// sizes, engine counts, supported tiling modifiers, etc.).
    fn capabilities(&self) -> GpuCapabilities;
}

/// A GPU context: one process's private GPU virtual address space.
/// Isolated from all other contexts by a dedicated IOMMU domain.
pub struct GpuContext {
    /// Opaque kernel handle. Never dereference from outside the GPU subsystem.
    pub handle: u64,
    /// The IOMMU domain ID assigned to this context (for cross-subsystem
    /// DMA-BUF import validation).
    pub iommu_domain_id: u64,
}

/// A GPU memory allocation.
pub struct BufferObject {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Size in bytes (always page-aligned).
    pub size: u64,
    /// Actual placement after allocation (may differ from the requested
    /// placement if VRAM was full and the driver fell back to GTT).
    pub actual_placement: BoPlacementFlags,
    /// Tiling modifier in use (DRM format modifier encoding).
    pub tiling: DrmFormatModifier,
}

/// Timeline semaphore / DMA fence — the canonical definition is `DmaFence` in
/// [Section 22.1](22-accelerators.md#unified-accelerator-framework--dmafence). That struct uses (device_id,
/// context_id, value) tuples with a `DmaFenceType` discriminant, supporting
/// device-local, cross-device, and CPU-signalable fences.
///
/// GPU subsystem aliases:
/// - `timeline_id` maps to `(DmaFence.device_id, DmaFence.context_id)`
/// - `seqno` maps to `DmaFence.value`
/// - All GPU fences use `DmaFenceType::DeviceLocal` unless P2P is involved.
///
/// Shared across GPU, Media, NPU, Crypto, and DMA Engine subsystems to allow
/// cross-subsystem dependency chains without conversions.
// See [Section 22.1](22-accelerators.md#unified-accelerator-framework) for the canonical DmaFence struct.

bitflags! {
    /// Where to source backing memory for a buffer object.
    pub struct BoPlacementFlags: u32 {
        /// GPU-local VRAM. Highest GPU bandwidth, not CPU-accessible without
        /// a GTT mapping.
        const VRAM   = 1 << 0;
        /// Graphics Translation Table (CPU-accessible via BAR2/GGTT aperture).
        const GTT    = 1 << 1;
        /// System (DRAM) memory. Always CPU-accessible; lowest GPU bandwidth.
        const SYSTEM = 1 << 2;
    }
}

bitflags! {
    /// Flags controlling a GPU VA mapping.
    pub struct BoMapFlags: u32 {
        /// GPU may read the buffer.
        const READ       = 1 << 0;
        /// GPU may write the buffer.
        const WRITE      = 1 << 1;
        /// CPU cache is coherent with GPU (requires hardware support; falls
        /// back to uncached if not available).
        const COHERENT   = 1 << 2;
    }
}

/// DRM format modifier — 64-bit value matching Linux's `drm_fourcc.h` encoding.
/// `fourcc_mod_code(vendor, val) = ((vendor as u64) << 56) | (val & 0x00FFFFFFFFFFFFFF)`.
/// Example: `I915_FORMAT_MOD_X_TILED = 0x0100000000000001`.
#[repr(transparent)]
pub struct DrmFormatModifier(pub u64);

impl DrmFormatModifier {
    pub const LINEAR: Self      = Self(0x0000000000000000);
    pub const I915_X_TILED: Self = Self(0x0100000000000001);
    pub const I915_Y_TILED: Self = Self(0x0100000000000002);
    /// AMD DCC (Display Compression Control) modifiers are **not** single constants.
    /// They are composed dynamically via the `AMD_FMT_MOD` macro bitfield system in
    /// Linux `include/uapi/drm/drm_fourcc.h`:
    ///   - Vendor base: `fourcc_mod_code(AMD, 0) = 0x0200000000000000`
    ///   - DCC: bit 13, DCC_RETILE: bit 14, TILE: bits 8-12,
    ///     TILE_VERSION: bits 0-7, PIPE_XOR_BITS: bits 19-21, etc.
    /// Mesa/Vulkan negotiate modifiers with KMS; there is no single `AMD_DCC` value.
    /// Drivers construct per-generation modifiers at runtime.
    /// Example GFX9+DCC+PIPE_ALIGNED: `Self(0x0200000000062002)`.
    pub const ARM_AFBC: Self    = Self(0x0800000000000001);
}

#[repr(u32)]
/// GPU hardware engine selector for command submission.
pub enum ExecQueue {
    /// 3D rendering and compute shaders (universal queue on most GPUs).
    Graphics  = 0,
    /// Dedicated compute queue (no graphics state, runs in parallel).
    Compute   = 1,
    /// Blitter / copy engine (lower power for buffer-to-buffer transfers).
    Copy      = 2,
    /// Video decode engine.
    VideoDec  = 3,
    /// Video encode engine.
    VideoEnc  = 4,
}

/// GPU hardware capabilities, returned by `GpuDevice::capabilities()`.
///
/// Describes the hardware resources and feature support. Used by the DRM
/// driver interface to populate `DRM_IOCTL_GET_CAP` responses and by the
/// scheduler subsystem ([Section 22.1](22-accelerators.md#unified-accelerator-framework)) for workload placement.
///
/// `#[repr(C)]` is required because this struct crosses the KABI boundary
/// (returned by `GpuDevice::capabilities()`) and is used for DRM ioctl
/// responses.
#[repr(C)]
pub struct GpuCapabilities {
    /// Total dedicated VRAM in bytes (0 for integrated GPUs sharing system memory).
    pub vram_size: u64,                         // offset 0,   8 bytes
    /// Maximum visible (CPU-mappable) VRAM aperture in bytes.
    pub visible_vram_size: u64,                 // offset 8,   8 bytes
    /// Number of graphics/universal engines (shader arrays).
    pub num_graphics_engines: u32,              // offset 16,  4 bytes
    /// Number of dedicated compute queues (0 if shared with graphics).
    pub num_compute_queues: u32,                // offset 20,  4 bytes
    /// Number of copy/DMA engines.
    pub num_copy_engines: u32,                  // offset 24,  4 bytes
    /// Number of video decode engines.
    pub num_video_dec_engines: u32,             // offset 28,  4 bytes
    /// Number of video encode engines.
    pub num_video_enc_engines: u32,             // offset 32,  4 bytes
    /// Explicit padding for u64 alignment of `max_bo_size`. Must be zeroed.
    pub _pad0: [u8; 4],                         // offset 36,  4 bytes
    /// Maximum buffer object size in bytes.
    pub max_bo_size: u64,                       // offset 40,  8 bytes
    /// Supported DRM format modifiers (64-bit values). Fixed-size array with
    /// count field for ABI stability (`ArrayVec` is not `#[repr(C)]`).
    /// Queried by userspace via `DRM_IOCTL_MODE_GETFB2`.
    pub supported_modifiers: [DrmFormatModifier; 32], // offset 48, 256 bytes
    /// Number of valid entries in `supported_modifiers` (0..=32).
    pub num_supported_modifiers: u32,           // offset 304, 4 bytes
    /// Whether the hardware supports hardware-accelerated GPU virtual memory
    /// (page tables managed by the GPU MMU, not the host IOMMU).
    pub has_gpu_vm: u8,                         // offset 308, 1 byte  (0=no, 1=yes)
    /// Whether the hardware supports preemptible compute contexts
    /// (mid-shader preemption for real-time scheduling).
    pub has_compute_preempt: u8,                // offset 309, 1 byte  (0=no, 1=yes)
    /// Explicit padding for u32 alignment of `shader_isa_version`.
    pub _pad: [u8; 2],                          // offset 310, 2 bytes
    /// Shader ISA version (vendor-specific encoding; exposed via
    /// `DRM_IOCTL_AMDGPU_INFO` or `DRM_IOCTL_I915_GETPARAM`).
    pub shader_isa_version: u32,                // offset 312, 4 bytes
    /// Maximum GPU clock frequency in MHz (0 if not queryable).
    pub max_clock_mhz: u32,                     // offset 316, 4 bytes
    /// VRAM type and width (e.g., "GDDR6 256-bit"). For diagnostic display.
    pub vram_type: [u8; 32],                    // offset 320, 32 bytes
}                                               // Total: 352 bytes
// Layout: vram_size(8) + visible_vram_size(8) + 5×engines(20) + _pad0(4) +
// max_bo_size(8) + modifiers(256) + num_modifiers(4) + vm(1) + preempt(1) +
// _pad(2) + isa(4) + clock(4) + vram_type(32) = 352 bytes.
const_assert!(core::mem::size_of::<GpuCapabilities>() == 352);

TDR model: The watchdog timer fires every 2 seconds. If a GPU context has not advanced its hardware progress counter since the last tick, the kernel invokes tdr_reset() on that context. The 2-second threshold is adjustable per-device via sysfs (/sys/class/gpu/<dev>/tdr_timeout_ms) by a process holding CAP_GPU_ADMIN. Reducing below 100 ms is not permitted; doing so would produce false positives during legitimate shader compilation stalls.

Cross-driver synchronization: DmaFence is the universal cross-driver timeline primitive. The display subsystem (Section 13.3) accepts a DmaFence in atomic_commit to defer scanout until rendering completes. The NPU subsystem (Section 13.8) and the crypto engine (Section 13.11) both use the same DmaFence struct so that inference pipelines and encrypted content pipelines can express multi-stage dependency chains in a single data structure.

Capability gating: CAP_GPU_RENDER is required for context allocation, BO allocation, and command submission. CAP_GPU_ADMIN is additionally required for clock control, performance counter access, and explicit TDR. Both capabilities are checked in the kernel before any hardware register is touched.

Hardware-specific detail: Per-vendor GPU architecture is documented inline in this section (generic GPU KABI), in Section 22.1–46 (21-accelerators.md) for accelerator scheduling, memory management, and NVIDIA porting (Section 22.7), and in Section 21.5 (20-user-io.md) for display/KMS. Vendor-specific register-level programming (Intel Xe/i915, AMD AMDGPU, ARM Mali Valhall, NVIDIA GSP) is covered during per-driver implementation using vendor documentation.

13.5.1 DMA Fence Behavior on GPU Crash¶

When a GPU crashes mid-workload, all pending DmaFence values associated with that GPU will never be signaled by the hardware. Without explicit kernel intervention, every waiter — CPU threads blocked in dma_fence_wait(), display scanout pipelines, video encoders, NPU submission queues — blocks indefinitely. UmkaOS resolves all pending fences during the crash handler to unblock waiters immediately.

Fence error types:

/// Error status signaled to fence waiters when the GPU crashes.
#[repr(C, u32)]
pub enum DmaFenceError {
    /// The GPU device was lost entirely. Work was not completed and cannot
    /// be retried on this device without a full device reset and driver reload.
    DeviceLost    = 1,
    /// A specific GPU context was killed (by TDR or an unrecoverable fault),
    /// but the GPU device itself remains operational. Other contexts continue.
    /// Work associated with the killed context was not completed.
    ContextKilled = 2,
}

/// Resolution applied to a pending fence during GPU crash handling.
#[repr(C, u32)]
pub enum FenceCrashResolution {
    /// Signal the fence with an error. Waiters wake up and receive the error.
    /// Used when work is lost and callers must handle the failure.
    SignalError(DmaFenceError),
    /// Signal the fence as completed. Used when the kernel has already rolled
    /// back the associated state and the waiter can safely proceed — for
    /// example, a fence guarding a buffer that has been fully reclaimed.
    SignalComplete,
}
// FenceCrashResolution: discriminant(u32=4) + payload(DmaFenceError=u32=4) = 8 bytes.
const_assert!(core::mem::size_of::<FenceCrashResolution>() == 8);

Fence registry: The GPU driver maintains a per-device fence registry — a SpinLock-protected XArray<u64, FenceRegistryEntry> keyed by DmaFence.value (u64 sequence number), storing FenceRegistryEntry { fence: DmaFence, ctx: GpuContextHandle, waker: Waker } tuples for all outstanding fences. The registry is stored in umka-core memory (not in the Tier 1 driver's isolation domain) so it is accessible during crash recovery after the domain is revoked. The SpinLock is held exclusively during crash handler fence resolution; worst case for a device with max_inflight fences: lock hold duration is O(max_inflight) iterations (~1 us per fence wake, bounded by hardware queue depth, typically 64-4096).

GPU crash handler sequence:

GPU crash detected:
  Source: firmware timeout interrupt, hardware fault interrupt, or TDR watchdog.

1. IOMMU isolation:
   - Revoke the GPU's IOMMU DMA domain (set to fault-on-access).
   - The GPU can no longer read or write system memory via DMA.
   - This is the first action; it happens before any fence signaling.

2. Fence resolution:
   - Acquire the fence registry lock (exclusive).
   - Iterate all pending fences in submission order (FIFO within each context):
     a. Fence associated with a specific GpuContext:
        - Signal with DmaFenceError::ContextKilled.
        - Wake all waiters blocked on this fence.
     b. Fence associated with the device (no context — e.g., inter-GPU dependency):
        - Signal with DmaFenceError::DeviceLost.
        - Wake all waiters blocked on this fence.
   - Release the fence registry lock.
   - Signaling is done under the lock to prevent races with concurrent
     dma_fence_wait() calls that might otherwise block after the lock is
     released but before the fence is signaled.

3. Context invalidation:
   - Iterate all GpuContext objects owned by the crashed GPU.
   - Transition each to GpuContextState::Lost.
   - Any subsequent call on a Lost context returns GpuError::ContextLost.

4. GPU recovery:
   - Attempt TDR reset (single-context kill) if only one context faulted.
   - Escalate to FLR (pcie_flr_with_timeout, Section 11.7.4) if TDR fails
     or the device itself is unresponsive.
   - After successful reset: transition surviving contexts to Suspended;
     the driver attempts to restore their last checkpoint state.
   - If FLR also fails: proceed with the permanent fault sequence in Section 11.7.4.

Waiter contract:

Callers using dma_fence_wait() receive Err(DmaFenceError) instead of waiting indefinitely. The return path is identical to a normal timeout — the caller's blocking state is cleared and control returns to the caller with an error code.

UmkaOS does not silently swallow fence errors. Every waiter that was blocked on a GPU fence at crash time receives an explicit error. There is no "signal-and-hope" behavior.

Userspace propagation:

The GPU driver's userspace interface layer maps fence errors to the appropriate API-level error codes: - Vulkan: VK_ERROR_DEVICE_LOST (full device crash) or VK_ERROR_INITIALIZATION_FAILED (context kill, if the device recovers). - OpenGL/EGL: EGL_CONTEXT_LOST (ARB_robustness extension). - CUDA/HIP: cudaErrorLostDevice / hipErrorLostDevice. - OpenCL: CL_DEVICE_NOT_AVAILABLE.

The driver maps DmaFenceError::DeviceLost → the device-lost variant and DmaFenceError::ContextKilled → the context-lost/robustness variant. Userspace applications that handle robustness extensions (Vulkan Robust Buffer Access, OpenGL ARB_robustness) can recover from context-killed errors without terminating.

Fence ordering guarantee on crash:

Fences within a single GpuContext are signaled in submission order (FIFO). This preserves the happens-before relationship for recovery code that inspects fence completion order to determine which operations committed before the crash and which did not. Cross-context fences (inter-context dependencies expressed via wait_fences in submit()) are signaled after all fences in the depended-upon context are signaled, maintaining the dependency ordering even in the error path.

13.6 RDMA¶

Tier: Tier 1. RDMA's defining property is that the hot path (posting work requests, ringing doorbells) never enters the kernel. This requires that the kernel map QP doorbell pages and work-request memory regions directly into userspace. Protection domain management, memory region pinning, and IOMMU programming must therefore reside in the kernel.

KABI interface name: rdma_device_v1 (in interfaces/rdma_device.kabi).

// umka-core/src/rdma/mod.rs — authoritative RDMA driver contract

/// An RDMA-capable network device (InfiniBand HCA, RoCEv2 NIC, iWARP adapter).
/// Implemented by drivers such as Mellanox/NVIDIA mlx5, Broadcom bnxt_re,
/// Intel irdma, and Marvell qedr.
pub trait RdmaDevice: Send + Sync {
    // --- Protection domain ---

    /// Allocate a protection domain. A PD is the unit of authorization:
    /// memory regions, queue pairs, and address handles all belong to exactly
    /// one PD. Objects in different PDs cannot communicate without explicit
    /// cross-registration (which is not currently supported).
    ///
    /// Requires `CAP_RDMA`.
    fn alloc_pd(&self) -> Result<ProtectionDomain, RdmaError>;

    /// Free a protection domain. All child objects (MRs, QPs, AHs) must
    /// have been freed before calling this; returns `RdmaError::PdInUse`
    /// if any remain.
    fn dealloc_pd(&self, pd: ProtectionDomain) -> Result<(), RdmaError>;

    // --- Memory regions ---

    /// Register a memory region. The kernel pins the pages covering
    /// `[addr, addr + length)` in the calling process's address space,
    /// programs the IOMMU to allow DMA from the device, and returns the
    /// local key (`lkey`, for local SGE references) and remote key (`rkey`,
    /// for remote RDMA operations targeting this region). Both keys are
    /// opaque 32-bit values; their encoding is device-specific.
    ///
    /// `access` controls what operations remote peers may perform via
    /// the rkey (read, write, atomic). Local access via lkey always
    /// allows reads and writes.
    ///
    /// Requires `CAP_RDMA`.
    fn alloc_mr(
        &self,
        pd: &ProtectionDomain,
        addr: u64,
        length: u64,
        access: MrAccessFlags,
    ) -> Result<MemoryRegion, RdmaError>;

    /// Deregister a memory region. Pages are unpinned and the IOMMU mapping
    /// is removed. Any in-flight RDMA operation targeting this MR will
    /// complete with a remote access error on the peer side.
    fn dealloc_mr(&self, mr: MemoryRegion) -> Result<(), RdmaError>;

    // --- Completion queues ---

    /// Create a completion queue with capacity for at least `cqe` entries.
    /// The driver may round up to a hardware-convenient size. The actual
    /// capacity is returned in `CompletionQueue::capacity`.
    ///
    /// Requires `CAP_RDMA`.
    fn create_cq(&self, cqe: u32) -> Result<CompletionQueue, RdmaError>;

    /// Destroy a completion queue. All QPs that reference this CQ must be
    /// destroyed first.
    fn destroy_cq(&self, cq: CompletionQueue) -> Result<(), RdmaError>;

    // --- Queue pairs ---

    /// Create a queue pair (send queue + receive queue) associated with the
    /// given protection domain and completion queues. `init_attr` specifies
    /// QP type, initial queue depths, and scatter-gather element counts.
    ///
    /// The QP is created in the RESET state. Call `modify_qp` to transition
    /// it to INIT → RTR → RTS before posting work requests.
    ///
    /// Requires `CAP_RDMA`.
    fn create_qp(
        &self,
        pd: &ProtectionDomain,
        send_cq: &CompletionQueue,
        recv_cq: &CompletionQueue,
        init_attr: &QpInitAttr,
    ) -> Result<QueuePair, RdmaError>;

    /// Transition a queue pair through the state machine (RESET→INIT→RTR→RTS,
    /// or error paths). `attr_mask` indicates which fields in `attr` are valid.
    fn modify_qp(
        &self,
        qp: &mut QueuePair,
        attr: &QpAttr,
        attr_mask: QpAttrMask,
    ) -> Result<(), RdmaError>;

    /// Destroy a queue pair. Any posted work requests are silently discarded.
    fn destroy_qp(&self, qp: QueuePair) -> Result<(), RdmaError>;

    // --- Kernel-bypass doorbell mapping ---

    /// Map the QP doorbell page into the calling process's virtual address
    /// space. Returns the userspace virtual address of the doorbell MMIO page.
    /// The process writes work requests to the QP memory (already mapped via
    /// `mmap` of the QP backing pages) and then writes a 64-bit descriptor to
    /// the doorbell address to ring the hardware. No syscall is needed on the
    /// hot path.
    ///
    /// The mapping is automatically removed when the QP is destroyed or the
    /// process exits.
    ///
    /// Requires `CAP_RDMA`.
    fn map_qp_doorbell(&self, qp: &QueuePair) -> Result<*mut u8, RdmaError>;

    // --- Kernel-side slow path (setup and error recovery only) ---

    /// Post receive work requests to the QP's receive queue. Used only
    /// during initialization and after QP error recovery; the normal path
    /// posts directly from userspace.
    fn post_recv(
        &self,
        qp: &QueuePair,
        wrs: &[RecvWorkRequest],
    ) -> Result<(), RdmaError>;

    /// Post send work requests to the QP's send queue. Used only during
    /// initialization and after QP error recovery.
    fn post_send(
        &self,
        qp: &QueuePair,
        wrs: &[SendWorkRequest],
    ) -> Result<(), RdmaError>;

    // --- Port query ---

    /// Query the state of a physical port. Returns link state, MTU, GID
    /// table entries, port capabilities, and current speed/width.
    fn query_port(&self, port_num: u8) -> Result<PortAttributes, RdmaError>;

    // --- Device query ---

    /// Return static device capabilities (max QPs, max CQEs, max MR size,
    /// supported transport types, atomic operation support, etc.).
    fn query_device(&self) -> DeviceAttributes;

    // --- Async event notification ---

    /// Return a reference to the device's async event ring buffer.
    /// The IB Architecture mandates async event delivery for: QP error
    /// transitions (fatal/access/communication errors), CQ overrun,
    /// port state changes (link up/down), and device-level fatal errors.
    /// The kernel consumes events from this ring in a workqueue to dispatch
    /// to registered event handlers (uverbs async event file, in-kernel
    /// consumers like NFS-RDMA).
    ///
    /// The ring is allocated in umka-core memory (Tier 0), not in the
    /// driver's isolation domain, so events remain accessible after a
    /// Tier 1 driver crash.
    /// `SpscRing` is defined in [Section 3.6](03-concurrency.md#lock-free-data-structures--spsc-ring).
    fn event_ring(&self) -> &SpscRing<RdmaEvent>;
}

/// Asynchronous event delivered by the RDMA device.
/// These events are mandatory per IB Architecture (Section 11.6.2).
#[repr(C, u32)]
pub enum RdmaEvent {
    /// QP transitioned to Error state (fatal, access, or communication error).
    /// `qp_handle` identifies the affected queue pair.
    QpError { qp_handle: u32, error: QpError },
    /// Completion queue overrun — CQEs were lost because the CQ was full.
    CqOverrun { cq_handle: u32 },
    /// Physical port state changed (link up, link down, active).
    PortStateChange { port_num: u8, new_state: PortState },
    /// Device-level fatal error — all QPs and CQs are invalidated.
    DeviceFatal,
}

/// Queue pair error codes (IB Architecture Section 11.6.2).
/// Reported in `RdmaEvent::QpError` when a QP transitions to the Error state.
#[repr(u32)]
pub enum QpError {
    /// Local length error: SG list too short or too long for the posted WR.
    LocalLenError       = 0,
    /// Local QP operation error: internal QP consistency check failed.
    LocalQpOpError      = 1,
    /// Local protection error: MR access violation (lkey invalid or access flags insufficient).
    LocalProtError      = 2,
    /// Work request flushed: QP transitioned to Error state; all outstanding WRs are flushed.
    WrFlushError        = 3,
    /// Memory window bind error: MW bind request violated access permissions.
    MwBindError         = 4,
    /// Remote access error: remote peer's MR access flags do not permit the operation.
    RemoteAccessError   = 5,
    /// Remote operation error: remote peer detected an unrecoverable error.
    RemoteOpError       = 6,
    /// Transport retry counter exceeded: retransmissions exhausted without ACK.
    TransportRetryExceeded = 7,
    /// RNR retry counter exceeded: receiver not ready retries exhausted.
    RnrRetryExceeded    = 8,
}

/// Physical port state (IB Architecture Section 14.2.5.6).
/// Reported in `RdmaEvent::PortStateChange` when a port transitions.
#[repr(u32)]
pub enum PortState {
    /// Port is not operational (link layer not initialized).
    Down       = 0,
    /// Port is initializing (subnet manager configuring).
    Initialize = 1,
    /// Port is armed (awaiting activation by subnet manager).
    Armed      = 2,
    /// Port is fully operational (can send and receive data).
    Active     = 3,
    /// Port is active but deferring sends (congestion management).
    ActiveDefer = 4,
}

/// A protection domain: unit of authorization for RDMA operations.
pub struct ProtectionDomain {
    /// Opaque kernel handle. Constrained to u32 by Linux uverbs ABI
    /// (`ib_uverbs_alloc_pd_resp` uses `__u32` handle). Kernel recycles
    /// freed handles; wrap only under sustained exhaustion without deallocation.
    /// Same constraint applies to MemoryRegion, CompletionQueue, and QueuePair handles.
    pub handle: u32,
}

/// A pinned, IOMMU-mapped memory region.
pub struct MemoryRegion {
    /// Opaque kernel handle.
    pub handle: u32,
    /// Local key: used in SGE (scatter-gather element) references.
    pub lkey: u32,
    /// Remote key: presented to a remote peer to authorize RDMA operations
    /// targeting this region.
    pub rkey: u32,
    /// Base virtual address of the registered region.
    pub addr: u64,
    /// Length of the registered region in bytes.
    pub length: u64,
}

/// A completion queue.
pub struct CompletionQueue {
    /// Opaque kernel handle.
    pub handle: u32,
    /// Actual CQ capacity (≥ the requested `cqe`).
    pub capacity: u32,
}

/// A queue pair (RC, UC, UD, or SRQ-attached RC).
pub struct QueuePair {
    /// Opaque kernel handle.
    pub handle: u32,
    /// The QP number used by the remote peer for addressing.
    pub qp_num: u32,
    /// Current QP state.
    pub state: QpState,
}

bitflags! {
    /// Access permissions granted on a memory region to remote peers.
    pub struct MrAccessFlags: u32 {
        /// Remote peer may issue RDMA Read targeting this MR.
        const REMOTE_READ   = 1 << 0;
        /// Remote peer may issue RDMA Write targeting this MR.
        const REMOTE_WRITE  = 1 << 1;
        /// Remote peer may issue atomic operations (CAS, FAA) on this MR.
        const REMOTE_ATOMIC = 1 << 2;
        /// Memory window binding is allowed (for dynamic rkey invalidation).
        const MW_BIND       = 1 << 3;
    }
}

#[repr(u32)]
/// QP state machine states (IB Architecture Specification Section 15.4.3).
pub enum QpState {
    /// Hardware-quiesced state. No WRs are processed.
    Reset  = 0,
    /// Initialized. Receive WRs may be posted; sends are not yet enabled.
    Init   = 1,
    /// Ready To Receive. Path information is configured; receives are active.
    Rtr    = 2,
    /// Ready To Send. Both sends and receives are active.
    Rts    = 3,
    /// Send Queue Drain. QP is draining sends due to error; new WRs rejected.
    Sqd    = 4,
    /// Send Queue Error. QP encountered an error; all WRs flushed with error.
    Sqe    = 5,
    /// Error. Both queues have been flushed with error completions.
    Err    = 6,
}

/// Queue pair initialization attributes. Passed to `create_qp()`.
///
/// These attributes are fixed for the lifetime of the QP and cannot be
/// modified after creation.
#[repr(C)]
pub struct QpInitAttr {
    /// QP transport type.
    pub qp_type: QpType,
    /// Maximum outstanding send work requests.
    pub max_send_wr: u32,
    /// Maximum outstanding receive work requests.
    pub max_recv_wr: u32,
    /// Maximum scatter-gather elements per send WR.
    pub max_send_sge: u32,
    /// Maximum scatter-gather elements per receive WR.
    pub max_recv_sge: u32,
    /// Maximum inline data size in bytes for send WRs (0 to disable inline).
    pub max_inline_data: u32,
    /// Signal completion for all send WRs or only flagged WRs.
    pub sq_sig_all: u8, // 0 = only flagged WRs, 1 = all WRs
    /// Explicit padding (repr(C) alignment: u32-aligned struct, offset 25, pad to 28).
    pub _pad: [u8; 3],
}
// QpInitAttr: qp_type(4) + max_send_wr(4) + max_recv_wr(4) + max_send_sge(4) +
//   max_recv_sge(4) + max_inline_data(4) + sq_sig_all(1) + _pad(3) = 28 bytes.
const_assert!(core::mem::size_of::<QpInitAttr>() == 28);

/// Queue pair transport type (IB Architecture Section 3.2).
#[repr(u32)]
pub enum QpType {
    /// Reliable Connected: ordered, reliable delivery. Most common for storage
    /// and cluster IPC. One QP per remote peer.
    Rc  = 2,
    /// Unreliable Connected: connected but no retransmission. Rarely used.
    Uc  = 3,
    /// Unreliable Datagram: connectionless, one QP serves all peers.
    /// Used for subnet management and multicast.
    Ud  = 4,
}

/// Queue pair modification attributes. Passed to `modify_qp()`.
///
/// Only fields indicated by the corresponding `QpAttrMask` bits are read;
/// all other fields are ignored.
#[repr(C)]
pub struct QpAttr {
    /// Target QP state for this transition.
    pub qp_state: QpState,            // offset 0, 4 bytes
    /// Path MTU (128, 256, 512, 1024, 2048, 4096 bytes).
    pub path_mtu: u32,                // offset 4, 4 bytes
    /// Remote QP number (required for INIT→RTR on RC/UC).
    pub dest_qp_num: u32,             // offset 8, 4 bytes
    /// Packet sequence number for receive (INIT→RTR).
    pub rq_psn: u32,                  // offset 12, 4 bytes
    /// Maximum number of outstanding RDMA Reads as responder.
    pub max_dest_rd_atomic: u8,       // offset 16, 1 byte
    /// Minimum RNR NAK timer (0-31, encoding per IB spec).
    pub min_rnr_timer: u8,            // offset 17, 1 byte
    /// Explicit padding to align ah_attr to u32 (4-byte boundary for flow_label).
    pub _pad_pre_ah: [u8; 2],         // offset 18, 2 bytes
    /// Address handle for the primary path (contains GRH, DLID, SL, etc.).
    pub ah_attr: AddressHandle,       // offset 20, 32 bytes
    /// Packet sequence number for send (RTR→RTS).
    pub sq_psn: u32,                  // offset 52, 4 bytes
    /// Maximum outstanding RDMA Read/Atomic as initiator.
    pub max_rd_atomic: u8,            // offset 56, 1 byte
    /// Retry count for transport errors (0-7).
    pub retry_cnt: u8,                // offset 57, 1 byte
    /// RNR retry count (0-7; 7 means infinite).
    pub rnr_retry: u8,                // offset 58, 1 byte
    /// Timeout for ack (0-31, encoding per IB spec: 4.096 * 2^timeout usec).
    pub timeout: u8,                  // offset 59, 1 byte
    /// Queue key for UD/UC QP types. Required for UD datagram send.
    pub qkey: u32,                    // offset 60, 4 bytes
}
// QpAttr size: 4(qp_state) + 4(path_mtu) + 4(dest_qp_num) + 4(rq_psn)
//   + 1(max_dest_rd_atomic) + 1(min_rnr_timer) + 2(_pad_pre_ah) + 32(ah_attr)
//   + 4(sq_psn) + 1(max_rd_atomic) + 1(retry_cnt) + 1(rnr_retry) + 1(timeout)
//   + 4(qkey) = 64 bytes.
const_assert!(core::mem::size_of::<QpAttr>() == 64);

bitflags! {
    /// Mask indicating which fields in `QpAttr` are valid for this modify call.
    /// Matches Linux UAPI `enum ibv_qp_attr_mask` values.
    pub struct QpAttrMask: u32 {
        const QP_STATE              = 1 << 0;
        const CUR_STATE             = 1 << 1;
        const EN_SQD_ASYNC_NOTIFY   = 1 << 2;
        const ACCESS_FLAGS          = 1 << 3;
        const PKEY_INDEX            = 1 << 4;
        const PORT                  = 1 << 5;
        const QKEY                  = 1 << 6;
        const AV                    = 1 << 7;
        const PATH_MTU              = 1 << 8;
        const TIMEOUT               = 1 << 9;
        const RETRY_CNT             = 1 << 10;
        const RNR_RETRY             = 1 << 11;
        const RQ_PSN                = 1 << 12;
        const MAX_QP_RD_ATOMIC      = 1 << 13;
        const ALT_PATH              = 1 << 14;
        const MIN_RNR_TIMER         = 1 << 15;
        const SQ_PSN                = 1 << 16;
        const MAX_DEST_RD_ATOMIC    = 1 << 17;
        const PATH_MIG_STATE        = 1 << 18;
        const CAP                   = 1 << 19;
        const DEST_QPN              = 1 << 20;
    }
}

/// Address handle: routing information for reaching a remote port.
/// Matches Linux `ib_uverbs_ah_attr` layout (32 bytes) for RDMA ABI compatibility.
/// Embedded in `QpAttr` (which is `#[repr(C)]`), so must have stable layout.
///
/// Contains GRH (Global Route Header) fields as a flat inline structure
/// matching Linux's `ib_uverbs_global_route` (24 bytes) followed by the
/// local addressing fields.
#[repr(C)]
pub struct AddressHandle {
    // --- GRH fields (matches Linux ib_uverbs_global_route, 24 bytes) ---
    /// Global Routing Header destination GID (128-bit IPv6-format address).
    pub dgid: [u8; 16],         // offset 0, 16 bytes
    /// GRH flow label (20 bits used). Required for RoCEv2 ECMP load balancing
    /// — without this, all connections share flow label 0 on spine-leaf fabrics,
    /// defeating multi-path hashing. Linux `rdma_set_ah_attr_from_wc()` and
    /// `ib_init_ah_attr_from_wc()` both populate this field.
    pub flow_label: u32,         // offset 16, 4 bytes
    /// Source GID index in the local port's GID table.
    pub sgid_index: u8,          // offset 20
    /// GRH hop limit (TTL). Without this, routed InfiniBand fabrics will
    /// not forward packets (hop_limit=0 means "do not route").
    pub hop_limit: u8,           // offset 21
    /// GRH traffic class (DSCP + ECN). Required for QoS differentiation
    /// on InfiniBand and RoCEv2 networks.
    pub traffic_class: u8,       // offset 22
    /// Reserved (matches Linux ib_uverbs_global_route.reserved).
    pub _grh_reserved: u8,       // offset 23

    // --- Local addressing fields ---
    /// Destination Local Identifier. Primary addressing for InfiniBand
    /// (non-RoCE). Without this, QPs cannot reach remote peers on IB networks.
    /// For RoCEv2, set to 0 (GRH/dgid is used instead).
    pub dlid: u16,               // offset 24, 2 bytes
    /// Service Level (0-15, for QoS differentiation).
    pub sl: u8,                  // offset 26
    /// Source path bits (for LID masking — multi-port HCAs).
    /// Default 0 works on most fabrics.
    pub src_path_bits: u8,       // offset 27
    /// Static rate encoding (IB spec Table 45). Controls inter-packet gap
    /// for rate matching between different link speeds. Default 0 = full speed.
    pub static_rate: u8,         // offset 28
    /// Flag indicating whether GRH is present. When `is_global == 0`,
    /// `dgid`/`flow_label`/etc. are ignored and `dlid` is the sole routing field.
    pub is_global: u8,           // offset 29
    /// Port number on the local HCA.
    pub port_num: u8,            // offset 30
    /// Reserved (matches Linux ib_uverbs_ah_attr.reserved).
    pub _reserved: u8,           // offset 31
    // Total: 16(dgid) + 4(flow_label) + 1(sgid_index) + 1(hop_limit)
    //   + 1(traffic_class) + 1(_grh_reserved) + 2(dlid) + 1(sl) + 1(src_path_bits)
    //   + 1(static_rate) + 1(is_global) + 1(port_num) + 1(_reserved) = 32 bytes.
}
const_assert!(core::mem::size_of::<AddressHandle>() == 32);

Kernel-bypass model: After map_qp_doorbell() and mmap of the QP work queue memory, userspace RDMA libraries (libibverbs, rdma-core) operate entirely without kernel involvement on the send path. The kernel is re-entered only for: QP state transitions, CQ overflow recovery, error handling, and address handle (AH) creation. This model is compatible with the OpenMPI and UCX transports used by HPC applications.

IOMMU integration: alloc_mr programs the device's IOMMU domain (same model as Section 13.5 GpuContext) so that only the registered address range is accessible to the device. A buffer overflow in an RDMA payload cannot reach outside the registered MR.

Multikernel integration: The distributed lock manager (Section 15.15) and the inter-node IPC transport (Section 5.1) both use RDMA as their high-speed fabric. The RDMA protection domain model maps directly to UmkaOS capability domains: each cluster node that participates in the multikernel has one PD per trust domain.

IB verbs compatibility: The RdmaDevice trait is a strict superset of the IB verbs interface exposed by Linux's ib_verbs.h. The umka-sysapi layer translates ibv_* library calls to the corresponding RdmaDevice methods, allowing unmodified rdma-core, OpenMPI, and OpenFabrics applications to run.

Hardware-specific detail: Per-vendor RDMA driver architecture (Mellanox/NVIDIA mlx5, Intel irdma, Broadcom bnxt_re, Marvell qedr) is documented inline in this section and covered during per-driver implementation using vendor documentation and the IB verbs specification.

13.7 Video / Media Pipeline¶

Tier: Tier 1 for hardware codec engines (Intel Quick Sync, AMD VCN, Qualcomm Venus, Mediatek VENC/VDEC, Apple VideoToolbox-equivalent hardware). Tier 2 for pure software codecs: a CPU-based ffmpeg instance is already a userspace process and requires no special KABI beyond ordinary DMA-BUF file descriptor passing and shared memory.

KABI interface name: media_device_v1 (in interfaces/media_device.kabi).

// umka-core/src/media/mod.rs — authoritative media pipeline driver contract

/// A hardware media processing device (codec engine, ISP, or similar).
/// Implemented by drivers for SoC video IP blocks and discrete capture cards.
pub trait MediaDevice: Send + Sync {
    // --- Capability discovery ---

    /// Enumerate all codec configurations supported by the hardware. Each
    /// entry specifies codec type, profile, level, maximum resolution,
    /// maximum frame rate, and whether encode and/or decode is supported.
    fn query_codecs(&self, buf: &mut [CodecCapability], max_count: u32) -> Result<u32, IoError>;

    // --- Session lifecycle ---

    /// Create a codec session. `config` specifies the codec, direction
    /// (encode or decode), input/output pixel formats, and initial encoding
    /// parameters (bitrate, QP, keyframe interval, rate control mode) for
    /// encode sessions or output pixel format (NV12, P010, etc.) for decode.
    ///
    /// Returns a `MediaSession` handle used for subsequent buffer operations.
    fn create_session(
        &self,
        config: &SessionConfig,
    ) -> Result<MediaSession, MediaError>;

    /// Destroy a codec session. All queued buffers are flushed and returned
    /// with `BufferState::Error` before the session handle is invalidated.
    fn destroy_session(&self, session: MediaSession) -> Result<(), MediaError>;

    /// Graceful end-of-stream drain. Signals the codec that no more input
    /// buffers will be submitted for this session. The codec flushes all
    /// internally-buffered frames and produces output buffers for each.
    /// The final output buffer is marked with `BufferState::LastFrame`.
    /// Subsequent `dequeue_buf()` calls after the last frame return
    /// `MediaError::Drained`.
    ///
    /// Equivalent to V4L2 `VIDIOC_DECODER_CMD(V4L2_DEC_CMD_STOP)` for
    /// decoders and `VIDIOC_ENCODER_CMD(V4L2_ENC_CMD_STOP)` for encoders.
    /// After drain completes, the session remains valid and can accept
    /// new input buffers (re-entering the active state).
    fn drain(&self, session: &MediaSession) -> Result<(), MediaError>;

    // --- Buffer queue ---

    /// Submit an input buffer (as a DMA-BUF handle) to the session for
    /// processing. For encode sessions the buffer contains raw video frames;
    /// for decode sessions it contains compressed bitstream data.
    ///
    /// `sequence` is a monotonically increasing caller-assigned sequence
    /// number returned with the corresponding output buffer so the caller can
    /// match inputs to outputs out-of-order.
    fn queue_buf(
        &self,
        session: &MediaSession,
        buf: DmaBufHandle,
        sequence: u64,
        flags: QueueFlags,
    ) -> Result<(), MediaError>;

    /// Retrieve the next completed output buffer. Blocks until a buffer is
    /// available or the session is destroyed. Returns the DMA-BUF handle of
    /// the output, the input sequence number it corresponds to, and a
    /// `DmaFence` ([Section 13.5](#gpu-compute)) that is signaled when the hardware has finished
    /// writing to the buffer (the caller must wait on this fence before
    /// reading the buffer contents from CPU or passing it to the display).
    fn dequeue_buf(
        &self,
        session: &MediaSession,
    ) -> Result<DequeuedBuffer, MediaError>;

    // --- Media graph topology ---

    /// Return all pads (typed I/O ports) belonging to this device node.
    fn pads(&self) -> &[MediaPad];

    /// Create a directed link between an output pad of this device and an
    /// input pad of another device. Both pads must be compatible (same
    /// pixel format, resolution, and frame rate). Returns a `MediaLink`
    /// handle. Enabling the link causes DMA-BUFs to flow from the source
    /// pad to the sink pad without copying.
    /// **KABI note**: `sink_device: &dyn MediaDevice` is a fat pointer (vtable + data
    /// pointer) and cannot cross KABI domain boundaries. This method is called
    /// exclusively within the Tier 0 core (the media controller infrastructure),
    /// not by Tier 1 drivers directly. Tier 1 drivers request link creation via
    /// a KABI opcode (`MediaKabiOp::CreateLink`) that passes `DeviceNodeId` +
    /// `PadId` as integer values; the Tier 0 core resolves these to `&dyn MediaDevice`
    /// references within its own address space.
    fn create_link(
        &self,
        src_pad: PadId,
        sink_device: &dyn MediaDevice,
        sink_pad: PadId,
        format: LinkFormat,
    ) -> Result<MediaLink, MediaError>;

    /// Destroy a link, stopping buffer flow between the two pads.
    fn destroy_link(&self, link: MediaLink) -> Result<(), MediaError>;

    // --- Dynamic parameter updates ---

    /// Update encoding parameters on a running encode session without
    /// destroying and recreating it. Only encode-direction parameters
    /// (bitrate, QP range, keyframe force) may be updated this way.
    fn update_encode_params(
        &self,
        session: &MediaSession,
        params: &EncodeParams,
    ) -> Result<(), MediaError>;
}

/// Pad identifier: index within a media device's pad array (0-based).
/// Matches Linux's `struct media_pad_desc.index` (u16).
pub type PadId = u16;

/// Direction of a media pad.
#[repr(u32)]
pub enum PadDirection {
    /// Pad produces DMA-BUFs (output of a video source or encoder).
    Source = 0,
    /// Pad consumes DMA-BUFs (input to a decoder or display).
    Sink   = 1,
}

/// Pixel format and frame size advertised by a media pad.
#[repr(C)]
pub struct PadFormat {
    /// V4L2 pixel format FourCC (e.g., `V4L2_PIX_FMT_NV12`).
    pub pixelformat: u32,
    /// Frame width in pixels.
    pub width: u32,
    /// Frame height in pixels.
    pub height: u32,
    /// Frame rate numerator (e.g., 30 for 30 fps).
    pub fps_numerator: u32,
    /// Frame rate denominator (e.g., 1 for 30 fps).
    pub fps_denominator: u32,
}
// PadFormat: 5 × u32 = 20 bytes.
const_assert!(size_of::<PadFormat>() == 20);

/// Negotiated format for a media link between two pads.
/// Describes the data format flowing from the source pad to the sink pad.
#[repr(C)]
pub struct LinkFormat {
    /// V4L2 pixel format FourCC carried on the link.
    pub pixelformat: u32,
    /// Frame width in pixels.
    pub width: u32,
    /// Frame height in pixels.
    pub height: u32,
}
// LinkFormat: 3 × u32 = 12 bytes.
const_assert!(size_of::<LinkFormat>() == 12);

/// A codec session handle.
pub struct MediaSession {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Session direction (Encode or Decode).
    pub direction: CodecDirection,
}

/// A directed link between two media pads. The link transfers ownership of
/// each DMA-BUF from the source pad to the sink pad atomically.
pub struct MediaLink {
    /// Opaque kernel handle.
    /// Constrained to u32 by Linux Media Controller ABI (`struct media_link_desc`
    /// uses `__u32` for link IDs). Allocated from an Idr (recycling allocator):
    /// handles are reused when links are destroyed, so only simultaneous link
    /// count matters, not cumulative allocations. At 1000 link creations/sec,
    /// monotonic wrap would occur in ~49.7 days; with Idr recycling, exhaustion
    /// requires >4 billion simultaneous links (impossible — physical device count
    /// bounds this). Acceptable for 50-year uptime.
    pub handle: u32,
    /// Source device pad identifier.
    pub src_pad: PadId,
    /// Sink device pad identifier.
    pub sink_pad: PadId,
    /// Negotiated format carried on this link.
    pub format: LinkFormat,
}

/// Maximum number of pixel formats a single media pad can advertise.
/// V4L2 devices typically support 10-30 formats per pad; 64 provides
/// headroom for multi-plane and vendor-specific formats. Devices with
/// more formats than this must split across multiple pads. If a driver
/// attempts to register more than 64 formats on a single pad, the
/// registration call returns `Err(KernelError::ResourceExhausted)` and
/// an FMA warning is emitted identifying the driver and pad.
pub const MEDIA_PAD_MAX_FORMATS: usize = 64;

/// A typed I/O port on a media device.
///
/// `#[repr(C)]` for KABI stability — returned by `MediaDevice::pads()`,
/// `CameraDevice::pads()`, and `CameraSubdevice::pads()`.
#[repr(C)]
pub struct MediaPad {
    /// Identifier unique within the owning device.
    pub id: PadId,
    /// Explicit padding (PadId = u16, PadDirection = u32 alignment).
    pub _pad0: [u8; 2],
    /// Whether this pad produces (Source) or consumes (Sink) DMA-BUFs.
    pub direction: PadDirection,
    /// Set of pixel formats and frame sizes this pad can accept or produce.
    /// Bounded to avoid heap allocation across the KABI boundary.
    /// Stored as flat array + count for stable C layout.
    pub supported_formats: [PadFormat; MEDIA_PAD_MAX_FORMATS],
    /// Number of valid entries in `supported_formats` (0..=MEDIA_PAD_MAX_FORMATS).
    pub format_count: u16,
    /// Explicit trailing padding.
    pub _pad1: [u8; 2],
}
// MediaPad: PadId(u16=2) + _pad0(2) + PadDirection(u32=4) + [PadFormat;64](20×64=1280) +
//   format_count(u16=2) + _pad1(2) = 1292 bytes.
const_assert!(size_of::<MediaPad>() == 1292);

/// Buffer lifecycle state. Returned with each dequeued buffer to indicate
/// the processing outcome.
#[repr(u32)]
pub enum BufferState {
    /// Buffer contains valid output data (normal completion).
    Done      = 0,
    /// Buffer is the last frame produced by a `drain()` operation.
    /// No more output buffers will be produced until new input is queued.
    LastFrame = 1,
    /// Buffer contents are invalid due to a codec error or session reset.
    /// The DMA-BUF may contain partial data and should be discarded.
    Error     = 2,
    /// Buffer was returned without processing because `destroy_session()`
    /// was called while it was queued.
    Cancelled = 3,
}

/// A completed output buffer returned by `dequeue_buf`.
pub struct DequeuedBuffer {
    /// DMA-BUF handle of the output data. For encode: compressed bitstream.
    /// For decode: raw frame in the pixel format requested in `SessionConfig`.
    pub buf: DmaBufHandle,
    /// Caller-assigned sequence number from the corresponding `queue_buf`.
    pub sequence: u64,
    /// Processing outcome for this buffer.
    pub state: BufferState,
    /// Fence signaled when hardware has finished writing to `buf`. The
    /// caller MUST wait on this fence before reading or forwarding the buffer.
    pub ready_fence: DmaFence,
}

/// Video codec type identifier.
///
/// UmkaOS-native sequential values. The V4L2 compatibility layer maps these to
/// V4L2 stateless codec API control IDs at the sysapi boundary.
/// Each variant identifies a specific video compression standard that the hardware
/// codec may support.
#[repr(u32)]
pub enum CodecType {
    /// H.264 (AVC). Most widely supported hardware codec.
    H264    = 0,
    /// H.265 (HEVC). 4K/8K primary codec on modern hardware.
    H265    = 1,
    /// VP9. WebM container standard (Google). Hardware decode on Intel Gen9+, AMD VCN 1+.
    Vp9     = 2,
    /// AV1. Open royalty-free codec (AOMedia). Hardware decode on Intel DG2+, AMD VCN 4+.
    Av1     = 3,
    /// JPEG. Still image codec (also used for MJPEG streams).
    Jpeg    = 4,
    /// VP8. Legacy WebM codec.
    Vp8     = 5,
    /// MPEG-2. Legacy broadcast codec (DVD, DVB-T).
    Mpeg2   = 6,
}

/// Media format identifier. Used for both pixel formats (raw video) and
/// compressed bitstream containers.
///
/// Pixel format values match the V4L2 `V4L2_PIX_FMT_*` fourcc codes.
/// Container format values use the Linux MEDIA_BUS_FMT_* convention.
#[repr(u32)]
pub enum MediaFormat {
    // --- Raw pixel formats (decode output / encode input) ---
    /// NV12: Y plane + interleaved UV plane, 4:2:0, 8-bit. Default decode output.
    Nv12      = 0x3231_564E,
    /// P010: Y plane + interleaved UV plane, 4:2:0, 10-bit (16-bit storage).
    /// Used for HDR content (BT.2020).
    P010      = 0x3031_3050,
    /// NV21: Y plane + interleaved VU plane, 4:2:0, 8-bit (Android default).
    Nv21      = 0x3132_564E,
    /// YUYV: packed 4:2:2, 8-bit. Common intermediate format.
    Yuyv      = 0x5659_5559,

    // --- Compressed bitstream containers (decode input / encode output) ---
    /// H.264 Annex B byte stream (0x00000001 start codes).
    H264AnnexB = 0x3436_3248,
    /// H.265 Annex B byte stream.
    HevcAnnexB = 0x4356_4548,
    /// VP9 IVF container.
    Vp9Ivf     = 0x3930_5056,
    /// AV1 OBU (Open Bitstream Units) format.
    Av1Obu     = 0x3130_5641,
    /// JPEG (JFIF).
    Jpeg       = 0x4750_4A4D,
}

/// Configuration for a new codec session.
#[repr(C)]
pub struct SessionConfig {
    /// Codec type (H264, H265, AV1, VP9, JPEG, etc.).
    pub codec: CodecType,
    /// Encode or Decode.
    pub direction: CodecDirection,
    /// Input pixel format (for encode) or bitstream container (for decode).
    pub input_format: MediaFormat,
    /// Output pixel format (for decode: NV12, P010, etc.; for encode: N/A).
    pub output_format: MediaFormat,
    /// Initial encode parameters (ignored for decode sessions).
    pub encode_params: EncodeParams,
}
// SessionConfig: CodecType(u32=4) + CodecDirection(u32=4) + MediaFormat(u32=4) +
//   MediaFormat(u32=4) + EncodeParams(16) = 32 bytes.
const_assert!(size_of::<SessionConfig>() == 32);

/// Encoding parameters. All fields are writable after session creation via
/// `update_encode_params`.
#[repr(C)]
pub struct EncodeParams {
    /// Target bitrate in bits per second. 0 means CQP (constant QP) mode.
    /// u32 max ~4.3 Gbps; sufficient for all current hardware codec engines.
    pub bitrate_bps: u32,
    /// Minimum quantization parameter (lower = better quality, larger frames).
    pub qp_min: u8,
    /// Maximum quantization parameter.
    pub qp_max: u8,
    /// Explicit padding for u32 alignment of `keyframe_interval`.
    _pad: [u8; 2],
    /// Force a keyframe every N frames. 0 disables periodic keyframes.
    pub keyframe_interval: u32,
    /// Rate control mode (CBR, VBR, CQP, CRF).
    pub rc_mode: RateControlMode,
}
const_assert!(size_of::<EncodeParams>() == 16);

#[repr(u32)]
pub enum CodecDirection {
    /// Hardware encoder: raw frames in, compressed bitstream out.
    Encode = 0,
    /// Hardware decoder: compressed bitstream in, raw frames out.
    Decode = 1,
}

#[repr(u32)]
pub enum RateControlMode {
    /// Constant bitrate. Buffer fullness is maintained; quality varies.
    Cbr = 0,
    /// Variable bitrate. Average bitrate target; quality peaks on I-frames.
    Vbr = 1,
    /// Constant quantization parameter. Bitrate varies; quality is fixed.
    Cqp = 2,
    /// Constant rate factor (quality-based VBR, similar to x264 CRF).
    Crf = 3,
}

bitflags! {
    /// Flags for `queue_buf`.
    pub struct QueueFlags: u32 {
        /// Mark this buffer as the last in a stream (EOS). The session will
        /// flush and return all pending output buffers after processing this
        /// input.
        const END_OF_STREAM = 1 << 0;
        /// Force a keyframe on this input buffer (encode only).
        const FORCE_KEYFRAME = 1 << 1;
    }
}

Buffer graph model: A complete media pipeline is a directed acyclic graph of MediaDevice nodes connected by MediaLink edges. DMA-BUFs flow from source pads to sink pads without copying. A typical pipeline:

[camera sensor] → [ISP] → [encoder] → [network or file]

The ISP and encoder are separate MediaDevice instances. The link between them carries DMA-BUFs whose lifetime is managed by the producing node; the consuming node signals via a DmaFence when it has finished reading the buffer so the producer can reuse it.

V4L2 M2M compatibility: umka-sysapi translates V4L2 memory-to-memory device ioctls (VIDIOC_QBUF, VIDIOC_DQBUF, VIDIOC_STREAMON) on M2M nodes to queue_buf / dequeue_buf / session start. The pixel format negotiation (VIDIOC_S_FMT) maps to SessionConfig field selection. Applications using libv4l2 or GStreamer's v4l2h264enc / v4l2h264dec elements run unmodified.

Hardware-specific detail: Per-vendor media codec and camera ISP driver architecture (Intel Quick Sync/GuC/HuC, AMD VCN, Qualcomm Venus, MediaTek VENC/VDEC, camera ISP — ARM Mali C71, Qualcomm Spectra) is documented inline in this section. Camera/video capture architecture is in Section 13.16.

13.8 AI / NPU Accelerator¶

Tier: Tier 1. Large model weight tensors require physically contiguous DMA allocations that the kernel memory allocator must satisfy. Inference latency requirements (< 1 ms first-token for edge models) preclude the overhead of a Tier 2 boundary crossing on each inference submission.

KABI interface name: accel_device_v1 (in interfaces/accel_device.kabi).

// umka-core/src/accel/mod.rs — authoritative NPU/accelerator driver contract

/// A hardware accelerator device: NPU, DSP, or tensor processor.
/// Implemented by drivers for Qualcomm Hexagon, Intel VPU (Meteor Lake NPU),
/// Apple ANE (via open-source reimplementation), MediaTek APU, and custom
/// ASICs.
pub trait AccelDevice: Send + Sync {
    // --- Buffer object management (shared model with [Section 13.5](#gpu-compute)) ---

    /// Allocate a buffer object in accelerator-accessible memory. `size` is
    /// in bytes (page-aligned). `placement` selects between accelerator-local
    /// SRAM/DRAM, coherent system memory, or non-coherent DMA-able system
    /// memory depending on what the hardware supports.
    ///
    /// Requires `CAP_ACCEL_INFERENCE`.
    fn alloc_bo(
        &self,
        size: u64,
        placement: AccelPlacementFlags,
    ) -> Result<BufferObject, AccelError>;

    /// Free a buffer object. Must not be in use by a model or in-flight
    /// inference when freed.
    fn free_bo(&self, bo: BufferObject) -> Result<(), AccelError>;

    // --- Model lifecycle ---

    /// Upload a pre-compiled model blob (produced by the vendor NPU compiler
    /// running in userspace) to the accelerator. The blob format is
    /// device-specific and opaque to the kernel; the kernel validates only
    /// its size and alignment constraints. The kernel does NOT JIT-compile or
    /// interpret the blob; it DMA-copies it to accelerator SRAM/DRAM and
    /// registers it with the firmware.
    ///
    /// Returns a `ModelHandle` used in subsequent `submit_inference` calls.
    ///
    /// Requires `CAP_ACCEL_INFERENCE`.
    fn load_model(
        &self,
        blob: DmaBufHandle,
        blob_size: u64,
    ) -> Result<ModelHandle, AccelError>;

    /// Unload a model, freeing accelerator SRAM and deregistering the model
    /// from firmware. Any in-flight inference using this model must complete
    /// before calling this; returns `AccelError::ModelInUse` if not.
    fn unload_model(&self, model: ModelHandle) -> Result<(), AccelError>;

    // --- Inference submission ---

    /// Submit an inference request. `input` is a DMA-BUF containing the
    /// input tensor data in the layout expected by the model (described in
    /// the model blob metadata). `output` is a DMA-BUF that the accelerator
    /// will write inference results to. Both buffers must be at least as
    /// large as the model's declared input/output tensor sizes.
    ///
    /// `wait_fences` lists `DmaFence` ([Section 13.5](#gpu-compute)) values that must be signaled
    /// before the inference begins (e.g., a camera frame that is still being
    /// written by the ISP). Returns a `DmaFence` signaled when the output
    /// tensor is complete and `output` is safe to read.
    ///
    /// Requires `CAP_ACCEL_INFERENCE`.
    fn submit_inference(
        &self,
        model: &ModelHandle,
        input: DmaBufHandle,
        output: DmaBufHandle,
        wait_fences: &[DmaFence],
    ) -> Result<DmaFence, AccelError>;

    // --- Capability query ---

    /// Return static device capabilities: supported data types (INT8, FP16,
    /// BF16, FP32), maximum model size in bytes, maximum batch size, list of
    /// supported operator sets (ONNX opset version, TFLite version, etc.),
    /// and hardware performance counters layout.
    fn query_capabilities(&self) -> AccelCapabilities;

    // --- TDR ---

    /// Reset the accelerator after a hung or timed-out inference. The kernel
    /// calls this automatically when an inference does not complete within
    /// the configured TDR timeout (default: 30 s for large models, adjustable
    /// via `/sys/class/accel/<dev>/tdr_timeout_ms` with `CAP_ACCEL_ADMIN`).
    ///
    /// All sessions on the device are reset. In-flight inferences return
    /// `AccelError::Timeout` (-ETIMEDOUT) to their callers. If the hardware
    /// supports per-session context isolation, only the hung session is
    /// terminated; other sessions resume.
    ///
    /// Requires `CAP_ACCEL_ADMIN`.
    fn tdr_reset(&self) -> Result<(), AccelError>;

    // --- Power management ---

    /// Suspend the accelerator. Called during system suspend (S3/S4) or
    /// runtime PM idle timeout. The driver must:
    /// 1. Quiesce all in-flight inferences (wait for completion or cancel).
    /// 2. Save loaded model state to system memory (DMA from device SRAM
    ///    to a kernel-allocated bounce buffer, if the device has dedicated
    ///    memory).
    /// 3. Power down the device (clock gate, power gate, or PCIe D3hot).
    ///
    /// On failure, the suspend path aborts and the device remains active
    /// ([Section 7.5](07-scheduling.md#suspend-resume-and-runtime-pm--power-state-machine)).
    fn suspend(&self) -> Result<(), AccelError>;

    /// Resume the accelerator after suspend. The driver must:
    /// 1. Power up and re-initialize the device hardware.
    /// 2. Restore previously loaded models from the saved state buffer.
    ///    If model restoration fails (e.g., firmware incompatibility after
    ///    a live kernel evolution), return `AccelError::ModelLost`. Callers
    ///    must re-load affected models via `load_model()`.
    /// 3. Mark all sessions as ready.
    fn resume(&self) -> Result<(), AccelError>;
}

/// A loaded model handle.
/// Returned by `AccelDevice::load_model()` and passed to `submit_inference()`
/// across the KABI boundary. `#[repr(C)]` is required to prevent the compiler
/// from reordering the four u64 fields — a driver reading `handle` at offset 0
/// could get `blob_size` if the kernel placed fields differently.
#[repr(C)]
pub struct ModelHandle {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Size of the model blob in bytes.
    pub blob_size: u64,
    /// Required input tensor size in bytes.
    pub input_size: u64,
    /// Required output tensor size in bytes.
    pub output_size: u64,
}
// ModelHandle: 4 × u64 = 32 bytes, no padding (all fields alignment 8).
const_assert!(core::mem::size_of::<ModelHandle>() == 32);

/// Static capabilities of an accelerator device.
/// Returned by `AccelDevice::query_capabilities()` across the KABI boundary.
#[repr(C)]
pub struct AccelCapabilities {
    /// Peak INT8 throughput in tera-operations per second.
    pub tops_int8: u32,
    /// Peak FP16 throughput in tera-operations per second.
    pub tops_fp16: u32,
    /// Accelerator-local memory size in bytes (SRAM + on-package DRAM).
    pub local_memory_bytes: u64,
    /// Maximum single model blob size in bytes.
    pub max_model_size_bytes: u64,
    /// Supported numeric data types.
    pub data_types: AccelDataTypeFlags,
    /// Supported operator sets (bitmask: ONNX, TFLite, QNN, OpenVINO IR).
    pub operator_sets: AccelOpSetFlags,
}
// AccelCapabilities: tops_int8(u32=4) + tops_fp16(u32=4) + local_memory_bytes(u64=8)
//   + max_model_size_bytes(u64=8) + data_types(u32=4) + operator_sets(u32=4) = 32 bytes.
const_assert!(core::mem::size_of::<AccelCapabilities>() == 32);

bitflags! {
    /// Numeric data types the accelerator can execute natively.
    pub struct AccelDataTypeFlags: u32 {
        const INT8  = 1 << 0;
        const INT16 = 1 << 1;
        const FP16  = 1 << 2;
        const BF16  = 1 << 3;
        const FP32  = 1 << 4;
    }
}

bitflags! {
    /// Supported operator set languages.
    pub struct AccelOpSetFlags: u32 {
        /// ONNX opset (any version accepted by this device's firmware).
        const ONNX      = 1 << 0;
        /// TensorFlow Lite flatbuffer format.
        const TFLITE    = 1 << 1;
        /// Qualcomm QNN binary format.
        const QNN       = 1 << 2;
        /// Intel OpenVINO IR format.
        const OPENVINO  = 1 << 3;
    }
}

bitflags! {
    /// Where to source backing memory for an accelerator buffer object.
    pub struct AccelPlacementFlags: u32 {
        /// Accelerator-local SRAM or on-package DRAM (highest bandwidth).
        const ACCEL_LOCAL = 1 << 0;
        /// System DRAM, coherent with CPU caches.
        const SYSTEM_COHERENT = 1 << 1;
        /// System DRAM, non-coherent (explicit cache flush/invalidate needed).
        const SYSTEM_NONCOHERENT = 1 << 2;
    }
}

Compiler model: The kernel never compiles or JIT-translates model graphs. Vendor SDKs (Qualcomm QNN SDK, Intel OpenVINO, Google XNNPACK, Arm Ethos toolchain) run entirely in userspace and produce a hardware-specific binary blob. The kernel's role is limited to loading that blob into accelerator memory, managing its lifetime, and scheduling inference jobs. This boundary keeps the attack surface small and avoids incorporating license-encumbered compiler code into the kernel.

Shared synchronization with GPU: AccelDevice uses DmaFence (Section 13.5) for all completion signaling. A camera-to-inference pipeline can therefore express its dependencies as:

camera_fence = ISP_submit(frame)
infer_fence  = accel.submit_inference(model, input, output, &[camera_fence])
display_fence = compositor.atomic_commit(plane, &[infer_fence])

No additional synchronization primitive is needed.

Capability gating: CAP_ACCEL_INFERENCE gates buffer allocation, model loading, and inference submission. CAP_ACCEL_ADMIN additionally gates TDR, thermal policy override, and access to hardware performance counters.

Hardware-specific detail: Per-vendor NPU/DSP driver architecture (Qualcomm Hexagon, Intel Meteor Lake NPU/OpenVINO, MediaTek APU, ONNX Runtime FPGA backend) is documented inline in this section and in Section 22.1–46 (21-accelerators.md) for the unified accelerator scheduling framework.

13.9 DMA Engine¶

Tier: Tier 1. DMA engines are platform infrastructure directly used by other Tier 1 subsystems (audio DMA in Section 13.4, display framebuffer DMA in Section 13.3, storage DMA in Section 11.8). They must operate in kernel context to program IOMMU tables and to route completion interrupts to the correct waiters.

KABI interface name: dma_engine_v1 (in interfaces/dma_engine.kabi).

// umka-core/src/dma_engine/mod.rs — authoritative DMA engine driver contract

/// A platform DMA engine controller. Implemented by drivers for Intel CBDMA /
/// DSA, ARM PL330, Synopsys eDMA, TI UDMA, and Xilinx AXI DMA.
pub trait DmaEngine: Send + Sync {
    /// Request a DMA channel from this engine. `capabilities` specifies the
    /// minimum set of capabilities the channel must provide (e.g.,
    /// `MEM_TO_MEM | SCATTER_GATHER`). The engine selects the best-matching
    /// channel from its pool; returns `DmaError::NoChannel` if none is
    /// available.
    ///
    /// On ACPI platforms, the channel is cross-referenced to the CSRT entry
    /// describing it. On DT platforms, the channel is cross-referenced to the
    /// `dmas` phandle in the requesting device's DT node. See the ACPI/DT
    /// enumeration note below.
    fn request_channel(
        &self,
        capabilities: DmaChannelCapabilities,
    ) -> Result<DmaChannel, DmaError>;

    /// Release a DMA channel back to the engine's pool. The channel must not
    /// have any in-flight transactions (`DmaFence` values that have not yet
    /// been signaled) when released; returns `DmaError::ChannelBusy` if so.
    fn release_channel(&self, channel: DmaChannel) -> Result<(), DmaError>;
}

/// A DMA channel: a single logical DMA stream backed by one hardware channel.
pub trait DmaChannel: Send + Sync {
    /// Submit a flat memory-to-memory copy of `len` bytes from physical
    /// address `src_pa` to physical address `dst_pa`. Returns a `DmaFence`
    /// signaled when the copy is complete.
    ///
    /// Both addresses must be within IOMMU-mapped regions. The caller is
    /// responsible for cache coherency (flush source, invalidate destination)
    /// on non-coherent platforms before and after the transfer.
    fn memcpy(
        &self,
        dst_pa: u64,
        src_pa: u64,
        len: u64,
    ) -> Result<DmaFence, DmaError>;

    /// Submit a scatter-gather copy. `entries` is a list of
    /// `(src_pa, dst_pa, len)` tuples. The engine processes entries in order.
    /// Returns a single `DmaFence` signaled after all entries complete.
    ///
    /// The maximum number of entries per call is bounded by
    /// `DmaChannelInfo::max_sg_entries`; split across multiple calls if
    /// needed.
    fn sg_copy(
        &self,
        entries: &[(u64, u64, u64)],
    ) -> Result<DmaFence, DmaError>;

    /// Fill `len` bytes starting at physical address `dst_pa` with the
    /// repeating byte pattern `value`. Returns a `DmaFence` signaled on
    /// completion. Used for zeroing newly allocated pages and clearing
    /// framebuffers.
    fn fill(
        &self,
        dst_pa: u64,
        len: u64,
        value: u8,
    ) -> Result<DmaFence, DmaError>;

    /// Return static information about this channel (capabilities,
    /// maximum transfer size, maximum scatter-gather entry count).
    fn channel_info(&self) -> DmaChannelInfo;
}

/// A DMA completion handle. Uses the canonical `DmaFence` from
/// [Section 22.1](22-accelerators.md#unified-accelerator-framework--dmafence):
/// - `device_id` = DMA engine's DeviceNodeId
/// - `context_id` = DMA channel index (u32)
/// - `value` = sequence number on the channel's completion timeline
/// - `fence_type` = `DmaFenceType::DeviceLocal`
///
/// Cheap to copy; backed by a hardware status word.
pub use crate::accel::DmaFence;

/// DmaFence operations are provided by the DMA engine driver via the KABI vtable.
/// The methods below document the contract; actual implementation is in the driver.
pub trait DmaFenceOps: Send + Sync {
    /// Poll whether this DMA transfer has completed. Returns immediately
    /// without blocking. Safe to call from interrupt context.
    fn is_done(&self, fence: &DmaFence) -> bool;

    /// Block the current thread until this DMA transfer completes or until
    /// `timeout_ns` nanoseconds elapse. Returns `Ok(())` on completion,
    /// `Err(DmaError::Timeout)` on timeout.
    fn wait(&self, fence: &DmaFence, timeout_ns: u64) -> Result<(), DmaError>;
}

/// Static information about a DMA channel.
///
/// `#[repr(C)]` is required because `DmaChannelInfo` is returned by
/// `DmaChannel::channel_info()`, a KABI vtable method — the struct
/// crosses the KABI boundary.
#[repr(C)]
pub struct DmaChannelInfo {
    /// Capabilities of this specific channel.
    pub capabilities: DmaChannelCapabilities, // 4 bytes  (offset 0)
    /// Explicit padding between u32 `capabilities` and u64 `max_transfer_bytes`
    /// (repr(C) alignment: u64 requires 8-byte alignment).
    pub _pad0: [u8; 4],                       // 4 bytes  (offset 4)
    /// Maximum number of bytes per single `memcpy` or `fill` call.
    pub max_transfer_bytes: u64,              // 8 bytes  (offset 8)
    /// Maximum number of scatter-gather entries per `sg_copy` call.
    pub max_sg_entries: u32,                  // 4 bytes  (offset 16)
    /// Whether this channel's transfers are observable by the CPU without
    /// an explicit cache flush (i.e., the DMA path is cache-coherent).
    // Derived from capabilities.contains(COHERENT).  Set by the kernel
    // DMA engine core in channel_info(), not by the driver.  Provided as
    // a convenience for callers; capabilities bitflag is authoritative.
    pub coherent: u8, // 0 = false (non-coherent, explicit cache flush needed), 1 = true
                                              // 1 byte   (offset 20)
    /// Explicit trailing padding for u64 struct alignment.
    pub _pad1: [u8; 3],                       // 3 bytes  (offset 21)
    // Total: 4 + 4 + 8 + 4 + 1 + 3 = 24 bytes.
}
const_assert!(size_of::<DmaChannelInfo>() == 24);

bitflags! {
    /// Capabilities that a DMA channel may provide.
    pub struct DmaChannelCapabilities: u32 {
        /// Memory-to-memory flat copy.
        const MEM_TO_MEM     = 1 << 0;
        /// Memory-to-device transfers (device is the sink).
        const MEM_TO_DEV     = 1 << 1;
        /// Device-to-memory transfers (device is the source).
        const DEV_TO_MEM     = 1 << 2;
        /// Scatter-gather transfer support.
        const SCATTER_GATHER = 1 << 3;
        /// Memory fill (pattern write, used for zeroing).
        const FILL           = 1 << 4;
        /// Cache-coherent DMA path (no manual flush/invalidate required).
        const COHERENT       = 1 << 5;
    }
}

Shared infrastructure model: DmaChannel is the common abstraction for all bulk-data DMA in UmkaOS. Subsystems that need DMA use it as follows:

Audio (Section 13.4): the PcmStream DMA ring uses a MEM_TO_DEV or DEV_TO_MEM channel obtained from the audio controller's built-in DMA or from a platform DMA engine channel bound in ACPI/DT.
Display (Section 13.3): cursor and framebuffer uploads on platforms without a GPU use a MEM_TO_DEV channel.
Storage (Section 11.8): on platforms where the storage controller does not have its own scatter-gather engine, DmaChannel::sg_copy is used for PRD tables.

On platforms where the device has its own built-in DMA (NVMe PRPs, AHCI PRDT, PCIe DMA engines on GPUs), the device driver does not use DmaEngine at all; the built-in DMA is programmed directly and the completion is reported via the device's own interrupt.

ACPI/DT enumeration: On ACPI platforms, DMA engine channels are described in the ACPI CSRT (Core System Resources Table, documented in the Microsoft Core System Resources Table (CSRT) specification, not part of the ACPI specification itself). The kernel's ACPI layer parses CSRT at boot and registers each channel group as a DmaEngine instance. Consumers reference channels by ACPI _CRS DMA descriptor. On Device Tree platforms, channels are described using the dmas and dma-names properties in the consuming device node, following the DMA Engine binding in the Linux kernel DT bindings (used as the authoritative reference for this property format).

Hardware-specific detail: Per-platform DMA engine driver architecture (Intel CBDMA/DSA, ARM PL330, TI UDMA-P on AM65x/J7, Synopsys eDMA on PCIe controllers) is documented inline in this section. Platform-specific channel discovery uses ACPI CSRT or Device Tree dmas/dma-names properties.

13.10 GPIO and Pin Control¶

Tier: Tier 1. GPIO controllers are low-level platform hardware directly used by many other Tier 1 drivers for chip-select lines, reset/enable signals, and interrupt routing. GPIO interrupts must be demultiplexed in the kernel IRQ subsystem (Section 11.2) before they can be delivered to drivers or (via eventfd) to userspace.

KABI interface names: gpio_controller_v1, pinctrl_v1 (in interfaces/gpio.kabi).

// umka-core/src/gpio/mod.rs — authoritative GPIO and pin control contract

/// A GPIO controller. One instance per hardware GPIO IP block (which may
/// expose dozens to hundreds of individual lines). Implemented by drivers for
/// Intel Broxton/Cannon Lake PCH GPIO, ARM PL061, NXP RGPIO, Qualcomm TLMM,
/// and Broadcom BCM2835 GPIO.
pub trait GpioController: Send + Sync {
    // --- Pin configuration ---

    /// Configure a GPIO line's direction, pull resistor, and drive mode.
    /// Must be called before `read` or `write` on the line.
    fn configure(
        &self,
        line: GpioLine,
        direction: GpioDirection,
        pull: GpioPull,
        drive: GpioDrive,
    ) -> Result<(), GpioError>;

    // --- Digital I/O ---

    /// Read the current logic level of an input (or output in read-back mode)
    /// GPIO line. Returns `true` for high, `false` for low. Returns
    /// `GpioError::NotInput` if the line is configured as output-only and the
    /// hardware does not support output read-back.
    fn read(&self, line: GpioLine) -> Result<bool, GpioError>;

    /// Set the output level of an output-configured GPIO line. `high` = true
    /// drives the line high; `high` = false drives it low. Returns
    /// `GpioError::NotOutput` if the line is configured as input.
    fn write(&self, line: GpioLine, high: bool) -> Result<(), GpioError>;

    // --- Interrupt registration ---

    /// Register an interrupt handler for a GPIO line. `mode` selects the
    /// edge or level trigger condition. `handler` is called in a Tier 1
    /// threaded interrupt context ([Section 3.12](03-concurrency.md#irq-chip-and-irqdomain-hierarchy) threaded IRQ model). Returns a
    /// `GpioIrqHandle`; dropping the handle atomically deregisters the
    /// handler and ensures no further invocations occur.
    ///
    /// Only one handler may be registered per line at a time; returns
    /// `GpioError::AlreadyRegistered` if a handler is already registered.
    fn request_irq(
        &self,
        line: GpioLine,
        mode: IrqMode,
        handler: GpioIrqHandler,
    ) -> Result<GpioIrqHandle, GpioError>;

    /// Deregister the interrupt handler associated with `handle`. Equivalent
    /// to dropping the `GpioIrqHandle` but provides an explicit error return.
    fn free_irq(&self, handle: GpioIrqHandle) -> Result<(), GpioError>;

    // --- Controller metadata ---

    /// Return the number of GPIO lines managed by this controller.
    fn line_count(&self) -> u32;

    /// Return the controller's unique identifier (used to construct
    /// `GpioLine` handles for cross-subsystem use).
    fn controller_id(&self) -> u32;
}

/// A pin control block. Manages the per-pin function multiplexer on SoCs
/// where physical pads can be assigned to multiple peripheral signals
/// (GPIO, I2C, SPI, UART, PCIe reference clock, etc.).
///
/// On platforms where pin multiplexing is co-located inside the GPIO
/// controller, both traits are implemented by the same driver struct.
pub trait PinCtrl: Send + Sync {
    /// Query the list of functions available for a given pin index. Writes
    /// `PinFunction` values into the caller-supplied buffer and returns the
    /// number written. Each entry has a name (e.g., "gpio", "i2c_sda",
    /// "spi_clk", "uart_tx") and the peripheral it routes to.
    /// KABI note: uses caller-supplied buffer, not Vec, for C driver compat.
    fn query_functions(&self, pin: u32, buf: &mut [PinFunction], max_count: u32) -> Result<u32, PinCtrlError>;

    /// Select a function for a pin, connecting the physical pad to the
    /// named peripheral signal. Any previously selected function is
    /// deactivated. Returns `PinCtrlError::Conflict` if another driver has
    /// claimed this pin in an incompatible function.
    fn select_function(
        &self,
        pin: u32,
        function: &PinFunction,
    ) -> Result<(), PinCtrlError>;

    /// Release ownership of a pin, returning it to a default high-impedance
    /// state. Safe to call even if no function is currently selected.
    fn release_pin(&self, pin: u32) -> Result<(), PinCtrlError>;
}

/// Handle to a single GPIO line: the combination of a controller and a
/// zero-based pin index within that controller.
///
/// This type is referenced by [Section 13.13](#i2csmbus-bus-framework) (I2C-HID interrupt line) and [Section 13.4](#audio-subsystem)
/// (audio jack detection) and is formally defined here. All other subsystems
/// that reference a GPIO line MUST use this type.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct GpioLine {
    /// Identifier of the `GpioController` that owns this line.
    pub controller_id: u32,
    /// Zero-based index of the line within the controller (0 …
    /// `controller.line_count() - 1`).
    pub pin_index: u32,
}

/// RAII handle for a registered GPIO interrupt. Dropping this value
/// deregisters the handler. Implemented as a token that the kernel associates
/// with the registration record; no raw pointers are exposed.
pub struct GpioIrqHandle {
    /// Opaque kernel handle. The kernel uses this to locate and remove the
    /// registration entry on drop.
    pub(crate) handle: u64,
}

impl Drop for GpioIrqHandle {
    /// Deregister the GPIO interrupt handler. Guaranteed to be called even
    /// if the owning driver panics, preventing stale handlers from firing
    /// after the driver struct is freed.
    fn drop(&mut self) { /* kernel deregistration via syscall or direct call */ }
}

/// Type alias for a GPIO interrupt handler function pointer.
/// The handler is called in a threaded interrupt context ([Section 3.12](03-concurrency.md#irq-chip-and-irqdomain-hierarchy)). It must not
/// block indefinitely; it may acquire short-duration spinlocks and queue
/// work to a kernel work queue.
///
/// **Per-instance context**: The handler receives `GpioLine` (not a raw void
/// pointer). Drivers that need per-instance context use the GPIO line
/// number as an index into a driver-local `ArrayVec` or `XArray` of
/// per-line state (analogous to Linux's `void *dev_id` pattern but
/// type-safe). The `GpioLine` is available at `request_irq()` time and
/// can be stored as the lookup key.
pub type GpioIrqHandler = fn(line: GpioLine, mode: IrqMode);

/// Available trigger modes for GPIO interrupts.
#[repr(u32)]
pub enum IrqMode {
    /// Trigger on a low-to-high transition.
    RisingEdge  = 0,
    /// Trigger on a high-to-low transition.
    FallingEdge = 1,
    /// Trigger on both transitions.
    BothEdges   = 2,
    /// Trigger while the line is held high (level-triggered).
    HighLevel   = 3,
    /// Trigger while the line is held low (level-triggered).
    LowLevel    = 4,
}

#[repr(u32)]
/// GPIO line direction.
pub enum GpioDirection {
    /// Line is an input; the driver reads the external logic level.
    Input  = 0,
    /// Line is an output; the driver drives the logic level.
    Output = 1,
}

#[repr(u32)]
/// Internal pull resistor configuration.
pub enum GpioPull {
    /// No pull resistor (high impedance when not driven).
    None     = 0,
    /// Weak pull-up to VCC.
    PullUp   = 1,
    /// Weak pull-down to GND.
    PullDown = 2,
}

#[repr(u32)]
/// Output drive mode.
pub enum GpioDrive {
    /// Totem-pole (push-pull): the driver actively drives both high and low.
    PushPull   = 0,
    /// Open-drain: the driver only pulls low; high is achieved by an external
    /// pull-up. Required for I2C bus lines and wired-AND configurations.
    OpenDrain  = 1,
}

/// A multiplexable function available on a SoC pin.
///
/// `#[repr(C)]` for KABI stability — filled by driver via
/// `PinCtrl::query_functions()` across the KABI boundary.
#[repr(C)]
pub struct PinFunction {
    /// Human-readable function name (e.g., "gpio", "i2c0_sda", "uart2_tx").
    /// Fixed-size, null-terminated C string for KABI compatibility. 32 bytes
    /// covers all realistic pin function names.
    pub name: [u8; 32],
    /// The peripheral subsystem this function connects to (e.g., I2C
    /// controller index 0, UART controller index 2).
    pub peripheral_id: u32,
}
// PinFunction: name(32) + peripheral_id(4) = 36 bytes. No pointer fields;
// size is identical on 32-bit and 64-bit targets.
const_assert!(core::mem::size_of::<PinFunction>() == 36);

IRQ model: request_irq() registers the handler with the kernel IRQ subsystem (Section 11.2). The GPIO controller's top-level interrupt line is demuxed by the GPIO driver: on each top-level interrupt, the driver reads the controller's pending interrupt register, identifies which lines are active, and dispatches the registered handlers for those lines in threaded IRQ context. Handlers run at a normal kernel thread priority with preemption enabled unless the handler explicitly raises its priority via the Section 11.2 scheduling API.

ACPI/DT enumeration: On ACPI platforms, GPIO lines are described using GpioInt (interrupt) and GpioIo (I/O) resource descriptors in device _CRS methods, following the ACPI specification Section 20.6.57 (GpioInt) and 19.6.58 (GpioIo). The GPIO subsystem resolves these descriptors to GpioLine handles and registers IRQs automatically during device enumeration. On Device Tree platforms, the gpios phandle-with-args property and the standard GPIO binding (two-cell format: <&gpio_controller pin_index flags>) are parsed to produce GpioLine handles.

Fix to Section 13.13: The GpioLine type and request_irq() method used by the I2C-HID driver (Section 13.13) to register the ATTN interrupt are formally defined by this Section 13.10 contract. The Section 13.13 description is authoritative on how I2C-HID uses GPIO; this section is authoritative on what GPIO provides.

Hardware-specific detail: Per-platform GPIO/pinctrl driver architecture (Intel PCH GPIO — Broxton/Cannon Lake/Tiger Lake, ARM PL061, Qualcomm TLMM, NXP i.MX IOMUXC, Broadcom BCM2835/2711) is documented inline in this section. Each platform's pin mux register layout is covered during per-driver implementation using vendor datasheets.

13.11 Crypto Accelerator¶

Tier: Tier 1. Hardware crypto engines need DMA access to key material and plaintext/ciphertext buffers. Tier 2 boundary crossing would add one to two microseconds per operation — unacceptable for TLS session establishment (RSA or ECDH operations on the critical path of every connection) and bulk record encryption (AES-GCM on every TCP segment with TLS offload).

KABI interface name: crypto_engine_v1 (in interfaces/crypto_engine.kabi).

// umka-core/src/crypto_engine/mod.rs — authoritative crypto accelerator contract

/// A hardware cryptographic accelerator. Implemented by drivers for:
/// - On-SoC crypto engines (Intel QAT, ARM TrustZone CryptoCell, NXP CAAM)
/// - NIC-integrated TLS offload engines (Mellanox ConnectX-6 TLS)
/// - HSM-adjacent secure enclaves
/// - Software fallback (when no hardware engine is present)
pub trait CryptoEngine: Send + Sync {
    // --- Capability discovery ---

    /// Return the algorithm configurations supported by this engine. Writes
    /// `AlgorithmDescriptor` entries into the caller-supplied buffer and
    /// returns the number written. Each entry specifies the algorithm family,
    /// key sizes, and performance tier (hardware-accelerated or software
    /// fallback). The caller selects an algorithm from this list when
    /// creating a session.
    /// KABI note: uses caller-supplied buffer, not Vec, for C driver compat.
    fn query_algorithms(&self, buf: &mut [AlgorithmDescriptor], max_count: u32) -> Result<u32, CryptoError>;

    // --- Key management ---

    /// Import raw key material into the engine under a wrapping key (or in
    /// plaintext if `wrapping_key` is None and the engine permits it). The
    /// engine stores the key internally; the caller's buffer is zeroed after
    /// import. Returns an opaque `KeyHandle`. Raw key bytes are never
    /// accessible after this call; all subsequent operations use the handle.
    ///
    /// `flags` controls whether the key may be exported (wrapped) later or
    /// is permanently non-extractable. Non-extractable keys cannot leave the
    /// hardware even if the kernel is fully compromised — the hardware
    /// enforces this at the engine level.
    ///
    /// Requires `CAP_CRYPTO_ADMIN`.
    /// `wrapping_key: Option<&KeyHandle>` is FFI-safe per the Rust Reference
    /// ("FFI: Nullable pointers"): `Option<&T>` has the same layout as the
    /// underlying pointer, with `None` represented as null. This is guaranteed
    /// on all architectures and pointer widths.
    ///
    /// `key_bytes: &[u8]` is a fat pointer (ptr + len). The IDL compiler
    /// decomposes it to `(key_ptr: *const u8, key_len: u64)` for the C ABI.
    fn import_key(
        &self,
        algorithm: AlgorithmId,
        key_bytes: &[u8],
        wrapping_key: Option<&KeyHandle>,
        flags: KeyFlags,
    ) -> Result<KeyHandle, CryptoError>;

    /// Export a key that was imported with `EXPORTABLE`. The key material is
    /// encrypted under `wrapping_key` and written into the caller-provided
    /// `out` buffer. Returns the actual key blob length on success.
    /// Returns `CryptoError::NonExtractable` if the key was imported with
    /// `NON_EXTRACTABLE`. Returns `CryptoError::BufferTooSmall` if `out`
    /// is not large enough.
    ///
    /// Caller provides buffer; return value is actual key length.
    /// Avoids heap allocation across KABI boundary.
    ///
    /// Requires `CAP_CRYPTO_ADMIN`.
    fn export_key(
        &self,
        key: &KeyHandle,
        wrapping_key: &KeyHandle,
        out: &mut [u8],
    ) -> Result<usize, CryptoError>;

    /// Destroy a key handle. The engine erases all key material. After this
    /// call, the `KeyHandle` is invalid and any session using it will return
    /// `CryptoError::InvalidKey` on the next operation.
    ///
    /// Requires `CAP_CRYPTO_ADMIN`.
    fn destroy_key(&self, key: KeyHandle) -> Result<(), CryptoError>;

    // --- Session lifecycle ---

    /// Allocate a session for a specific algorithm and key. A session holds
    /// per-operation state (IV/nonce counters, HMAC state, RSA blinding
    /// factors) and is bound to one `KeyHandle`. Sessions are not thread-safe;
    /// concurrent callers must allocate separate sessions.
    ///
    /// Requires `CAP_CRYPTO_ACCEL`.
    fn alloc_session(
        &self,
        algorithm: AlgorithmId,
        key: &KeyHandle,
    ) -> Result<CryptoSession, CryptoError>;

    /// Free a session. Any in-flight operation on this session must complete
    /// first; returns `CryptoError::SessionBusy` if not.
    fn free_session(&self, session: CryptoSession) -> Result<(), CryptoError>;

    // --- Operation submission ---

    /// Submit a cryptographic operation. The request is placed on the
    /// engine's ring buffer (Section 11.6 ring model). Returns immediately with a
    /// `DmaFence` ([Section 13.5](#gpu-compute) timeline semaphore) that is signaled when the
    /// output DMA-BUF has been fully written and is safe to read.
    ///
    /// `request` fully describes the operation: op type (encrypt, decrypt,
    /// sign, verify, hash, key-exchange), input DMA-BUF, output DMA-BUF,
    /// associated data (for AEAD ciphers), nonce/IV, and tag buffer.
    ///
    /// For AEAD operations (AES-GCM, ChaCha20-Poly1305): authentication tag
    /// is appended to ciphertext on encrypt, and verified + stripped on
    /// decrypt. A tag mismatch on decrypt returns `CryptoError::AuthFailed`
    /// via the fence's error status.
    ///
    /// Requires `CAP_CRYPTO_ACCEL`.
    fn submit(
        &self,
        session: &CryptoSession,
        request: &CryptoRequest,
    ) -> Result<DmaFence, CryptoError>;

    // --- Streaming (incremental) crypto ---

    /// Feed a chunk of data into an ongoing hash or streaming cipher operation.
    /// For hash algorithms (SHA-256, SHA-512, etc.): accumulates input into the
    /// internal hash state. For streaming ciphers (AES-CBC, AES-CTR): encrypts
    /// the chunk and writes output to `output_buf`, carrying the cipher state
    /// (IV / counter) to the next `update()` call.
    ///
    /// `submit()` is the one-shot path (complete input in a single DMA-BUF).
    /// `update()` + `finalize()` is the streaming path for data too large to
    /// fit in a single DMA-BUF or for incremental processing (e.g., hashing
    /// a multi-gigabyte file during writeback).
    /// `output` uses sentinel value `DmaBufHandle(-1)` for "no output buffer"
    /// instead of `Option<DmaBufHandle>`. `Option<i32>` has no niche optimization
    /// (i32 uses all bit patterns), making it 8 bytes with no stable C ABI
    /// representation. The sentinel pattern matches `CryptoRequest.aad`.
    fn update(
        &self,
        session: &CryptoSession,
        input: DmaBufHandle,
        input_len: u64,
        output: DmaBufHandle,  // -1 = no output buffer (hash-only update)
    ) -> Result<DmaFence, CryptoError>;

    /// Finalize a streaming operation. For hash: writes the final digest to
    /// `output`. For streaming ciphers: flushes any buffered partial block
    /// (with ciphertext stealing for XTS, PKCS#7 padding for CBC) and writes
    /// the final output. Resets the session's streaming state.
    fn finalize(
        &self,
        session: &CryptoSession,
        output: DmaBufHandle,
    ) -> Result<DmaFence, CryptoError>;
}

/// An opaque key handle. The raw key material is inaccessible after
/// `import_key`; this handle is the only means to reference the key in
/// subsequent operations.
///
/// `#[repr(C)]` is required because `&KeyHandle` crosses the KABI boundary
/// (driver dereferences fields via KABI vtable methods `export_key`,
/// `destroy_key`, `alloc_session`).
///
/// **Security note**: `pub(crate)` on `id` is a code-level encapsulation hint,
/// not a security boundary. A Tier 1 driver receiving `&KeyHandle` can read
/// `id` at offset 0 via pointer arithmetic on the `#[repr(C)]` struct.
/// Opacity of `id` is enforced by the crypto engine's per-operation validation
/// (key existence + algorithm binding + session association), not by Rust
/// visibility. Knowing the `id` value without a valid session and matching
/// algorithm is useless — the engine rejects operations that fail validation.
#[repr(C)]
pub struct KeyHandle {
    /// Opaque engine-assigned key identifier.
    pub(crate) id: u64,
    /// The algorithm this key is bound to.
    pub algorithm: AlgorithmId,
    /// Whether this key is allowed to be exported.
    pub exportable: u8, // 0 = false, 1 = true
    /// Explicit trailing padding (struct alignment = 8 from u64 field).
    _pad: [u8; 3],
}
const_assert!(size_of::<KeyHandle>() == 16);

/// A crypto session: per-operation state bound to one key and algorithm.
pub struct CryptoSession {
    /// Opaque kernel handle.
    pub handle: u64,
    /// Algorithm this session is configured for.
    pub algorithm: AlgorithmId,
}

/// A single cryptographic operation request, placed on the engine's ring.
#[repr(C)]
pub struct CryptoRequest {
    /// The type of operation to perform.
    pub op: CryptoOp,
    /// DMA-BUF containing input data (plaintext for encrypt, ciphertext for
    /// decrypt, message for hash/sign, public value for key agreement).
    pub input: DmaBufHandle,
    /// DMA-BUF that the engine will write output into (ciphertext for
    /// encrypt, plaintext for decrypt, digest for hash, signature for sign,
    /// shared secret for key agreement).
    pub output: DmaBufHandle,
    /// Associated data for AEAD operations (authenticated but not encrypted).
    /// When `aad_len == 0` (no associated data), this field is ignored by the
    /// engine — callers should set it to `DmaBufHandle(0)` (null sentinel).
    /// For non-AEAD operations (hash, symmetric encrypt/decrypt without AAD),
    /// both `aad` and `aad_len` are ignored.
    pub aad: DmaBufHandle,
    /// Nonce or IV for symmetric ciphers. Length and format are
    /// algorithm-specific: 12 bytes for AES-GCM, 12 bytes for
    /// ChaCha20-Poly1305. Ignored for hash and asymmetric operations.
    /// WARNING: AES-GCM nonces MUST be exactly 12 bytes per NIST SP
    /// 800-38D. Other lengths trigger GHASH-based IV derivation with
    /// weaker security. The 16-byte buffer accommodates future algorithms.
    pub nonce: [u8; 16],
    /// Actual nonce/IV length in bytes (0 if not applicable).
    pub nonce_len: u8,
    /// Explicit padding for u64 alignment of `input_len`. Must be zeroed
    /// to prevent kernel memory disclosure to hardware DMA engines.
    pub _pad: [u8; 7],
    /// Input data length in bytes.
    pub input_len: u64,
    /// Output buffer length in bytes. The engine bounds-checks output writes
    /// against this value. For AEAD encryption, output_len must accommodate
    /// ciphertext + authentication tag. For hash operations, output_len is
    /// the digest size. Engine returns an error if the output buffer is too small.
    pub output_len: u64,
    /// Associated data length in bytes.
    pub aad_len: u64,
}
const_assert!(core::mem::size_of::<CryptoRequest>() == 64);

/// The specific cryptographic operation requested.
#[repr(u32)]
pub enum CryptoOp {
    /// Symmetric encryption (AES-GCM, ChaCha20-Poly1305, AES-CBC, AES-CTR).
    Encrypt          = 0,
    /// Symmetric decryption with authentication tag verification (AEAD) or
    /// plain decryption (non-AEAD).
    Decrypt          = 1,
    /// Compute a message digest (SHA-256, SHA-384, SHA-512, SHA-3-256).
    Hash             = 2,
    /// Compute an HMAC (HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512).
    Hmac             = 3,
    /// Asymmetric signing (RSA-PSS, ECDSA P-256, ECDSA P-384, Ed25519).
    Sign             = 4,
    /// Asymmetric signature verification.
    Verify           = 5,
    /// Key agreement / scalar multiplication (ECDH P-256, ECDH P-384,
    /// X25519, X448). Output is the shared secret.
    KeyAgreement     = 6,
    /// TLS record encryption (NIC TLS offload engines only). Input is a
    /// plaintext TLS record; output is the encrypted wire-format record.
    TlsRecordEncrypt = 7,
    /// TLS record decryption (NIC TLS offload engines only).
    TlsRecordDecrypt = 8,
}

/// Identifier for a specific algorithm configuration.
#[repr(u32)]
pub enum AlgorithmId {
    AesGcm128       = 0,
    AesGcm256       = 1,
    ChaCha20Poly1305 = 2,
    AesCbc128       = 3,
    AesCbc256       = 4,
    AesCtr128       = 5,
    AesCtr256       = 6,
    /// AES-XTS (XOR-encrypt-XOR with ciphertext stealing). The standard
    /// disk encryption algorithm used by dm-crypt/LUKS, fscrypt, and
    /// hardware crypto offload (Intel QAT, NXP CAAM). Required for
    /// encrypted swap ([Section 4.13](04-memory.md#swap-subsystem)), dm-crypt ([Section 15.2](15-storage.md#block-io-and-volume-management)),
    /// and fscrypt ([Section 15.20](15-storage.md#fscrypt-file-level-encryption)).
    AesXts128       = 7,
    AesXts256       = 8,
    Sha256          = 16,
    Sha384          = 17,
    Sha512          = 18,
    Sha3_256        = 19,
    HmacSha256      = 32,
    HmacSha384      = 33,
    HmacSha512      = 34,
    RsaPss2048Sha256 = 48,
    RsaPss4096Sha384 = 49,
    EcdsaP256Sha256 = 64,
    EcdsaP384Sha384 = 65,
    Ed25519         = 66,
    EcdhP256        = 80,
    EcdhP384        = 81,
    X25519          = 82,
    X448            = 83,
}

bitflags! {
    /// Flags controlling key lifecycle and extractability.
    pub struct KeyFlags: u32 {
        /// Key may be exported (wrapped) by a process holding
        /// `CAP_CRYPTO_ADMIN`. Mutually exclusive with `NON_EXTRACTABLE`.
        const EXPORTABLE       = 1 << 0;
        /// Key material never leaves the hardware security boundary. Once
        /// imported, it cannot be read out even with physical access to DRAM.
        /// Mutually exclusive with `EXPORTABLE`.
        const NON_EXTRACTABLE  = 1 << 1;
        /// Key is persistent across power cycles (stored in hardware key
        /// store, e.g., TPM NV index or TrustZone secure storage). Engines
        /// that do not support persistence return `CryptoError::Unsupported`
        /// if this flag is set.
        const PERSISTENT       = 1 << 2,
    }
}

/// Descriptor of one algorithm configuration supported by an engine.
///
/// `#[repr(C)]` is required because `AlgorithmDescriptor` is returned by
/// `query_algorithms()`, a KABI vtable method.
#[repr(C)]
pub struct AlgorithmDescriptor {
    /// The algorithm this descriptor covers.
    pub id: AlgorithmId,
    /// Whether this algorithm is executed in hardware or software fallback.
    pub hardware_accelerated: u8, // 0 = software fallback, 1 = hardware accelerated
    /// Explicit padding for `#[repr(C)]` layout stability.
    pub _pad: [u8; 3],
    /// Approximate throughput in MiB/s for bulk operations (encrypt/decrypt/
    /// hash). 0 for asymmetric operations where throughput is not meaningful.
    pub throughput_mibps: u32,
    /// Approximate latency in microseconds per single operation (for
    /// asymmetric operations such as sign/verify/key-agreement).
    pub latency_us: u32,
}
const_assert!(size_of::<AlgorithmDescriptor>() == 16);

Software fallback: If query_algorithms() returns a descriptor with hardware_accelerated == 0 for a requested algorithm, the submit() path executes the algorithm in a kernel software implementation (Rust aes-gcm, chacha20poly1305, sha2, p256, x25519-dalek). The API is identical regardless of acceleration. Callers that require hardware acceleration for security reasons (e.g., to achieve constant-time execution or non-extractable keys) must check the hardware_accelerated flag in the descriptor before creating a session.

TLS offload integration: A NIC with TLS record-layer offload (e.g., Mellanox ConnectX-6, Marvell OcteonTX2) registers itself as a CryptoEngine with TlsRecordEncrypt and TlsRecordDecrypt in its algorithm list. The kernel TLS layer (ktls, Section 16.1 net-tls) queries the CryptoEngine registry and, if a matching engine is found for the session's cipher suite, offloads record encryption there. The TCP send path then bypasses the software TLS layer and passes plaintext records directly to the NIC. The kernel retains ownership of the session key via a KeyHandle; the NIC's shadow copy is invalidated when destroy_key is called.

Capability gating: CAP_CRYPTO_ACCEL is required for session allocation and operation submission. CAP_CRYPTO_ADMIN is additionally required for key import, export, and destruction, and for reading hardware performance counters. Processes without CAP_CRYPTO_ACCEL receive the software fallback path transparently; they do not receive an error.

Hardware-specific detail: Per-vendor crypto accelerator driver architecture (Intel QAT, ARM TrustZone CryptoCell cc712/cc713, NXP CAAM, Mellanox ConnectX TLS offload) is documented inline in this section. TPM 2.0 key storage integration is specified in Section 9.3 (08-security.md).

13.12 USB Class Drivers and Mass Storage¶

USB devices follow a class-based driver model. The USB host controller driver (xHCI for USB 3.x, EHCI for USB 2.0) is a Tier 1 platform driver that manages host controller hardware and the root hub. Class drivers are layered above it and bind to devices by USB class code, subclass, and protocol — not by vendor/product ID — giving a single driver coverage across all standards-compliant devices of a class.

13.12.1 USB Host Controller (xHCI, Tier 1)¶

The xHCI driver (USB 3.2 specification) manages:

Transfer ring management: each endpoint has a ring buffer (producer/consumer pointers in memory). The driver enqueues Transfer Request Blocks (TRBs); the controller processes them and posts Transfer Event TRBs to the Event Ring.
Command ring: host-issued commands (Enable Slot, Disable Slot, Configure Endpoint, Reset Device) use a separate command ring.
Interrupt moderation: MSI-X per-interrupter; Event Ring Segment Table (ERST) maps event ring memory to the controller.

Device enumeration: root hub port status change → enumerate device at default address 0 → GET_DESCRIPTOR (device, configuration, interface, endpoint) → assign address via SET_ADDRESS → bind class driver based on bDeviceClass or bInterfaceClass.

13.12.2 USB Mass Storage (UMS) and USB Attached SCSI (UAS)¶

Both protocols expose USB storage devices as block devices to umka-block.

UMS (USB Mass Storage, Bulk-Only Transport): - Wraps SCSI commands in a Command Block Wrapper (CBW) sent over a bulk-out endpoint; device responds with data and a Command Status Wrapper (CSW) on bulk-in. One outstanding command at a time. - Device registers as BlockDevice with umka-block upon successful SCSI INQUIRY → READ CAPACITY(16) sequence.

UAS (USB Attached SCSI, USB 3.0+): - Four-endpoint protocol (command, status, data-in, data-out). Multiple outstanding commands (up to 65535 via stream IDs). Significantly higher throughput and lower latency than UMS for fast SSDs. - Preferred over UMS when both are supported (bInterfaceProtocol = 0x62). - Same BlockDevice registration as UMS; umka-block sees no difference.

Hotplug: USB device removal triggers an Unregister event in the device registry (Section 11.4). The volume layer (Section 15.2) transitions dependent block devices to DEVICE_FAILED state. Auto-mount/unmount policy is handled by a userspace daemon (udev-compatible via umka-sysapi) reacting to device registry events.

BOT wire format — Command Block Wrapper (CBW) and Command Status Wrapper (CSW):

/// USB Mass Storage Bulk-Only Transport Command Block Wrapper.
/// Sent on the bulk-out endpoint before each SCSI command.
/// USB Mass Storage Class spec §5.1.
#[repr(C, packed)]
pub struct UmsCbw {
    /// Signature: 0x43425355 ("USBC" in little-endian).
    pub signature: Le32,
    /// Tag: unique per-command, echoed in the corresponding CSW.
    pub tag: Le32,
    /// Data transfer length in bytes (may be 0 for no-data commands).
    pub data_transfer_length: Le32,
    /// Bit 7: direction (0 = host-to-device, 1 = device-to-host).
    /// Bits 6:0: reserved (zero).
    pub flags: u8,
    /// Target logical unit number (bits 3:0; bits 7:4 reserved).
    pub lun: u8,
    /// Length of the SCSI CDB in bytes (1-16). Bits 4:0 are the length;
    /// bits 7:5 are reserved (zero).
    pub cb_length: u8,
    /// SCSI Command Descriptor Block, zero-padded to 16 bytes.
    pub cb: [u8; 16],
}
const_assert!(size_of::<UmsCbw>() == 31);

/// USB Mass Storage Bulk-Only Transport Command Status Wrapper.
/// Received on the bulk-in endpoint after data transfer completes.
/// USB Mass Storage Class spec §5.2.
#[repr(C, packed)]
pub struct UmsCsw {
    /// Signature: 0x53425355 ("USBS" in little-endian).
    pub signature: Le32,
    /// Tag: matches the tag from the corresponding CBW.
    pub tag: Le32,
    /// Residue: difference between data_transfer_length and actual bytes transferred.
    pub data_residue: Le32,
    /// Status: 0x00 = command passed, 0x01 = command failed, 0x02 = phase error.
    pub status: u8,
}
const_assert!(size_of::<UmsCsw>() == 13);

Minimum required SCSI commands for UMS block device registration:

Opcode	Command	Purpose
0x12	INQUIRY	Device identification and capability discovery
0x00	TEST UNIT READY	Check if device is ready for I/O
0x25	READ CAPACITY(10)	Get device size (32-bit LBA)
0x9E/0x10	READ CAPACITY(16)	Get device size (64-bit LBA, for >2TB)
0x28	READ(10)	Read data (32-bit LBA, up to 64KB)
0x2A	WRITE(10)	Write data (32-bit LBA, up to 64KB)
0x35	SYNCHRONIZE CACHE(10)	Flush volatile cache to stable media
0x1A	MODE SENSE(6)	Query device parameters (write protect, cache mode)
0x03	REQUEST SENSE	Retrieve error details after a failed command

Error recovery state machine (BOT reset recovery, USB MS spec §5.3.4): 1. Send CBW for SCSI command. 2. If CSW status = 0x02 (phase error) or bulk endpoint stalls: a. Send USB class-specific request Bulk-Only Mass Storage Reset (bRequest=0xFF). b. Clear HALT on bulk-in endpoint (CLEAR_FEATURE(ENDPOINT_HALT)). c. Clear HALT on bulk-out endpoint. d. Retry the original command (max 3 retries). 3. If reset recovery fails: transition device to DEVICE_FAILED.

Tier classification: UMS/UAS drivers are Tier 2 — they communicate over USB (inherently higher latency than PCIe), and the attack surface of USB storage firmware justifies full process isolation over the modest CPU overhead.

13.12.3 USB4 and Thunderbolt¶

USB4 (based on Thunderbolt 3 protocol) and Thunderbolt 3/4 are high-bandwidth interconnects (40 Gbps) that tunnel multiple protocols — PCIe, DisplayPort, USB — over a single cable. They are relevant across server (external NVMe enclosures, 40GbE NICs), workstation (external GPUs), and embedded (dock stations) contexts.

Architecture: A USB4/Thunderbolt port is controlled by a retimer/router chip with its own firmware. The host-side driver configures the router and establishes tunnels. The tunneled protocols then appear as native devices:

Physical cable (USB4/TB4)
  └── USB4 router (host controller + retimer firmware)
       ├── PCIe tunnel → appears as PCIe device (NVMe, GPU, NIC)
       ├── DisplayPort tunnel → appears as DP connector (Section 21.4.3, `20-user-io.md`)
       └── USB tunnel → appears as USB hub → USB class devices

Kernel responsibilities:

Router enumeration: Discover USB4 routers via their management interface (MMIO registers or USB control endpoint). Read router topology descriptor to find upstream/downstream adapters and their capabilities.
IOMMU enforcement (mandatory for PCIe tunnels): Before establishing a PCIe tunnel to an external device, the kernel allocates an IOMMU domain for the tunneled device. The PCIe device behind the tunnel is treated identically to a native PCIe device — it gets its own IOMMU domain, its own device registry entry, and its driver follows the normal Tier 1/2 model. IOMMU protection is not optional; external PCIe devices are untrusted by definition.
Tunnel authorization: The kernel blocks PCIe tunnel establishment until an authorization signal is received via sysfs:
```
/sys/bus/thunderbolt/devices/<device>/authorized
```
Writing 1 authorizes the device; writing 0 de-authorizes and tears down the tunnel. This is the kernel's policy interface — what triggers the write (user prompt, pre-approved list, automatic trust) is userspace policy.
Hotplug lifecycle:
Connect: router detects device → kernel enumerates → IOMMU domain allocated → authorization check → tunnel established → PCIe/DP/USB device appears
Disconnect: router reports link-down → kernel tears down tunnel → IOMMU domain revoked → device registry Unregister event → volume/display/USB layers handle disappearance gracefully

/// USB4/Thunderbolt router state.
// Kernel-internal, not KABI: contains ArrayVec, XArray, Option (no stable C layout).
pub struct Usb4Router {
    /// Router hardware generation and capabilities.
    pub gen: Usb4Generation,
    /// Upstream adapter (host-facing port).
    pub upstream: Usb4Adapter,
    /// Downstream adapters (device-facing ports).
    // ArrayVec: bounded iteration (probe-time, max 64 per USB4 spec).
    pub downstream: ArrayVec<Usb4Adapter, 64>,
    /// Currently active tunnels.
    // ArrayVec: bounded iteration (tunnel setup, max 64 = one per adapter pair).
    pub tunnels: ArrayVec<Usb4Tunnel, 64>,
    /// IOMMU domains for active PCIe tunnels.
    // XArray: O(1) lookup by adapter ID (runtime hot path for PCIe DMA mapping).
    pub pcie_domains: XArray<IommuDomain>,
}

// UmkaOS-internal discriminant values, not from USB4/Thunderbolt specification.
// Usb4Gen2/Gen3 match the USB4 generation number for clarity; Tb3/Tb4 are
// arbitrary chosen values (30/40) that do not correspond to any standard field.
#[repr(u32)]
pub enum Usb4Generation {
    Usb4Gen2 = 2,   // 20 Gbps
    Usb4Gen3 = 3,   // 40 Gbps
    Tb3      = 30,  // Thunderbolt 3 (40 Gbps)
    Tb4      = 40,  // Thunderbolt 4 (40 Gbps, mandatory PCIe + DP)
}

/// A USB4/Thunderbolt adapter (port) within a router.
/// Each adapter is a protocol endpoint: PCIe, DisplayPort, or USB3.
// Kernel-internal, not KABI: embedded in Usb4Router (no repr(C), contains bool).
pub struct Usb4Adapter {
    /// Adapter number within the router (0-63).
    pub id: Usb4AdapterId,
    /// Adapter type (which protocol this port supports).
    pub kind: Usb4AdapterKind,
    /// Current link state of this adapter.
    pub link_up: bool,
    /// Negotiated lane speed for this adapter.
    pub gen: Usb4Generation,
}

#[repr(u8)]
pub enum Usb4AdapterKind {
    /// Lane adapter (physical link).
    Lane    = 0,
    /// PCIe protocol adapter.
    Pcie    = 1,
    /// DisplayPort protocol adapter (IN or OUT).
    Dp      = 2,
    /// USB 3.x protocol adapter.
    Usb3    = 3,
}

/// Adapter number within a USB4 router (0-63).
/// Per USB4 spec §9.3, each router has up to 64 adapters (protocol adapters
/// and lane adapters). The adapter ID is the index into the router's adapter
/// configuration space.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Usb4AdapterId(pub u8);

// Kernel-internal, not KABI: Option is acceptable (not repr(C)).
pub struct Usb4Tunnel {
    pub kind: Usb4TunnelKind,
    pub adapter_id: Usb4AdapterId,
    /// Some for PCIe tunnels (require IOMMU isolation), None for DP/USB3 tunnels.
    pub iommu_domain: Option<IommuDomain>,
}

#[repr(u32)]
pub enum Usb4TunnelKind {
    Pcie        = 0,
    DisplayPort = 1,
    Usb3        = 2,
}

IOMMU domain lifecycle on disconnect/reconnect:

To prevent IOMMU domain reuse on rapid disconnect/reconnect sequences:

On disconnect: the device's IOMMU domain is immediately invalidated (all IOMMU mappings flushed). The domain ID enters a quarantine period (TTL = 5 seconds). The device's CAP_DMA capability is revoked immediately via cap_revoke(device_cap_handle).
Quarantine: the quarantined domain ID is reserved and cannot be assigned to any new device until TTL expires and all in-flight DMA transactions are confirmed drained (via iommu_domain_drain_wait()).
On reconnect: the reconnecting device receives a fresh IOMMU domain with a new domain ID. It never inherits the quarantined domain. Authorization re-runs from scratch (user prompt or policy check).
DMA capability binding: CAP_DMA is bound to the IOMMU domain ID, not the device identity. A reconnecting device gets a new CAP_DMA capability after authorization; the old capability is permanently revoked.

This prevents the race where old IOMMU mappings remain active when a new device appears at the same slot, and prevents capability reuse across device identities.

Firmware updates for TB controllers: Controller firmware is updatable via the NVM update protocol (vendor-specific, typically via the thunderbolt sysfs interface). The kernel exposes the firmware version and provides a write interface for firmware blobs. Actual firmware image selection and update policy is userspace.

Relationship to Section 5.3: External PCIe devices attached via USB4/Thunderbolt use the same IOMMU hard boundary and unilateral controls (bus master disable, FLR, slot power) as internal PCIe devices. If the external device runs an UmkaOS peer kernel (Section 5.2), it participates in the cluster exactly as an internal device would — the tunnel is transparent to the cluster protocol.

13.12.3.1 Authorization TOCTOU Prevention¶

The authorization flow described above has a time-of-check / time-of-use (TOCTOU) window: a malicious device could present one identity at authorization time and then swap its firmware or topology between authorization and PCIe tunnel enumeration, gaining access to an authorized tunnel under a different identity. UmkaOS closes this window with a cryptographic authorization token that binds to immutable hardware identifiers, plus a mandatory re-verification step at the start of enumeration.

Authorization token:

/// Default authorization token lifetime for interactive sessions.
/// After this duration, the token expires and a new authorization is required.
/// Balances security (limits damage window if device is swapped) with usability.
/// Overridable via the `umka.thunderbolt_auth_timeout_s` kernel parameter.
pub const TBT_AUTH_DEFAULT_TIMEOUT_S: u64 = 1800; // 30 minutes

/// Timeout waiting for the userspace authorization daemon to respond.
/// If no response within this window, the tunnel request is denied (fail-closed).
/// Prevents hangs when the authorization daemon is unresponsive.
pub const TBT_AUTH_DAEMON_RESPONSE_TIMEOUT_S: u64 = 30;

/// Cryptographic authorization token for a USB4/Thunderbolt PCIe tunnel.
/// Generated by the security manager when the user authorizes a device.
/// Stored in umka-core memory (not in the driver's isolation domain).
// Kernel-internal, not KABI: HMAC is computed over field values, not struct bytes.
pub struct TbtAuthToken {
    /// HMAC-SHA256(auth_key, device_uuid || device_serial || topology_path_bytes).
    /// `auth_key` is a kernel-private key generated at boot (never exported).
    pub token: [u8; 32],
    /// Thunderbolt device UUID as reported by router firmware (immutable field).
    pub device_uuid: [u8; 16],
    /// Thunderbolt device serial number as reported by router firmware (immutable).
    pub device_serial: u64,
    /// Topology path (upstream router UIDs + adapter indices) at authorization time.
    pub topology_path: TbtTopologyPath,
    /// Monotonic nanosecond timestamp when authorization was granted.
    pub authorized_at_ns: u64,
    /// Expiry timestamp (monotonic ns). 0 means valid until disconnect.
    /// Defaults to authorized_at_ns + TBT_AUTH_DEFAULT_TIMEOUT_S * 1_000_000_000
    /// for interactive sessions. Set to 0 for explicit "valid until disconnect" policy.
    pub expires_at_ns: u64,
}

/// Topology path: ordered list of (router_uid, adapter_index) pairs from the
/// host controller down to the authorized device. Max depth = 6 hops (USB4 spec).
// Kernel-internal, not KABI: embedded in TbtAuthToken (no repr(C), no ABI boundary).
pub struct TbtTopologyPath {
    pub hops: [(u64, u8); 6], // (router_uid, adapter_index)
    pub depth: u8,
}

Headless and daemon-less authorization policy:

Headless/daemon-less policy: If no USB4/Thunderbolt authorization daemon is registered (headless server, container, or daemon crash), ALL PCIe tunnel requests are denied by default until explicit authorization via umka-tbtctl authorize <uuid>. USB-class endpoints (not PCIe tunnels) are unaffected by this policy. The deny-default is logged at KERN_INFO level.

Daemon response timeout: If no daemon response arrives within TBT_AUTH_DAEMON_RESPONSE_TIMEOUT_S seconds, the kernel auto-denies the tunnel and logs the timeout event at KERN_WARNING.

Token generation at authorization time:

When the security manager grants authorization (in response to a write of 1 to /sys/bus/thunderbolt/devices/<device>/authorized):

Read the device's UUID and serial from the router firmware via the Thunderbolt management interface (read-only fields in the router topology descriptor; these fields are populated at cable plug-in by the router firmware from the device's identity block and cannot be modified by software).
Snapshot the current topology path (upstream router UIDs + adapter indices from host controller down to this device).
Compute HMAC-SHA256(auth_key, device_uuid || device_serial || topology_path_bytes) using the kernel's boot-time-generated auth_key.
Store the resulting TbtAuthToken in umka-core memory, associated with the device's Usb4Router entry.

Re-verification at enumeration time:

Before the USB4/TBT driver establishes the PCIe tunnel and presents the tunneled device to the PCIe bus, the kernel performs a mandatory re-verification:

PCIe tunnel enumeration protocol (enforced by umka-core, not the driver):

1. USB4/TBT driver requests PCIe tunnel enumeration for adapter <id>.
2. Security manager retrieves the stored TbtAuthToken for that adapter.
3. If no token exists: enumeration denied (PermissionDenied).
4. If token has expired (expires_at_ns != 0 && monotonic_ns() > expires_at_ns):
   revoke authorization, log security event, return PermissionDenied.
5. Re-read device UUID + serial from router firmware.
6. Re-snapshot current topology path.
7. Recompute HMAC-SHA256(auth_key, uuid || serial || path_bytes).
8. Compare computed token with stored token — must match byte-for-byte.
9. If mismatch:
   a. Log security event: "TBT TOCTOU: device identity changed after authorization,
      adapter <id>, expected UUID <stored_uuid>, got <current_uuid>"
   b. Revoke authorization (clear authorized bit, destroy stored token).
   c. Disconnect: instruct the router firmware to disable the PCIe adapter.
   d. Return SecurityViolation to the caller.
10. If match: proceed with PCIe tunnel establishment and enumeration,
    holding a reference to the auth token for the duration of enumeration.
    **Atomicity of steps 5-8**: Steps 5-8 must execute as a single critical
    section under `tbt_security_lock` (per-host-controller spinlock). This
    prevents a device swap between the UUID re-read (step 5) and the HMAC
    comparison (step 8). The lock is also held during step 1 of authorization
    (token generation), ensuring that authorization and re-verification are
    serialized. The lock is NOT held during the actual PCIe enumeration
    (steps 10-12), which may take milliseconds — only the cryptographic
    verification window is protected.
11. During enumeration: verify that the PCIe device hierarchy rooted at the
    tunnel matches the topology snapshot (router count, UIDs at each hop).
    Any discrepancy aborts enumeration with the same TOCTOU revocation sequence.
12. Post-enumeration: associate the PCIe device nodes with this auth token.
    Store the token reference in each `DeviceDescriptor` for the tunneled devices.

Topology change monitoring after enumeration:

After the PCIe tunnel is established, the USB4 driver monitors router firmware events (hotplug notifications, link-state change interrupts from the host controller):

Router added or removed: any unexpected change in the topology between the host controller and the authorized device triggers re-verification. If the re-verify fails (token mismatch due to topology change), the kernel disconnects the PCIe tunnel and revokes authorization.
Link-down on authorized adapter: treated as a disconnect event. The auth token is destroyed. Reconnection requires a fresh authorization cycle.
Router UID mismatch: if a router at any hop in the stored topology path reports a different UID than the token recorded, the kernel disconnects immediately. This catches the attack where an intermediate router (not the endpoint device) is replaced.

The topology monitoring event loop runs in the USB4 host controller driver (Tier 1). Events are delivered via the host controller's interrupt, processed in the driver's interrupt handler, and dispatched to the security manager via an MPSC ring.

13.13 I2C/SMBus Bus Framework¶

I2C (Inter-Integrated Circuit) and SMBus (System Management Bus, a subset of I2C) are low-speed serial buses used throughout the hardware stack — in servers as well as consumer and embedded devices:

Server / datacenter uses: - BMC (Baseboard Management Controller) sensor buses: CPU, DIMM, and VRM temperature sensors; fan speed controllers; PSU monitoring - PMBus (Power Management Bus, layered on SMBus): voltage regulators, power sequencing, power rail telemetry - SPD (Serial Presence Detect): JEDEC EEPROM on each DIMM, read at boot for memory training; JEDEC JEP106 manufacturer ID, capacity, speed grade, thermal sensor register on DDR4/5 DIMMs - IPMI satellite controllers (IPMB — IPMI over I2C)

Consumer / embedded uses: - Touchpads and touchscreens (I2C-HID protocol, Section 13.13 below) - Audio codecs (I2C control path for volume, routing, power state) - Ambient light sensors, accelerometers (shock/vibration detection) - Battery and charger controllers (Smart Battery System over SMBus)

13.13.1 I2C Bus Trait¶

Platform I2C controller drivers (Intel LPSS, AMD FCH, Synopsys DesignWare, Broadcom BCM2835, Aspeed AST2600 BMC) implement the I2cBus trait. The trait is in umka-core/src/bus/i2c.rs.

/// I2C device address (7-bit, right-aligned; 0x00–0x7F).
pub type I2cAddr = u8;

/// I2C transfer result.
#[repr(u32)]
pub enum I2cResult {
    Ok              = 0,
    /// No ACK (device not present or not responding).
    NoAck           = 1,
    /// Bus arbitration lost (multi-master collision).
    ArbitrationLost = 2,
    /// Timeout (clock stretching exceeded or device hung).
    Timeout         = 3,
    InvalidParam    = 4,
}

/// I2C bus trait. Implemented by platform-specific controller drivers.
/// Used only within Rust-internal code (same compilation unit). For KABI
/// boundaries between separately-compiled modules, use `I2cBusVTable` below.
pub trait I2cBus: Send + Sync {
    /// Combined write-then-read (I2C repeated START).
    /// Typical pattern: write register address, read value.
    fn transfer(&self, addr: I2cAddr, write: &[u8], read: &mut [u8]) -> I2cResult;

    fn write(&self, addr: I2cAddr, data: &[u8]) -> I2cResult {
        self.transfer(addr, data, &mut [])
    }

    fn read(&self, addr: I2cAddr, buf: &mut [u8]) -> I2cResult {
        self.transfer(addr, &[], buf)
    }
}

/// C-ABI vtable for I2C bus controller operations, used at KABI boundaries.
/// When a Tier 1 HID/sensor driver needs to call the I2C bus controller (which
/// may be a separately-compiled Tier 0 module), it receives an `I2cDevice`
/// (below) containing a pointer to this vtable rather than an `Arc<dyn I2cBus>`.
#[repr(C)]
pub struct I2cBusVTable {
    /// Bounds-safety check: vtable size in bytes. Always
    /// `core::mem::size_of::<I2cBusVTable>()` for the implementing driver.
    pub vtable_size: u64,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,
    /// Combined write-then-read (I2C repeated START).
    /// `ctx`: opaque per-bus context pointer (first arg to all operations).
    pub transfer: unsafe extern "C" fn(
        ctx:       *mut c_void,
        addr:      I2cAddr,
        write:     *const u8,
        write_len: u32,
        read:      *mut u8,
        read_len:  u32,
    ) -> I2cResult,
}
// I2cBusVTable: vtable_size(u64=8) + kabi_version(u64=8) + transfer(fn ptr).
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<I2cBusVTable>() == 24);
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<I2cBusVTable>() == 20);

/// Handle to a device at a fixed address on a specific I2C bus.
/// Uses C-ABI compatible vtable pointer + opaque context instead of
/// `Arc<dyn I2cBus>` to allow use across KABI boundaries between separately
/// compiled Tier 0 bus controller and Tier 1 device driver modules.
pub struct I2cDevice {
    /// Pointer to the bus controller's operation vtable. Points to a static
    /// vtable allocated in the bus controller module; never null.
    ///
    /// **Invariant**: After construction, `(*bus_ops).vtable_size >=
    /// size_of::<I2cBusVTable>()` is guaranteed. All methods may dereference
    /// `bus_ops` without re-checking vtable_size. The check is performed once
    /// at probe time in `I2cDevice::new()`.
    pub bus_ops:  *const I2cBusVTable,
    /// Opaque per-bus context pointer passed as the first argument to every
    /// vtable function. Points to the controller driver's internal bus state.
    pub bus_ctx:  *mut c_void,
    pub addr: I2cAddr,
}

// SAFETY: `bus_ops` points to a static vtable in the bus controller module
// that outlives any I2cDevice (the controller is a Tier 0 module that is
// never unloaded). `bus_ctx` points to the controller's internal state which
// implements `I2cBus: Send + Sync`. All vtable functions are `extern "C"`
// with no thread-local state — they are safe to call from any CPU.
// The I2cDevice may be shared across CPUs (e.g., interrupt handler on one
// CPU, process context on another) with external synchronization provided
// by the bus controller's per-bus mutex.
unsafe impl Send for I2cDevice {}
unsafe impl Sync for I2cDevice {}

impl I2cDevice {
    /// Construct an `I2cDevice`, validating the vtable_size bound.
    ///
    /// # Panics
    /// Panics if `(*bus_ops).vtable_size < size_of::<I2cBusVTable>()` (KABI
    /// version mismatch — the bus controller was compiled against an incompatible
    /// vtable layout).
    pub fn new(bus_ops: *const I2cBusVTable, bus_ctx: *mut c_void, addr: I2cAddr) -> Self {
        assert!(
            unsafe { (*bus_ops).vtable_size } as usize >= core::mem::size_of::<I2cBusVTable>(),
            "I2cBusVTable size mismatch: driver vtable too small"
        );
        Self { bus_ops, bus_ctx, addr }
    }
}

impl I2cDevice {
    pub fn read_reg(&self, reg: u8) -> Result<u8, I2cResult> {
        let mut buf = [0u8];
        // SAFETY: bus_ops and bus_ctx come from the bus controller at probe time.
        let result = unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, &reg as *const u8, 1, buf.as_mut_ptr(), 1,
            )
        };
        match result {
            I2cResult::Ok => Ok(buf[0]),
            e => Err(e),
        }
    }

    pub fn write_reg(&self, reg: u8, val: u8) -> I2cResult {
        let data = [reg, val];
        // SAFETY: bus_ops and bus_ctx are valid; data is stack-local and valid for transfer duration.
        unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, data.as_ptr(), 2, core::ptr::null_mut(), 0,
            )
        }
    }

    /// Read a 16-bit little-endian register (common on SMBus devices).
    pub fn read_reg16_le(&self, reg: u8) -> Result<u16, I2cResult> {
        let mut buf = [0u8; 2];
        // SAFETY: bus_ops/bus_ctx valid; buf is stack-local and valid for transfer duration.
        let result = unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, &reg as *const u8, 1, buf.as_mut_ptr(), 2,
            )
        };
        match result {
            I2cResult::Ok => Ok(u16::from_le_bytes(buf)),
            e => Err(e),
        }
    }

    /// Read a 16-bit big-endian register (common on JEDEC thermal sensors, I2C EEPROMs).
    pub fn read_reg16_be(&self, reg: u8) -> Result<u16, I2cResult> {
        let mut buf = [0u8; 2];
        // SAFETY: bus_ops/bus_ctx valid; buf is stack-local and valid for transfer duration.
        let result = unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, &reg as *const u8, 1, buf.as_mut_ptr(), 2,
            )
        };
        match result {
            I2cResult::Ok => Ok(u16::from_be_bytes(buf)),
            e => Err(e),
        }
    }
}

Tier classification: I2C controller drivers are Tier 1 — they are platform-integrated and accessed from multiple other Tier 1 drivers (audio, sensor, battery). Device drivers using I2C (touchpads, sensors) follow their own tier classification based on their function.

Device enumeration: I2C devices are enumerated from ACPI (_HID, _CRS with I2cSerialBusV2 resource) or device-tree compatible strings. The bus manager matches each ACPI/DT node to a registered I2C device driver.

13.13.2 SMBus and Hardware Sensors¶

SMBus restricts I2C to well-defined transaction types (Quick Command, Send Byte, Read Byte, Read Word, Block Read) and adds a PEC (Packet Error Code) byte for data integrity. The UmkaOS SMBus layer wraps I2cBus and enforces SMBus transaction semantics.

13.13.2.1 Hardware Monitoring (hwmon) Interface¶

Server and workstation motherboards expose dozens of sensors over I2C/SMBus. UmkaOS provides a HwmonDevice trait analogous to Linux's hwmon subsystem:

/// Limits and alarm thresholds for a single sensor channel.
/// Fields are `None` when the hardware does not support that limit.
#[derive(Copy, Clone, Debug, Default)]
pub struct SensorLimits {
    /// Lower warning threshold (same unit as the sensor reading).
    pub min: Option<i32>,
    /// Upper warning threshold.
    pub max: Option<i32>,
    /// Upper critical threshold (hardware or thermal shutdown level).
    pub crit: Option<i32>,
    /// Lower critical threshold (e.g. critically low voltage).
    pub crit_min: Option<i32>,
}

/// Alarm (out-of-range) status for a single sensor channel.
/// Latched by hardware until read; cleared on read on most devices.
#[derive(Copy, Clone, Debug, Default)]
pub struct SensorAlarm {
    pub below_min: bool,
    pub above_max: bool,
    pub above_crit: bool,
    pub below_crit_min: bool,
    /// Hardware fault (open-circuit thermistor, fan stall, etc.).
    pub fault: bool,
}

/// A hardware monitor device: temperature, fan, voltage, current, power.
///
/// All index arguments are 0-based and hardware-specific. Implementations
/// return `None` for indices beyond the device's channel count.
///
/// The sysfs layout under `/sys/class/hwmon/hwmon<N>/` is generated by the
/// hwmon core from these methods, following Linux hwmon ABI naming exactly
/// so that existing userspace tools (lm-sensors, fancontrol, collectd,
/// Prometheus node-exporter) work without modification:
///   - `temp<i+1>_input`    ← `temperature_mc(i)`
///   - `temp<i+1>_max`      ← `temp_limits_mc(i).max`
///   - `temp<i+1>_crit`     ← `temp_limits_mc(i).crit`
///   - `temp<i+1>_alarm`    ← `temp_alarm(i).above_max || above_crit`
///   - `fan<i+1>_input`     ← `fan_rpm(i)`
///   - `fan<i+1>_min`       ← `fan_limits_rpm(i).min`
///   - `fan<i+1>_alarm`     ← `fan_alarm(i).below_min || fault`
///   - `fan<i+1>_pwm`       ← `set_fan_pwm(i, pwm)` (write-only sysfs attr)
///   - `in<i+1>_input`      ← `voltage_mv(i)`
///   - `in<i+1>_min`/`max`  ← `voltage_limits_mv(i)`
///   - `in<i+1>_alarm`      ← `voltage_alarm(i)`
///   - `curr<i+1>_input`    ← `current_ma(i)`
///   - `power<i+1>_input`   ← `power_uw(i)` (µW)
///   - `energy<i+1>_input`  ← `energy_uj(i)` (µJ, monotonic counter)
///   - `update_interval`    ← `update_interval_ms()`
pub trait HwmonDevice: Send + Sync {
    /// Device name reported in `name` sysfs attribute (e.g. "nct6779", "k10temp").
    /// Must be a valid kernel identifier: lowercase, no spaces, ≤ 32 bytes.
    fn name(&self) -> &str;

    // ── Temperature ─────────────────────────────────────────────────────────

    /// Current temperature in millidegrees Celsius. `None` if channel absent.
    fn temperature_mc(&self, index: u8) -> Option<i32>;

    /// Hardware-programmed temperature limits. `None` if the channel is absent
    /// or the device does not support readable limits for this channel.
    fn temp_limits_mc(&self, index: u8) -> Option<SensorLimits> { None }

    /// Latched temperature alarm status. `None` if not supported.
    fn temp_alarm(&self, index: u8) -> Option<SensorAlarm> { None }

    // ── Fan ─────────────────────────────────────────────────────────────────

    /// Fan speed in RPM. `None` if channel absent.
    fn fan_rpm(&self, index: u8) -> Option<u32>;

    /// Hardware-programmed fan speed limits.
    fn fan_limits_rpm(&self, index: u8) -> Option<SensorLimits> { None }

    /// Fan alarm status (stall, below-minimum, etc.).
    fn fan_alarm(&self, index: u8) -> Option<SensorAlarm> { None }

    /// Set fan PWM duty cycle (0 = off, 255 = full speed).
    /// Returns `Err` if the device does not support PWM control or the channel
    /// is absent. The hwmon core exposes this as a writable `fan<N>_pwm` sysfs
    /// attribute only when this method returns `Ok` for at least one index.
    fn set_fan_pwm(&self, index: u8, pwm: u8) -> Result<(), HwmonError>;

    // ── Voltage ─────────────────────────────────────────────────────────────

    /// Voltage in millivolts. `None` if channel absent.
    fn voltage_mv(&self, index: u8) -> Option<i32>;

    /// Hardware voltage limits in millivolts.
    fn voltage_limits_mv(&self, index: u8) -> Option<SensorLimits> { None }

    /// Voltage alarm status (over/under-voltage).
    fn voltage_alarm(&self, index: u8) -> Option<SensorAlarm> { None }

    // ── Current ─────────────────────────────────────────────────────────────

    /// Current in milliamperes. `None` if channel absent.
    fn current_ma(&self, index: u8) -> Option<i32>;

    // ── Power and Energy ─────────────────────────────────────────────────────

    /// Instantaneous power in microwatts. `None` if channel absent.
    /// Used for RAPL domains, PMBus power rails, platform power meters.
    fn power_uw(&self, index: u8) -> Option<u64> { None }

    /// Monotonically increasing energy counter in microjoules.
    /// Wraps at `u64::MAX`. `None` if the device does not maintain an energy
    /// accumulator (as distinct from instantaneous power sampling).
    fn energy_uj(&self, index: u8) -> Option<u64> { None }

    // ── Metadata ─────────────────────────────────────────────────────────────

    /// Sensor update interval in milliseconds. The hwmon core polls at this
    /// rate when userspace reads stale data; it is also exposed as
    /// `update_interval` in sysfs (writable if `set_update_interval` is impl'd).
    fn update_interval_ms(&self) -> u32 { 1000 }

    /// Optionally allow userspace to change the update interval. Return `Err`
    /// if the device does not support reconfigurable rates.
    fn set_update_interval_ms(&self, _ms: u32) -> Result<(), HwmonError> {
        Err(HwmonError::NotSupported)
    }
}

/// Errors returned by `HwmonDevice` methods.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum HwmonError {
    /// The requested operation is not supported by this device.
    NotSupported,
    /// Underlying bus error (I2C NACK, SMBus timeout, register read failure).
    BusError,
    /// The channel index is out of range for this device.
    InvalidIndex,
}

Registered HwmonDevice instances are exposed via sysfs under /sys/class/hwmon/hwmon<N>/. The hwmon core generates all sysfs attributes from the trait methods above, including both read-only sensor values and read-write control attributes (PWM, update interval). UmkaOS's hwmon sysfs layout is binary-compatible with Linux's hwmon ABI: every sysfs path and data format matches Linux exactly, so userspace daemons (lm-sensors, fancontrol, libsensors, Prometheus node-exporter, IPMI daemons) run without modification.

13.13.2.2 PMBus (Power Management Bus)¶

PMBus is a layered protocol over SMBus for communicating with power conversion devices (VRMs, PSUs, battery chargers). PMBus defines a standardised command set (PMBUS_READ_VIN, PMBUS_READ_VOUT, PMBUS_READ_IOUT, PMBUS_READ_TEMPERATURE_1, etc.) with standardised data formats.

The UmkaOS PMBus driver: 1. Probes devices via ACPI/DT with pmbus compatible string. 2. Reads PMBUS_MFR_ID, PMBUS_MFR_MODEL for identification. 3. Registers a HwmonDevice exposing all PMBus telemetry channels. 4. Monitors STATUS_WORD for fault conditions (over-voltage, over-current, over-temperature, fan fault) and reports via fma_report_health() (Section 20.1) so the FMA can correlate thermal events with EDAC memory error rates on the same DIMM. The FMA telemetry path is the sole reporting mechanism — there is no separate hwmon-to- system-event-bus path that would bypass FMA correlation.

13.13.2.3 DIMM SPD and Thermal Sensors¶

DDR4/DDR5 DIMMs have an SPD EEPROM at I2C address 0x50–0x57 (slot-indexed). The memory controller driver reads SPD at boot for training parameters. DDR4 DIMMs also expose a thermal sensor at address 0x18–0x1F via the TS3518 or compatible interface.

/// SPD EEPROM read (partial — first 256 bytes sufficient for JEDEC training).
pub fn read_spd(bus: &dyn I2cBus, slot: u8) -> Result<[u8; 256], I2cResult> {
    let addr = 0x50u8 | (slot & 0x07);
    let mut buf = [0u8; 256];
    // SPD page select not needed for first 256 bytes on DDR4.
    // I2cBus::transfer() returns I2cResult (not Result), so use match instead of ?.
    match bus.transfer(addr, &[0x00], &mut buf) {
        I2cResult::Ok => Ok(buf),
        e => Err(e),
    }
}

/// DDR4 thermal sensor read (TS register, 13-bit signed, 0.0625°C LSB).
/// `dev` must point to a JEDEC JC42.4 thermal sensor at address 0x18 | slot.
/// Returns temperature in millidegrees Celsius.
pub fn read_dimm_temp_mc(dev: &I2cDevice) -> Result<i32, I2cResult> {
    // JEDEC JC42.4 thermal sensors transmit MSB first (big-endian).
    let raw = dev.read_reg16_be(0x05)?;
    // Bits [15:13] are alarm flags (CRIT, MAX, MIN) — MUST be masked off.
    // Bit [12] is the sign bit. Bits [11:0] are magnitude in 1/16°C units.
    // Masking to 13 bits and sign-extending from bit 12 (matching Linux
    // drivers/hwmon/jc42.c `jc42_temp_from_reg()`).
    let masked = raw & 0x1FFF;
    let sign_extended = if masked & 0x1000 != 0 {
        (masked as i32) - 0x2000
    } else {
        masked as i32
    };
    // sign_extended is in units of 1/16°C → millidegrees: * 125 / 2 = * 62.5
    Ok(sign_extended * 125 / 2)
}

13.13.2.4 LPC Bus and Super I/O Sensor Chips¶

On x86 PC and server motherboards, the motherboard sensor controller (fan speed, temperature, voltage) is typically a Super I/O chip connected to the LPC (Low Pin Count) bus — the legacy ISA bus successor that carries UART, LPC-attached EC, and sensor devices. Common chips: Nuvoton NCT6xxx, IT87xx series, Winbond W83xxx, SMSC/Microchip MEC series.

Architecture scope: LPC bus is x86-specific (part of the Intel Platform Controller Hub heritage). ARM, RISC-V, and PPC platforms expose motherboard sensors via SCMI (§7.2.3), I2C/SMBus (§11.10.2), or IPMI (§12.x). This subsection applies exclusively to x86-64 UmkaOS builds.

Super I/O access protocol: Super I/O chips decode I/O port cycles on the LPC bus at two standard address pairs: 0x2E/0x2F (index/data) or 0x4E/0x4F. A 16-bit chip-ID is read by writing logical device 0xFF and reading registers 0x20/0x21. Logical devices (LDN) are selected by writing the LDN byte to the index register; individual registers within a device are then accessed by writing the register address to index and reading/writing data.

/// Super I/O configuration port pair.
#[derive(Copy, Clone)]
pub enum SuperIoBase {
    Primary   = 0x2E,   // Most Super I/O chips default here
    Secondary = 0x4E,   // Some boards use this; try both during probe
}

/// Probe a Super I/O chip at the given base address.
/// Returns `None` if no chip responds; `Some((vendor, chip_id))` otherwise.
///
/// # Safety
/// Caller must ensure no other code is concurrently accessing these I/O ports.
/// Called only during boot probe with interrupts disabled on the BSP.
pub unsafe fn superio_probe(base: SuperIoBase) -> Option<SuperIoId> {
    // Enter configuration mode: most chips use 0x87/0x87 or 0x55/0xAA sequence.
    // UmkaOS tries both; the chip responds to whichever matches.
    superio_enter_config(base);
    let id_high = superio_read(base, LDN_GLOBAL, REG_CHIP_ID_HIGH);
    let id_low  = superio_read(base, LDN_GLOBAL, REG_CHIP_ID_LOW);
    let chip_id = ((id_high as u16) << 8) | (id_low as u16);
    superio_exit_config(base);
    if chip_id == 0x0000 || chip_id == 0xFFFF { return None; }
    Some(SuperIoId { base, chip_id })
}

/// Detected Super I/O device identity.
pub struct SuperIoId {
    pub base: SuperIoBase,
    /// Manufacturer-assigned 16-bit chip identifier.
    /// e.g. 0xD42A = Nuvoton NCT6796D, 0x8728 = ITE IT8728F
    pub chip_id: u16,
}

Driver registration: Each supported chip has a SuperIoDriver registered in a static table (no heap allocation). At boot, superio_probe() is called for both base addresses; if a chip responds, its chip_id is matched against the driver table and the matching driver's init() is called. The driver creates a HwmonDevice implementation and registers it with the hwmon subsystem.

/// A driver for one Super I/O chip variant.
pub struct SuperIoDriver {
    /// Chip ID match mask and value: matches if `(chip_id & mask) == value`.
    /// Allows a single driver to cover minor chip revisions.
    pub id_mask:  u16,
    pub id_value: u16,
    /// Human-readable chip name (e.g. "nct6796d", "it8728f").
    pub name: &'static str,
    /// Initialise the chip and register a `HwmonDevice`.
    /// Returns the hwmon device registration handle on success.
    pub init: fn(id: SuperIoId) -> Result<HwmonHandle, HwmonError>,
}

Hardware monitor logical device: Super I/O chips expose a dedicated Hardware Monitor logical device (LDN varies by chip; e.g. LDN 0x0B on NCT6xxx). The HWM LDN has an I/O base address register; the driver reads this and maps the HWM register space at that address. Temperature, fan speed, and voltage registers are then accessed via direct I/O port reads from the HWM register bank.

Fan RPM measurement uses a 16-bit counter clocked at a chip-specific frequency (typically 22.5 kHz for NCT6xxx): rpm = 1,350,000 / raw_count for NCT6xxx. Temperature sensors are 8-bit or 9-bit two's complement in units of 1°C or 0.5°C depending on the register. Voltage channels measure V_in through a resistor divider; the conversion factor is chip-specific and must be documented per channel.

The Super I/O hardware monitor driver reports all channels through the standard HwmonDevice trait; no Super I/O-specific APIs are exposed to the rest of the kernel. This ensures lm-sensors and userspace tools work via the standard /sys/class/hwmon/ interface without Super I/O knowledge.

13.13.2.5 RAPL and Platform Power as hwmon Devices¶

Intel RAPL (Running Average Power Limit) and AMD equivalent energy measurement MSRs (§7.2.5) expose energy counters per power domain. These are registered as HwmonDevice instances so that userspace power monitoring tools (lm-sensors power channels, turbostat, Prometheus node_cpu_energy_joules_total) can read them through the standard /sys/class/hwmon/ interface without requiring MSR access privileges.

Registration: The RAPL subsystem (§7.2.5) calls rapl_hwmon_register() at the end of boot phase 9 (after RAPL is initialised). One HwmonDevice is registered per RAPL domain that is readable on the current platform:

RAPL domain	hwmon device name	`power1_input` source	`energy1_input` source
Package (PKG)	`intel_rapl_pkg`	Δenergy / Δtime	`MSR_PKG_ENERGY_STATUS` (x86) or `RAPL_ENERGY_PKG` (AMD)
Core (PP0)	`intel_rapl_core`	Δenergy / Δtime	`MSR_PP0_ENERGY_STATUS`
Uncore (PP1)	`intel_rapl_uncore`	Δenergy / Δtime	`MSR_PP1_ENERGY_STATUS` (where supported)
DRAM	`intel_rapl_dram`	Δenergy / Δtime	`MSR_DRAM_ENERGY_STATUS` (Haswell+ servers)
Platform (PSys)	`intel_rapl_psys`	Δenergy / Δtime	`MSR_PLATFORM_ENERGY_STATUS` (Skylake+ mobile)

/// A hwmon device backed by a single RAPL energy domain.
pub struct RaplHwmonDevice {
    /// Human-readable name for the sysfs `name` attribute.
    pub name_str: &'static str,
    /// The RAPL domain this device reads (from §7.2.5 `PowerDomain` enum).
    pub domain: PowerDomainId,
    /// Last (energy_uj, timestamp_ns) pair protected by SpinLock.
    /// SpinLock is the primary (and only) implementation — `AtomicU128` is not
    /// available on stable Rust for any target, and not available at all on 4
    /// of 8 supported architectures (ARMv7, PPC32, RISC-V 64, LoongArch64).
    /// Acceptable because power monitoring is a cold path (polled at ~1-10 Hz).
    last_sample: SpinLock<(u64, u64)>,  // (energy_uj, timestamp_ns)
}

impl HwmonDevice for RaplHwmonDevice {
    fn name(&self) -> &str { self.name_str }

    // No temperature, fan, voltage, or current sensors.
    fn temperature_mc(&self, _: u8) -> Option<i32> { None }
    fn fan_rpm(&self, _: u8) -> Option<u32> { None }
    fn voltage_mv(&self, _: u8) -> Option<i32> { None }
    fn current_ma(&self, _: u8) -> Option<i32> { None }
    fn set_fan_pwm(&self, _: u8, _: u8) -> Result<(), HwmonError> {
        Err(HwmonError::NotSupported)
    }

    /// Instantaneous power in microwatts, derived from the energy counter
    /// delta since the previous read. First read always returns 0 µW (no
    /// prior sample).
    ///
    /// **Atomicity**: SpinLock protects both energy and timestamp fields,
    /// ensuring a reader never sees a mismatched pair.
    fn power_uw(&self, index: u8) -> Option<u64> {
        if index != 0 { return None; }
        let energy_now = rapl_read_energy_uj(self.domain)?;
        let ts_now     = clock::monotonic_ns();
        let mut guard = self.last_sample.lock();
        let (energy_prev, ts_prev) = *guard;
        *guard = (energy_now, ts_now);
        drop(guard);
        if ts_prev == 0 { return Some(0); } // first read
        let delta_uj = energy_now.wrapping_sub(energy_prev);
        let delta_ns = ts_now.saturating_sub(ts_prev);
        if delta_ns == 0 { return Some(0); }
        Some(delta_uj.saturating_mul(1000) / delta_ns)
    }

    /// Monotonic energy counter in microjoules. Wraps at u64::MAX (≫ system lifetime).
    fn energy_uj(&self, index: u8) -> Option<u64> {
        if index != 0 { return None; }
        rapl_read_energy_uj(self.domain)
    }

    fn update_interval_ms(&self) -> u32 { 100 }  // 10 Hz: reasonable for power
}

ARM/RISC-V/PPC equivalents: On AArch64 SCMI platforms, SCMI_POWER_DOMAIN and SCMI_SENSOR protocols expose per-domain power readings; the SCMI thermal driver (§7.2.3) registers corresponding HwmonDevice instances with the same power_uw() / energy_uj() methods and identical sysfs paths. On PPC64LE OPAL platforms, OPAL's opal_sensor_read() call provides equivalent data registered under the same hwmon framework. RISC-V platforms expose platform power (where available) via the device tree power-domains node or ACPI RAPL-compatible interface on server-class RISC-V systems. The hwmon ABI is identical regardless of the underlying platform power source.

13.13.3 I2C-HID Protocol¶

I2C-HID (HID over I2C, HIDI2C v1.0 specification) is used for touchpads, touchscreens, fingerprint readers, and other HID devices with I2C interfaces. The kernel implements the transport layer; HID report parsing is shared with the USB HID stack (Section 13.12).

Protocol flow: 1. ACPI reports device with PNP0C50 (_HID) or ACPI0C50; _CRS provides I2C address, IRQ GPIO line, and descriptor register address. 2. Driver reads HID descriptor (30 bytes) from the descriptor register. 3. Driver reads HID Report Descriptor and passes it to the shared HidParser. 4. Device asserts IRQ GPIO (falling edge) when a new input report is ready. 5. ISR: reads input report from the input register address specified in descriptor; parses via HidParser; posts InputEvent to the input subsystem ring buffer (Section 21.3, 20-user-io.md).

#[repr(C, packed)]
pub struct I2cHidDescriptor {
    pub length:            u16,   // Must be 30 (per HIDI2C v1.0 spec)
    pub bcd_version:       u16,   // 0x0100 for v1.0
    pub report_desc_len:   u16,
    pub report_desc_reg:   u16,
    pub input_reg:         u16,
    pub max_input_len:     u16,
    pub output_reg:        u16,
    pub max_output_len:    u16,
    pub cmd_reg:           u16,
    pub data_reg:          u16,
    pub vendor_id:         u16,
    pub product_id:        u16,
    pub version_id:        u16,
    // No _reserved field: the HIDI2C v1.0 wire format is exactly 30 bytes
    // (15 × u16). The struct is 30 bytes with #[repr(C, packed)].
    // When reading from the device, read exactly 30 bytes into this struct.
}
// I2cHidDescriptor: 15 × u16 = 30 bytes (packed, no padding).
const_assert!(size_of::<I2cHidDescriptor>() == 30);

HID parser security bounds (all input from the USB device is UNTRUSTED):

/// Maximum HID report descriptor byte length.
/// USB HID spec §7.2.1 recommends keeping descriptors under 4096 bytes.
/// UmkaOS enforces this as a hard limit to prevent parser state explosion
/// from untrusted (potentially malicious) USB devices.
pub const HID_REPORT_DESC_MAX_BYTES: usize = 4096;

/// Maximum number of usage/field items per HID report ID.
/// Limits parser memory to HID_MAX_FIELDS_PER_REPORT × sizeof(HidField) per report.
pub const HID_MAX_FIELDS_PER_REPORT: usize = 256;

/// Maximum number of report descriptors per HID device.
/// (Enforced structurally by ArrayVec<HidReport, HID_MAX_REPORTS>.)
pub const HID_MAX_REPORTS: usize = 16;

HID descriptor parsing error handling (all input is UNTRUSTED — from USB device): - Descriptor exceeds HID_REPORT_DESC_MAX_BYTES → return Err(HidError::DescriptorTooLong) - Unknown item tag → skip item per USB HID §7.2.2.7 (long-item skipping) and continue parsing (permissive, for hardware compatibility with quirky devices) - Fields exceed HID_MAX_FIELDS_PER_REPORT → truncate excess fields, log KERN_WARNING - report_count × report_size overflows u32 → return Err(HidError::ReportSizeOverflow) - Descriptor ends mid-item → return Err(HidError::TruncatedDescriptorDescriptor) - Collection nesting exceeds MAX_COLLECTION_DEPTH → return Err(HidError::CollectionTooDeep)

13.13.3.1 HID Report Descriptor Parser¶

The HidReportDescriptor parser is a state machine shared between the I2C-HID and USB HID stacks. It processes the raw byte stream of a HID report descriptor (USB HID spec §6.2.2) and produces a structured HidReportMap of typed fields.

// umka-core/src/hid/parser.rs

/// Maximum nesting depth of Collection items in a HID report descriptor.
/// USB HID spec §6.2.2.6 defines Collections as nestable; devices rarely
/// exceed 4 levels. Depth 8 accommodates all known hardware while bounding
/// stack usage in the parser state machine.
pub const MAX_COLLECTION_DEPTH: usize = 8;

/// A single field extracted from a HID report descriptor.
/// Represents one data item (button, axis, etc.) within a report.
#[derive(Clone, Debug)]
pub struct HidField {
    /// HID Usage Page (e.g., 0x01 = Generic Desktop, 0x09 = Button,
    /// 0x0D = Digitizer). Set by the most recent Usage Page global item.
    pub usage_page:   u16,
    /// Minimum Usage ID in the range assigned to this field.
    /// For single-usage fields, `usage_min == usage_max`.
    pub usage_min:    u16,
    /// Maximum Usage ID in the range (inclusive).
    pub usage_max:    u16,
    /// Minimum logical value this field can report.
    /// Signed: negative values are valid (e.g., relative axes).
    pub logical_min:  i32,
    /// Maximum logical value this field can report.
    pub logical_max:  i32,
    /// Bit offset of this field within the report (accumulated by the parser).
    pub bit_offset:   u32,
    /// Size of each datum in bits (from Report Size item).
    pub report_size:  u32,
    /// Number of data values in this field (from Report Count item).
    pub report_count: u32,
    /// Item flags from the Main item (Input/Output/Feature):
    /// bit 0 = Data(0)/Constant(1), bit 1 = Array(0)/Variable(1),
    /// bit 2 = Absolute(0)/Relative(1), bit 3 = NoWrap(0)/Wrap(1),
    /// bit 4 = Linear(0)/NonLinear(1), bit 5 = PreferredState(0)/NoPreferred(1),
    /// bit 6 = NoNullPosition(0)/NullState(1), bit 8 = BitField(0)/Buffered(1).
    pub flags:        u32,
}

/// Parsed output of the HID report descriptor parser.
/// Contains all reports and their fields, bounded by compile-time limits.
pub struct HidReportMap {
    /// Parsed reports, indexed by report ID. At most `HID_MAX_REPORTS` entries.
    pub reports: ArrayVec<HidReport, HID_MAX_REPORTS>,
}

/// Parser state machine global items (USB HID §6.2.2.7).
/// These items persist across Main items until explicitly changed.
#[derive(Clone, Default)]
struct HidGlobalState {
    usage_page:   u16,
    logical_min:  i32,
    logical_max:  i32,
    report_size:  u32,
    report_count: u32,
    report_id:    u8,
}

/// Parser state machine local items (USB HID §6.2.2.8).
/// These items are reset after each Main item.
#[derive(Clone, Default)]
struct HidLocalState {
    usage_min: u16,
    usage_max: u16,
    /// Usage stack for single-usage items (Usage items before a Main item).
    /// Bounded to prevent unbounded growth from malicious descriptors.
    usages:    ArrayVec<u16, 256>,
}

/// Errors returned by the HID report descriptor parser.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum HidError {
    /// Raw descriptor exceeds `HID_REPORT_DESC_MAX_BYTES`.
    DescriptorTooLong,
    /// Descriptor byte stream ends in the middle of an item header or payload.
    TruncatedDescriptor,
    /// `report_count * report_size` overflows `u32`, indicating a malformed
    /// or malicious descriptor.
    ReportSizeOverflow,
    /// Collection nesting exceeds `MAX_COLLECTION_DEPTH`.
    CollectionTooDeep,
    /// End Collection item without matching Collection (nesting underflow).
    CollectionUnderflow,
    /// More reports than `HID_MAX_REPORTS` in a single descriptor.
    TooManyReports,
}

/// HID report descriptor state-machine parser.
///
/// Processes items sequentially from the raw byte stream. The parser
/// maintains a global state stack (for Push/Pop items) and local state
/// that resets after each Main item (Input/Output/Feature).
///
/// **State machine transitions** (USB HID §6.2.2):
///
/// ```text
///   ┌─────────────────────────────────────────────────────┐
///   │                    IDLE                              │
///   │  (initial state, or after processing a Main item)   │
///   └──────┬──────────┬──────────┬──────────┬─────────────┘
///          │          │          │          │
///     Global item  Local item  Collection  Main item
///     (Usage Page, (Usage,     (begin/end) (Input/Output/
///      Log Min/Max, Usage Min,              Feature)
///      Report Size, Usage Max)
///      Report Count,
///      Report ID,
///      Push/Pop)
///          │          │          │          │
///          ▼          ▼          ▼          ▼
///    Update global  Update     Push/pop   Emit HidField(s)
///    state fields   local      collection  from global+local
///                   state      depth       state, reset local
///                   fields     counter     state, return to IDLE
///          │          │          │          │
///          └──────────┴──────────┴──────────┘
///                         │
///                    back to IDLE
/// ```
pub struct HidReportDescriptor;

impl HidReportDescriptor {
    /// Parse a raw HID report descriptor byte stream into a `HidReportMap`.
    ///
    /// # Arguments
    /// - `data`: Raw descriptor bytes (from the device). Length must not
    ///   exceed `HID_REPORT_DESC_MAX_BYTES`.
    ///
    /// # Returns
    /// - `Ok(HidReportMap)` with all parsed reports and fields.
    /// - `Err(HidError)` on malformed, truncated, or oversized descriptors.
    ///
    /// # Algorithm
    /// 1. Validate `data.len() <= HID_REPORT_DESC_MAX_BYTES`.
    /// 2. Initialize empty `HidReportMap`, `HidGlobalState`, `HidLocalState`,
    ///    collection depth = 0, global state stack = `ArrayVec<_, MAX_COLLECTION_DEPTH>`.
    /// 3. While bytes remain, read the item header byte:
    ///    - Bits [1:0] = size (0, 1, 2, 4 bytes payload). If size bits = 0b11, payload is 4 bytes.
    ///    - Bits [3:2] = type (0 = Main, 1 = Global, 2 = Local, 3 = reserved/long item).
    ///    - Bits [7:4] = tag (item-specific).
    ///    - For long items (tag = 0xF, type = 3): read 1-byte data size + 1-byte long tag,
    ///      skip `data_size` bytes (USB HID §6.2.2.7). Continue parsing.
    /// 4. Dispatch by type:
    ///    - **Global**: Update `HidGlobalState` fields. For Push: push current global
    ///      state onto the stack (bounded by `MAX_COLLECTION_DEPTH`). For Pop: restore.
    ///    - **Local**: Update `HidLocalState` fields (Usage, Usage Min/Max).
    ///    - **Main (Collection)**: Increment collection depth. If depth >
    ///      `MAX_COLLECTION_DEPTH`, return `Err(CollectionTooDeep)`.
    ///    - **Main (End Collection)**: Decrement collection depth.
    ///    - **Main (Input/Output/Feature)**: Emit field(s) — see step 5.
    /// 5. Emit fields for a Main item:
    ///    a. Look up or create the `HidReport` for `global.report_id`.
    ///    b. Compute total bits = `global.report_count * global.report_size`.
    ///       If overflow, return `Err(ReportSizeOverflow)`.
    ///    c. Build `HidField` from global + local state. If local `usages`
    ///       stack has entries, assign one usage per field (Variable items) or
    ///       use Usage Min/Max range (Array items).
    ///    d. Push field to report's `fields: ArrayVec<HidField, HID_MAX_FIELDS_PER_REPORT>`.
    ///       If full, truncate and log `KERN_WARNING`.
    ///    e. Advance report's `total_bits` by the computed bit count.
    ///    f. Reset local state.
    /// 6. Return `Ok(HidReportMap)`.
    pub fn parse_items(data: &[u8]) -> Result<HidReportMap, HidError> {
        if data.len() > HID_REPORT_DESC_MAX_BYTES {
            return Err(HidError::DescriptorTooLong);
        }
        let mut map = HidReportMap::new();
        let mut global = HidGlobalState::default();
        let mut local = HidLocalState::default();
        let mut depth: u8 = 0;
        let mut global_stack: ArrayVec<HidGlobalState, MAX_COLLECTION_DEPTH> = ArrayVec::new();
        let mut pos: usize = 0;
        while pos < data.len() {
            let header = data[pos];
            let size = match header & 0x03 { 0b11 => 4, s => s as usize };
            let item_type = (header >> 2) & 0x03;
            let tag = (header >> 4) & 0x0F;
            pos += 1;
            if tag == 0xF && item_type == 3 { // long item
                if pos >= data.len() { return Err(HidError::TruncatedDescriptor); }
                let long_size = data[pos] as usize;
                pos += 2 + long_size; // skip long tag + data
                continue;
            }
            if pos + size > data.len() { return Err(HidError::TruncatedDescriptor); }
            let payload = &data[pos..pos + size];
            pos += size;
            match item_type {
                0 => { /* Main: dispatch Input/Output/Feature/Collection/EndCollection */
                    match tag {
                        0x8 => { /* Input */  emit_field(&mut map, &global, &mut local, payload)?; }
                        0x9 => { /* Output */ emit_field(&mut map, &global, &mut local, payload)?; }
                        0xB => { /* Feature */emit_field(&mut map, &global, &mut local, payload)?; }
                        0xA => { /* Collection */
                            depth += 1;
                            if depth > MAX_COLLECTION_DEPTH as u8 { return Err(HidError::CollectionTooDeep); }
                        }
                        0xC => { /* End Collection */
                            if depth == 0 { return Err(HidError::CollectionUnderflow); }
                            depth -= 1;
                        }
                        _ => {} // unknown main tag: skip
                    }
                }
                1 => { /* Global: update global state, handle Push/Pop */
                    global.apply_global_item(tag, payload, &mut global_stack)?;
                }
                2 => { /* Local: update local state */
                    local.apply_local_item(tag, payload);
                }
                _ => {} // reserved type: skip
            }
        }
        Ok(map)
    }
}

The full I2cHidDevice implementation and interrupt handler:

// umka-core/src/hid/i2c_hid.rs

/// I2C-HID driver state.
///
/// **Allocation invariant**: Allocated as `Pin<Box<I2cHidDevice>>` at probe time.
/// The device is stored in a module-level registry as `*const I2cHidDevice`.
/// The `Pin` guarantee ensures the device is never moved after the pointer
/// is registered, preventing dangling pointers in the interrupt dispatch path.
pub struct I2cHidDevice {
    /// I2C device handle.
    pub i2c: I2cDevice,
    /// Descriptor (fetched at probe time).
    pub desc: I2cHidDescriptor,
    /// Interrupt GPIO line (from ACPI `_CRS` GpioInt resource).
    pub irq_gpio: GpioLine,
    /// RAII handle for the registered GPIO interrupt. Dropping this
    /// deregisters the handler. Obtained from `GpioController::request_irq()`
    /// at probe time ([Section 13.10](#gpio-and-pin-control)).
    pub irq_handle: GpioIrqHandle,
    /// HID report descriptor (fetched once at probe time). `Box<[u8]>` over
    /// `Vec<u8>`: the slice is allocated at probe with the exact length returned
    /// by the device and never resized. Prevents accidental reallocation if a
    /// method on `Vec` is called after probe.
    pub report_desc: Box<[u8]>,
    /// Pre-allocated input report buffer sized to `desc.max_input_len` at probe.
    /// Wrapped in `UnsafeCell` for interior mutability: the interrupt handler
    /// obtains `&mut [u8]` via `UnsafeCell::get()` through a shared `&self`
    /// reference. `Box<[u8]>` over `Vec<u8>`: the fixed-capacity slice prevents
    /// reallocation in interrupt context. No heap allocation during IRQ handling.
    pub report_buf: UnsafeCell<Box<[u8]>>,
    /// Parsed HID report parser state. Parses a HID report descriptor
    /// (sequence of items per USB HID spec Section 7.2.2) into a structured
    /// representation of reports, fields, and usages.
    /// Bounds enforced: HID_MAX_REPORTS, HID_MAX_FIELDS_PER_REPORT, HID_REPORT_DESC_MAX_BYTES.
    ///
    /// ```rust
    /// pub struct HidParser {
    ///     /// Parsed report descriptors, indexed by report ID.
    ///     pub reports: ArrayVec<HidReport, HID_MAX_REPORTS>,
    /// }
    /// pub struct HidReport {
    ///     pub report_id: u8,
    ///     pub report_type: HidReportType, // Input, Output, Feature
    ///     pub fields: ArrayVec<HidField, HID_MAX_FIELDS_PER_REPORT>,
    ///     pub total_bits: u32,
    /// }
    /// pub struct HidField {
    ///     pub usage_page: u16,
    ///     pub usage_min: u16,
    ///     pub usage_max: u16,
    ///     pub logical_min: i32,
    ///     pub logical_max: i32,
    ///     pub bit_offset: u32,
    ///     pub bit_size: u32,
    ///     pub count: u32,
    ///     pub flags: u32, // Variable, Array, Absolute, Wrap, etc.
    /// }
    /// ```
    pub parser: HidParser,
}

impl I2cHidDevice {
    /// Probe an I2C-HID device. Called when ACPI reports `PNP0C50` (I2C-HID).
    pub fn probe(i2c: I2cDevice, irq_gpio: GpioLine) -> Result<Self, ProbeError> {
        // Read descriptor from register 0x0001.
        let mut desc_buf = [0u8; 30];
        // SAFETY: bus_ops and bus_ctx come from the bus controller at probe time.
        unsafe {
            ((*i2c.bus_ops).transfer)(
                i2c.bus_ctx, i2c.addr, [0x01, 0x00].as_ptr(), 2,
                desc_buf.as_mut_ptr(), desc_buf.len(),
            )
        }.into_result()?;
        // SAFETY: I2cHidDescriptor is #[repr(C, packed)] with all u16/u8 fields,
        // matching the 30-byte wire format. read_unaligned is required because the
        // I2C transfer buffer may not be 2-byte aligned.
        let desc: I2cHidDescriptor = unsafe { core::ptr::read_unaligned(desc_buf.as_ptr() as *const _) };

        // Read HID report descriptor.
        let mut report_desc = vec![0u8; desc.report_desc_len as usize];
        let reg_bytes = desc.report_desc_reg.to_le_bytes();
        // SAFETY: bus_ops and bus_ctx validated at probe time; reg_bytes is stack-local.
        unsafe {
            ((*i2c.bus_ops).transfer)(
                i2c.bus_ctx, i2c.addr, reg_bytes.as_ptr(), reg_bytes.len(),
                report_desc.as_mut_ptr(), report_desc.len(),
            )
        }.into_result()?;

        // Parse HID report descriptor to build parser.
        let parser = HidParser::parse(&report_desc)?;

        // Register interrupt handler via the GPIO controller
        // ([Section 13.10](#gpio-and-pin-control)). `request_irq()` returns a
        // `GpioIrqHandle` that deregisters the handler on drop.
        let gpio_ctrl = gpio_controller_for_line(irq_gpio)?;
        let _irq_handle = gpio_ctrl.request_irq(
            irq_gpio,
            IrqMode::FallingEdge,
            Self::handle_interrupt_dispatch,
        )?;

        let report_buf = UnsafeCell::new(vec![0u8; desc.max_input_len as usize].into_boxed_slice());
        let report_desc = report_desc.into_boxed_slice();
        Ok(Self { i2c, desc, irq_gpio, irq_handle: _irq_handle, report_desc, report_buf, parser })
    }

    /// Dispatch function matching `GpioIrqHandler` signature `fn(GpioLine, IrqMode)`.
    /// Looks up the I2cHidDevice instance from a module-level registry keyed
    /// by GPIO line, then calls `handle_interrupt()` on the found instance.
    /// The registry is populated in `probe()` and cleared on drop.
    fn handle_interrupt_dispatch(line: GpioLine, _mode: IrqMode) {
        // Module-level device registry: SpinLock<ArrayVec<(GpioLine, *const I2cHidDevice), 8>>
        // keyed by GPIO line number.
        //
        // Lock protects ONLY the lookup. The I2C transfer (50-500us at 400 kHz)
        // runs AFTER the lock is released. Holding a SpinLock during an I2C
        // transfer would block all other interrupts on this CPU and risk
        // deadlock if the I2C controller shares an interrupt line.
        let dev_ptr = {
            let guard = I2C_HID_DEVICES.lock();
            guard.iter().find(|(l, _)| *l == line).map(|(_, p)| *p)
        }; // SpinLock released here
        if let Some(dev_ptr) = dev_ptr {
            // SAFETY: dev_ptr is valid for the lifetime of the I2cHidDevice
            // (deregistered in Drop before the device is freed).
            // Aliasing: UnsafeCell<Box<[u8]>>::get() returns *mut Box<[u8]>.
            // Exclusive access guaranteed: GPIO interrupt handlers for a given
            // line are serialized by the interrupt controller hardware — the IRQ
            // is masked during handler execution and only unmasked after return
            // (per the two-phase interrupt model). Therefore, only one `&mut`
            // reference to this report_buf exists at any time.
            let dev = unsafe { &*dev_ptr };
            let report_buf = unsafe { &mut *dev.report_buf.get() };
            Self::handle_interrupt(&dev.i2c, &dev.desc, &dev.parser, report_buf);
        }
    }

    /// Interrupt handler: read HID report, parse, deliver events.
    fn handle_interrupt(i2c: &I2cDevice, desc: &I2cHidDescriptor, parser: &HidParser,
                        report_buf: &mut [u8]) {
        // Pre-allocated in probe() — interrupt handlers must not perform heap allocation.
        let reg_bytes = desc.input_reg.to_le_bytes();
        // SAFETY: bus_ops and bus_ctx validated at probe time; reg_bytes is stack-local.
        let result = unsafe {
            ((*i2c.bus_ops).transfer)(
                i2c.bus_ctx, i2c.addr, reg_bytes.as_ptr(), reg_bytes.len(),
                report_buf.as_mut_ptr(), report_buf.len(),
            )
        };
        if result != I2cResult::Ok {
            return; // Ignore read errors (spurious interrupt or device glitch).
        }

        // Parse HID report → InputEvent structs.
        let events = parser.parse_input_report(&report_buf);
        for event in events {
            umka_input::post_event(event); // Write to input subsystem ring buffer (Section 21.2).
        }
    }
}

13.13.4 Precision Touchpad (PTP)¶

Windows Precision Touchpad devices use HID Usage Page 0x0D (Digitizers), Usage 0x05 (Touch Pad). The HID report contains: - Contact count: Number of active touches (0-10+). - Per-contact data: X/Y position (absolute, in logical units), contact width/height, pressure, contact ID. - Button state: Physical button click (if present), pad click (tap-to-click handled in userspace).

// umka-core/src/hid/touchpad.rs

/// Parsed Precision Touchpad report.
pub struct PtpReport {
    /// Number of active contacts.
    pub contact_count: u8,
    /// Per-contact data (up to 10 simultaneous touches).
    pub contacts: [PtpContact; 10],
    /// Button state (bit 0 = left button, bit 1 = right button).
    pub buttons: u8,
}

/// Single touch contact on a Precision Touchpad.
#[derive(Clone, Copy)]
pub struct PtpContact {
    /// Contact ID (persistent across reports while finger is down).
    pub id: u8,
    /// Tip switch (1 = finger down, 0 = finger lifted).
    pub tip: bool,
    /// X position (logical units, 0 = left edge).
    pub x: u16,
    /// Y position (logical units, 0 = top edge).
    pub y: u16,
    /// Width (logical units, or 0 if not reported).
    pub width: u16,
    /// Height (logical units, or 0 if not reported).
    pub height: u16,
}

Gesture recognition: Kernel delivers raw multi-touch HID reports via the input ring buffer. Gesture recognition (palm rejection, tap-to-click, multi-finger swipes) is handled by a userspace input library (libinput or equivalent).

13.14 Bluetooth HCI Driver¶

Interface contract: Section 13.2 (WirelessDriver trait covers 802.11; BT HCI uses a separate HCI socket interface exposed via umka-sysapi). Tier decision: Tier 2 for the BT stack (control path not latency-sensitive), Tier 1 for the kernel HCI transport driver.

Stack Decision: BlueZ-compatible via umka-sysapi HCI socket interface — UmkaOS provides a kernel HCI (Host Controller Interface) driver that exposes the standard Linux AF_BLUETOOTH socket interface (socket(AF_BLUETOOTH, SOCK_RAW, BTPROTO_HCI)) via umka-sysapi. The /dev/hciN character device interface is deprecated in modern Linux (removed in BlueZ 5.x); UmkaOS uses AF_BLUETOOTH HCI sockets exclusively. The BlueZ userspace daemon (bluetoothd) runs in Tier 2, implementing L2CAP, SDP, RFCOMM, A2DP, HID, and pairing logic. This approach: - Reuses the mature BlueZ stack (~200K lines, 15+ years of protocol compatibility testing). - Avoids the multi-year effort of a clean-room Bluetooth stack. - Maintains compatibility with existing Bluetooth management tools (bluez-utils, bluetoothctl).

13.14.1 Kernel HCI Driver (Tier 1)¶

The HCI driver is Tier 1 (MPK-isolated) and handles raw HCI packet transport. Common transports: - USB HCI: Bulk endpoints (ACL data), interrupt endpoint (events), control endpoint (commands). Most common on laptops (Intel, Realtek, Qualcomm combo modules). - UART HCI: Serial port (ttyS, ttyUSB) with H4/H5/BCSP framing. Common on ARM SoCs (RPi, embedded).

// umka-core/src/bluetooth/hci.rs

/// HCI packet type.
#[repr(u8)]
pub enum HciPacketType {
    /// HCI command (host → controller).
    Command = 0x01,
    /// ACL data (bidirectional, L2CAP payload).
    AclData = 0x02,
    /// SCO data (bidirectional, voice payload).
    ScoData = 0x03,
    /// HCI event (controller → host).
    Event = 0x04,
    /// ISO data (bidirectional, isochronous channels — LE Audio / LC3 / Auracast).
    /// Added in Bluetooth Core Specification v5.2 (2019). Required for LE Audio
    /// transport. BlueZ 5.66+ assumes ISO support.
    IsoData = 0x05,
}

/// HCI device handle (opaque to userspace).
#[repr(C)]
pub struct HciDeviceId(u32);
const_assert!(size_of::<HciDeviceId>() == 4);

/// HCI command packet (max 259 bytes: 1 byte type + 2 bytes opcode + 1 byte len + 255 bytes data).
///
/// **Validation**: On every ring buffer dequeue (before sending to the controller),
/// the HCI driver validates:
/// - `packet_type == 0x01` (HCI_COMMAND_PKT), else drop with `EINVAL`
/// - `param_len <= 255` (guaranteed by u8, but explicitly checked after ring read
///   to guard against ring corruption)
/// - For known opcodes: `param_len` matches the expected parameter length from
///   the HCI specification (Bluetooth Core 5.4, Vol 4 Part E). Unknown opcodes
///   accept any `param_len` (vendor-specific commands).
/// - Total packet size `4 + param_len` does not exceed the ring slot size.
#[repr(C, packed)]
pub struct HciCommand {
    /// Packet type (always 0x01 for commands).
    pub packet_type: u8,
    /// Opcode (OCF + OGF encoded as little-endian u16 on wire).
    /// Stored as `[u8; 2]` instead of `u16` because this struct is `packed`
    /// and the field sits at offset 1 (unaligned). In Rust, creating a
    /// reference to a misaligned `u16` is undefined behavior. `[u8; 2]`
    /// has alignment 1, so it is always safe to reference at any offset.
    /// Use `opcode()` helper to read as `u16`.
    pub opcode: [u8; 2],
    /// Parameter length (0-255).
    pub param_len: u8,
    /// Parameters (variable length).
    pub params: [u8; 255],
}

impl HciCommand {
    /// Read the opcode as a `u16` (little-endian, matching HCI wire order).
    pub fn opcode(&self) -> u16 {
        u16::from_le_bytes(self.opcode)
    }
    /// Extract the OGF (Opcode Group Field, upper 6 bits).
    pub fn ogf(&self) -> u8 {
        (self.opcode() >> 10) as u8
    }
    /// Extract the OCF (Opcode Command Field, lower 10 bits).
    pub fn ocf(&self) -> u16 {
        self.opcode() & 0x03FF
    }
}

// HciCommand: packed, 1(packet_type) + 2(opcode) + 1(param_len) + 255(params) = 259 bytes.
const_assert!(size_of::<HciCommand>() == 259);

/// HCI event packet (max 258 bytes: 1 byte type + 1 byte event code + 1 byte len + 255 bytes data).
///
/// **Validation**: On every event received from the controller (ring buffer enqueue
/// to userspace), the driver validates:
/// - `packet_type == 0x04` (HCI_EVENT_PKT), else drop with `EPROTO`
/// - `param_len` does not exceed the remaining bytes in the USB/UART receive buffer.
///   A `param_len` claiming more data than actually received is truncated to the
///   actual received length, and a `BT_WARNING` tracepoint is emitted.
// All fields are u8/[u8; N] -- alignment is 1, no padding.  repr(C)
// sufficient (packed not needed unlike HciCommand which has u16 opcode).
#[repr(C)]
pub struct HciEvent {
    /// Packet type (always 0x04 for events).
    pub packet_type: u8,
    /// Event code.
    pub event_code: u8,
    /// Parameter length (0-255).
    pub param_len: u8,
    /// Parameters (variable length).
    pub params: [u8; 255],
}
// HciEvent: all u8 fields, no padding. 1+1+1+255 = 258 bytes.
const_assert!(size_of::<HciEvent>() == 258);

The HCI driver exposes a ring buffer interface (Section 11.8) to the BlueZ daemon: - Command ring: BlueZ writes HciCommand structs, driver sends them to the controller via USB bulk OUT or UART TX. - Event ring: Driver receives events from the controller (USB interrupt IN or UART RX), writes HciEvent structs to the ring. - ACL TX ring: BlueZ writes ACL data packets (L2CAP frames), driver sends them to the controller. - ACL RX ring: Driver receives ACL data from the controller, writes to the ring.

13.14.2 BlueZ Daemon (Tier 2)¶

bluetoothd runs as a Tier 2 process. It opens an AF_BLUETOOTH HCI socket (which is backed by the HCI ring buffer interface via umka-sysapi), reads/writes HCI packets, and implements all higher-layer protocols: - L2CAP: Logical Link Control and Adaptation Protocol (multiplexing, segmentation). - SDP: Service Discovery Protocol (enumerate remote device capabilities). - RFCOMM: Serial port emulation over Bluetooth (for legacy apps). - A2DP: Advanced Audio Distribution Profile (high-quality stereo audio streaming). - AVRCP: Audio/Video Remote Control Profile (play/pause, volume control). - HID: Human Interface Device (keyboards, mice, game controllers). - HSP/HFP: Headset/Hands-Free Profiles (phone call audio).

Pairing: BlueZ userspace daemon (bluetoothd) handles pairing logic and stores the pairing database. Kernel provides HCI transport only.

13.14.3 A2DP Audio Routing to PipeWire¶

When a Bluetooth headset is paired and A2DP is active: 1. bluetoothd decodes the SBC/AAC/LDAC A2DP stream from ACL packets (received via the HCI ACL RX ring). 2. Writes decoded PCM samples to a PipeWire ring buffer (Section 21.4, the same ring buffers used for wired audio). 3. PipeWire mixes/routes the audio to the audio subsystem (Section 21.4, 20-user-io.md). 4. For playback, the reverse path: PipeWire writes PCM to a ring, bluetoothd encodes to SBC/AAC, sends via ACL TX.

Latency: A2DP adds ~100-200ms latency (codec encoding/decoding, BT scheduling). This is unavoidable (Bluetooth spec limitation). Gaming audio and video calls use SCO (Synchronous Connection-Oriented) links for lower latency at the cost of lower quality (64kbps, 8kHz sample rate).

13.14.4 HID Input Routing¶

When a Bluetooth keyboard/mouse is paired: 1. bluetoothd receives HID reports via L2CAP over ACL. 2. Translates them to standard InputEvent structs (Section 21.3, same format as USB HID). 3. Writes to the input subsystem ring buffer (Section 21.3). 4. umka-input (the input multiplexer) routes events to the active Wayland compositor or VT.

Wake-on-Bluetooth: Before S3 suspend, bluetoothd tells the HCI driver to enable "wake on HID activity" (any HID report from a paired device wakes the system). The driver programs the USB controller's PME mask or UART's RTS line to wake on RX. Pressing a key on the Bluetooth keyboard wakes the laptop.

13.14.5 Architectural Decision¶

Bluetooth: BlueZ-compatible via umka-sysapi HCI

Decision: Kernel HCI driver (Tier 1) exposes AF_BLUETOOTH HCI sockets. BlueZ daemon (Tier 2) implements L2CAP, A2DP, HID, pairing. Reuses mature BlueZ stack (~200K lines, 15+ years of testing) instead of multi-year clean-room effort. UmkaOS maintains Linux HCI socket ABI compatibility.

13.15 WiFi Driver¶

Interface contract: Section 13.2 (WirelessDriver trait, wireless_device_v1 KABI). This section specifies the Intel/Realtek/Qualcomm/MediaTek/Broadcom implementations of that contract. Tier and ring-buffer design decisions are authoritative in Section 13.2.

Tier: Tier 1 (per Section 13.2 — latency-sensitive; IOMMU-bounded firmware threat model).

Chipset coverage (minimum for launch): - Intel: AX210, AX211, AX411 (WiFi 6E) - Realtek: RTL8852AE, RTL8852BE, RTL8852CE (common in consumer laptops) - Qualcomm: QCA6390, QCA6391, WCN6855 (Snapdragon-based laptops) - Mediatek: MT7921, MT7922 (budget laptops) - Broadcom: BCM4350, BCM4352 (older MacBooks, some Thinkpads)

13.15.1 WiFi Driver Architecture¶

WiFi drivers implement the WirelessDriver trait defined in Section 13.2 (wireless_device_v1 KABI). Each chipset driver is Tier 1, MPK-isolated on x86-64 (Section 11.3), and communicates with umka-net via the TX/RX ring buffers specified in Section 13.2.

// umka-core/src/net/wireless.rs

/// WiFi device handle. Opaque to userspace, used for ioctl operations.
#[repr(C)]
pub struct WirelessDeviceId(u64);
const_assert!(core::mem::size_of::<WirelessDeviceId>() == 8);

/// WiFi scan result. **DEPRECATED**: Use `BssEntry` from
/// [Section 13.2](#wireless-subsystem) instead. This type is retained only for
/// mac80211-level driver diagnostics where the full cfg80211 BssEntry
/// is not available. New code should use `BssEntry`.
#[repr(C)]
pub struct WifiScanResult {
    /// BSSID (MAC address of the AP).
    pub bssid: [u8; 6],
    /// SSID length (0-32 bytes).
    pub ssid_len: u8,
    /// SSID (variable length, up to 32 bytes; remaining bytes are zero).
    pub ssid: [u8; 32],
    /// RSSI (signal strength in dBm, typically -100 to 0).
    pub rssi: i8,
    /// Channel number (1-14 for 2.4 GHz, 36-165 for 5 GHz, 1-233 for 6 GHz).
    pub channel: u16,
    /// Explicit padding (offset 42, security needs u32 alignment at offset 44).
    pub _pad0: [u8; 2],
    /// Security type (bitmask: WPA2=0x1, WPA3=0x2, Enterprise=0x4).
    pub security: u32,
    /// BSS load (0-255, indicates AP congestion; 255 = unknown).
    pub bss_load: u8,
    /// Explicit trailing padding (struct alignment = 4 from u32 field).
    pub _pad1: [u8; 3],
}
// WifiScanResult: bssid(6) + ssid_len(1) + ssid(32) + rssi(1) + channel(2) +
//   _pad0(2) + security(4) + bss_load(1) + _pad1(3) = 52 bytes.
const_assert!(core::mem::size_of::<WifiScanResult>() == 52);

/// WiFi connection parameters. **DEPRECATED**: Use `ConnectParams` from
/// [Section 13.2](#wireless-subsystem) instead. This type exists as a driver-level
/// simplified view; the cfg80211 `ConnectParams` is the canonical type.
#[repr(C)]
pub struct WifiConnectParams {
    /// SSID length (1-32 bytes).
    pub ssid_len: u8,
    /// SSID.
    pub ssid: [u8; 32],
    /// BSSID (all zeros = any BSSID; specific BSSID = forced roam to that AP).
    pub bssid: [u8; 6],
    /// Explicit padding: bssid ends at offset 39; security (u32) needs alignment 4.
    /// 39 % 4 = 3, so 1 byte padding. Per CLAUDE.md rule 11.
    pub _pad0: u8,
    /// Security type (WPA2=0x1, WPA3=0x2, Enterprise=0x4).
    pub security: u32,
    /// PSK (pre-shared key) length in bytes (0 for open networks).
    pub psk_len: u8,
    /// PSK (for WPA2-PSK / WPA3-SAE personal).
    pub psk: [u8; 64],
    /// Explicit padding: psk ends at offset 109; eap (contains u64, alignment 8).
    /// 109 % 8 = 5, so 3 bytes padding. Per CLAUDE.md rule 11.
    pub _pad1: [u8; 3],
    /// 802.1X parameters (for enterprise; zero if not used).
    pub eap: Eap8021xParams,
}
// WifiConnectParams: ssid_len(1) + ssid(32) + bssid(6) + pad(1) + security(4) +
//   psk_len(1) + psk(64) + pad(3) + eap(Eap8021xParams=280) = 392 bytes.
const_assert!(core::mem::size_of::<WifiConnectParams>() == 392);

/// 802.1X / EAP parameters for enterprise WiFi.
#[repr(C)]
pub struct Eap8021xParams {
    /// EAP method (0=none, 1=PEAP, 2=TTLS, 3=TLS).
    pub method: u8,
    /// Identity length.
    pub identity_len: u8,
    /// Identity (username).
    pub identity: [u8; 128],
    /// Password length (0 for certificate-based).
    pub password_len: u8,
    /// Password (for PEAP/TTLS).
    pub password: [u8; 128],
    /// Explicit padding: password ends at offset 259; ca_cert (u64) needs alignment 8.
    /// 259 % 8 = 3, so 5 bytes padding. Per CLAUDE.md rule 11.
    pub _pad_align: [u8; 5],
    /// CA certificate handle (for TLS verification; 0 = no pinning).
    pub ca_cert: u64,
    /// Explicit reserved space for future EAP extensions. Content without this
    /// field = 272 bytes, already 8-byte aligned (272 % 8 = 0). 272 + 8 = 280.
    /// Per CLAUDE.md rule 11 — all padding explicit.
    pub _pad_tail: [u8; 8],
}
// Eap8021xParams layout (all padding explicit):
//   method(u8=1) + identity_len(u8=1) + identity([u8;128]=128) +
//   password_len(u8=1) + password([u8;128]=128) + _pad_align([u8;5]=5) +
//   ca_cert(u64=8) + _pad_tail([u8;8]=8) = 280 bytes.
//   Struct alignment = 8 (from ca_cert: u64). 280 % 8 = 0. No implicit padding.
const_assert!(core::mem::size_of::<Eap8021xParams>() == 280);

/// WiFi power save mode. **DEPRECATED**: Use `WirelessPowerSave` from
/// [Section 13.2](#wireless-subsystem) instead. This enum is a duplicate; the
/// cfg80211-level `WirelessPowerSave` is the canonical definition.
#[repr(u32)]
pub enum WifiPowerSaveMode {
    /// No power save (CAM - Constantly Awake Mode). Lowest latency, highest power.
    Disabled = 0,
    /// 802.11 Power Save Mode (PSM). Sleep between beacons, wake for DTIM.
    Enabled = 1,
    /// Aggressive power save (skip DTIMs, rely on TIM). Highest battery savings.
    Aggressive = 2,
}

/// WiFi connection state.
#[repr(u32)]
pub enum WifiState {
    /// Not connected, not scanning.
    Idle = 0,
    /// Scanning for networks.
    Scanning = 1,
    /// Authenticating with AP (4-way handshake in progress).
    Authenticating = 2,
    /// Connected, link up.
    Connected = 3,
    /// Disconnecting (deauth sent, waiting for confirmation).
    Disconnecting = 4,
}

/// WiFi statistics.
#[repr(C)]
pub struct WifiStats {
    /// Current state.
    pub state: WifiState,
    /// Connected SSID length (0 if not connected).
    pub ssid_len: u8,
    /// Connected SSID.
    pub ssid: [u8; 32],
    /// Connected BSSID (all zeros if not connected).
    pub bssid: [u8; 6],
    /// Explicit padding (offset 43 is odd, channel needs u16 alignment).
    pub _pad0: u8,
    /// Current channel.
    pub channel: u16,
    /// RSSI (dBm).
    pub rssi: i8,
    /// Explicit padding (offset 47 is odd, link_speed_mbps needs u16 alignment).
    pub _pad1: u8,
    /// Link speed (Mbps).
    pub link_speed_mbps: u16,
    /// Explicit padding (offset 50, tx_packets needs u64 alignment at offset 56).
    pub _pad2: [u8; 6],
    /// TX packets.
    pub tx_packets: u64,
    /// RX packets.
    pub rx_packets: u64,
    /// TX bytes.
    pub tx_bytes: u64,
    /// RX bytes.
    pub rx_bytes: u64,
    /// TX errors (failed transmissions). Internal u64 counter; converted to
    /// u32 for nl80211 wire format (`NL80211_STA_INFO_TX_FAILED` is `__u32`).
    /// At 100K errors/sec (saturated error path), u64 wraps in ~5.8 billion years.
    pub tx_errors: u64,
    /// RX errors (FCS errors, drops). Internal u64 counter; converted to
    /// u32 for nl80211 wire format (`NL80211_STA_INFO_RX_DROP_MISC` is `__u32`).
    /// Same longevity as tx_errors.
    pub rx_errors: u64,
}
// WifiStats: state(4) + ssid_len(1) + ssid(32) + bssid(6) + _pad0(1) + channel(2)
//   + rssi(1) + _pad1(1) + link_speed_mbps(2) + _pad2(6) + tx_packets(8) +
//   rx_packets(8) + tx_bytes(8) + rx_bytes(8) + tx_errors(8) + rx_errors(8) = 104 bytes.
const_assert!(core::mem::size_of::<WifiStats>() == 104);

13.15.2 Firmware Isolation Model¶

WiFi firmware runs on the chip (Intel AX210's embedded ARM core, Qualcomm's dedicated DSP), not in host CPU Ring 0. The Tier 1 driver manages:

Firmware upload: Firmware blobs loaded from /ukfs/firmware/wifi/<vendor>/<chip>.bin at driver probe time via umka_driver_firmware_load() KABI call (maps the blob DMA-accessible, issues chip-specific firmware load command).
Control path: Commands (scan, connect, disconnect) sent via MMIO registers or command rings (chip-specific).
Data path: TX/RX ring buffers (see Section 13.15) populated by driver, consumed/produced by firmware DMA engine.

IOMMU enforcement: The WiFi chip's DMA is bounded to: - TX ring buffer pages (read-only from chip perspective) - RX ring buffer pages (write-only from chip perspective) - Firmware upload buffer (read-only, unmapped after upload completes)

The driver cannot access arbitrary physical memory, and the firmware cannot DMA outside its assigned buffers. This matches the NVMe threat model (Section 11.4): firmware is untrusted, IOMMU is the hard boundary.

Firmware blob loading: Firmware is NOT shipped in the kernel binary (bloat, licensing). /ukfs/firmware/ is a separate partition or directory populated during install. The kernel provides umka_driver_firmware_load(device_id, "iwlwifi-ax210-v71.ucode") which: 1. Reads the file from the firmware partition (uses VFS, Tier 1 filesystem driver). 2. Allocates an IOMMU-fenced DMA buffer. 3. Copies the firmware blob to the buffer. 4. Returns a DmaBufferHandle to the driver. 5. Driver passes the handle to the chip's firmware loader.

13.15.3 TX/RX Ring Buffer Design¶

WiFi uses the same ring buffer protocol as NVMe (Section 11.8). The driver allocates two rings:

TX ring: Host writes packet descriptors (packet buffer address, length, metadata). Firmware DMA engine reads descriptors, fetches packets, transmits over the air.
RX ring: Firmware DMA engine writes received packet descriptors (packet buffer address, length, RSSI, channel, timestamp). Host reads descriptors, processes packets.

// umka-driver-sdk/src/wireless.rs

/// WiFi TX descriptor (64 bytes, cache-line aligned).
#[repr(C, align(64))]
pub struct WifiTxDescriptor {
    /// Physical address of packet buffer (DMA-mapped).
    pub buffer_addr: u64,
    /// Packet length in bytes (14-2304 for 802.11).
    pub length: u16,
    /// TX flags (ACK required, QoS TID, encryption).
    pub flags: u16,
    /// Sequence number (for retransmissions).
    pub seq: u16,
    /// Retry count (0 for first attempt).
    pub retries: u8,
    /// TX power (dBm, or 0xFF for default).
    pub tx_power: i8,
    /// Rate index (driver-specific rate table).
    pub rate_index: u8,
    _pad: [u8; 47],
}
// Layout: 8 + 2 + 2 + 2 + 1 + 1 + 1 + 47 = 64 bytes. align(64) satisfied.
const_assert!(size_of::<WifiTxDescriptor>() == 64);

/// WiFi RX descriptor (64 bytes, cache-line aligned).
///
/// Layout (with `#[repr(C, align(64))]`):
///   Offset 0:  buffer_addr (u64)     = 8 bytes
///   Offset 8:  length (u16)          = 2 bytes
///   Offset 10: flags (u16)           = 2 bytes
///   Offset 12: rssi (i8)             = 1 byte
///   Offset 13: noise (i8)            = 1 byte
///   Offset 14: channel (u16)         = 2 bytes
///   Offset 16: timestamp_us (u64)    = 8 bytes  (naturally aligned at 16)
///   Offset 24: _pad                  = 40 bytes
///   Total: 64 bytes (matches align(64), no implicit padding needed).
#[repr(C, align(64))]
pub struct WifiRxDescriptor {
    /// Physical address of packet buffer (firmware wrote packet here).
    pub buffer_addr: u64,
    /// Packet length in bytes.
    pub length: u16,
    /// RX flags (FCS OK, decryption OK, AMPDU).
    pub flags: u16,
    /// RSSI (dBm).
    pub rssi: i8,
    /// Noise floor (dBm).
    pub noise: i8,
    /// Channel number.
    pub channel: u16,
    /// Timestamp (hardware TSF, microseconds).
    pub timestamp_us: u64,
    _pad: [u8; 40],
}
// Layout: 8 + 2 + 2 + 1 + 1 + 2 + 8 + 40 = 64 bytes. align(64) satisfied.
const_assert!(size_of::<WifiRxDescriptor>() == 64);

Tier 1 isolation and the two-level ring design:

WiFi drivers run in Tier 1 (MPK/POE-isolated memory domain). They cannot directly access firmware-owned hardware rings (which reside in the device's DMA-accessible region, outside the driver's memory domain). Instead, the data path uses a two-level ring — the same pattern as NIC drivers in Section 11.8:

KABI ring (Tier 1 domain): The driver writes WifiTxDescriptor entries to a KABI-managed ring buffer in the driver's own memory domain. This ring is allocated by the Tier 0 core and shared with the driver via the KABI DMA interface.
Firmware ring (Tier 0 domain): The kernel's NIC/WiFi infrastructure (running in Tier 0 core) copies descriptors from the KABI ring to the actual firmware-accessible hardware ring and kicks the doorbell. On RX, the kernel copies from the firmware ring to the KABI ring.

This copy adds ~50-100ns per descriptor (~64 bytes, L1-resident). For WiFi (max ~600Mbps 802.11ax), this overhead is negligible relative to air interface latency (~1-10ms). The copy ensures that a crashing Tier 1 WiFi driver cannot corrupt firmware ring state or inject arbitrary DMA addresses.

Zero-copy path (from driver's perspective): When umka-net (the Tier 1 network stack) needs to send a packet over WiFi: 1. umka-net allocates a packet buffer from the DMA-capable memory pool (Section 12.3 umka_driver_dma_alloc). 2. Writes the 802.11 frame (header + payload) to the buffer. 3. Writes a WifiTxDescriptor to the KABI TX ring. 4. Signals the kernel (KABI doorbell). 5. Kernel copies the descriptor to the firmware TX ring, kicks the hardware doorbell (MMIO write to Tier 0 domain). 6. Firmware DMA-reads the descriptor, DMA-reads the packet, transmits. 7. Firmware writes a completion entry to the firmware TX completion ring; kernel copies it to the KABI completion ring for the driver.

13.15.4 Power Management¶

WiFi power management integrates with Section 7.4 power budgeting and Section 7.9 suspend/resume.

Power save modes: - WifiPowerSaveMode::Disabled: Driver keeps the radio in CAM (Constantly Awake Mode). Lowest latency, ~1.5W idle power. - WifiPowerSaveMode::Enabled: Driver enables 802.11 PSM. Radio sleeps between beacons, wakes for DTIM. ~300mW idle power, ~10-20ms wake latency. - WifiPowerSaveMode::Aggressive: Driver enables DTIM skipping (only wake every 3rd DTIM), beacon filtering (hardware drops beacons not containing traffic indication). ~150mW idle power, ~50-100ms wake latency.

Mode selection: Controlled by the power profile (Section 7.5): - Performance profile: Disabled - Balanced profile: Enabled - BatterySaver profile: Aggressive

Fast wake: When the radio is in PSM and an outbound packet arrives, the driver: 1. Immediately sends a null data frame with PM=0 (telling the AP "I'm awake now"). 2. Queues the outbound packet in the TX ring. 3. The firmware buffers it until the AP acknowledges the PM=0 frame (~5-10ms). 4. Then transmits the queued packet.

13.15.5 WoWLAN (Wake-on-WLAN)¶

Before entering S3 suspend (Section 7.9), the driver registers wake patterns with the firmware: - Magic Packet: Wake on receiving a packet with destination MAC matching the WiFi interface. - Disconnect: Wake on AP deauth/disassoc (lost connection). - GTK Rekey: Wake on WPA2/WPA3 group key rekey (maintains encryption sync). WPA3 GTK rekey offload uses variable-length KEK/KCK (see WirelessDriver::set_rekey_offload()).

The firmware remains powered (in D3hot, not D3cold) during S3. When a wake pattern matches, the firmware asserts the PCIe PME (Power Management Event) signal, waking the system. The driver's resume() callback (Section 7.9) re-establishes the connection.

Security consideration: WoWLAN patterns are capability-gated. Only processes with CAP_NET_ADMIN can configure wake patterns, preventing DoS (malicious process sets "wake on any packet" → battery drain).

13.15.6 Scan Offload¶

The driver supports background scanning while suspended (S0ix Modern Standby, Section 7.9): 1. Before S0ix entry, the driver programs the firmware with a scan schedule (every 30 seconds, channels 1/6/11 only, passive scan). 2. Firmware performs scans autonomously while the host CPU is in C10 (powered down). 3. If scan results differ significantly (RSSI drop >20dB, AP disappeared), firmware wakes the host via PME. 4. Driver's resume handler evaluates roaming decision.

This enables "instant reconnect" on lid open: the firmware already scanned for APs and selected the best candidate while the laptop was asleep.

13.15.7 Roaming¶

When the driver detects poor link quality (RSSI < -75dBm, packet loss >5%), it triggers a roam: 1. Background scan for APs on the same SSID. 2. Select best candidate (highest RSSI, lowest BSS load). 3. Send reassociation request to the new AP. 4. If successful, TX/RX rings continue using the same buffers (no data plane disruption). 5. If failed, stay connected to the current AP, retry roam in 5 seconds.

Seamless roaming: The driver batches the last ~10 outbound packets in a shadow buffer during reassociation. If roaming succeeds, retransmits them to the new AP. If roaming fails, discards them (they're already lost). This avoids TCP connection resets during roaming.

13.15.8 Architectural Decision: WiFi Tier Classification¶

Decision: WiFi drivers are Tier 1 (in-kernel, isolation-domain-sandboxed).

Rationale: Tier 2 (separate process) would add ~200–500 cycles of IPC overhead per packet on the hot RX path. WiFi is latency-sensitive: video calls, SSH sessions, and cloud gaming are all affected by millisecond-scale jitter. WiFi firmware runs on-chip (AX210's embedded ARM core, Qualcomm's DSP) — not on the host CPU — so Tier 1 does not mean "trust the firmware"; IOMMU enforcement is the hard boundary, matching the NVMe threat model (Section 11.4). Tier 2 would add latency without improving isolation.

13.15.9 nl80211 — Linux Wireless Configuration Interface¶

nl80211 is the Linux Generic Netlink-based wireless configuration interface. Userspace tools — wpa_supplicant, iw, hostapd, iwd, NetworkManager — use nl80211 to scan for networks, configure connections, and manage access point (AP) mode. Without nl80211, WiFi is invisible to standard Linux userspace.

UmkaOS implements nl80211 in the umka-wireless module (Tier 1, inside umka-net). The module registers a Generic Netlink family named "nl80211" and translates nl80211 commands into WirelessDriver KABI calls (Section 13.2). No cfg80211 or mac80211 kernel modules are needed — UmkaOS's implementation is a direct translation layer.

13.15.9.1.1 Architecture¶

wpa_supplicant / iw / hostapd / NetworkManager
    │  AF_NETLINK socket, NETLINK_GENERIC, family "nl80211"
    │  NL80211_CMD_* commands, NL80211_ATTR_* attributes
    ▼
umka-wireless: nl80211 Generic Netlink handler
    │  Translates nl80211 → WirelessDriver KABI calls
    │  Delivers WirelessEvent → nl80211 multicast notifications
    ▼
WirelessDriver (KABI, §13.1.1)
    │  WirelessDriver::scan(), connect(), disconnect(), ...
    ▼
WiFi chip driver (Tier 1: Intel AX210, Realtek RTL8852AE, ...)

13.15.9.1.2 Generic Netlink Registration¶

/// nl80211 Generic Netlink family registration.
pub struct Nl80211Family {
    /// Family ID (auto-assigned at registration; user queries via
    /// `CTRL_CMD_GETFAMILY` on the "nlctrl" family).
    pub family_id: u16,
    /// Multicast groups for unsolicited event delivery.
    pub mcast_groups: [Nl80211McastGroup; 5],
}

pub enum Nl80211McastGroup {
    /// Scan notifications (scan started, scan results ready).
    Scan,
    /// Regulatory domain notifications.
    Regulatory,
    /// MLME events (auth, assoc, disassoc, connect, disconnect).
    Mlme,
    /// Vendor-specific events.
    Vendor,
    /// NAN (Neighbor Awareness Networking) events.
    Nan,
}

13.15.9.1.3 Key NL80211 Commands¶

The following NL80211 commands are implemented. All commands use NETLINK_GENERIC with family "nl80211". Requests carry NL80211_ATTR_IFINDEX to identify the wireless interface.

NL80211 Command	wpa_supplicant use	UmkaOS implementation
`NL80211_CMD_GET_WIPHY`	Query hardware capabilities (bands, rates, features)	`WirelessDriver::capabilities()` + hardware query
`NL80211_CMD_GET_INTERFACE`	Get interface mode (station/AP/monitor)	per-interface state
`NL80211_CMD_SET_INTERFACE`	Change interface mode	`WirelessDriver::set_interface_type()`
`NL80211_CMD_TRIGGER_SCAN`	Start a scan (SSIDs, channels, IEs)	`WirelessDriver::scan()`
`NL80211_CMD_GET_SCAN`	Dump scan results (BSS list)	`WirelessDriver::get_scan_results()`
`NL80211_CMD_AUTHENTICATE`	Send 802.11 authentication frame	`WirelessDriver::authenticate()`
`NL80211_CMD_ASSOCIATE`	Send 802.11 association request	`WirelessDriver::associate()`
`NL80211_CMD_DEAUTHENTICATE`	Send deauthentication frame	`WirelessDriver::disconnect()`
`NL80211_CMD_DISASSOCIATE`	Send disassociation frame	`WirelessDriver::disconnect()`
`NL80211_CMD_CONNECT`	SME-controlled connect (full auth+assoc)	`WirelessDriver::connect()`
`NL80211_CMD_DISCONNECT`	SME-controlled disconnect	`WirelessDriver::disconnect()`
`NL80211_CMD_GET_STATION`	Per-station info (RSSI, TX rate, etc.)	`WirelessDriver::get_station()`
`NL80211_CMD_SET_STATION`	Set per-station flags	`WirelessDriver::change_station()`
`NL80211_CMD_NEW_STATION`	Add a new station entry (AP mode)	`WirelessDriver::add_station()`
`NL80211_CMD_DEL_STATION`	Remove a station entry (AP mode)	`WirelessDriver::del_station()`
`NL80211_CMD_DUMP_STATION`	Iterate all known stations	`WirelessDriver::dump_station()`
`NL80211_CMD_NEW_KEY`	Install pairwise/group key (WPA2/WPA3)	`WirelessDriver::add_key()`
`NL80211_CMD_DEL_KEY`	Remove an installed key	`WirelessDriver::del_key()`
`NL80211_CMD_GET_KEY`	Read key sequence counter	`WirelessDriver::get_key()`
`NL80211_CMD_SET_KEY`	Set default TX key index	`WirelessDriver::set_default_key()` / `set_default_mgmt_key()`
`NL80211_CMD_SET_BSS`	Configure AP parameters (beacon interval, DTIM, HT)	`WirelessDriver::set_bss_params()`
`NL80211_CMD_START_AP`	Start access point mode	`WirelessDriver::start_ap()`
`NL80211_CMD_STOP_AP`	Stop access point mode	`WirelessDriver::stop_ap()`
`NL80211_CMD_REGISTER_FRAME`	Register for specific management frames	`WirelessDriver::register_mgmt_frame()`
`NL80211_CMD_FRAME`	Send a management frame (probe req, auth, etc.)	`WirelessDriver::mgmt_tx()`
`NL80211_CMD_FRAME_WAIT_CANCEL`	Cancel pending mgmt frame TX wait	`WirelessDriver::mgmt_tx_cancel_wait()`
`NL80211_CMD_SET_POWER_SAVE`	Enable/disable power save mode	`WirelessDriver::set_power_save()`
`NL80211_CMD_SET_CHANNEL`	Set monitor channel	`WirelessDriver::set_channel()`
`NL80211_CMD_REMAIN_ON_CHANNEL`	Stay on a channel for off-channel TX	`WirelessDriver::remain_on_channel()`
`NL80211_CMD_CANCEL_REMAIN_ON_CHANNEL`	Cancel remain-on-channel session	`WirelessDriver::cancel_remain_on_channel()`
`NL80211_CMD_SET_PMKSA`	Add PMKSA cache entry (fast roaming)	`WirelessDriver::set_pmksa()`
`NL80211_CMD_DEL_PMKSA`	Remove PMKSA cache entry	`WirelessDriver::del_pmksa()`
`NL80211_CMD_FLUSH_PMKSA`	Flush all PMKSA entries	`WirelessDriver::flush_pmksa()`
`NL80211_CMD_SET_CQM`	Configure connection quality monitoring (RSSI thresholds)	`WirelessDriver::set_cqm_rssi_config()`
`NL80211_CMD_CHANNEL_SWITCH`	Initiate CSA (Channel Switch Announcement, AP mode)	`WirelessDriver::channel_switch()`
`NL80211_CMD_SET_REKEY_OFFLOAD`	Offload GTK rekeying to firmware (suspend)	`WirelessDriver::set_rekey_offload()`
`NL80211_CMD_ADD_VIRTUAL_INTERFACE`	Create secondary virtual interface (p2p, monitor)	`WirelessDriver::add_interface()`
`NL80211_CMD_DEL_VIRTUAL_INTERFACE`	Delete secondary interface	`WirelessDriver::del_interface()`

13.15.9.1.4 Asynchronous Events (Multicast Notifications)¶

UmkaOS delivers wireless events as nl80211 multicast notifications to registered listeners (wpa_supplicant subscribes via NL80211_MCGRP_MLME):

The canonical event type for nl80211 multicast notifications is WirelessEvent defined in Section 13.2. Each WirelessEvent variant maps 1:1 to an NL80211_CMD_* multicast message. The nl80211 translation layer in umka-net reads events from the per-device event ring and serializes them as Netlink attribute messages for userspace subscribers (wpa_supplicant, iw, NetworkManager).

Variant-to-command mapping (see Section 13.2 for the full enum definition):

`WirelessEvent` variant	nl80211 command	Key attributes
`ScanDone`	`NL80211_CMD_NEW_SCAN_RESULTS`	`NL80211_ATTR_GENERATION`
`AuthResult`	`NL80211_CMD_AUTHENTICATE`	`NL80211_ATTR_FRAME`, `NL80211_ATTR_STATUS_CODE`
`AssocResult`	`NL80211_CMD_ASSOCIATE`	`NL80211_ATTR_FRAME`, `NL80211_ATTR_STATUS_CODE`, `NL80211_ATTR_RESP_IE`
`Connected`	`NL80211_CMD_CONNECT`	`NL80211_ATTR_STATUS_CODE`
`Disconnected`	`NL80211_CMD_DISCONNECT`	`NL80211_ATTR_REASON_CODE`, `NL80211_ATTR_DISCONNECTED_BY_AP`
`CqmRssi`	`NL80211_CMD_NOTIFY_CQM`	`NL80211_ATTR_CQM`
`Roamed`	`NL80211_CMD_ROAM`	`NL80211_ATTR_MAC`, `NL80211_ATTR_REQ_IE`, `NL80211_ATTR_RESP_IE`
`MgmtFrame`	`NL80211_CMD_FRAME`	`NL80211_ATTR_FRAME`, `NL80211_ATTR_WIPHY_FREQ`
`MicFailure`	`NL80211_CMD_MICHAEL_MIC_FAILURE`	`NL80211_ATTR_MAC`, `NL80211_ATTR_KEY_TYPE`
`NewStation`	`NL80211_CMD_NEW_STATION`	`NL80211_ATTR_MAC`
`DelStation`	`NL80211_CMD_DEL_STATION`	`NL80211_ATTR_MAC`
`MgmtTxStatus`	`NL80211_CMD_FRAME_TX_STATUS`	`NL80211_ATTR_COOKIE`, `NL80211_ATTR_ACK`

13.15.9.1.5 WiPhy and Band Information¶

NL80211_CMD_GET_WIPHY returns a comprehensive capabilities structure that wpa_supplicant and iw use to configure connections. Key nested attributes:

// Query response struct — NOT stored in the event ring buffer.
// Vec is acceptable (cold path: capability query ioctl).
/// Wireless phy capabilities (nested in NL80211_ATTR_WIPHY_BANDS).
pub struct Nl80211Band {
    /// Frequency range.
    pub band: Nl80211BandId,
    /// Supported channels (frequency in MHz + channel flags).
    pub channels: Vec<Nl80211Channel>,
    /// Supported TX bit rates (HT/VHT/HE MCS tables).
    pub rates: Vec<Nl80211Rate>,
    /// HT capabilities (NL80211_BAND_ATTR_HT_CAPA): MIMO streams, channel width, etc.
    pub ht_cap: Option<HtCapabilities>,
    /// VHT capabilities (NL80211_BAND_ATTR_VHT_CAPA): 80/160 MHz, MU-MIMO.
    pub vht_cap: Option<VhtCapabilities>,
    /// HE capabilities (NL80211_BAND_ATTR_IFTYPE_DATA): WiFi 6/6E rates.
    pub he_cap: Option<HeCapabilities>,
}

pub enum Nl80211BandId {
    Ghz2_4 = 0,
    Ghz5   = 1,
    Ghz60  = 2,
    Ghz6   = 3,
}

pub struct Nl80211Channel {
    /// Channel center frequency in MHz.
    pub freq_mhz:  u32,
    /// Channel flags (NL80211_FREQUENCY_ATTR_*).
    pub flags:     ChannelFlags,
    /// Maximum TX power in dBm (tenth of dBm units: 200 = 20.0 dBm).
    pub max_power: u32,
}

bitflags! {
    pub struct ChannelFlags: u32 {
        /// Passive scan only (no probe requests transmitted).
        const PASSIVE_SCAN    = 1 << 0;
        /// Beaconing not allowed.
        const NO_IBSS         = 1 << 1;
        /// Radar detection required (DFS channel).
        const RADAR           = 1 << 2;
        /// No HT40- operation.
        const NO_HT40_MINUS   = 1 << 3;
        /// No HT40+ operation.
        const NO_HT40_PLUS    = 1 << 4;
        /// No 80 MHz operation.
        const NO_80MHZ        = 1 << 5;
        /// No 160 MHz operation.
        const NO_160MHZ       = 1 << 6;
        /// Indoor only.
        const INDOOR_ONLY     = 1 << 7;
        /// Go concurrent (can be used simultaneously with another channel).
        const GO_CONCURRENT   = 1 << 8;
        /// No 20 MHz operation (only available in wider modes).
        const NO_20MHZ        = 1 << 9;
        /// No HE operation.
        const NO_HE           = 1 << 10;
        /// Disabled (regulatory constraint).
        const DISABLED        = 1 << 11;
    }
}

/// A supported bit rate descriptor. Reported in NL80211_BAND_ATTR_RATES.
/// Maps to `struct nl80211_rate` in nl80211.h.
pub struct Nl80211Rate {
    /// Bit rate in units of 100 kbps (e.g., 10 = 1 Mbps, 110 = 11 Mbps,
    /// 540 = 54 Mbps for legacy 802.11a/b/g rates).
    pub bitrate: u32,
    /// Rate flags (NL80211_RATE_INFO_*).
    pub flags: Nl80211RateFlags,
}

bitflags! {
    pub struct Nl80211RateFlags: u32 {
        /// Short preamble supported (802.11b only: 2, 5.5, 11 Mbps).
        const SHORT_PREAMBLE = 1 << 0;
    }
}

HtCapabilities, VhtCapabilities, and HeCapabilities are defined in Section 13.2 and referenced here for nl80211 band capabilities.

13.15.9.1.6 Regulatory Domain¶

UmkaOS enforces regulatory channel restrictions via the CRDA (Central Regulatory Domain Agent) or a compiled-in regulatory database (wireless-regdb):

// Query response struct — NOT stored in the event ring buffer.
// Vec is acceptable (cold path: capability query ioctl).
/// Regulatory domain (ISO 3166-1 alpha-2 country code).
pub struct RegDomain {
    /// Country code (e.g., "US", "DE", "JP", "00" = world regulatory domain).
    pub alpha2: [u8; 2],
    /// DFS region (for radar detection requirements).
    pub dfs_region: DfsRegion,
    /// Frequency rules.
    pub rules: Vec<RegRule>,
}

pub struct RegRule {
    /// Frequency range (MHz).
    pub freq_range:  core::ops::RangeInclusive<u32>,
    /// Maximum bandwidth allowed (MHz).
    pub max_bw_mhz:  u32,
    /// Maximum EIRP (maximum equivalent isotropically radiated power, dBm).
    pub max_eirp_dbm: u32,
    /// Rule flags.
    pub flags:       RegRuleFlags,
}

bitflags! {
    pub struct RegRuleFlags: u32 {
        const NO_OFDM     = 1 << 0; // OFDM not allowed
        const NO_CCK      = 1 << 1; // CCK not allowed
        const NO_INDOOR   = 1 << 2; // Indoor operation prohibited
        const NO_OUTDOOR  = 1 << 3; // Outdoor operation prohibited
        const DFS         = 1 << 4; // DFS required
        const PTP_ONLY    = 1 << 5; // Point-to-point links only
        const PTMP_ONLY   = 1 << 6; // Point-to-multipoint only
        const NO_IR       = 1 << 7; // No initiating radiation (passive listen only)
        const AUTO_BW     = 1 << 11; // Auto-select bandwidth based on local conditions
        const IR_CONCURRENT = 1 << 12; // IR even if not associated
        const NO_HT40MINUS  = 1 << 13;
        const NO_HT40PLUS   = 1 << 14;
        const NO_80MHZ      = 1 << 15;
        const NO_160MHZ     = 1 << 16;
    }
}

/// DFS (Dynamic Frequency Selection) region. Determines radar detection
/// requirements and channel availability check (CAC) durations.
/// Values match NL80211_DFS_* in nl80211.h.
#[repr(u8)]
pub enum DfsRegion {
    /// No DFS region (DFS channels unavailable).
    Unset = 0,
    /// FCC (Federal Communications Commission): US, Canada, Brazil.
    /// CAC time: 60 seconds (weather radar channels: 600 seconds).
    Fcc   = 1,
    /// ETSI (European Telecommunications Standards Institute): EU, UK, AU.
    /// CAC time: 60 seconds (weather radar channels: 600 seconds).
    Etsi  = 2,
    /// MKK (Japan — Ministry of Internal Affairs and Communications).
    /// CAC time: 60 seconds. Unique radar patterns for J52/W52/W53 channels.
    Jp    = 3,
}

Regulatory domain changes are broadcast via NL80211_CMD_REG_CHANGE multicast to the NL80211_MCGRP_REGULATORY group.

13.15.9.1.7 P2P (Wi-Fi Direct)¶

Wi-Fi P2P (peer-to-peer, used by Miracast/screen mirroring and Android Beam) is implemented as a virtual interface mode:

pub enum Nl80211IfType {
    Unspecified = 0,
    Adhoc       = 1,  // IBSS (independent BSS)
    Station     = 2,  // Client station (default)
    Ap          = 3,  // Access Point
    ApVlan      = 4,  // AP VLAN (virtual interface per client)
    Wds         = 5,  // WDS (4-address mode)
    Monitor     = 6,  // Monitor (receive all frames, no TX)
    MeshPoint   = 7,  // 802.11s mesh
    P2pClient   = 8,  // P2P client
    P2pGo       = 9,  // P2P Group Owner (acts as AP for P2P group)
    P2pDevice   = 10, // P2P device (for discovery; not an AP or station)
    Ocb         = 11, // Outside Context of BSS (802.11p, V2X)
    Nan         = 12, // NAN (Neighbor Awareness Networking)
}

NL80211_CMD_ADD_VIRTUAL_INTERFACE with NL80211_IFTYPE_P2P_DEVICE creates the P2P discovery interface. wpa_supplicant handles P2P negotiation in userspace; the kernel provides the management frame exchange mechanism via NL80211_CMD_FRAME / NL80211_CMD_REGISTER_FRAME.

13.15.9.1.8 Linux Compatibility¶

Same "nl80211" Generic Netlink family name
Same NL80211_CMD_* command codes (compatible with kernel 5.15+ nl80211.h)
Same NL80211_ATTR_* attribute IDs
Same multicast group names (scan, regulatory, mlme, vendor, nan)
Same NL80211_BAND_* band descriptors
iw(8): station and AP management works without modification
wpa_supplicant 2.10+: full WPA2-Personal, WPA2-Enterprise (EAP-PEAP/TTLS/TLS), WPA3-SAE
hostapd 2.10+: AP mode, 802.11r fast roaming, 802.11w management frame protection
iwd 2.x: full station mode, systemd-iwd integration
NetworkManager 1.44+: uses wpa_supplicant or iwd, both work
rfkill integration: rfkill soft-block disables the wiphy; cfg80211 tears down active connections and sends NL80211_CMD_DISCONNECT to subscribers. Userspace monitors rfkill state via /dev/rfkill (struct rfkill_event), not nl80211. There is no NL80211_CMD_RFKILL_EVENT — rfkill uses its own character device interface.

13.16 Camera and Video Capture¶

Tier: Tier 1. Camera drivers handle continuous frame capture from USB webcams, embedded SoC sensors, and ISP (image signal processor) pipelines. Frame delivery is latency-sensitive (dropped frames visible to users in video calls) and camera hardware performs DMA directly into frame buffers, so crash containment via Tier 1 isolation is mandatory. A compromised camera driver must not be able to read kernel memory, access other devices, or suppress the privacy indicator LED.

KABI interface name: camera_device_v1 (in interfaces/camera_device.kabi).

Design principles:

Mandatory privacy enforcement — The kernel, not the driver, controls the camera indicator LED. A Tier 1 driver crash or exploit cannot suppress the indicator. Access requires CAP_CAMERA (Section 9.2).
DMA-BUF-first buffer model — All frame buffers are DmaBufHandle references (Section 4.14). No kernel-side copies. Zero-copy camera→GPU and camera→encoder pipelines via shared DMA-BUF file descriptors.
ISP pipeline graph — Camera ISP subdevices (sensor, CSI receiver, ISP, scaler) are connected via MediaPad/MediaLink (Section 13.7). The CameraDevice trait handles streaming and controls; the ISP topology reuses the existing media graph primitives.

Relationship to MediaDevice (Section 13.7): CameraDevice is NOT a subtype of MediaDevice. They serve different purposes: MediaDevice handles codec sessions (encode/decode with explicit start/end). CameraDevice handles continuous frame capture with sensor controls (brightness, exposure, autofocus). However, ISP blocks within a camera pipeline ARE CameraSubdevice entities connected via MediaPad/MediaLink, reusing the same topology primitives.

13.16.1 CameraDevice Trait¶

// umka-core/src/camera/mod.rs — authoritative camera driver contract

/// A camera capture device. Implemented by UVC (USB Video Class) drivers,
/// MIPI CSI-2 sensor drivers, and platform-specific ISP drivers.
///
/// # Lifecycle
///
/// 1. `query_formats()` / `enum_frame_sizes()` / `enum_frame_intervals()` — discover capabilities.
/// 2. `create_stream(config)` → `CameraStreamHandle` — allocate a capture stream.
/// 3. `queue_buf()` N times — fill the buffer pool.
/// 4. `start_stream()` — hardware begins DMA into queued buffers.
/// 5. `dequeue_buf()` in a loop — retrieve completed frames.
/// 6. `stop_stream()` → `destroy_stream()` — release resources.
///
/// The kernel camera framework (not the driver) manages the indicator LED
/// and capability checks. The driver never interacts with the indicator directly.
pub trait CameraDevice: Send + Sync {
    // --- Capability discovery ---

    /// Enumerate pixel formats supported by this device into a caller-supplied buffer.
    /// Returns the number of formats written, up to the length of `buf`.
    fn query_formats(
        &self,
        buf: &mut [CameraFormatDesc],
        max_count: u32,
    ) -> Result<u32, CameraError>;

    /// Enumerate supported frame sizes for a given pixel format.
    /// For USB webcams: typically discrete sizes (640x480, 1280x720, 1920x1080).
    /// For ISP outputs: typically stepwise (min/max/step).
    fn enum_frame_sizes(
        &self,
        format: CameraPixelFormat,
        buf: &mut [FrameSizeDesc],
        max_count: u32,
    ) -> Result<u32, CameraError>;

    /// Enumerate supported frame intervals (frame rates) for a given format and size.
    fn enum_frame_intervals(
        &self,
        format: CameraPixelFormat,
        width: u32,
        height: u32,
        buf: &mut [FrameInterval],
        max_count: u32,
    ) -> Result<u32, CameraError>;

    // --- Stream lifecycle ---

    /// Create a capture stream with the specified configuration. The driver
    /// validates that the hardware supports the requested format, resolution,
    /// and frame rate. Returns a stream handle used for all subsequent operations.
    ///
    /// A device may support multiple concurrent streams (e.g., main + thumbnail)
    /// if the hardware supports it. Returns `StreamsExhausted` if all hardware
    /// streams are in use.
    fn create_stream(
        &self,
        config: &CameraStreamConfig,
    ) -> Result<CameraStreamHandle, CameraError>;

    /// Start DMA on an open stream. The hardware begins writing frames into
    /// queued buffers. At least one buffer must be queued before calling this.
    ///
    /// The kernel camera framework calls `CameraIndicator::activate()` before
    /// this method returns — the driver must not interact with the indicator.
    fn start_stream(&self, handle: CameraStreamHandle) -> Result<(), CameraError>;

    /// Stop DMA on a running stream. All queued buffers are returned with
    /// `CaptureFrameFlags::LAST`. The stream remains allocated and can be
    /// restarted with `start_stream()`.
    ///
    /// The kernel camera framework calls `CameraIndicator::deactivate()` after
    /// this method completes.
    fn stop_stream(&self, handle: CameraStreamHandle) -> Result<(), CameraError>;

    /// Destroy a stream handle and release all associated hardware resources.
    /// The stream must be stopped first. All unreturned buffers are invalidated.
    fn destroy_stream(&self, handle: CameraStreamHandle) -> Result<(), CameraError>;

    // --- Buffer cycle ---

    /// Queue a DMA buffer for the hardware to write the next captured frame into.
    /// The buffer must be a valid `DmaBufHandle` of at least the frame size
    /// specified in the stream configuration.
    fn queue_buf(
        &self,
        handle: CameraStreamHandle,
        buf: DmaBufHandle,
    ) -> Result<(), CameraError>;

    /// Dequeue a completed frame. Blocks until a frame is ready or the stream
    /// stops. The returned `CapturedFrame` contains the DMA buffer, a fence
    /// that is signaled when the hardware finishes writing, a monotonically
    /// increasing sequence number, and a nanosecond timestamp.
    fn dequeue_buf(
        &self,
        handle: CameraStreamHandle,
    ) -> Result<CapturedFrame, CameraError>;

    // --- Control framework ---

    /// Enumerate hardware controls (brightness, exposure, autofocus, etc.)
    /// into a caller-supplied buffer. Returns the number of controls written.
    fn enum_controls(
        &self,
        buf: &mut [CameraControlInfo],
        max_count: u32,
    ) -> Result<u32, CameraError>;

    /// Read a control value.
    fn get_control(&self, id: CameraControlId) -> Result<i64, CameraError>;

    /// Write a control value. Returns `ControlReadOnly` for hardware-locked
    /// controls (e.g., `Privacy` reflects physical shutter state).
    fn set_control(&self, id: CameraControlId, value: i64) -> Result<(), CameraError>;

    // --- ISP topology (for devices with subdevice pipelines) ---

    /// Return the media pads on this device. Simple USB webcams have a single
    /// Source pad. ISP pipelines have multiple pads for input/output connections.
    /// Caller provides a buffer; driver fills up to `max_count` pads and returns
    /// the actual count written.
    fn pads(&self, buf: &mut [MediaPad], max_count: u32) -> Result<u32, CameraError>;

    /// Camera events (frame ready, control changed, privacy shutter toggled,
    /// device lost) are delivered via the standard KABI shared-memory event
    /// ring pair. The camera framework allocates the ring at device probe time;
    /// the driver writes CameraEvent entries; the framework polls the ring.
    /// No method needed here — the framework manages the ring lifecycle.

    // --- Power ---

    /// Suspend camera device. Stops DMA, powers down sensor. Called before
    /// platform S3/S0ix entry.
    fn suspend(&self) -> Result<(), CameraError>;

    /// Resume camera device. Re-initializes sensor, restores control state.
    fn resume(&self) -> Result<(), CameraError>;
}

13.16.2 Camera Controls¶

Camera controls represent tunable hardware parameters. Each control has a type, a valid range, and a default value. Controls are discovered via enum_controls() and read/written via get_control() / set_control().

/// Camera control identifier. User-class controls (0-99) correspond to V4L2
/// user controls; camera-class controls (100-299) correspond to V4L2 camera
/// controls. Values are UmkaOS-native; the V4L2 compat layer maps them
/// to/from Linux V4L2_CID_* values.
#[repr(u32)]
pub enum CameraControlId {
    // --- User class (V4L2 user controls) ---
    /// Image brightness adjustment (-128..127, 0 = no adjustment).
    Brightness          = 0,
    /// Image contrast adjustment (0..255, 128 = default).
    Contrast            = 1,
    /// Color saturation (0..255, 128 = default, 0 = grayscale).
    Saturation          = 2,
    /// Hue rotation in degrees (-180..180, 0 = no shift).
    Hue                 = 3,
    /// Gamma correction curve exponent (100 = gamma 1.0, 220 = gamma 2.2).
    Gamma               = 4,
    /// Edge enhancement level (0 = off, 255 = maximum).
    Sharpness            = 5,
    /// Backlight compensation mode (0 = off, 1 = on).
    BacklightCompensation = 6,
    /// Power line frequency for flicker avoidance.
    /// 0 = disabled, 1 = 50 Hz, 2 = 60 Hz, 3 = auto-detect.
    PowerLineFrequency   = 7,

    // --- Camera class (V4L2 camera controls) ---
    /// Automatic exposure mode. 0 = manual, 1 = auto, 2 = shutter priority,
    /// 3 = aperture priority.
    ExposureAuto         = 100,
    /// Manual exposure time in 100-microsecond units.
    ExposureAbsolute     = 101,
    /// Automatic exposure priority. 0 = constant frame rate (may under-expose
    /// in low light), 1 = variable frame rate (may reduce fps in low light).
    ExposureAutoPriority = 102,
    /// Manual focus position (device-specific units, 0 = infinity).
    FocusAbsolute        = 110,
    /// Relative focus adjustment (negative = nearer, positive = farther).
    FocusRelative        = 111,
    /// Autofocus enable. 0 = manual, 1 = continuous autofocus.
    FocusAuto            = 112,
    /// Trigger a single autofocus sweep (write-only, button-type control).
    AutoFocusStart       = 113,
    /// Horizontal pan in arc-seconds (device-specific range).
    PanAbsolute          = 120,
    /// Vertical tilt in arc-seconds.
    TiltAbsolute         = 121,
    /// Absolute zoom level (1x = 100, 2x = 200, etc.).
    ZoomAbsolute         = 122,
    /// White balance color temperature in Kelvin (2800..6500 typical).
    WhiteBalanceTemperature = 130,
    /// Automatic white balance. 0 = manual, 1 = auto.
    AutoWhiteBalance     = 131,
    /// ISO sensitivity (100, 200, 400, 800, ...). 0 = auto.
    IsoSensitivity       = 140,
    /// Auto ISO mode. 0 = manual, 1 = auto.
    IsoSensitivityAuto   = 141,
    /// Digital image stabilization. 0 = off, 1 = on.
    ImageStabilization   = 152,
    /// Physical privacy shutter state (read-only). 0 = open, 1 = closed.
    /// Reflects hardware GPIO state. Cannot be set by software.
    Privacy              = 200,
}

/// Metadata for a single camera control, returned by `enum_controls()`.
#[repr(C)]
pub struct CameraControlInfo {
    /// Which control this describes.
    pub id: CameraControlId,
    /// Data type of the control value.
    pub control_type: CameraControlType,
    /// Minimum allowed value (inclusive).
    pub min: i64,
    /// Maximum allowed value (inclusive).
    pub max: i64,
    /// Step size for value changes. For menus, step = 1.
    pub step: i64,
    /// Factory default value.
    pub default_value: i64,
    /// Behavioral flags (read-only, volatile, etc.).
    pub flags: CameraControlFlags,
    /// Explicit tail padding: flags ends at offset 44; struct alignment = 8 (from i64).
    /// 44 → 48 requires 4 bytes. Per CLAUDE.md rule 11.
    pub _pad: [u8; 4],
}
// CameraControlInfo: CameraControlId(u32=4) + CameraControlType(u32=4) + min(i64=8) +
//   max(i64=8) + step(i64=8) + default_value(i64=8) + flags(u32=4) + _pad(4) = 48.
const_assert!(size_of::<CameraControlInfo>() == 48);

/// Type of a camera control value.
#[repr(u32)]
pub enum CameraControlType {
    /// Signed integer with min/max/step range.
    Integer     = 1,
    /// Boolean (0 or 1).
    Boolean     = 2,
    /// Menu: value is an index into a driver-provided menu item list.
    Menu        = 3,
    /// Write-only trigger (e.g., AutoFocusStart). Value is ignored.
    Button      = 4,
    /// 64-bit signed integer with min/max/step range.
    Integer64   = 5,
    /// Bitmask: each bit is independently settable.
    Bitmask     = 6,
}

bitflags! {
    /// Behavioral flags for camera controls.
// kernel-internal, not KABI
    #[repr(C)]
    pub struct CameraControlFlags: u32 {
        /// Control is disabled by hardware or current mode. Read returns the
        /// last value; write returns `CameraError::InvalidControl`.
        const DISABLED   = 1 << 0;
        /// Control value is determined by hardware and cannot be set.
        /// Example: `Privacy` (reflects physical shutter GPIO).
        const READ_ONLY  = 1 << 1;
        /// Control value changes asynchronously (e.g., auto-exposure adjusts
        /// the exposure value continuously). The read value may differ from
        /// the last written value.
        const VOLATILE   = 1 << 2;
        /// Control is temporarily inactive because another control overrides
        /// it (e.g., manual exposure is inactive when ExposureAuto = auto).
        const INACTIVE   = 1 << 3;
    }
}

13.16.3 Pixel Formats¶

Pixel formats use the same fourcc encoding as Linux V4L2 for binary compatibility. The compat layer translates between CameraPixelFormat values and V4L2_PIX_FMT_* constants without conversion — they are numerically identical.

/// Camera pixel format. Values are V4L2-compatible fourcc codes computed as
/// `(a as u32) | ((b as u32) << 8) | ((c as u32) << 16) | ((d as u32) << 24)`.
#[repr(u32)]
pub enum CameraPixelFormat {
    // --- YUV packed ---
    /// YUYV 4:2:2 (16 bpp). Most common USB webcam format.
    Yuyv    = 0x5659_5559, // v4l2_fourcc('Y','U','Y','V')
    /// UYVY 4:2:2 (16 bpp).
    Uyvy    = 0x5956_5955, // v4l2_fourcc('U','Y','V','Y')
    /// YVYU 4:2:2 (16 bpp).
    Yvyu    = 0x5559_5659, // v4l2_fourcc('Y','V','Y','U')

    // --- YUV semi-planar ---
    /// NV12: Y plane + interleaved UV plane. 12 bpp. Primary GPU input format.
    Nv12    = 0x3231_564E, // v4l2_fourcc('N','V','1','2')
    /// NV21: Y plane + interleaved VU plane. 12 bpp. Legacy Android Camera API format.
    Nv21    = 0x3132_564E, // v4l2_fourcc('N','V','2','1')
    /// NV16: Y plane + interleaved UV plane. 4:2:2, 16 bpp.
    Nv16    = 0x3631_564E, // v4l2_fourcc('N','V','1','6')

    // --- YUV planar ---
    /// YU12 (I420): Y + U + V planes. 12 bpp.
    Yu12    = 0x3231_5559, // v4l2_fourcc('Y','U','1','2')
    /// YV12: Y + V + U planes. 12 bpp.
    Yv12    = 0x3231_5659, // v4l2_fourcc('Y','V','1','2')

    // --- Compressed ---
    /// Motion JPEG. Hardware-compressed on many USB webcams.
    Mjpeg   = 0x4750_4A4D, // v4l2_fourcc('M','J','P','G')
    /// H.264 (Annex B byte stream). UVC 1.5 payload type.
    H264    = 0x3436_3248, // v4l2_fourcc('H','2','6','4')
    /// H.265 / HEVC.
    Hevc    = 0x4356_4548, // v4l2_fourcc('H','E','V','C')

    // --- RGB ---
    /// 24-bit RGB (R, G, B byte order). Rare in cameras, common in test patterns.
    Rgb24   = 0x3342_4752, // v4l2_fourcc('R','G','B','3')
    /// 24-bit BGR (B, G, R byte order).
    Bgr24   = 0x3352_4742, // v4l2_fourcc('B','G','R','3')

    // --- Raw Bayer (ISP input) ---
    /// 8-bit Bayer RGGB.
    Srggb8  = 0x4247_4752, // v4l2_fourcc('R','G','G','B')
    /// 10-bit Bayer RGGB (packed: 4 pixels in 5 bytes).
    Srggb10 = 0x3031_4752, // v4l2_fourcc('R','G','1','0') — V4L2_PIX_FMT_SRGGB10
    /// 12-bit Bayer RGGB.
    Srggb12 = 0x3231_4752, // v4l2_fourcc('R','G','1','2') — V4L2_PIX_FMT_SRGGB12
    /// 8-bit Bayer GRBG.
    Sgrbg8  = 0x4742_5247, // v4l2_fourcc('G','R','B','G')
    /// 8-bit Bayer GBRG.
    Sgbrg8  = 0x4752_4247, // v4l2_fourcc('G','B','R','G')
    /// 8-bit Bayer BGGR.
    Sbggr8  = 0x3138_4142, // v4l2_fourcc('B','A','8','1') — V4L2_PIX_FMT_SBGGR8
}

/// Description of a supported pixel format, returned by `query_formats()`.
#[repr(C)]
pub struct CameraFormatDesc {
    /// Pixel format fourcc code.
    pub format: CameraPixelFormat,
    /// Human-readable name (null-terminated UTF-8, max 32 bytes).
    pub description: [u8; 32],
    /// Format property flags.
    pub flags: FormatDescFlags,
}
// CameraFormatDesc: CameraPixelFormat(u32=4) + description([u8;32]=32) + flags(u32=4) = 40 bytes.
const_assert!(size_of::<CameraFormatDesc>() == 40);

bitflags! {
    /// Pixel format property flags.
// kernel-internal, not KABI
    #[repr(C)]
    pub struct FormatDescFlags: u32 {
        /// Format produces compressed data (MJPEG, H.264, HEVC).
        const COMPRESSED   = 1 << 0;
        /// Format requires the ISP pipeline to convert to a displayable format.
        const RAW_SENSOR   = 1 << 1;
    }
}

/// Frame size descriptor, returned by `enum_frame_sizes()`.
/// Size: Discrete variant = 4+4+4(discriminant) = 12 bytes;
/// Stepwise variant = 4*6+4(discriminant) = 28 bytes. repr(C) enum
/// with largest variant = 28 bytes.
#[repr(C, u32)]
pub enum FrameSizeDesc {
    /// Device supports only specific discrete resolutions.
    Discrete {
        width: u32,
        height: u32,
    },
    /// Device supports any resolution within a range with step constraints.
    Stepwise {
        min_width: u32,
        max_width: u32,
        step_width: u32,
        min_height: u32,
        max_height: u32,
        step_height: u32,
    },
}
const_assert!(core::mem::size_of::<FrameSizeDesc>() == 28);

/// Frame interval (inverse frame rate) as a rational number.
/// Example: 1/30 = 30 fps, 1/60 = 60 fps, 1/15 = 15 fps.
#[repr(C)]
pub struct FrameInterval {
    /// Numerator (typically 1).
    pub numerator: u32,
    /// Denominator (frame rate in fps when numerator = 1).
    pub denominator: u32,
}
// FrameInterval: 2 × u32 = 8 bytes.
const_assert!(size_of::<FrameInterval>() == 8);

13.16.4 Stream Configuration and Captured Frames¶

/// Configuration for creating a capture stream via `create_stream()`.
#[repr(C)]
pub struct CameraStreamConfig {
    /// Desired pixel format.
    pub format: CameraPixelFormat,
    /// Frame width in pixels.
    pub width: u32,
    /// Frame height in pixels.
    pub height: u32,
    /// Desired frame interval (reciprocal of frame rate).
    pub interval: FrameInterval,
    /// Number of buffers to pre-allocate in the kernel buffer pool.
    /// Minimum: 2 (double-buffering). Recommended: 4-8 for smooth capture.
    /// The driver may adjust this upward if the hardware requires more.
    pub buffer_count: u32,
}
// CameraStreamConfig: CameraPixelFormat(u32=4) + width(u32=4) + height(u32=4) +
//   interval(FrameInterval=8) + buffer_count(u32=4) = 24 bytes.
const_assert!(size_of::<CameraStreamConfig>() == 24);

/// Opaque handle to an active capture stream, returned by `create_stream()`.
#[repr(C)]
pub struct CameraStreamHandle {
    /// Kernel-internal stream identifier. Unique per device.
    /// Short-lived (created at stream start, destroyed at stream stop).
    /// No external protocol constrains this to u32; CameraStreamHandle is a
    /// UmkaOS-native concept. u64 per project 50-year-uptime policy.
    pub handle: u64,
}
// CameraStreamHandle: u64 = 8 bytes.
const_assert!(size_of::<CameraStreamHandle>() == 8);

/// A captured frame returned by `dequeue_buf()`.
// kernel-internal, not KABI (embeds DmaFence and DmaBufHandle which are kernel types;
// the KABI wire format for captured frames is generated separately by kabi-gen).
#[repr(C)]
pub struct CapturedFrame {
    /// DMA buffer containing the frame data. The buffer contents are valid
    /// only after `ready_fence` is signaled.
    pub buf: DmaBufHandle,
    /// Fence signaled when the hardware has finished writing to the buffer.
    /// For zero-copy pipelines: pass this fence to the GPU or encoder as a
    /// wait dependency to avoid reading incomplete frame data.
    pub ready_fence: DmaFence,
    /// Monotonically increasing frame sequence number (starts at 0 for each
    /// stream). Gaps indicate dropped frames.
    pub sequence: u64,
    /// Capture timestamp: nanoseconds since boot (monotonic clock).
    /// Taken by the driver at interrupt time (frame-end interrupt or SOF).
    pub timestamp_ns: u64,
    /// Field order for interlaced capture. `Progressive` for all modern cameras.
    pub field: FieldOrder,
    /// Per-frame flags.
    pub flags: CaptureFrameFlags,
    /// Actual number of bytes written by the hardware. For compressed formats
    /// (MJPEG, H.264), this is the compressed frame size. For uncompressed
    /// formats, this equals `width * height * bytes_per_pixel`.
    pub bytes_used: u32,
}

/// Field order for interlaced video capture.
#[repr(u32)]
pub enum FieldOrder {
    /// Non-interlaced (progressive scan). All modern cameras use this.
    Progressive = 0,
    /// Top field only.
    Top         = 1,
    /// Bottom field only.
    Bottom      = 2,
    /// Both fields interleaved, top first.
    InterlacedTB = 3,
    /// Both fields interleaved, bottom first.
    InterlacedBT = 4,
}

bitflags! {
    /// Per-frame flags on a captured frame.
// kernel-internal, not KABI
    #[repr(C)]
    pub struct CaptureFrameFlags: u32 {
        /// Frame data is corrupted (DMA error, sensor overflow). The buffer
        /// is returned to the application for recycling but should not be
        /// displayed or processed.
        const ERROR    = 1 << 0;
        /// For compressed formats: this frame is a keyframe (IDR for H.264).
        const KEYFRAME = 1 << 1;
        /// This is the last frame before stream stop. No more frames will
        /// be delivered until `start_stream()` is called again.
        const LAST     = 1 << 2;
    }
}

/// Buffer state machine for the capture buffer pool.
///
/// ```text
///   queue_buf()         hardware DMA           dequeue_buf()
/// ┌──────────┐     ┌──────────────────┐     ┌──────────────┐
/// │  Free    │────▶│     Queued       │────▶│    Done      │
/// │          │     │ (waiting for HW) │     │ (frame ready)│
/// └──────────┘     └──────────────────┘     └──────────────┘
///       ▲                    │                      │
///       │                    │ DMA error            │
///       │                    ▼                      │
///       │              ┌──────────┐                 │
///       │              │  Error   │                 │
///       │              │ (corrupt)│                 │
///       │              └──────────┘                 │
///       │                    │                      │
///       └────────────────────┴──────────────────────┘
///                    (user recycles buffer)
/// ```

13.16.5 ISP Pipeline Model¶

Camera devices with hardware ISP (image signal processors) expose a pipeline of processing blocks as CameraSubdevice entities connected via MediaPad and MediaLink (Section 13.7). The topology is discovered at probe time and exposed to userspace via the media controller compat ioctls.

A typical SoC camera pipeline:

┌──────────┐   CSI-2   ┌──────────────┐        ┌─────────┐         ┌──────────────┐
│  Sensor  │──────────▶│ CSI-2 RX     │───────▶│  ISP    │────────▶│ DMA Writer   │──▶ Memory
│ (OV5640) │  D-PHY    │ (SoC-specific)│ Bayer  │ (demosaic│  NV12  │ (frame DMA)  │
│  pad[0]  │  2 lanes  │  pad[0] [1]  │        │  scale  )│        │   pad[0]     │
└──────────┘           └──────────────┘        │  pad[0][1]│        └──────────────┘
                                               └─────────┘

Each block is a CameraSubdevice with typed pads. Links between pads carry negotiated format and resolution constraints. The kernel validates the pipeline before streaming starts: format propagation must be consistent along every link.

/// A camera ISP subdevice (sensor, CSI-2 receiver, ISP block, scaler).
///
/// Subdevices are internal pipeline components that do not manage user-visible
/// buffers. Only the final DMA writer endpoint (the `CameraDevice` itself)
/// produces `CapturedFrame` output.
///
/// Subdevice pad operations follow the same model as V4L2 subdev pad ops
/// for Linux media controller compatibility.
pub trait CameraSubdevice: Send + Sync {
    /// Return the entity function (Sensor, ISP, etc.).
    fn entity_function(&self) -> CameraEntityFunction;

    /// Return the media pads on this subdevice. Caller provides buffer;
    /// driver fills up to `max_count` pads and returns actual count.
    fn pads(&self, buf: &mut [MediaPad], max_count: u32) -> Result<u32, CameraError>;

    /// Enumerate media bus codes (pixel formats) supported on a given pad.
    /// Media bus codes describe the format of data flowing between subdevices
    /// (as opposed to `CameraPixelFormat` which describes memory layout).
    fn enum_mbus_codes(
        &self,
        pad_id: PadId,
        buf: &mut [MediaBusCode],
        max_count: u32,
    ) -> Result<u32, CameraError>;

    /// Get the currently configured format on a pad.
    fn get_pad_format(
        &self,
        pad_id: PadId,
    ) -> Result<MediaBusFormat, CameraError>;

    /// Set the format on a pad. The driver may adjust the format to the
    /// nearest supported configuration. Returns the actual format applied.
    fn set_pad_format(
        &self,
        pad_id: PadId,
        format: &MediaBusFormat,
    ) -> Result<MediaBusFormat, CameraError>;

    /// Enumerate supported frame sizes on a pad for a given media bus code.
    fn enum_frame_sizes_on_pad(
        &self,
        pad_id: PadId,
        code: MediaBusCode,
        buf: &mut [FrameSizeDesc],
        max_count: u32,
    ) -> Result<u32, CameraError>;

    /// Get the crop/compose rectangle on a pad. Used for ISP crop and scaler
    /// configuration.
    fn get_selection(
        &self,
        pad_id: PadId,
        target: SelectionTarget,
    ) -> Result<Rectangle, CameraError>;

    /// Set the crop/compose rectangle on a pad.
    fn set_selection(
        &self,
        pad_id: PadId,
        target: SelectionTarget,
        rect: &Rectangle,
    ) -> Result<Rectangle, CameraError>;
}

/// ISP pipeline entity function codes. Values match Linux
/// `MEDIA_ENT_F_*` constants for media controller compatibility.
#[repr(u32)]
pub enum CameraEntityFunction {
    /// Image sensor (OmniVision, Sony, Samsung, etc.).
    Sensor          = 0x0002_0001, // MEDIA_ENT_F_CAM_SENSOR
    /// Flash controller (LED or xenon).
    Flash           = 0x0002_0002, // MEDIA_ENT_F_FLASH
    /// Lens controller (autofocus actuator).
    Lens            = 0x0002_0003, // MEDIA_ENT_F_LENS
    /// Image signal processor (demosaic, color correction, noise reduction).
    Isp             = 0x0000_4009, // MEDIA_ENT_F_PROC_VIDEO_ISP
    /// Video scaler / resizer.
    Scaler          = 0x0000_4005, // MEDIA_ENT_F_PROC_VIDEO_SCALER
    /// Statistics engine (auto-exposure, auto-white-balance histograms).
    StatisticsEngine = 0x0000_4006, // MEDIA_ENT_F_PROC_VIDEO_STATISTICS
    /// CSI-2 receiver (SoC-specific D-PHY/C-PHY interface).
    CsiReceiver     = 0x0005_0002, // MEDIA_ENT_F_VID_IF_BRIDGE
}

/// Media bus format: describes pixel data format on the wire between
/// subdevices (not the in-memory layout — that's CameraPixelFormat).
#[repr(C)]
pub struct MediaBusFormat {
    /// Media bus code (e.g., MEDIA_BUS_FMT_SRGGB10_1X10 for 10-bit raw Bayer).
    pub code: MediaBusCode,
    /// Frame width in pixels.
    pub width: u32,
    /// Frame height in pixels.
    pub height: u32,
    /// Color space (sRGB, BT.601, BT.709, RAW).
    pub colorspace: CameraColorSpace,
}
// MediaBusFormat: MediaBusCode(u32=4) + width(u32=4) + height(u32=4) + CameraColorSpace(u32=4)
//   = 16 bytes.
const_assert!(size_of::<MediaBusFormat>() == 16);

/// Media bus format code. Values match Linux `MEDIA_BUS_FMT_*`.
pub type MediaBusCode = u32;

/// Selection target for crop/compose operations on ISP pads.
#[repr(u32)]
pub enum SelectionTarget {
    /// Crop rectangle on a sink pad (what portion of the input to process).
    Crop        = 0,
    /// Compose rectangle on a source pad (where to place output within frame).
    Compose     = 1,
    /// Hardware limits: maximum crop/compose region.
    CropBounds  = 2,
    /// Hardware limits: maximum compose region.
    ComposeBounds = 3,
}

/// Axis-aligned rectangle for crop/compose operations.
#[repr(C)]
pub struct Rectangle {
    /// Left edge offset in pixels.
    pub left: u32,
    /// Top edge offset in pixels.
    pub top: u32,
    /// Width in pixels.
    pub width: u32,
    /// Height in pixels.
    pub height: u32,
}
// Rectangle: 4 × u32 = 16 bytes.
const_assert!(size_of::<Rectangle>() == 16);

/// Camera color space, carried in media bus format negotiation.
#[repr(u32)]
pub enum CameraColorSpace {
    /// Raw sensor data (no color processing applied). Used between Sensor and ISP.
    Raw     = 0,
    /// sRGB color space. Default for consumer cameras after ISP processing.
    Srgb    = 1,
    /// BT.601 (SD video). Used by some webcams in YUYV mode.
    Bt601   = 2,
    /// BT.709 (HD video). Used for 720p+ webcam output.
    Bt709   = 3,
}

Pipeline validation rules (enforced by the kernel camera framework before start_stream() is allowed):

Every link in the pipeline must have matching formats on source and sink pads. The kernel walks the graph from sensor to DMA writer and verifies that source_pad.format.code is accepted by the connected sink_pad.
By default, the kernel warns if sink resolution exceeds source resolution along the pipeline. An ISP scaler may downscale (sensor 4032x3024 → ISP output 1920x1080) and the common case is to reject upscaling. However, drivers that support hardware upscaling (e.g., Qualcomm Spectra ISP for digital zoom) must explicitly declare the capability via ISP_CAP_UPSCALE in their device capabilities, at which point the kernel allows upscaling on links involving that ISP entity. The driver validates the specific scaling ratio via set_pad_format().
Raw Bayer formats (Srggb8/10/12, Sgrbg*, Sgbrg*, Sbggr*) are only valid on links between Sensor and ISP entities. After the ISP demosaics, the output must be a processed format (NV12, YUYV, etc.).
Compressed formats (MJPEG, H264, HEVC) are only valid on the output pad of a hardware encoder entity or on a UVC device that delivers hardware-compressed frames directly.

13.16.6 Privacy and Security¶

UmkaOS enforces camera privacy at the kernel level. Unlike Linux, where indicator LED control is driver-dependent (and can be suppressed by a compromised driver), UmkaOS separates indicator control from the camera driver.

Capability requirement: Opening a camera device (/dev/videoN) requires CAP_CAMERA (bit 89, Section 9.2). Without this capability, open() returns EPERM. Container runtimes and sandboxes can deny camera access by not granting CAP_CAMERA to the process.

/// Camera indicator LED controller. Separated from `CameraDevice` so that
/// the kernel framework — not the driver — controls the indicator.
///
/// The driver registers a `CameraIndicator` at probe time. The kernel camera
/// framework calls `activate()` / `deactivate()` around streaming operations.
/// The driver never calls these methods directly.
pub trait CameraIndicator: Send + Sync {
    /// Turn on the indicator LED. Called by the kernel before `start_stream()`
    /// returns to the application. Must not fail silently — if the LED cannot
    /// be activated, streaming must not proceed.
    fn activate(&self) -> Result<(), CameraError>;

    /// Turn off the indicator LED. Called by the kernel after `stop_stream()`
    /// completes on all streams for this device.
    fn deactivate(&self) -> Result<(), CameraError>;

    /// Returns true if the indicator is wired to the sensor clock or power
    /// line (hardware-enforced — the LED is physically impossible to suppress
    /// while the sensor is active). Returns false if the indicator is a
    /// separate GPIO that the kernel toggles in software.
    fn is_hardware_enforced(&self) -> bool;
}

Mandatory indicator protocol:

Application calls start_stream().
Kernel calls indicator.activate() before start_stream() returns. If activate() returns IndicatorFault, streaming is denied and start_stream() returns CameraError::IndicatorFault.
Application calls stop_stream().
Kernel calls indicator.deactivate() after stop_stream() completes.
If a Tier 1 driver crashes during streaming: the kernel crash recovery path (Section 11.9) calls indicator.deactivate() after driver reload completes and before new streams are accepted. During the ~50-150ms recovery window, the indicator remains on (fail-safe: visible to user).

Hardware-enforced indicators: Some camera hardware (notably Apple FaceTime cameras and some Lenovo ThinkPad designs) wire the indicator LED to the sensor clock line. When the sensor receives a clock signal (i.e., is active), the LED is physically on. No software can suppress this. The driver reports is_hardware_enforced() == true for these devices.

Software-enforced indicators: On devices where the LED is a separate GPIO, the kernel toggles the GPIO pin directly (via the GPIO framework, Section 13.10). The GPIO phandle for the indicator is read from the device tree (led-gpios property) or ACPI (_DSD method). The driver is NOT given access to this GPIO — the kernel camera framework holds it exclusively.

Privacy shutter detection: If the device has a physical privacy shutter with a GPIO sensor, the state is exposed as CameraControlId::Privacy (read-only, boolean). When the shutter is closed, the kernel delivers CameraEvent::PrivacyChanged { closed: true } via the event ring. The driver may continue to stream when the shutter is closed (delivering black frames) — it is the application's responsibility to handle the privacy control notification.

Audit logging: Every camera lifecycle event generates an audit log entry (Section 20.1):

Event	Audit Fields
Device open	pid, uid, device_path, timestamp
`start_stream()`	pid, uid, device_path, stream_config (format, resolution, fps)
`stop_stream()`	pid, uid, device_path, frames_captured, duration_ns
Device close	pid, uid, device_path, total_streams_opened

13.16.7 UVC Driver Contract¶

UVC (USB Video Class) is the most common camera interface for desktop and laptop webcams. A conformant UVC driver must implement the following:

USB descriptor parsing: The driver parses the UVC-specific descriptors from the USB configuration descriptor: - VideoControl (VC) interface: processing units (brightness, contrast controls), camera terminal (pan/tilt/zoom/exposure), extension units (vendor-specific). - VideoStreaming (VS) interface: format descriptors (MJPEG, uncompressed YUYV, H.264), frame descriptors (resolutions and frame intervals per format).

Control mapping: UVC controls are mapped to CameraControlId:

UVC Control	UVC Unit	CameraControlId
Brightness	Processing Unit	`Brightness`
Contrast	Processing Unit	`Contrast`
Saturation	Processing Unit	`Saturation`
Sharpness	Processing Unit	`Sharpness`
White Balance Temperature	Processing Unit	`WhiteBalanceTemperature`
Backlight Compensation	Processing Unit	`BacklightCompensation`
Power Line Frequency	Processing Unit	`PowerLineFrequency`
Auto-Exposure Mode	Camera Terminal	`ExposureAuto`
Exposure Time (Absolute)	Camera Terminal	`ExposureAbsolute`
Auto-Exposure Priority	Camera Terminal	`ExposureAutoPriority`
Focus (Absolute)	Camera Terminal	`FocusAbsolute`
Focus Auto	Camera Terminal	`FocusAuto`
Zoom (Absolute)	Camera Terminal	`ZoomAbsolute`
Pan (Absolute)	Camera Terminal	`PanAbsolute`
Tilt (Absolute)	Camera Terminal	`TiltAbsolute`
Privacy	Camera Terminal	`Privacy` (read-only)

USB transfer management: UVC uses isochronous USB transfers for streaming video (guaranteed bandwidth, bounded latency) and bulk transfers for still image capture. The driver allocates URBs (USB Request Blocks) and manages the isochronous ring buffer, extracting complete frames from the USB payload stream.

Quirk handling: Many USB cameras claim UVC compliance but deviate from the specification. The driver maintains a quirk table:

/// Maximum number of UVC quirk entries.
const UVC_MAX_QUIRKS: usize = 128;

/// A UVC device quirk override.
#[repr(C)]
pub struct UvcQuirk {
    /// USB vendor ID (0 = wildcard).
    pub vendor_id: u16,
    /// USB product ID (0 = wildcard).
    pub product_id: u16,
    /// Quirk flags to apply.
    pub flags: UvcQuirkFlags,
}
// UvcQuirk: vendor_id(u16=2) + product_id(u16=2) + flags(u32=4) = 8 bytes.
const_assert!(size_of::<UvcQuirk>() == 8);

bitflags! {
    /// UVC quirk flags for non-compliant devices.
// kernel-internal, not KABI
    #[repr(C)]
    pub struct UvcQuirkFlags: u32 {
        /// Device claims to support UVC 1.5 but only speaks UVC 1.0/1.1.
        const FORCE_UVC_1_0       = 1 << 0;
        /// Device does not properly report stream errors in the header.
        const IGNORE_STREAM_ERROR = 1 << 1;
        /// Device needs a delay between SET_INTERFACE and first URB submit.
        const PROBE_DELAY         = 1 << 2;
        /// Device reports incorrect dwMaxPayloadTransferSize; use the
        /// endpoint's wMaxPacketSize instead.
        const FIX_MAX_PAYLOAD     = 1 << 3;
        /// Device does not support GET_INFO on controls; skip capability
        /// probing and assume all advertised controls are read-write.
        const SKIP_CTRL_INFO      = 1 << 4;
    }
}

Reference: USB Implementers Forum, USB Video Class Specification 1.5, 2023.

13.16.8 MIPI CSI-2 Integration¶

MIPI CSI-2 (Camera Serial Interface 2) is the standard for embedded SoC cameras on ARM, RISC-V, and some x86 platforms. The camera pipeline consists of multiple hardware blocks modeled as CameraSubdevice entities.

Typical embedded camera pipeline:

Sensor (e.g., OV5640, IMX219, IMX477)
  │  MIPI CSI-2 serial link (D-PHY or C-PHY)
  │  1-4 data lanes at 0.5-4.5 Gbps per lane
  ▼
CSI-2 Receiver (SoC-specific: i.MX, Qualcomm CSIPHY, TI CAL)
  │  Raw Bayer or YUV pixel data
  ▼
ISP (if present: Qualcomm Spectra, i.MX ISI, ARM Mali C71)
  │  Demosaic → color correction → noise reduction → scaling
  ▼
DMA Writer → System memory (page-aligned DMA-BUF)

/// MIPI CSI-2 receiver configuration, set during subdevice probe.
/// Discovered from device tree (`data-lanes`, `clock-frequency` properties)
/// or ACPI DSDT tables. Never hardcoded.
#[repr(C)]
pub struct CsiReceiverConfig {
    /// Number of data lanes (1-4 for D-PHY, 1-3 trios for C-PHY).
    pub num_lanes: u8,                      // 1 byte, offset 0
    /// Explicit padding: num_lanes ends at offset 1, lane_rate_mbps (u32)
    /// needs alignment 4. 1 % 4 = 1, need 3 bytes. CLAUDE.md rule 11.
    pub _pad0: [u8; 3],                     // 3 bytes, offset 1
    /// Per-lane data rate in megabits per second. Determines the pixel
    /// throughput: `pixel_rate = lane_rate * 2 * num_lanes / bits_per_sample`.
    pub lane_rate_mbps: u32,                // 4 bytes, offset 4
    /// PHY type.
    pub phy_type: CsiPhyType,               // 4 bytes, offset 8
    /// Virtual channel ID (0-3). Enables multiple cameras on a single CSI-2
    /// link by assigning each sensor a different virtual channel.
    pub virtual_channel: u8,                // 1 byte, offset 12
    /// Explicit tail padding: content ends at offset 13. Struct align = 4
    /// (from lane_rate_mbps: u32 and CsiPhyType: repr(u32)). 13 % 4 = 1,
    /// need 3 bytes. CLAUDE.md rule 11.
    pub _pad1: [u8; 3],                     // 3 bytes, offset 13
}
// CsiReceiverConfig layout (all padding explicit):
//   num_lanes(u8=1) + _pad0([u8;3]=3) + lane_rate_mbps(u32=4) +
//   phy_type(CsiPhyType repr(u32)=4) + virtual_channel(u8=1) +
//   _pad1([u8;3]=3) = 16 bytes. Struct align 4. 16 % 4 = 0. No implicit padding.
const_assert!(size_of::<CsiReceiverConfig>() == 16);

/// MIPI CSI-2 physical layer type.
#[repr(u32)]
pub enum CsiPhyType {
    /// D-PHY: up to 4.5 Gbps/lane (MIPI D-PHY v2.5). Most common.
    /// 1-4 differential data lanes + 1 clock lane.
    DPhy = 0,
    /// C-PHY: up to 6.0 Gsps/trio (MIPI C-PHY v2.0). Higher throughput
    /// per pin than D-PHY. 1-3 trios, no dedicated clock lane.
    CPhy = 1,
}

Per-SoC platform data: Each SoC's CSI-2 receiver and ISP have different register interfaces, DMA engine configurations, and pipeline topologies. These details are implemented in per-SoC Tier 1 drivers (e.g., imx_csi, qcom_camss, ti_cal). The CameraSubdevice trait provides the uniform interface; platform data comes from device tree bindings.

Multi-camera support: SoCs with multiple CSI-2 ports support multiple concurrent cameras. Each camera is a separate CameraDevice instance with its own pipeline graph. The virtual channel field in CsiReceiverConfig enables multiplexing up to 4 cameras on a single CSI-2 link (common in automotive and industrial applications).

13.16.9 V4L2 Compatibility¶

umka-sysapi translates V4L2 (VIDIOC_*) and Media Controller (MEDIA_IOC_*) ioctls to CameraDevice and CameraSubdevice method calls, enabling unmodified applications (GStreamer v4l2src, FFmpeg v4l2, libcamera, Chromium VideoCaptureDeviceLinux) to run without changes.

V4L2 ioctl mapping (magic 'V', 0x56):

V4L2 ioctl	Nr	UmkaOS method
`VIDIOC_QUERYCAP`	0	Synthetic: returns device caps, driver name, bus info
`VIDIOC_ENUM_FMT`	2	`query_formats()`
`VIDIOC_G_FMT`	4	Return current stream format (or default if no stream)
`VIDIOC_S_FMT`	5	`create_stream()` with format/resolution/fps from `v4l2_format`
`VIDIOC_TRY_FMT`	64	Validate format without applying (dry-run `create_stream()`)
`VIDIOC_REQBUFS`	8	Allocate capture buffer pool (DMA-BUF backed)
`VIDIOC_QUERYBUF`	9	Return buffer metadata (offset, length, flags)
`VIDIOC_QBUF`	15	`queue_buf()`
`VIDIOC_DQBUF`	17	`dequeue_buf()`
`VIDIOC_EXPBUF`	16	Export buffer as DMA-BUF file descriptor
`VIDIOC_STREAMON`	18	`start_stream()`
`VIDIOC_STREAMOFF`	19	`stop_stream()`
`VIDIOC_QUERYCTRL`	36	`enum_controls()` (single control by ID)
`VIDIOC_G_CTRL`	27	`get_control()`
`VIDIOC_S_CTRL`	28	`set_control()`
`VIDIOC_G_EXT_CTRLS`	71	`get_control()` × N (batch read)
`VIDIOC_S_EXT_CTRLS`	72	`set_control()` × N (batch write)
`VIDIOC_ENUM_FRAMESIZES`	74	`enum_frame_sizes()`
`VIDIOC_ENUM_FRAMEINTERVALS`	75	`enum_frame_intervals()`
`VIDIOC_G_PARM`	21	Frame rate query (from stream config)
`VIDIOC_S_PARM`	22	Frame rate set (via `create_stream()` interval)

Buffer memory mode translation:

V4L2 mode	Value	UmkaOS behavior
`V4L2_MEMORY_DMABUF`	4	Native: DMA-BUF file descriptor passed directly. Zero-copy.
`V4L2_MEMORY_MMAP`	1	Kernel allocates DMA-BUF internally, provides `mmap()` offset for userspace mapping. Functionally equivalent to DMABUF from the driver's perspective.
`V4L2_MEMORY_USERPTR`	2	Returns `EINVAL`. Deprecated in modern V4L2; all use cases covered by DMABUF and MMAP.

Media Controller ioctl mapping (magic '|', 0x7C):

Media ioctl	Nr	UmkaOS method
`MEDIA_IOC_DEVICE_INFO`	0	Return device topology metadata
`MEDIA_IOC_ENUM_ENTITIES`	1	Enumerate `CameraSubdevice` entities
`MEDIA_IOC_ENUM_LINKS`	2	Enumerate `MediaLink` connections between pads
`MEDIA_IOC_SETUP_LINK`	3	`create_link()` / `destroy_link()` on `CameraDevice`
`MEDIA_IOC_G_TOPOLOGY`	4	Full topology snapshot (v2 API, Linux 4.19+)

V4L2 subdevice ioctls (for ISP pipeline configuration via /dev/v4l-subdevN):

V4L2 subdev ioctl	UmkaOS method
`VIDIOC_SUBDEV_ENUM_MBUS_CODE`	`CameraSubdevice::enum_mbus_codes()`
`VIDIOC_SUBDEV_G_FMT`	`CameraSubdevice::get_pad_format()`
`VIDIOC_SUBDEV_S_FMT`	`CameraSubdevice::set_pad_format()`
`VIDIOC_SUBDEV_ENUM_FRAME_SIZE`	`CameraSubdevice::enum_frame_sizes_on_pad()`
`VIDIOC_SUBDEV_G_SELECTION`	`CameraSubdevice::get_selection()`
`VIDIOC_SUBDEV_S_SELECTION`	`CameraSubdevice::set_selection()`

13.16.10 Error Types and Events¶

/// Error type returned by CameraDevice and CameraSubdevice methods.
#[repr(C, u32)]
pub enum CameraError {
    /// The requested pixel format is not supported by this device.
    UnsupportedFormat     = 1,
    /// The requested resolution is not supported.
    UnsupportedResolution = 2,
    /// The requested frame rate is not supported.
    UnsupportedFrameRate  = 3,
    /// All hardware streams are in use. Destroy an existing stream first.
    StreamsExhausted      = 4,
    /// DMA buffer allocation failed.
    NoMemory              = 5,
    /// Hardware error (register access failed, DMA engine stalled).
    HardwareError         = 6,
    /// Operation requires an active stream but none is started.
    StreamNotStarted      = 7,
    /// `start_stream()` called on a stream that is already started.
    StreamAlreadyStarted  = 8,
    /// The specified control ID is not supported by this device.
    InvalidControl        = 9,
    /// The control is read-only (e.g., Privacy reflects hardware state).
    ControlReadOnly       = 10,
    /// The control value is outside the valid range (min..=max).
    ControlOutOfRange     = 11,
    /// The ISP pipeline has an invalid link (format mismatch, resolution
    /// increase, raw Bayer after ISP). Fix the pipeline before starting.
    PipelineInvalid       = 12,
    /// The camera indicator LED could not be activated. Streaming is denied
    /// to preserve the mandatory privacy guarantee.
    IndicatorFault        = 13,
    /// The physical privacy shutter is closed. The driver may still stream
    /// (delivering black frames), but this error is returned from
    /// `start_stream()` as a warning if the shutter GPIO reads closed.
    PrivacyShutterClosed  = 14,
    /// Operation timed out (e.g., dequeue_buf with no frame ready within
    /// the timeout period).
    Timeout               = 15,
    /// The device was disconnected or the Tier 1 driver crashed.
    DeviceLost            = 16,
}

/// Asynchronous camera events delivered via the per-device event ring.
/// Applications poll the event ring for frame completion, control changes,
/// and privacy state transitions.
///
/// Size: largest variant is FrameReady. Layout for #[repr(C, u32)]:
///   discriminant (u32 = 4 bytes, offset 0)
///   + implicit alignment padding (4 bytes, offset 4, for u64 alignment)
///   + sequence (u64 = 8 bytes, offset 8)
///   + timestamp_ns (u64 = 8 bytes, offset 16)
///   = 4 + 4 + 8 + 8 = 24 bytes total.
/// const_assert!(size_of::<CameraEvent>() == 24);
#[repr(C, u32)]
pub enum CameraEvent {
    /// A captured frame is ready for dequeue.
    FrameReady {
        /// Frame sequence number.
        sequence: u64,
        /// Capture timestamp (nanoseconds since boot).
        timestamp_ns: u64,
    } = 0,
    /// An error occurred during streaming.
    StreamError {
        /// What went wrong.
        error: CameraError,
    } = 1,
    /// A hardware control value changed asynchronously (e.g., auto-exposure
    /// adjusted the exposure time, or the physical privacy shutter toggled).
    ControlChanged {
        /// Which control changed.
        id: CameraControlId,
        /// New value.
        value: i64,
    } = 2,
    /// The physical privacy shutter state changed.
    PrivacyChanged {
        /// 0 = open, 1 = closed. u8 instead of bool for cross-domain safety.
        closed: u8,
    } = 3,
    /// The device was disconnected or the driver crashed.
    ///
    /// Applications distinguish the two cases by querying the device node
    /// after receiving this event:
    /// - **Tier 1 driver crash (recoverable)**: The device node remains in
    ///   sysfs. `open()` on the device returns `EAGAIN` during the ~50-150 ms
    ///   recovery window ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)), then
    ///   succeeds after the driver is reloaded. The application should close
    ///   all stream handles, wait briefly, and re-open the device. The privacy
    ///   indicator LED remains on during recovery to signal that the camera
    ///   hardware is still active.
    /// - **Physical disconnect (permanent)**: The device node is removed from
    ///   sysfs. `open()` returns `ENODEV`. The application should close all
    ///   handles and release resources. A `KOBJ_REMOVE` uevent is emitted for
    ///   userspace device managers (udev).
    ///
    /// In both cases, no further `CameraEvent`s will be delivered on existing
    /// handles. The application must close all stream handles.
    DeviceLost = 4,
}
const_assert!(core::mem::size_of::<CameraEvent>() == 24);

13.16.11 Cross-References¶

Section 4.14 (DMA Subsystem): DmaBufHandle, DmaDevice, CoherentDmaBuf, StreamingDmaMap — all frame buffer DMA operations.
Section 9.2 (SystemCaps): CAP_CAMERA (bit 89) gates camera device access.
Section 11.1 (Driver Isolation): Tier 1 classification — camera drivers run in Ring 0 with MPK/POE/DACR isolation.
Section 11.9 (Crash Recovery): Tier 1 driver reload (~50-150ms). Privacy indicator remains on during recovery.
Section 12.6 (KABI Transport): camera_device_v1 uses T1 ring buffer transport for Tier 1 drivers.
Section 13.7 (Media Pipeline): MediaPad, MediaLink, DmaFence — reused for ISP topology and frame synchronization.
Section 19.1 (Syscall Interface): V4L2 ioctl translation in umka-sysapi.
Section 20.1 (Fault Management): Camera audit log entries.
Section 22.1 (Accelerators): Camera → GPU zero-copy pipeline via shared DmaBufHandle.

13.17 Printers and Scanners¶

Printer and scanner support is out of scope for the UmkaOS kernel architecture.

Modern printing is entirely a userspace concern: CUPS (Common Unix Printing System) handles print job management, IPP Everywhere and AirPrint provide driverless network printing over HTTP/HTTPS, and legacy printer drivers (HPLIP, Gutenprint) are userspace filter programs. USB printers use the standard USB printer class (bulk endpoint), which is already covered by the USB subsystem (Section 11.4). CUPS, HPLIP, and Gutenprint run unmodified on UmkaOS.

Scanning is similarly userspace: SANE (Scanner Access Now Easy) and eSCL (AirScan) are userspace libraries. USB scanners use standard USB bulk endpoints. Network scanners communicate via HTTP. No kernel driver is needed.

No KABI traits, device class frameworks, or kernel-level abstractions are defined for printers or scanners.

13.18 Live Kernel Evolution¶

13.18.1 The Theseus Model¶

Theseus OS (Rice University, 2020) demonstrated that kernel components can be individually replaced at runtime without rebooting, by making state ownership explicit and granular.

UmkaOS already does this for drivers (Section 11.9 crash recovery). This section extends it to core kernel components.

13.18.2 Design: Explicit State Ownership Graph¶

// umka-core/src/evolution/mod.rs

/// Every kernel component declares its state explicitly.
/// This enables:
///   1. Live replacement: old component's state is migrated to new component.
///   2. Crash recovery: component's state can be reconstructed from invariants.
///   3. State inspection: debugging and observability.

/// Trait that every replaceable kernel component implements.
pub trait EvolvableComponent {
    /// Component's serializable state.
    /// Must capture ALL mutable state that persists across calls.
    type State: Serialize + Deserialize;

    /// Export current state for migration to a new version.
    fn export_state(&self) -> Self::State;

    /// Import migrated state from a previous version (for live replacement).
    ///
    /// Called on an already-constructed instance (via `initialize_fresh()` or
    /// a zero-initialized allocation). The function mutates `self` in place
    /// to incorporate the exported state. This matches the C-ABI bridge in
    /// `VtableHeader.import_state_fn` which takes `*mut VtableHeader` and
    /// mutates in place.
    ///
    /// When `dry_run` is true, the function validates that it CAN import
    /// the state (version compat, schema checks) without actually mutating —
    /// used by `is_compatible_batch()` during pre-swap validation.
    fn import_state(&mut self, state: Self::State, dry_run: bool) -> Result<(), MigrationError>;

    /// Initialize fresh (for first boot or after state loss).
    fn initialize_fresh(config: &KernelConfig) -> Self
    where Self: Sized;

    /// Version of this component's state format.
    /// Migration rule: v(N) can import v(N-1) state ONLY.
    /// For larger jumps (v1 → v5): chained migration through intermediates
    /// (v1 → v2 → v3 → v4 → v5). Each version carries ONE migration
    /// function from the immediately prior version. The chain runs
    /// during import_state() before the atomic swap.
    fn state_version(&self) -> u64;
}

Chain length bound: To prevent unbounded migration chains, the maximum chain length is 8 intermediate versions. A component at version v(K) can be live-evolved to at most version v(K+8) in a single operation. Larger version jumps require: (a) A direct v(K)→v(K+N) migration function registered by the new component (the component author provides a migration path that skips intermediates), or (b) Multiple sequential live evolutions (v(K)→v(K+8)→v(K+16)→...), each of which is a separate atomic operation with its own rollback capability. The 8-version limit bounds the worst-case migration time to ~8× the single-step migration cost. If a chained migration exceeds 500 ms total elapsed time, the evolution is aborted and the old component continues running. This timeout is configurable via evolution.max_chain_time_ms.

State Serialization Format:

/// Serialized component state for live replacement.
pub struct ComponentState {
    /// Component identifier (e.g., "scheduler", "page_replacement").
    /// Fixed-size string to ensure validity across live-replacement boundaries
    /// (heap/static pointers from the replaced component are invalid after replacement).
    pub component_id: ArrayString<64>,
    /// State format version (matches EvolvableComponent::state_version).
    pub version: u64,
    /// Serialized state data (component-owned schema).
    /// Allocated from the kernel heap via `alloc::vec::Vec` — this is acceptable
    /// because state export/import runs only during live replacement (rare, cold
    /// path, well after the heap allocator is initialized). State sizes are
    /// bounded per component (see [Section 13.18](#live-kernel-evolution--state-size-budget) table).
    pub data: Vec<u8>,
    /// CRC32C of all preceding fields, using hardware acceleration
    /// (SSE4.2 `crc32` on x86, ARMv8 CRC instructions).
    ///
    /// **Checksum**: CRC32C provides adequate 32-bit error detection for this
    /// small, cold-path structure. A cryptographic hash is unnecessary here —
    /// state integrity against malicious tampering is enforced by the evolution
    /// framework's capability checks and signature verification ([Section 13.18](#live-kernel-evolution--state-size-budget)),
    /// not by this checksum.
    pub checksum: u32,
}

Each component owns its serialization schema. The kernel provides StateSerializer helpers for common patterns (serialize BTreeMap, serialize per-CPU arrays, serialize LRU lists) but does not impose a format. Components choose what to serialize and how — the contract is that import_state(export_state()) produces an equivalent component.

State spill avoidance (preferred pattern): Components that can structure their state so that per-client data is owned by the client (task, file descriptor, etc.) rather than by the component itself avoid the need for export_state/import_state entirely. The component becomes a stateless processor: swap replaces code, data persists in client-owned structures. This pattern, inspired by Theseus OS (Boos et al., OSDI 2020), eliminates serialization bugs, reduces swap latency to ~1-10 μs, and allows in-flight operations to continue uninterrupted. io_uring uses this pattern — see Section 19.3 for the full design. Components with inherently global state (scheduler run queues, page replacement LRU) still use export_state/import_state.

Replacement mechanism selection guide:

Criterion	Mechanism	Swap latency	Example components
Component is stateless (no owned mutable state; all state in client-owned structures)	`AtomicPtr` swap (~1 μs)	~1 μs	`PhysAllocPolicy`, `VmmPolicy`, `IoSchedOps`, `QdiscOps`, `CongestionOps` (vtable is `&'static dyn` — per-connection `CongPriv` state is owned by the socket, not the algorithm; existing connections continue using the old vtable until close, new connections pick up the swapped vtable)
Component owns global mutable state (run queues, LRU lists, caches)	Full Phase A/A'/B/C lifecycle	~10-100 ms	VFS, network stack, scheduler policy, slab allocator, block layer
Decision rule: Does the component own mutable state that must survive replacement? If no → AtomicPtr. If yes → Phase A/A'/B/C.

Stateless policy batch replacement: When the evolution orchestrator replaces an entire EvolvableComponent batch, stateless policy modules (e.g., IoSchedOps, QdiscOps) are included in the swap list alongside stateful components. Each stateless policy's AtomicPtr is swapped in Phase B as part of the batch. No quiescence is needed for stateless policies — they have no owned state to drain. The batch orchestrator iterates the swap list and performs all AtomicPtr::store operations within the same stop-the-world IPI window.

13.18.3 Component Replacement Flow¶

Live kernel component replacement (e.g., new scheduler algorithm):

Pre-Phase A — Signature verification:
  0. The evolution orchestrator verifies the replacement module's ML-DSA-65
     signature against the KABI keyring
     ([Section 12.7](12-kabi.md#kabi-service-dependency-resolution--signing-key-initialization))
     via `verify_evolution_signature()`:

     ```rust
     /// Verify the ML-DSA-65 signature on a replacement component ELF.
     ///
     /// This is Evolvable code (replaceable) — the signature verification
     /// algorithm itself can be live-evolved. The function reads the
     /// `.kabi_signature` ELF section containing the detached ML-DSA-65
     /// signature and verifies it against the KABI keyring's public keys.
     ///
     /// # Errors
     ///
     /// Returns `Err(SignatureError::Invalid)` if the signature does not
     /// verify, `Err(SignatureError::MissingSection)` if the ELF lacks
     /// a `.kabi_signature` section, or `Err(SignatureError::KeyNotFound)`
     /// if no matching public key is in the keyring.
     pub fn verify_evolution_signature(
         elf: &ElfImage,
         keyring: &KabiKeyring,
     ) -> Result<(), SignatureError>;
     ```

     If signature verification fails, the evolution is rejected before any
     state export. No IPI, no quiescence, no disruption.

Phase A — Preparation (runs concurrently with normal operation, NOT stop-the-world):
  1. New component binary loaded (same mechanism as policy module, Section 19.7).
  2. Old component: export_state() → serialized state.
     This may walk large data structures (all run queues, LRU lists, etc.).
     Time: potentially milliseconds for complex components.
     Normal operation continues during this phase — the old component
     is still active and handling requests.
     **Timeout**: `export_state()` has a 5-second timeout (configurable via
     `evolution.export_timeout_ms`, default 5000). If the component fails
     to export within this period, the evolution attempt is aborted and
     the component remains in its current version. The timeout prevents a
     buggy or deadlocked component from blocking the entire evolution
     process indefinitely. On timeout, the evolution orchestration emits
     an FMA event (`FaultEvent::EvolutionExportTimeout`) and returns
     `Err(EvolutionError::ExportTimeout)`.
  3. New component: import_state(serialized_state) → initialized.
     **Timeout**: `import_state()` has the same 5-second timeout. On
     timeout, the new component is discarded and the old continues.

Phase A' — Quiescence (bounded, runs before the atomic swap):
  Before the atomic swap, the old component enters a **quiescence phase**: all
  in-flight operations are allowed to complete (with a bounded deadline), and new
  operations are queued. The quiescence deadline is configurable per component type
  (default: 10ms for scheduler, 50ms for page replacement, 100ms for Evolvable services).
  If the deadline expires before all in-flight operations drain, the replacement is
  aborted and the old component resumes normal operation without disruption.

  **DLM quiescence** (when evolving Evolvable that includes the DLM subsystem):
  DLM lock operations (acquire, release, convert, cancel) use direct Evolvable calls
  (T0 transport), not KABI rings. Phase A' must drain all in-flight DLM operations
  before vtable swap. Protocol: (1) Set `dlm_quiescing` flag — new DLM operations
  return ERESTARTSYS (callers retry after evolution). (2) Wait for
  `dlm_in_flight_ops.load(Acquire) == 0` (each DLM entry increments on entry,
  decrements on exit). (3) DLM deadline: 100ms (matches Evolvable default). If
  in-flight ops do not drain (e.g., stuck waiting for remote lock grant), abort
  evolution. DLM operations that hold distributed locks across the quiescence
  window are safe: the lock state is exported in Phase A and re-imported in
  Phase C. Remote peers continue seeing the lock as held.

  **Scheduler evolution scope**: Scheduler evolution replaces the Evolvable
  scheduler module — all EEVDF formulas and tree operations (`update_curr`,
  `pick_eevdf`, `enqueue_entity`, `dequeue_entity`, `avg_vruntime`,
  `entity_eligible`, `update_entity_lag`, PELT), plus policy heuristics
  (weight calculations, preemption thresholds, idle balance strategy, load
  balancer logic). Only the Nucleus data structure layouts (`VruntimeTree`,
  `EevdfRunQueue`, `EevdfTask`, RB-tree node layout) and the fixed priority dispatch ladder
  (DL > RT > CBS/EEVDF > IDLE) are non-replaceable. The dispatch ladder is a
  static enum match, not a vtable call. Correctness of replaced formulas is
  ensured by Nucleus invariant checkers before the swap is committed. This is
  consistent with [Section 7.1](07-scheduling.md#scheduler).

  **Scheduler-specific quiescence note**: For the scheduler, `pick_next_task` is
  called from the timer interrupt on every tick on every CPU. During quiescence,
  these calls are intercepted by the trampoline and queued — **except for DL and
  RT tasks**, which are exempt (see below). This means no new EEVDF/CBS scheduling
  decisions are made during the quiescence window — CPUs running EEVDF/CBS tasks
  continue executing their current task. The 10ms quiescence window is chosen to
  be ≤ one scheduler tick (typically 4ms on HZ=250), so at most 2-3 timer ticks
  are queued per CPU. The queued `pick_next_task` calls are replayed by the new
  scheduler immediately after the Phase B atomic swap.

  **DL and RT task exemption during quiescence**: The vtable trampoline inspects
  the current CPU's runqueue head task's scheduling class before intercepting
  `pick_next_task`. DL and RT tasks are **exempt from quiescence interception**:
  the trampoline dispatches their `pick_next_task` calls directly through the
  **old** policy vtable, bypassing the `PendingOpsPerCpu` entirely. This ensures:

  - **DL tasks** never miss their absolute deadlines. A DL task with a 2ms
    deadline cannot tolerate a 10ms scheduling blackout. The old scheduler
    continues to pick DL tasks using the existing EDF ordering until Phase B
    completes.
  - **RT-FIFO and RT-RR tasks** maintain their hard-latency guarantees. RT tasks
    at priorities 1-99 have deterministic dispatch requirements; a 10ms gap would
    violate POSIX real-time scheduling guarantees. The old scheduler continues to
    pick RT tasks using the existing priority-ordered dispatch until Phase B
    completes.
  - **EEVDF/CBS tasks** are the only classes whose `pick_next_task` calls are
    queued. Their ~10ms worst-case latency impact is comparable to a long
    spinlock hold and acceptable for a live-evolution operation that occurs at
    most a few times per kernel lifetime.

  The trampoline exemption check is O(1) — a single comparison against the
  scheduling class enum stored in the per-CPU runqueue's `curr->sched_class`
  field. No lock is acquired; the field is read under the existing per-CPU
  runqueue lock that `pick_next_task` already holds.

```rust
/// Trampoline exemption check for scheduler quiescence.
/// Called within the `pick_next_task` interception path when `quiescing` is true.
///
/// Returns `true` if the current CPU has DL or RT tasks at the head of its
/// runqueue, meaning `pick_next_task` must be dispatched through the old vtable
/// immediately rather than being queued in the PendingOpsPerCpu.
///
/// # Safety
///
/// Must be called with the per-CPU runqueue lock held (guaranteed by
/// `pick_next_task`'s calling convention).
#[inline(always)]
fn sched_quiescence_exempt(rq: &Runqueue) -> bool {
    // DL class has runnable tasks — exempt (absolute deadline guarantee).
    if rq.dl.nr_running > 0 {
        return true;
    }
    // RT class has runnable tasks — exempt (POSIX RT latency guarantee).
    if rq.rt.nr_running > 0 {
        return true;
    }
    // EEVDF / CBS / Idle — not exempt, queue the pick_next_task call.
    false
}
```

  **Full runqueue ownership transfer (Phase A' → Phase B atomic)**:

  All runqueue classes (DL, RT, and CFS/EEVDF) are transferred **atomically as
  part of Phase B**, not deferred to Phase C. This eliminates any window where
  tasks could be "invisible" to the scheduler:

  - A DL task missing from all runqueues would miss its absolute deadline.
  - An RT task missing from all runqueues would violate POSIX real-time guarantees.
  - A CFS task missing from all runqueues would experience unbounded latency
    (the PendingOps replay path cannot guarantee scheduling order fidelity).

  By transferring all runqueue state atomically alongside the vtable swap, the
  new scheduler inherits the exact runqueue state and can dispatch all classes
  immediately after Phase B. Phase C is only needed for non-runqueue state
  (timers, bandwidth controllers, PELT load-tracking state) that can be
  reconstructed lazily without affecting scheduling correctness.

  During Phase B (stop-the-world IPI), while all CPUs are halted, the
  orchestration layer snapshots the per-CPU runqueue state for all scheduling
  classes into the evolution slot. The snapshot must happen during Phase B
  (not Phase A') because DL/RT tasks continue dispatching through the old
  vtable during Phase A', meaning runqueues are still being actively modified:

  ```rust
  /// Per-evolution state for full runqueue ownership transfer.
  /// Populated during Phase B (stop-the-world IPI window) and consumed
  /// immediately afterward in Phase B step 5c (transfer into the new
  /// scheduler's runqueues). The snapshot is taken during Phase B — not
  /// Phase A' — because Phase A' is not stop-the-world, and runqueues
  /// are being actively modified by DL/RT-exempt dispatches. Only the
  /// Phase B IPI guarantees all CPUs are halted and runqueue locks can
  /// be acquired without racing concurrent scheduling decisions.
  pub struct SchedEvolutionSlot {
      /// Per-CPU DL runqueue snapshots. Indexed by CPU ID (0..nr_online).
      /// Captured under each CPU's runqueue lock during the Phase B
      /// stop-the-world IPI. Contains the full EDF-ordered task list with
      /// each task's (runtime, deadline, period) parameters.
      ///
      /// NOT `PerCpu<T>` — uses a heap-allocated `Vec` indexed by CPU ID
      /// because the orchestrating CPU must access ALL CPUs' slots during
      /// Phase B step 5c (remote access). `PerCpu<T>::get()` requires
      /// `&PreemptGuard` and only accesses the current CPU's slot.
      /// Heap allocation is acceptable: this is a cold path (evolution
      /// occurs at most a few times per kernel lifetime).
      pub dl_rqs: Vec<Option<DlRunQueueSnapshot>>,
      /// Per-CPU RT runqueue snapshots. Same allocation model as `dl_rqs`.
      pub rt_rqs: Vec<Option<RtRunQueueSnapshot>>,
      /// Per-CPU CFS/EEVDF runqueue snapshots. Each snapshot is
      /// heap-allocated via `Box<CfsRunQueueSnapshot>` because
      /// `CfsRunQueueSnapshot` is ~320 KiB (4096 tasks × 80 bytes).
      /// Storing this inline in a `Vec<Option<CfsRunQueueSnapshot>>`
      /// would require ~320 KiB × NR_CPUS contiguous allocation.
      /// Instead, each CPU's snapshot is individually `Box`ed, and the
      /// Vec holds `Option<Box<CfsRunQueueSnapshot>>`.
      pub cfs_rqs: Vec<Option<Box<CfsRunQueueSnapshot>>>,
  }

  /// Maximum DL tasks per CPU snapshot. Typical systems have <10 DL tasks;
  /// 64 covers pathological pinning. If exceeded, evolution aborts with
  /// `EvolutionError::RunqueueTooLarge` (same pattern as CFS snapshots).
  pub const MAX_DL_TASKS_PER_RQ: usize = 64;

  /// Maximum RT tasks per CPU snapshot. RT task counts are bounded by
  /// priority levels (100 RT priorities); 128 covers all realistic cases.
  /// If exceeded, evolution aborts with `EvolutionError::RunqueueTooLarge`.
  pub const MAX_RT_TASKS_PER_RQ: usize = 128;

  pub struct DlRunQueueSnapshot {
      /// EDF-ordered task list (earliest deadline first).
      /// Bounded by `MAX_DL_TASKS_PER_RQ`. If the per-CPU DL runqueue
      /// exceeds this capacity during snapshot capture, the snapshot phase
      /// fails and evolution is aborted with `EvolutionError::RunqueueTooLarge`
      /// — all CPUs are released and the old component continues. This
      /// prevents an `ArrayVec::push` panic during Phase B stop-the-world,
      /// which would be unrecoverable (no CPU calls `release_all_cpus()`).
      pub tasks: ArrayVec<DlTaskParams, MAX_DL_TASKS_PER_RQ>,
      pub nr_running: u32,
  }

  pub struct RtRunQueueSnapshot {
      /// Priority-ordered task list (highest priority first).
      /// Bounded by `MAX_RT_TASKS_PER_RQ`. Same overflow handling as
      /// `DlRunQueueSnapshot`: abort evolution on capacity exceeded.
      pub tasks: ArrayVec<RtTaskParams, MAX_RT_TASKS_PER_RQ>,
      pub nr_running: u32,
  }

  /// Maximum tasks per CFS runqueue snapshot. Bounds memory usage and
  /// prevents OOM during evolution. 4096 is sufficient for even heavily
  /// loaded systems — a single CPU runqueue rarely exceeds a few hundred
  /// runnable CFS tasks. If a runqueue has more than MAX_TASKS_PER_RQ
  /// tasks (pathological overload), the snapshot captures the first 4096
  /// in vruntime order and the evolution is aborted with
  /// `EvolutionError::RunqueueTooLarge` — the system is too loaded for
  /// safe evolution and should be drained before retrying.
  pub const MAX_TASKS_PER_RQ: usize = 4096;

  pub struct CfsRunQueueSnapshot {
      /// EEVDF-ordered task entries with full scheduling state.
      /// Bounded by `MAX_TASKS_PER_RQ` to avoid unbounded heap allocation.
      /// The `ArrayVec` is inline (not separately heap-allocated) within
      /// the evolution slot (a pre-allocated per-CPU structure). If the runqueue exceeds
      /// `MAX_TASKS_PER_RQ` tasks, the snapshot phase fails and the
      /// evolution is aborted — the old component continues without
      /// disruption.
      ///
      /// **Memory footprint**: Each `CfsTaskParams` is 80 bytes (per
      /// `const_assert!` below); at `MAX_TASKS_PER_RQ` (4096), this array
      /// is ~320 KiB per CPU.  On a 128-core system, total snapshot memory
      /// is ~40 MiB.  Each `CfsRunQueueSnapshot` is individually
      /// heap-allocated via `Box::new()` and stored in the
      /// `SchedEvolutionSlot.cfs_rqs` Vec as `Option<Box<CfsRunQueueSnapshot>>`.
      /// Each Box is allocated from the NUMA-local memory pool for that CPU
      /// (via `alloc_on_node(cpu_to_node(cpu_id))`) to avoid remote memory
      /// access during Phase B snapshot capture and transfer.
      /// The `SchedEvolutionSlot` itself is also heap-allocated
      /// (not stack-allocated, not in PerCpu section).  On systems
      /// that do not need live kernel evolution, this feature can be
      /// disabled at compile time (`cfg(feature = "live-evolution")`),
      /// reclaiming all ~320 KiB/CPU.
      /// This is acceptable for a server-class system where live evolution
      /// justifies the memory cost.
      pub tasks: ArrayVec<CfsTaskParams, MAX_TASKS_PER_RQ>,
      pub nr_running: u32,
      /// Min vruntime of the runqueue (used to normalize new task placement).
      pub min_vruntime: u64,
  }

  /// `#[repr(C)]` required: this struct crosses compiler version boundaries
  /// during live evolution. `Option<CbsServerParams>` replaced with explicit
  /// presence flag for deterministic layout across Rust toolchain versions.
  #[repr(C)]
  pub struct CfsTaskParams {
      /// Task identifier for re-linking after transfer.
      pub pid: Pid,                   // 4 bytes  (offset 0)
      /// Explicit padding: Pid (i32) -> vruntime (u64) alignment.
      pub _pad0: [u8; 4],            // 4 bytes  (offset 4)
      /// Virtual runtime (the EEVDF ordering key).
      pub vruntime: u64,              // 8 bytes  (offset 8)
      /// EEVDF lag value (signed: positive = owed CPU time).
      pub lag: i64,                   // 8 bytes  (offset 16)
      /// Task weight (nice-derived, used for proportional sharing).
      pub weight: u32,                // 4 bytes  (offset 24)
      /// Explicit padding: weight (u32) -> deadline (u64) alignment.
      pub _pad1: [u8; 4],            // 4 bytes  (offset 28)
      /// EEVDF virtual deadline (vruntime + eligible_delta / weight).
      pub deadline: u64,              // 8 bytes  (offset 32)
      /// 0 = no CBS server, 1 = CBS server params in `cbs` are valid.
      pub has_cbs: u8,                // 1 byte   (offset 40)
      /// Explicit padding: has_cbs (u8) -> cbs (CbsServerParams, align 8).
      pub _pad2: [u8; 7],            // 7 bytes  (offset 41)
      /// CBS server state. Only valid when `has_cbs == 1`.
      pub cbs: CbsServerParams,       // 32 bytes (offset 48)
      // Total: 4+4+8+8+4+4+8+1+7+32 = 80 bytes.
  }
  const_assert!(core::mem::size_of::<CfsTaskParams>() == 80);

  /// CBS (Constant Bandwidth Server) parameters for a task, used during
  /// live scheduler evolution to preserve bandwidth guarantees across the
  /// swap. See [Section 7.1](07-scheduling.md#scheduler--cpu-bandwidth-guarantees) for the runtime CBS
  /// protocol. `#[repr(C)]` required: embedded in `CfsTaskParams` which
  /// crosses compiler version boundaries.
  #[repr(C)]
  pub struct CbsServerParams {
      /// Assigned bandwidth budget per period, in nanoseconds.
      pub budget_ns: u64,
      /// Replenishment period in nanoseconds.
      pub period_ns: u64,
      /// Remaining budget in the current period (nanoseconds).
      pub remaining_ns: i64,
      /// Absolute deadline of the current period (wall-clock nanoseconds).
      pub deadline_ns: u64,
  }
  const_assert!(core::mem::size_of::<CbsServerParams>() == 32);
  ```

  **Phase B transfer protocol** (within the stop-the-world IPI, step 5c):

  Before Phase B, the orchestration layer freezes all CPUs by sending an
  NMI IPI (on x86-64) or equivalent non-maskable interrupt. Because NMI
  is non-maskable, a CPU may be halted while holding its runqueue lock
  (e.g., inside `schedule()` or `update_curr()`). The transfer protocol
  handles this via `try_lock()` with a safe fallback:

  1. For each CPU (already halted by the IPI):
     a. Attempt to acquire the per-CPU runqueue lock via `try_lock()`.
        If `try_lock()` succeeds, proceed normally.
        If `try_lock()` fails, the halted CPU holds this lock. Since the
        CPU is halted and cannot race, the orchestrating CPU accesses the
        runqueue data directly via raw pointer (bypassing the lock).
        // SAFETY: The halted CPU cannot execute any instructions until
        // `release_all_cpus()` is called, so there is no data race.
        // This lock bypass is analogous to Linux's `stop_machine()`.
        //
        // **Post-resume safety**: After CPUs resume, a halted CPU that was
        // mid-write to the old runqueue (e.g., inside `migrate_task()`)
        // completes that write. This is safe because old component memory
        // is retained for the 5-second watchdog window (Phase C2, step 10),
        // which is orders of magnitude longer than the post-resume
        // instruction completion window (~10-50 instructions, ~50ns).
        // The old runqueue data is never freed until the watchdog expires
        // AND an RCU grace period completes.
        //
        // **Mid-migration task scenario**: A task being moved between
        // runqueues at the instant of NMI may be absent from all
        // runqueues (removed from source, not yet added to destination).
        // The `post_swap_runqueue_audit` (Phase C1a) detects this via the
        // "task absent from all runqueues" check and repairs it by adding
        // the task to its `task.cpu` runqueue in the new scheduler.
        The orchestrating CPU must NOT release the lock on the halted
        CPU's behalf — the halted CPU will release it when it resumes.
     b. Transfer `SchedEvolutionSlot.dl_rqs[cpu]` into the new scheduler's
        DL runqueue data structure. Each DL task's `(runtime, deadline, period)`
        is copied verbatim — no parameter recalculation. The new scheduler
        inherits the task's current EDF state exactly.
     c. Transfer `SchedEvolutionSlot.rt_rqs[cpu]` into the new scheduler's
        RT runqueue. For RT-RR tasks, the remaining time slice is preserved.
     d. Transfer `SchedEvolutionSlot.cfs_rqs[cpu]` into the new scheduler's
        CFS/EEVDF runqueue. Each task's `(vruntime, lag, weight, deadline)` is
        copied verbatim. CBS server parameters are preserved. The new scheduler
        inherits the exact vruntime ordering — no vruntime normalization or
        recalculation is performed (the `min_vruntime` base is also transferred).
     e. Release the runqueue lock.

  2. After all per-CPU runqueues are transferred, the vtable pointer swap occurs
     (step 6). The new scheduler is now active with the complete runqueue state.

  3. After Phase B completes (CPUs released), the scheduler is unfrozen:
     preemption is re-enabled on all CPUs via a follow-up IPI.

  This transfer adds ~1-5 μs per CPU to the Phase B stop-the-world window
  (O(N_tasks_per_cpu) per CPU, dominated by CFS task count). On a system with
  10,000 CFS tasks across 64 CPUs (~156 tasks/CPU average), the per-CPU transfer
  completes in ~2-3 μs (memcpy of pre-allocated snapshot buffers). The total
  Phase B window increases from ~1-10 μs to ~10-50 μs for scheduler evolution —
  acceptable because scheduler evolution occurs at most a few times per kernel
  lifetime, and the window is still below the 100 μs real-time latency budget.

  **Phase C scope narrowed for scheduler evolution**: Because all runqueue
  entries are transferred in Phase B, Phase C for scheduler evolution is limited
  to reconstructing non-runqueue state:
  - PELT (Per-Entity Load Tracking) running averages — reconstructed from the
    current task weights and recent CPU utilization samples.
  - CPU bandwidth controllers — timer state re-initialized from the cgroup
    bandwidth parameters (quota, period) which are stored in cgroup-owned
    structures that survive the evolution.
  - Scheduler statistics counters — reset to zero (informational only, no
    correctness impact).
  Phase C no longer replays `pick_next_task` calls from PendingOps for scheduler
  evolution. PendingOps for the scheduler class are drained and discarded after
  Phase B (the runqueue state they would have produced is already transferred).

  **Invariant**: At no point during the evolution lifecycle does any task (DL,
  RT, or CFS) exist outside of a runqueue. During Phase A', tasks run on the old
  scheduler's runqueues. During the Phase B IPI, ownership transfers atomically
  to the new scheduler's runqueues. After Phase B, the new scheduler dispatches
  all classes immediately — no Phase C replay is needed for runqueue state.

  This transfer is bounded: O(N_total_tasks) across all CPUs. Typical systems
  have <10 DL tasks, <50 RT tasks, and <10,000 CFS tasks total across all CPUs,
  so the full transfer completes in <50 μs.

```rust
/// Maximum serialized argument size for a deferred vtable call.
/// Total struct size = 16 (header) + 240 (payload) = 256 bytes = 4 cache lines.
pub const PENDING_OP_MAX_ARG_SIZE: usize = 240;

/// Completion token for a deferred vtable call. The caller's thread parks
/// on this token (via `CompletionToken::wait()`) until the drain thread
/// replays the op through the new component's vtable and calls
/// `CompletionToken::complete(result)`.
///
/// Allocated from a per-CPU slab cache (`COMPLETION_TOKEN_CACHE`) during
/// Phase A' interception. The interception trampoline allocates a token,
/// stores a pointer in `PendingOp.completion`, and parks the caller.
/// The drain thread (Phase C1) calls `complete()` which stores the result
/// and wakes the parked thread. The caller then reads the result and
/// frees the token back to the slab.
// kernel-internal, not KABI
pub struct CompletionToken {
    /// Result of the replayed vtable call (method-specific encoding).
    /// Written by the drain thread, read by the parked caller.
    pub result: UnsafeCell<i64>,
    /// Wakeup state: 0 = pending, 1 = complete. Written by drain, polled by caller.
    pub state: AtomicU8,
    /// The parked thread's task pointer (for `wake_up_process()`).
    pub waiter: *mut Task,
}

impl CompletionToken {
    /// Store the result and wake the parked caller.
    ///
    /// # Safety
    /// Must be called exactly once per token. `self.waiter` must be a valid
    /// task pointer (set during Phase A' interception, before the caller parks).
    pub unsafe fn complete(&self, result: i64) {
        // Store result before signaling completion (Release ordering ensures
        // the result write is visible before the state transition).
        *self.result.get() = result;
        self.state.store(1, Ordering::Release);
        // Wake the parked thread.
        wake_up_process(self.waiter);
    }

    /// Block the current thread until the drain thread calls `complete()`.
    /// Returns the result value stored by the drain thread.
    pub fn wait(&self) -> i64 {
        while self.state.load(Ordering::Acquire) == 0 {
            // Park the current thread. The drain thread will call
            // wake_up_process() on our task pointer.
            schedule();
        }
        // SAFETY: state == 1 means the drain thread has written the result
        // with Release ordering. Our Acquire load above synchronizes-with it.
        unsafe { *self.result.get() }
    }
}

/// A vtable call deferred during component quiescence (live driver evolution).
/// Fixed-size layout enables a statically-allocated ring buffer — no heap allocation
/// during the quiescence window when memory operations may be restricted.
///
/// `method_id = u32::MAX` is a sentinel for an empty/invalid slot.
/// Vtable method ordinals are 0-based, so `u32::MAX` cannot collide
/// with any valid method index.
///
/// **Large-argument methods** (e.g., scheduler `pick_next_task` with ~2.5 KB
/// `SchedPolicyContext`): arguments that exceed `PENDING_OP_MAX_ARG_SIZE` are
/// stored in a per-CPU pre-allocated snapshot buffer. `PendingOp.args` then
/// carries only a `PendingOpBufRef` (8 bytes: cpu_id + buffer_index) that
/// the replay loop resolves to the pre-allocated buffer. This avoids both
/// heap allocation and a 2.5 KB copy into the ring entry.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct PendingOp {
    /// Vtable method index (matches the `KernelServicesVTable` or `DriverVTable` ordinal).
    pub method_id: u32,
    /// Number of valid bytes in `args` (0 if method takes no arguments).
    pub arg_len: u32,
    /// Completion token for waking the parked caller after replay.
    /// Null pointer for fire-and-forget ops that do not need a response.
    /// Populated by the interception trampoline during Phase A'.
    /// Consumed by `drain_pending_ops()` during Phase C1 (reads pointer,
    /// then writes null to mark as consumed).
    ///
    /// Uses `*mut CompletionToken` with null sentinel instead of
    /// `Option<*mut CompletionToken>` because `PendingOp` crosses the
    /// live-replacement boundary (old orchestration writes, new
    /// orchestration reads via AtomicPtr transfer in Phase B step 2).
    /// `Option` has no stable C layout across compiler versions;
    /// explicit null sentinel is safe for `#[repr(C)]` structs.
    pub completion: *mut CompletionToken,
    /// Serialized method arguments. Encoding is method-specific (documented per method).
    /// For inline arguments: raw bytes up to PENDING_OP_MAX_ARG_SIZE.
    /// For large arguments: first 8 bytes contain a `PendingOpBufRef`.
    pub args: [u8; PENDING_OP_MAX_ARG_SIZE],
}

/// Reference to a per-CPU pre-allocated snapshot buffer for large arguments.
/// Stored in `PendingOp.args[0..8]` when the argument exceeds
/// `PENDING_OP_MAX_ARG_SIZE`. The replay loop resolves this to the actual
/// buffer using `PENDING_OP_BUFFERS[cpu_id][buffer_index]`.
#[repr(C)]
pub struct PendingOpBufRef {
    /// CPU that captured the snapshot (identifies the per-CPU buffer pool).
    /// u16: 65535 CPUs sufficient through 2060+ projection.
    pub cpu_id: u16,
    /// Index into that CPU's pre-allocated buffer array.
    pub buffer_index: u16,
    /// Actual byte length of the argument data in the buffer.
    pub data_len: u32,
}
// PendingOpBufRef: cpu_id(u16=2) + buffer_index(u16=2) + data_len(u32=4) = 8 bytes.
const_assert!(core::mem::size_of::<PendingOpBufRef>() == 8);

/// Per-CPU pool of pre-allocated large-argument buffers for pending ops.
/// Each buffer is 4 KB (one page), sufficient for the largest policy context
/// (SchedPolicyContext ≈ 2.5 KB). 4 buffers per CPU handles the maximum
/// concurrent quiescence depth (one per evolvable subsystem).
pub const PENDING_OP_BUF_SIZE: usize = 4096;
pub const PENDING_OP_BUFS_PER_CPU: usize = 4;

// Compile-time assertion: struct must be exactly 256 bytes (4 × 64-byte cache lines).
const _PENDING_OP_SIZE_CHECK: () = assert!(
    core::mem::size_of::<PendingOp>() == 256,
    "PendingOp must be exactly 256 bytes"
);

/// Maximum number of ops that can be queued during a single quiescence window.
/// At ~1000 control-plane vtable calls/sec, provides ~64ms of buffering.
///
/// **Scope**: this queue captures CONTROL-PLANE vtable calls only (configuration,
/// status queries, lifecycle operations). DATA-PLANE I/O (NVMe submissions, NIC
/// packet processing) does NOT flow through the pending-ops queue — it flows
/// through DomainRingBuffer ring pairs that are drained and transferred atomically
/// during Phase B (the ring pointer swap is O(1), not per-entry). High-throughput
/// drivers (100K+ IOPS) are therefore unaffected by this queue's capacity.
/// Drivers that cannot tolerate even control-plane quiescence (e.g., real-time
/// audio with <5ms deadline) should use the Tier 1 crash recovery path
/// (Section 11.7) instead of live evolution.
pub const PENDING_OPS_QUEUE_CAPACITY: usize = 64;

/// Per-CPU pending-op slot for a quiescing driver instance.
///
/// During Phase A', each CPU executing old code registers its pending op on
/// its own slot. The drain thread iterates all CPU slots to collect pending ops.
/// This avoids CAS contention on a shared ring — each per-CPU slot is SPSC
/// (the CPU is the sole producer, the drain thread is the sole consumer),
/// preserving the formal verification property (INV-6).
///
/// Concurrency model: Each CPU writes to its own PendingOpSlot. The drain thread
/// reads all CPU slots. This is per-CPU SPSC, not MPSC.
///
/// Each CPU writes only to its own slot during Phase A' interception — there is no
/// cross-CPU contention and no MPSC ring needed. The drain thread (running during
/// Phase A'/B on the orchestrating CPU) reads from all CPU slots sequentially.
/// This replaces the earlier PendingOpsRing which was incorrectly designed as
/// SPSC despite being used in an MPSC pattern (multiple CPUs intercepting ops).
pub struct PendingOpsPerCpu {
    /// Per-CPU pending op queues. Indexed by cpu_id (0..nr_cpus).
    /// NOT `PerCpu<T>` — uses a heap-allocated `Vec` indexed by CPU ID
    /// because the drain thread (running on the orchestrating CPU during
    /// Phase C1) must access ALL CPUs' slots, not just the current CPU's.
    /// `PerCpu<T>::get()` requires `&PreemptGuard` and accesses only the
    /// current CPU's slot. The same pattern is used by `SchedEvolutionSlot`
    /// (line 312) for identical reasons.
    /// Allocated at component registration time (warm path, bounded by nr_cpus).
    /// Use `get_remote()` pattern: `&self.slots[cpu_id]` for cross-CPU access.
    slots: Vec<PendingOpSlot>,
    /// Total pending ops across all CPUs (for capacity checking).
    total_pending: AtomicU64,
}

/// Per-CPU slot: bounded queue for ops intercepted on this CPU.
/// Most CPUs have 0-1 pending ops during quiescence; the queue handles
/// the rare case of re-entrant or batched operations.
/// Capacity 32 accommodates worst-case quiescence periods (~25 scheduler
/// ticks at ~4ms each = ~100ms). If capacity is exceeded, the pending op
/// is executed synchronously against the old component (safe because
/// quiescence is not yet confirmed).
pub struct PendingOpSlot {
    buf: ArrayVec<PendingOp, 32>,
    /// Set to 1 when this CPU has pending ops to drain. 0 = empty.
    /// Uses `AtomicU8` instead of `AtomicBool` for cross-compiler-version
    /// safety — `PendingOpSlot` is embedded in `PendingOpsPerCpu` which
    /// is transferred across the live-replacement boundary via
    /// `AtomicPtr::store` in Phase B step 2. `AtomicBool` has no stable
    /// ABI guarantee across compiler versions. Consistent with
    /// `VtableHeader.quiescing` (also `AtomicU8` for this reason).
    has_pending: AtomicU8,
}
```

  **Cross-CPU ordering for replayed PendingOps**: PendingOps captured on
  different CPUs during Phase A' have no inherent cross-CPU ordering — CPU 0's
  intercepted ops and CPU 3's intercepted ops were executing concurrently with
  no happens-before relationship between them. Cross-CPU ordering is established
  by the **Phase B IPI barrier**: when `stop_the_world()` fires (step 4 of
  Phase B), all CPUs halt at a known safe point. At that instant:
  - Every CPU has committed all its pending ops to its per-CPU `PendingOpSlot`
    (the interception trampoline runs to completion before the IPI handler
    saves state).
  - The drain thread (step 5) collects ops from all CPU slots sequentially.
  - The collection order (CPU 0, CPU 1, ..., CPU N) is deterministic but
    arbitrary — it does not imply a temporal ordering between ops from
    different CPUs.

  **Drain function**: During Phase C1 (step 9), the orchestrating CPU drains
  all pending ops and replays them against the new component:

  ```rust
  /// Drain all per-CPU pending op slots and replay against the new component.
  ///
  /// Called by the orchestrating CPU after Phase B completes and the new
  /// component's vtable is installed. All other CPUs have been released
  /// from the halt loop and CAN dispatch through the new vtable
  /// (`quiescing` was cleared in Phase B step 5, BEFORE `release_all_cpus()`
  /// in step 6). This means `drain_pending_ops` runs CONCURRENTLY with
  /// new-vtable callers on other CPUs.
  ///
  /// **Concurrency contract**: The new component MUST handle concurrent
  /// access during replay. PendingOps from Phase A' and new calls from
  /// post-resume CPUs may execute simultaneously against the same
  /// component state. For stateless components (simple method dispatch)
  /// this is trivially safe. For stateful components (scheduler, VFS),
  /// the component's internal locking handles concurrency — the same
  /// locking that protects concurrent calls during normal operation.
  ///
  /// The drain iterates CPUs in ascending order (CPU 0, 1, ..., N).
  /// Within each CPU's slot, ops are replayed in FIFO order (matching
  /// the interception order). Cross-CPU ordering is not preserved
  /// because it did not exist (ops were concurrent).
  ///
  /// For ops whose method was tombstoned in the new component (method
  /// removed or renamed), the op is dropped and the caller receives
  /// `Err(KabiError::NotSupported)` via the completion mechanism.
  ///
  /// # Errors
  ///
  /// If any replayed op fails (new component returns an error), the
  /// error is propagated to the original caller via the op's completion
  /// future. The drain continues with remaining ops — individual op
  /// failures do not abort the evolution.
  pub fn drain_pending_ops(
      pending: &mut PendingOpsPerCpu,
      new_vtable: *const VtableHeader,
      new_ctx: *mut (),
      nr_cpus: usize,
  ) {
      for cpu_id in 0..nr_cpus {
          // SAFETY: After Phase B, all CPUs have committed their pending
          // ops and are no longer writing to their slots. We access each
          // CPU's slot by Vec index (slots[cpu_id]). This is safe because:
          // (1) slots is Vec<PendingOpSlot>, indexed by CPU ID;
          // (2) no CPU is writing to its slot after the IPI barrier in Phase B;
          // (3) the drain thread is the sole reader at this point.
          let slot = &mut pending.slots[cpu_id];
          if slot.has_pending.load(Ordering::Acquire) == 0 {
              continue;
          }
          for op in slot.buf.iter_mut() {
              // Replay: call the new vtable's method with the saved args.
              // SAFETY: new_vtable is the freshly-installed vtable from
              // Phase B step 4. new_ctx is the new component's opaque
              // context pointer (obtained from the new VtableHeader during
              // Phase B). The args slice is truncated to op.arg_len (the
              // actual argument length, not the fixed-size buffer).
              // Cast *const VtableHeader -> *const () for vtable_dispatch.
              let result = unsafe {
                  vtable_dispatch(
                      new_vtable as *const (),
                      op.method_id,
                      &op.args[..op.arg_len as usize],
                      new_ctx,
                  )
              };
              // Convert vtable_dispatch Result to i64 for CompletionToken.
              // vtable_dispatch returns Result<ResultBuffer, KabiError>.
              // The internal dispatch_fn returns i32 status; ResultBuffer
              // is always empty for vtable calls. Map to i64 errno.
              let status: i64 = match result {
                  Ok(_) => 0,
                  Err(e) => e.to_errno() as i64,
              };
              // Deliver result to the original caller's completion future.
              // Null check replaces Option::take() — completion is *mut
              // CompletionToken with null = no waiter (fire-and-forget op).
              if !op.completion.is_null() {
                  // SAFETY: completion was allocated from COMPLETION_TOKEN_CACHE
                  // during Phase A' interception and has not been freed. The drain
                  // thread is the sole consumer of this token.
                  unsafe { (*op.completion).complete(status) };
                  op.completion = core::ptr::null_mut();
              }
          }
          // Clear the slot for future use.
          slot.buf.clear();
          slot.has_pending.store(0, Ordering::Release);
      }
      pending.total_pending.store(0, Ordering::Release);
  }
  ```

  **Replay guarantee**: During Phase C1 (step 9), the new component replays
  PendingOps in per-CPU order (all ops from CPU 0, then all from CPU 1, etc.).
  Within each CPU's op list, order matches the interception order (the per-CPU
  `PendingOpSlot` is FIFO). Cross-CPU op ordering is not preserved because it
  did not exist — the ops were concurrent. This is correct because:
  - Control-plane vtable calls from different CPUs are independent (no
    shared mutable state between concurrent callers at the vtable interface).
  - If two ops from different CPUs conflict (e.g., both configure the same
    device register), the last-writer-wins semantics of the vtable method
    apply — the same semantics as if both calls had executed concurrently
    on the old component without interception.

  **Operation interception mechanism**: At Phase A' entry, a per-component
  `quiescing: AtomicU8` flag is set to `1`. The vtable entry trampoline checks
  this flag before dispatching each call. When `quiescing` is `1`, the trampoline
  appends the operation descriptor (a serialized `PendingOp` containing the method ID
  and argument blob) to the current CPU's `PendingOpSlot` instead of invoking the old
  component. This interception is lock-free and contention-free (each CPU writes only
  to its own per-CPU slot, see `PendingOpsPerCpu` above). The vtable pointer itself is not yet swapped —
  interception happens at the trampoline level, not the pointer level.

  **Queued operation handling**: Operations that arrive during Phase A' are appended
  to `pending_ops` via the interception mechanism above. If `pending_ops` reaches
  capacity (default: 64 entries), the quiescence deadline is extended by up to
  100ms. If the deadline expires and in-flight operations have still not drained,
  the evolution is aborted: `quiescing` is set to `false`, the trampoline resumes
  normal dispatch, and the old component resumes without disruption.

  **State re-export**: After in-flight operations drain, the old component's state
  is re-exported (`export_state()` on the now-quiesced component). This re-export
  does NOT capture `pending_ops` — the queue is transferred separately in Phase B.

Phase B — Atomic swap (stop-the-world, ~1-10 μs):

  **Nucleus primitive** (`evolution_apply()` / `evolution_apply_batch()`):
  These steps are in the Nucleus code — the code listing at `evolution_apply()`:
  4. All CPUs briefly hold (IPI to stop-the-world).
  5. The per-CPU `PendingOpsPerCpu` reference is transferred to the new component.
     The drain thread iterates all CPU slots and collects pending ops into the
     new component's replay queue. This is O(N_cpus) but each slot access is
     cache-local. Operations that arrived between the Phase A' re-export and
     the IPI are captured because the interception trampoline continues
     appending to per-CPU slots until the IPI fires.

  **Orchestration steps** (called by the Evolvable orchestration layer WITHIN the
  STW window, before or after `evolution_apply()` returns). These steps are NOT
  part of the Nucleus primitive — they are Evolvable code that runs while all CPUs
  are halted. For `evolution_apply_batch()`, these steps are interleaved with the
  batch swap sequence. An implementing agent should treat the Nucleus primitive
  (steps 4-6 above) and these orchestration steps as separate functions called
  within the same STW critical section:

  5a. **ValidatedCap cache invalidation** (conditional): If the evolved service
      uses a different isolation domain ID than the old service (i.e.,
      `new_domain_id != old_domain_id`), all per-CPU CapValidationToken caches
      are purged for the old domain ID. This reuses the same IPI-based cache
      flush protocol defined for driver crash recovery
      ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange--capvalidationtoken-invalidation-on-driver-crash)):
      each halted CPU clears matching cache entries before resuming. The typical
      case for live evolution is that the new component inherits the old
      component's DriverDomainId (same KABI service, same isolation boundary),
      so no flush is needed and step 5a is a no-op. Domain ID changes occur
      only when live evolution also changes the component's isolation tier
      (e.g., promoting a Tier 1 service to Tier 0 or vice versa).
      **Domain ID inheritance rule**: By default, live evolution preserves the
      old component's `DriverDomainId`. The evolution orchestration allocates a
      new domain ID only if the `KabiPolicyManifest` or `KabiDriverManifest`
      declares a different tier than the old component. When a new domain ID is
      allocated, the old domain ID is marked stale (generation incremented to
      even) and all CapValidationToken caches referencing the old ID are flushed
      as part of the Phase B IPI.
  5b. **KABI version registry freeze assertion**: Before the IPI fires,
      orchestration asserts that no KABI registry write lock is held on any CPU.
      The KABI version registry (`RcuHashMap<(DriverClass, KabiVersion), VtablePtr>`)
      uses an RCU writer lock for register/unregister operations. A registry
      update concurrent with Phase B would deadlock: the stop-the-world IPI
      blocks RCU grace period advancement (all CPUs are halted, so no CPU can
      pass through a quiescent state), and the registry writer waiting for an
      RCU grace period would never complete. The ordering constraint:
      **no KABI registry write operations may be in progress when Phase B
      begins.** Orchestration enforces this by acquiring the registry's
      write-side mutex during Phase A (before quiescence), holding it through
      Phase B, and releasing it in Phase C after the new component is activated.
      This serializes registry updates with evolution — a concurrent
      `driver_load()` or `driver_unload()` blocks on the mutex until evolution
      completes. The mutex acquisition in Phase A is non-blocking (trylock with
      retry): if the lock is contended, orchestration retries up to 10 times
      with 1ms backoff. If all retries fail, evolution is aborted and the old
      component continues unchanged.
  5c. **Scheduler runqueue transfer** (scheduler evolution only): For each CPU,
      acquire the per-CPU runqueue lock, transfer all runqueue entries (DL, RT,
      and CFS/EEVDF) from `SchedEvolutionSlot` into the new scheduler's
      runqueue data structures, then release the lock. This ensures all tasks
      are continuously present on a runqueue throughout Phase B. See "Full
      runqueue ownership transfer" above for the complete protocol. Adds ~1-5 μs
      per CPU to the stop-the-world window for scheduler evolution only; zero
      cost for non-scheduler evolution.
  5d. **KABI service registry update**: The KABI registry's vtable pointer table
      is updated via RCU swap — new component's vtable pointers atomically replace
      old entries. This ensures that any KABI dispatch initiated after Phase B
      targets the new component. In-flight dispatches that entered before Phase B
      use the old vtable (still valid — old binary is retained until Phase C
      watchdog expires). Registry update adds ~50ns (single `rcu_assign_pointer()`).
  5e. **T0 transport generation increment**: Increment the global T0 vtable
      generation counter (`t0_vtable_generation.fetch_add(1, Release)`). This
      counter serves as a stale-dispatch detector AFTER Phase B completes:
      code paths that cache vtable pointers across multiple calls (bypassing
      the standard `kabi_call_t0!` per-call RCU protection) can compare their
      cached generation against the current value and detect that a swap
      occurred. **Note**: During Phase B (stop-the-world), all CPUs are halted,
      so no CPU can observe the increment until after `release_all_cpus()`.
      The generation check is therefore a post-swap safety net, NOT a
      pre-swap blocking mechanism. For the standard `kabi_call_t0!` path
      (which takes an RCU read lock per call), the generation counter is
      redundant — RCU protection already ensures the vtable remains valid
      for the duration of the call. **T0 callers MUST hold `rcu_read_lock()`
      for the entire vtable dereference + method call** — the old vtable
      memory is freed via `call_rcu()` after Phase C, and without RCU
      read-side protection a caller could dereference freed memory. All T0
      dispatch macros (`kabi_call_t0!`) enforce this by expanding to
      `rcu_read_lock(); dispatch; rcu_read_unlock()`.
      See also [Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes).
      T0 callers MUST be non-blocking — any potentially
      blocking Evolvable operation uses T1 ring transport instead.
  6. Old component's vtable pointer replaced with new component's vtable
     (store with `Release` ordering).
  7. Interrupt handlers redirected: the IRQ dispatch table entry for the
     driver's interrupt line(s) is updated to point to the new module's
     handler. Since Phase B runs with all CPUs halted (IPI stop-the-world),
     no interrupt can fire during the swap. After Phase B, the next
     interrupt dispatches to the new handler.
     `quiescing` flag cleared. T0 evolution
     waitqueue released (blocked callers now enter new code via updated vtable).
     **Memory ordering for resumed CPUs**: Each CPU's first vtable load after
     the STW IPI return uses `Acquire` ordering, pairing with the `Release`
     store in step 6. This ensures the CPU observes the new vtable pointer
     and all memory writes performed by the new component's `import_state()`
     before dispatching through it.
  8. CPUs released. New component is now active.
  **Tasks blocked in waitqueues inside the old component's code**: These tasks
  are sleeping on a WaitQueue that survives the vtable swap — WaitQueues are
  data-side structures, not code-side (they persist in Nucleus/Core memory).
  When the wait condition is satisfied after Phase B, the wakeup path uses
  the NEW vtable (via the swapped `AtomicPtr`). The waking task returns from
  sleep and enters the new code seamlessly. No special handling is needed
  because UmkaOS waitqueues are data-side, not code-side.
  **Tasks actively executing old component code on a CPU** (not sleeping, not in
  a waitqueue — running): the IPI in Phase B step 4 forces these CPUs to enter the
  IPI handler, which sets a per-CPU `evolution_pending` flag. When the task returns
  from the IPI (still on the same stack frame), it checks the flag and re-dispatches
  through the new vtable. The old code page is not freed until the RCU grace period
  ensures no CPU holds a reference.

  **RCU callbacks during STW**: RCU callbacks are not invoked during the
  stop-the-world window (steps 4-8). All CPUs are halted by the IPI, so no
  CPU passes through a quiescent state and no grace period can complete.
  Pending RCU callbacks resume processing after step 8 when CPUs are released
  and normal tick processing restarts.
  Only the pointer swap + queue transfer is stop-the-world. The queue transfer
  (step 5) adds ~100ns to the stop-the-world window. Step 5a adds ~50ns per
  purged cache entry (at most 16 entries per CPU) when a domain ID change
  triggers a flush; zero cost when domain ID is inherited. Step 5c adds ~1-5 μs
  per CPU for scheduler evolution (full runqueue transfer); zero cost otherwise.
  Steps 4-8 are the entirety of the Nucleus evolution primitive.

Phase C — Activation and cleanup:
  Phase C1 — New component activation:
    9. New component drains `pending_ops` queue before accepting new operations.
       Each pending op is replayed through the new component's vtable.
  Phase C1a — Post-swap integrity audit (scheduler-specific):
    9a. For scheduler evolution only: iterate the task table and verify that
        every task with state TASK_RUNNING is present on exactly one CPU's
        runqueue, and no runqueue contains a task that is not TASK_RUNNING.
        This audit runs after pending ops are drained (step 9) and before
        accepting new scheduling decisions. See "Post-Swap Runqueue Audit"
        below for the full specification.
  Phase C2 — Deferred cleanup (after watchdog window):
    10. The old component is NOT immediately freed. It is frozen (no new calls)
        but its memory is retained for the Post-Swap Watchdog window (5 seconds,
        see below). If the watchdog triggers a revert, the old component is
        reactivated from this frozen state.
    11. After the watchdog window expires without revert, the old component's
        vtable is freed via an RCU grace period (`call_rcu`). This ensures that
        any reader still holding an RCU-protected reference to the old vtable
        (e.g., a T0 call that loaded the pointer before Phase B but has not yet
        dereferenced it) completes before the memory is reclaimed. After the
        grace period, the old component's code pages are unloaded.
  Phase C3 — Old code page cleanup (after the RCU grace period in step 11):
    12. Unmap old code pages from kernel page tables via
        `kernel_unmap_pages(old_code_base, old_code_size)`.
    13. Flush TLB on all CPUs via IPI:
        `flush_tlb_kernel_range(old_code_base, old_code_base + old_code_size)`.
    14. Return physical frames to the buddy allocator.
    This sequence is identical to module unload in Linux. The TLB shootdown
    IPI adds ~10-50 μs but runs in Phase C (non-STW), so it does not affect
    Phase B latency.
  Total disruption: ~1-10 μs (the Phase B stop-the-world window only).

If import_state fails (incompatible version):
  → Abort replacement. Old component continues. No disruption.

If new component crashes after replacement:
  → Crash recovery (Section 11.7). Reload old component with initialize_fresh().
  → Component state lost, but system continues.

Post-Swap Watchdog:

After the atomic swap (Phase B), a 5-second watchdog timer starts. If the new component crashes or triggers a fault within this window, the kernel reverts to the old component using the RETAINED serialized state (from export_state() in Phase A), not initialize_fresh(). This preserves accumulated state (run queue weights, LRU ordering, learned parameters) across a failed swap attempt. Only if the retained state itself is corrupted does the kernel fall back to initialize_fresh().

13.18.3.1 Post-Evolution Behavioral Health Monitoring¶

The 5-second crash watchdog detects hard failures (panics, traps, infinite loops). For stateful EvolvableComponent replacements (scheduler, VMM, VFS, TCP stack), subtle behavioral degradation — increased tail latency, elevated error rates, throughput regression — may not manifest as a crash but still indicates a defective evolution. The behavioral health monitoring protocol extends the crash watchdog with a longer soak period that compares post-evolution metrics against a pre-evolution baseline.

Phase assignment: Phase 4 (requires full FMA health scoring from Section 20.1).

1. Baseline capture. Before Phase A begins, the evolution orchestration captures a behavioral baseline from the FMA health scoring subsystem:

/// Pre-evolution behavioral baseline, captured immediately before Phase A.
///
/// The orchestration samples the target component's FMA health metrics over
/// a 60-second trailing window and records their statistical summary. This
/// baseline is used during the post-evolution soak period to detect
/// behavioral degradation that does not manifest as a crash.
///
/// Stored in the EvolutionOrchestration context alongside the retained
/// old component state. Not serialized — valid only for the duration of
/// this evolution operation.
pub struct EvolutionBaseline {
    /// Per-subsystem metrics snapshot taken immediately before evolution.
    /// Capacity 32 covers all standard FMA metrics for any single component
    /// (typical: 4-8 metrics per component).
    pub metrics: ArrayVec<BaselineMetric, 32>,
    /// Timestamp of baseline capture.
    pub captured_at: Instant,
}

/// One metric's statistical summary from the pre-evolution window.
/// All values in nanoseconds (fixed-point). No FPU dependency — safe for
/// all architectures including PPC32 softfloat and some ARMv7 configs.
pub struct BaselineMetric {
    /// FMA metric ID (matches FmaHealthMetric::metric_id).
    pub metric_id: u32,
    /// Mean value in nanoseconds over the 60-second window before evolution.
    pub mean_ns: u64,
    /// P99 value in nanoseconds over the same window.
    pub p99_ns: u64,
}

2. Soak period. After Phase C completes and the crash watchdog is disarmed (5 seconds without hard failure), the behavioral health monitor begins. The soak period is configurable per component:

Component Class	Default Soak Period	Rationale
Evolvable services (KABI VFS, net, block)	60 seconds	Service-level metrics stabilize within seconds; 60s captures steady-state behavior under typical server load.
Core components (scheduler, VMM, page reclaim)	300 seconds	Core component regressions may only manifest under sustained load or rare scheduling scenarios (RT deadline miss, memory pressure).

During the soak period, the FMA health scoring system samples the evolved component's metrics at 500ms intervals and compares each sample against the baseline.

3. Degradation threshold. Configurable per component via the evolution orchestration configuration:

/// Per-component behavioral health thresholds for post-evolution monitoring.
///
/// These thresholds define what constitutes "behavioral degradation" after
/// a live evolution. They are intentionally conservative defaults — the
/// operator tunes them based on workload characteristics.
///
/// The thresholds are NOT used for automatic rollback (forward-only
/// semantics). They trigger FMA alerts that inform the operator to prepare
/// a corrective evolution module.
pub struct EvolutionHealthConfig {
    /// Soak period duration after Phase C.
    pub soak_period: Duration,
    /// Maximum acceptable P99 latency increase ratio in per-mille
    /// (e.g., 1500 = 1.5x = 50% increase). Compared against
    /// BaselineMetric::p99_ns for latency-class metrics.
    /// Fixed-point avoids FPU dependency in kernel code.
    pub max_latency_ratio_permille: u32,
    /// Maximum acceptable error rate increase (absolute, per mille).
    /// Compared against BaselineMetric::mean for error-rate-class metrics.
    pub max_error_rate_increase_per_mille: u32,
    /// Number of consecutive degraded checks before alert.
    /// Default: 3 (at 500ms intervals = 1.5 seconds of sustained degradation).
    pub alert_after_consecutive: u32,
}

Default configurations:

Parameter	Evolvable Services	Core Components
`soak_period`	60s	300s
`max_latency_ratio_permille`	1500 (1.5x = 50% increase)	1300 (1.3x = 30% increase)
`max_error_rate_increase_per_mille`	5	2
`alert_after_consecutive`	3	3

4. Alert, not rollback. If degradation exceeds the threshold for alert_after_consecutive consecutive checks (default: 3, at 500ms intervals = 1.5 seconds of sustained degradation), the system emits FmaEvent::EvolutionDegradation with the specific metric ID, baseline value, and observed value. No automatic rollback occurs — forward-only semantics are preserved. The operator is expected to:

Inspect the FMA event to identify the degraded metric.
Prepare a corrective evolution module (bug fix or configuration adjustment).
Apply the corrective module via the standard evolution trigger interface (Section 13.18).

If the soak period expires without sustained degradation, the evolution is considered fully successful. The baseline is discarded and the old component's retained pages are freed.

Observability: Soak state is visible at /ukfs/kernel/evolution/{component}/soak_state with values: inactive (no evolution in progress or soak complete), monitoring (soak period active, checks running), degraded (alert threshold exceeded, FMA event emitted). The baseline metrics and current comparison values are readable at /ukfs/kernel/evolution/{component}/soak_baseline (JSON format, cold path only).

Post-Swap Runqueue Audit:

After scheduler evolution Phase C1 (pending ops drained) and the DL/RT hot-switch completes (step 9a), the orchestration layer performs a consistency audit before enabling normal scheduling:

/// Post-swap scheduler integrity audit.
///
/// Iterates the task table and all per-CPU runqueues to verify that the
/// scheduler state is consistent after live evolution. This audit runs
/// once per scheduler swap, at Phase C1a, while the orchestration layer
/// still holds the evolution lock (no concurrent evolution can start).
///
/// # Invariants verified
///
/// 1. Every TASK_RUNNING task appears on exactly one CPU's runqueue.
/// 2. No runqueue contains a task that is not TASK_RUNNING.
/// 3. DL task count matches: sum of per-CPU `dl.nr_running` equals the
///    global DL task count from the exported state.
/// 4. RT task count matches: sum of per-CPU `rt.nr_running` equals the
///    global RT task count from the exported state.
///
/// # Error handling
///
/// On audit failure:
/// - Log a kernel warning via FMA with the specific invariant violation
///   (which task, which CPU, expected vs actual state).
/// - Attempt automatic repair: if a TASK_RUNNING task is on zero runqueues,
///   re-enqueue it on its last-known CPU. If a task is on multiple runqueues,
///   remove duplicates (keep the entry on the task's `cpu` field).
/// - If repair succeeds, log the correction and continue.
/// - If repair fails (contradictory state), trigger watchdog rollback to
///   the old scheduler (same as a post-swap crash).
///
/// # Performance
///
/// The audit iterates the task table once (O(N_tasks)) and each runqueue
/// once (O(sum of runqueue lengths)). On a system with 10,000 tasks, this
/// takes ~100 us. The audit runs exactly once per scheduler evolution, not
/// on every tick.
pub fn post_swap_runqueue_audit(
    task_table: &TaskTable,
    evolution_slot: &SchedEvolutionSlot,
    exported_state: &SchedStateExport,
) -> Result<(), RunqueueAuditError> {
    // Phase 1: Build a bitmap of all tasks found on any runqueue.
    // XArray-backed: task PID -> (cpu_id, sched_class) for each runqueue entry.
    let mut rq_membership: XArray<(CpuId, SchedClassId)> = XArray::new();

    // Iterate per-CPU runqueues via the evolution slot's Vec (indexed by CPU ID).
    // This works because SchedEvolutionSlot uses Vec<Option<...>> indexed by CPU ID,
    // not PerCpu<T>. The orchestrating CPU can access all slots.
    for (cpu_id, rq) in evolution_slot.cfs_rqs.iter().enumerate().filter_map(|(i, opt)| opt.as_ref().map(|rq| (i, rq))) {
        for task in rq.iter_all_classes() {
            if task.state != TaskState::Running {
                // INV-2 violation: non-RUNNING task on runqueue.
                log_fma_warning!(
                    "post_swap_audit: task {} (state={:?}) on cpu {} runqueue",
                    task.pid, task.state, cpu_id
                );
                return Err(RunqueueAuditError::NonRunningOnQueue {
                    pid: task.pid,
                    cpu: CpuId(cpu_id as u32),
                    state: task.state,
                });
            }
            if let Some(prev) = rq_membership.store(
                task.pid.as_u64(),
                (CpuId(cpu_id as u32), task.sched_class),
            ) {
                // INV-1 violation: task on multiple runqueues.
                log_fma_warning!(
                    "post_swap_audit: task {} on both cpu {} and cpu {}",
                    task.pid, prev.0, cpu_id
                );
                return Err(RunqueueAuditError::DuplicateEnqueue {
                    pid: task.pid,
                    cpu_a: prev.0,
                    cpu_b: CpuId(cpu_id as u32),
                });
            }
        }
    }

    // Phase 2: Check every TASK_RUNNING task is in the bitmap.
    for task in task_table.iter_running() {
        if rq_membership.load(task.pid.as_u64()).is_none() {
            // INV-1 violation: RUNNING task not on any runqueue.
            log_fma_warning!(
                "post_swap_audit: RUNNING task {} not on any runqueue (last cpu={})",
                task.pid, task.cpu
            );
            return Err(RunqueueAuditError::MissingFromQueue {
                pid: task.pid,
                last_cpu: task.cpu,
            });
        }
    }

    // Phase 3: Verify DL and RT counts match exported state.
    // Iterate DL/RT runqueues from evolution_slot (same pattern as Phase 1
    // for CFS — evolution_slot holds per-CPU runqueue snapshots).
    let dl_total: u32 = evolution_slot.dl_rqs.iter()
        .filter_map(|opt| opt.as_ref())
        .map(|rq| rq.nr_running)
        .sum();
    let rt_total: u32 = evolution_slot.rt_rqs.iter()
        .filter_map(|opt| opt.as_ref())
        .map(|rq| rq.nr_running)
        .sum();
    if dl_total != exported_state.dl_task_count {
        log_fma_warning!(
            "post_swap_audit: DL count mismatch: runqueues={}, exported={}",
            dl_total, exported_state.dl_task_count
        );
    }
    if rt_total != exported_state.rt_task_count {
        log_fma_warning!(
            "post_swap_audit: RT count mismatch: runqueues={}, exported={}",
            rt_total, exported_state.rt_task_count
        );
    }

    Ok(())
}

/// Errors detected by the post-swap runqueue audit.
pub enum RunqueueAuditError {
    /// A TASK_RUNNING task was not found on any CPU's runqueue.
    MissingFromQueue { pid: Pid, last_cpu: CpuId },
    /// A task appeared on more than one CPU's runqueue.
    DuplicateEnqueue { pid: Pid, cpu_a: CpuId, cpu_b: CpuId },
    /// A non-RUNNING task was found on a runqueue.
    NonRunningOnQueue { pid: Pid, cpu: CpuId, state: TaskState },
}

The audit is a debug-level safety net, not a performance-critical path. It runs once per scheduler evolution (an event that occurs at most a few times per kernel lifetime). If the audit detects a mismatch, the FMA warning provides precise diagnostic information for root-cause analysis. The automatic repair path (re-enqueue missing tasks, remove duplicates) handles the most likely failure modes — tasks that were mid-migration between CPUs during the DL/RT hot-switch.

Memory During Swap:

The dual-load approach (old + new component coexist during Phase A) requires sufficient memory for both. Typical component state sizes: scheduler ~64KB, page replacement ~128KB, I/O scheduler ~8KB per device. If insufficient memory is available for the new component's state, the swap returns ENOMEM and the old component continues unchanged. Maximum expected dual-load overhead: ~128KB for the scheduler (the largest replaceable component).

13.18.4 Export Symbol Contract¶

When a component is live-replaced, other components may depend on its exported symbols (vtable entries, public functions, constants). The following rules govern export compatibility during live evolution:

Compatible exports required. The new version MUST export the same KABI vtable entries at compatible types (same layout, same semantics). If the new version changes an export's signature (different parameter types, different return type, different struct layout), the live evolution is rejected at load time during Phase A. The loader compares vtable sizes and entry signatures before proceeding to state export.
Indirection-based resolution. Export addresses are resolved through the KABI vtable indirection table, not direct pointers. When the new version loads, the vtable pointer is atomically updated during Phase B (step 5). Dependent components never hold raw function pointers to the old version's code -- they dispatch through the vtable pointer, which is updated in the stop-the-world window. This is the same mechanism used for policy module vtable dispatch (Section 19.9).
Removed exports rejected. If the new version removes a vtable entry (reduces vtable_size), the evolution is rejected unless no loaded component references the removed entry. The loader scans the dependency graph during Phase A to verify this. Adding new entries (increasing vtable_size) is always safe -- existing callers never reference entries beyond the size they were compiled against.

13.18.5 What Can Be Live-Replaced¶

Reading the "Replaceable?" column: "No" means the component's code and struct layout cannot be hot-swapped for a different implementation. It does NOT mean the data is immutable. Non-replaceable data components (memory allocator data, capability data, CpuFeatureTable) hold values that change continuously or during evolution — only the struct definitions, field layouts, and the code that manipulates them are frozen. Non-replaceable code components (page table hardware ops, KABI trampoline, alt_patch_apply(), syscall entry) are pure verified instructions with no policy decisions — they are small enough to be correct-by-verification.

13.18.5.1 State Size Budget¶

The following table documents each component's serialized state size for live replacement. These bounds are used by the evolution framework to pre-allocate the state transfer buffer during Phase A (before stop_the_world()), ensuring that state export never fails due to allocation pressure. Components whose state exceeds 256 KB require explicit approval in the evolution manifest.

Evolution manifest: Each evolution payload (ELF image + metadata) includes a manifest header that declares the component's identity, version, state budget, and dependency requirements:

/// Manifest embedded in evolution payload ELF .umka_manifest section.
#[repr(C)]
pub struct EvolutionManifest {
    /// Component name (e.g., "phys_alloc_policy", "eevdf_scheduler").
    pub component_name: [u8; 64],             // 64 bytes (offset 0)
    /// State format version (DATA_FORMAT_EPOCH). Must be monotonically
    /// increasing for successful import_state().
    pub state_version: u64,                   // 8 bytes  (offset 64)
    /// Maximum serialized state size in bytes. The evolution framework
    /// pre-allocates this much memory in Phase A. If > 256 KB, the
    /// administrator must set `evolution.allow_large_state=true`.
    pub max_state_bytes: u32,                 // 4 bytes  (offset 72)
    /// Explicit padding between u32 and u64 (repr(C) alignment).
    pub _pad0: [u8; 4],                       // 4 bytes  (offset 76)
    /// KABI version required by this component.
    pub kabi_version: u64,                    // 8 bytes  (offset 80)
    /// Number of service imports this component requires.
    pub nr_imports: u16,                      // 2 bytes  (offset 88)
    /// Number of service exports this component provides.
    pub nr_exports: u16,                      // 2 bytes  (offset 90)
    /// Explicit trailing padding to u64 struct alignment.
    pub _pad1: [u8; 4],                       // 4 bytes  (offset 92)
    // Total: 64 + 8 + 4 + 4 + 8 + 2 + 2 + 4 = 96 bytes.
}
const_assert!(size_of::<EvolutionManifest>() == 96);

Component	Replaceable?	State Size	Notes
CPU scheduler	Yes	Per-CPU run queues, CBS servers (~64KB total)	Policy module swap (Section 19.9) covers most cases. Hot-path exception: `pick_next_task()` reads the current `SchedPolicy` pointer without lock (`AtomicPtr::load(Acquire)`). Policy swap is visible on the next scheduler tick — no explicit synchronization barrier is needed because the tick handler re-reads the `AtomicPtr` on every invocation.
Page reclaim (data)	No	ZoneLru, CgroupLru, LruGeneration, ShadowEntry, LruDrainBuffer — the LRU lists, generation counters, and shadow entries	Struct layouts non-replaceable; code operating on these structs is in the Page reclaim (policy) entry. Values change continuously (every reclaim cycle — pages move between generations, shadow entries created/destroyed, drain buffers filled/flushed). Verified (Section 24.4).
Page reclaim (policy)	Yes (stateless)	No owned state; operates on ZoneLru/CgroupLru fields	`PageReclaimPolicy` trait: anon/file ratio, scan budget, refault threshold, cross-NUMA promotion, cgroup priority, generation advancement. Atomic pointer swap (~1 us). Same pattern as PhysAllocPolicy. See Section 4.4.
I/O scheduler	Yes (stateless)	Queues owned by BlockDevice; scheduler is stateless `pick_next()`	Atomic pointer swap (~1 us); no queue drain. See Section 16.21
Network qdisc/classifier	Yes (stateless)	Queued packets + shaping state owned by Qdisc; algorithm is stateless ops	Atomic pointer swap (~100ns same-type); no packet loss. See Section 16.21
Memory allocator (data)	No	PageArray (vmemmap), BuddyFreeList, PcpPagePool — the physical memory map itself	Struct layouts non-replaceable; code operating on these structs is in the Memory allocator (policy) entry. Values change continuously (every alloc/free/compact). Verified (Section 24.4).
Memory allocator (policy)	Yes (stateless)	No owned state; operates on BuddyAllocator fields	`PhysAllocPolicy` trait: block selection, split/merge, NUMA fallback, watermark tuning, compaction trigger. Atomic pointer swap (~1 us). Same pattern as I/O scheduler. See Section 4.2.
Page table manager (hardware ops)	No	PTE read/write/encode — ~10 arch-specific instructions per architecture	Hardware primitives in `arch::current::mm`. Verified (Section 24.4).
Page table manager (policy)	Yes (stateless)	No owned state; operates on MmStruct/Vma/Page	`VmmPolicy` trait: fault handler strategy, THP promotion, TLB flush policy, PCID/ASID eviction, readahead. Warm path: VmmPolicy is called on warm paths only — strategy decisions around the fault, not the PTE installation itself. The per-fault indirect call (~3-5 ns on a ~1-2 μs operation = <0.3% overhead) is acceptable. See Section 4.8.
Capability system (data)	No	CapTable (XArray), CapEntry, generation/permission checks — ~5 instructions per validation	Struct layout and validation code non-replaceable. Exception: capability validation primitives (`cap_lookup()`, `cap_generation_valid()`, `cap_has_rights()`) remain Nucleus because they are the security perimeter — a compromised hot-swap of `cap_has_rights()` would bypass the entire capability model. This is the ONE case where runtime code is Nucleus. Values change continuously (every grant/revoke/delegate). KABI hot path calls only these. Verified (Section 24.4).
Capability system (policy)	Yes (stateless)	No owned state; operates on CapTable/CapEntry	`CapPolicy` trait: `capable()`, `delegate_check()`, `revocation_order()`, `inherit_on_exec()`, `evaluate_constraints()`, `syscaps_to_permissions()`, LSM hooks, cluster revocation. Atomic pointer swap (~1 us). MonotonicVerifier gate. See Section 9.1.
KABI dispatch trampoline	No	`vtable[method_id](args)` — 3-4 instructions	Formally verified (Section 24.4). Trivial indirect call.
KABI version registry	Yes (RCU update)	`RcuHashMap<(DriverClass, KabiVersion), VtablePtr>`	Add/remove KABI versions at runtime. Lock-free reads on dispatch path; writer lock on register/unregister (cold path). New kernel KABI versions are added during live evolution without reboot.
Syscall entry (Layer 1)	No	Per-arch trap handler + entry function pointer	Trap handler code non-replaceable. Entry fn pointer updated when SysAPI layer (Layer 2) is replaced — Nucleus writes the new dispatch address during evolution Phase B. Saves registers, sign-extends nr, calls Layer 2.
SysAPI layer (umka-sysapi)	Yes (Section 13.18)	Bidirectional dispatch table + cached ABI translation tables (~72KB)	Owns the dispatch table (positive=Linux, negative=UmkaOS native). Stateless per-call. Table rebuilt on replacement.
KABI services (VFS, net, block)	Yes (Section 13.18)	MB-scale (connection tables, caches)	Incremental state export; 100ms quiescence; 10s watchdog
DLM (distributed lock manager)	Yes (Phase 4+)	Lock resource hash table, waiter queues, master node assignments	Export: serialize `DlmState { lock_resources: XArray<LockResource>, master_map: XArray<NodeId>, waiter_queues }`. Quiescence: drain in-flight lock ops (100ms deadline, see Phase A' DLM quiescence). Import: rebuild from exported state, re-query mastery from cluster peers. Cluster-wide coordination required (all nodes must evolve DLM in lockstep or use versioned wire protocol).
Tier 1 drivers	Yes (two paths)	Driver-internal state	Crash path: fault detection + -EIO + device reset + reload (Section 11.9). Graceful path: cooperative drain + export_state + vtable swap, no device reset, no -EIO (see Section 13.18).
CpuFeatureTable (data)	No	Per-CPU feature sets, universal intersection, errata union	Struct layout non-replaceable. Values updated during evolution (page temporarily writable during Phase B; Evolvable's `detect_features()` re-reads CPU registers, updates errata union, re-freezes). `ErrataCaps` u64 has ~48 unused bits. Extensible via Data Format Evolution Pattern 1 if exhausted.
CPU adaptation (policy)	Yes (Evolvable)	No owned state beyond CpuFeatureTable reads	`detect_features()`, errata matching tables, `cpu_features_freeze()`, `alt_patch_all()`, `algo_dispatch_init_all()` — all swappable. New errata and instruction alternatives deployed without reboot. See Section 2.21.
Evolution primitive	No	Stop-the-world IPI, page remap/copy, vtable pointer atomic swap, CPU release (~2-3 KB)	The irreducible mechanism that installs any pre-verified, pre-loaded image. No ELF parsing, no signature verification, no policy decisions. Verified (Section 24.4).
Evolution orchestration	Yes (self-evolving)	ELF loader, ML-DSA-65 verification, Phase A/A'/B/C sequencing, PendingOpsPerCpu management, symbol resolution, state export/import, DATA_FORMAT_EPOCH (~12-13 KB)	New orchestration verified by old orchestration, then installed by Nucleus primitive. Self-upgrade: the first thing evolution can evolve is its own orchestration logic.
`alt_patch_apply()`	No	Writes bytes to code + flushes I-cache (~10 instructions per arch)	Formally verified. The primitive tool that `alt_patch_all()` calls.
Slab allocator (data)	No	SlabCache metadata, depot inventory, per-CPU magazines	Struct layouts non-replaceable; code operating on these structs is in the Slab allocator (policy) entry. Values change continuously (every alloc/free). Magazines drain to depot during quiescence. Verified (Section 24.4).
Slab allocator (policy)	Yes (Evolvable)	No owned state; operates on SlabCache/depot/magazine fields	Cache sizing, magazine depth tuning, GC policy. Atomic pointer swap (~1 μs). State export: cache list + depot counts (~1-10 KB).
Workqueue	Yes (Evolvable)	Active work items, thread pool config, per-CPU worker states (~100 KB)	Quiescence: drain all pending work items (bounded by max work duration), block new `queue_work()` calls. Single-blob export.
IPC dispatch	Yes (Evolvable)	Per-process fd table references, pipe buffer states, SysV IPC objects	Quiescence: drain in-flight IPC operations via refcount. Depends on VFS (pipe fds are file objects). Replace after VFS if both replaced simultaneously.
ACPI/DTB parser	Yes (Evolvable)	Parsed device tree, ACPI table cache	Cold path only — no quiescence needed. Export parsed tree + raw table cache. Rarely replaced (only on firmware update or new errata table).
LSM framework	Yes (Evolvable)	Per-inode security labels, policy database, audit rules (1-50 MB for SELinux)	Quiescence: drain in-flight hook evaluations (bounded, <1 ms). Policy database is chunked export.
FMA/observability	Yes (Evolvable)	Active fault records, tracepoint registration table, metric counters	Quiescence: drain in-flight `fma_report()` calls. Metric counters are lossy — acceptable to drop observations during swap window.
eBPF verifier + JIT	Yes (Evolvable)	Loaded program registry (`XArray<BpfProg>`), map FD table, JIT code pages	Quiescence: block new `BPF_PROG_LOAD` syscalls. In-flight eBPF program execution (attached to hooks) continues on the old JIT code until the next attach-point invocation picks up the new dispatch. Programs+maps survive replacement (data in XArray; JIT pages are not freed until programs are explicitly detached).
KVM (umka-kvm)	Yes (Evolvable)	Per-VM state (VMCS/VMCB, EPT/NPT, virtual APIC pages), per-vCPU registers	Quiescence: kick all vCPUs out of guest mode via IPI (KVM_REQ_VM_DEAD equivalent). Export: per-VM VMCS/VMCB shadow state + virtual device state. Import: reconstruct VMCS/VMCB from exported state, re-enter guest mode. Guest VMs experience a pause (tens of ms) but no crash.
Cgroup subsystem	Yes (Evolvable)	Cgroup hierarchy (tree of CgroupNode), per-controller state (cpu.weight, memory.max, io.weight), task→cgroup assignments	Quiescence: block new `cgroup.procs` writes and new cgroup mkdir/rmdir. Export: hierarchy tree (parent-first BFS) + per-controller parameters + task assignments. Task→cgroup associations are u64 cgroup IDs (stable across replacement).
Namespace subsystem	Yes (Evolvable)	Namespace objects (PidNs, NetNs, MountNs, etc.), per-task ns_proxy references	Quiescence: block new `unshare()`/`clone(CLONE_NEW*)` syscalls. In-flight operations complete via refcount. Export: namespace hierarchy + task associations. Depends on VFS (mount ns), network (net ns), cgroups (cgroup ns).
Crypto API	Yes (Evolvable)	Algorithm registry (`XArray<CryptoAlg>`), active `CryptoTfm` handles	Quiescence: block new `crypto_alloc_*()` calls. In-flight crypto operations complete on existing `CryptoTfm` instances (refcounted). Export: algorithm registry + priority table. New algorithms automatically available after replacement.
Seccomp-BPF	Yes (Evolvable)	Per-task seccomp filter chains (BPF programs)	No quiescence needed — filter chains are per-task, immutable once attached (`seccomp(SECCOMP_FILTER_FLAG_TSYNC)` is atomic). Replacement only affects the seccomp infrastructure (syscall interception dispatch); existing filter chains remain valid BPF bytecode interpreted/JIT'd by the eBPF subsystem.
TTY/PTY subsystem	Yes (Evolvable)	TTY device table, line discipline state, PTY pairs	Quiescence: drain in-flight line discipline operations. Export: open TTY list + ldisc state per TTY. Depends on VFS (TTY is a char device file).
Audio subsystem	Yes (Evolvable)	PCM stream state, mixer state, ALSA control elements	Quiescence: drain in-flight PCM period callbacks (bounded by period size). Export: device list + mixer state + active stream parameters. Glitch during swap (~50-100 ms silence).
Display/DRM subsystem	Yes (Evolvable)	Mode objects (CRTC, encoder, connector), framebuffer metadata, atomic state	Quiescence: complete current atomic commit, block new commits. Export: mode object table + active display configuration. Brief visual glitch during swap (one or two lost frames).
Device registry	No	DeviceNode tree, PropertyTable, BusIdentity, DeviceResources	Core data structure read by all driver operations. Struct layout non-replaceable. Values change continuously (device hotplug/unplug). Verified (Section 24.4).
Clock framework	Yes (Evolvable)	Clock tree (`CLK_TREE`), rate/parent assignments	Cold/warm path. Quiescence: block new `clk_set_rate()` calls. Export: clock tree topology + current rates. Quick (<10 ms).
Power management	Yes (Evolvable)	RAPL domains, runtime PM counters, suspend/resume callbacks	Quiescence: complete any in-flight suspend/resume transition. Export: per-device PM state + RAPL limits.
NFS client	Yes (Evolvable)	Mount table, open file state, delegation cache, RPC transport state	Quiescence: drain in-flight RPC calls (bounded by RPC timeout). Export: mount list + delegation state + open file handles. NFSv4 leases survive (client ID is stable across replacement).
Process/task management	No (data) / Yes (policy)	Task structs, PID table, task groups — data is non-replaceable. `fork()`/`exec()`/`exit()` policy logic (resource accounting, inheritance rules) is Evolvable.	Task struct layout is Nucleus (formally verified, read by scheduler/capability/signal code). Process lifecycle policy is replaceable.
Signal handling	No (data) / Yes (policy)	Signal queues, sighand structs — data is non-replaceable. Signal delivery policy (default actions, coredump logic, ptrace interaction) is Evolvable.	Signal queue and sigaction table layouts are Nucleus. Delivery/dispatch policy is replaceable.
io_uring	Yes (Evolvable)	Per-ring SQ/CQ state, registered files/buffers	Quiescence: drain in-flight SQEs (wait for all CQEs). Block new `io_uring_enter()` calls. Export: ring metadata + registered buffers. Applications see `-EAGAIN` during brief swap window.
Netfilter/conntrack	Yes (Evolvable)	Connection tracking table, NAT mappings, rule chains	Quiescence: drain in-flight packet hook evaluations. Export: conntrack table (hash entries) + rule chains. Active connections survive (conntrack entries transferred).
NAPI polling	No	Per-NIC NAPI struct, poll lists, budget accounting	Integral to NIC driver operation. NAPI state is per-device, non-replaceable (replaced only when the owning NIC driver is replaced). Budget tuning is part of the network stack policy.
Accelerator framework	Yes (Evolvable)	AccelScheduler state, device context table, fence pool	Quiescence: drain in-flight GPU/NPU commands (wait for all fences to signal). Export: scheduler priority queues + context table + fence state. Device commands see brief pause.
DSM (distributed shared memory)	Yes (Phase 4+)	Region table, coherence directory, subscriber lists	Cluster-wide coordination required (all nodes must agree on protocol version). Export: region metadata + directory state. Active regions see brief pause during swap.
RDMA transport	Yes (Phase 4+)	QP table, MR (memory region) registrations, completion queues	Quiescence: drain in-flight RDMA operations (wait for all CQEs). Export: QP state + MR table. Cluster-wide versioned wire protocol required.
ML policy framework (data)	No	`KernelParamStore` (2048-slot param array), `ObservationRing` per-CPU ring buffers, `POLICY_HANDLERS` table	Struct layouts non-replaceable; `param_read()`/`param_write()` code and observation emission logic are in the ML policy framework (orchestration) entry. Values change continuously (every ML parameter update, every observation emission). Param access on hot path is a single `AtomicI64::load(Relaxed)` on Nucleus data (~1 cycle).
ML policy framework (orchestration)	Yes (Evolvable)	`PolicyServiceRegistration` table, `MlPolicyCss` per-cgroup overrides, decay enforcement timer, rate limiter state	Quiescence: block new `ML_POLICY_REGISTER` ioctls and `PolicyUpdateMsg` ring submissions. Drain in-flight parameter updates (bounded by rate limiter — max 1000 msg/s, drain completes in <10 ms). Per-cgroup `MlPolicyCss` overrides survive replacement (cgroup state is kernel-owned). Export: registered service table + active decay timers + per-cgroup override snapshots (~10-50 KB). Import: rebuild `PolicyServiceRegistration` entries, reconnect Tier 2 services (services detect replacement via generation counter in the shared-memory param store header and re-register).

Generation counter systems reconciliation: UmkaOS uses three distinct generation counter mechanisms, each serving a different purpose. They are orthogonal -- no cross-system comparison is ever performed. Each counter is independently sufficient for its subsystem:

Counter	Type	Scope	Comparison	Purpose
`t0_vtable_generation`	u64 monotonic	Global (one per kernel)	`new > old`	Live evolution: detect stale vtable pointers after Phase B swap. T0 callers load the generation before dispatch and compare after return; a mismatch means the vtable was swapped mid-call and the caller must retry via the evolution waitqueue.
`DriverDomain::generation`	u64 monotonic	Per-driver-domain	`current != cached`	Crash recovery: detect domain reset after driver reload. Callers (e.g., `dispatch_xmit()`) cache the generation at binding time; a mismatch on the hot path means the driver crashed and was replaced, and the cached vtable/ring is stale. Returns `IoError::NODEV`.
`KernelParamStoreShadow::generation`	u64 monotonic	Per-parameter-store	`new > cached`	ML policy: detect stale parameter snapshots. Tier 2 policy services snapshot the generation when reading parameters; on the next read cycle, a generation mismatch triggers a full re-read of changed parameters. Services detect policy framework replacement via this counter and re-register.

T0 generation serialization: The global t0_vtable_generation counter serializes all T0 evolution operations. This is acceptable because T0 evolution is rare (minutes to hours between events). Per-component generation would add complexity without measurable benefit.

Stateful policy modules in Tier 0: Stateful policy modules in the T0 domain are dispatched through the same t0_vtable_generation check as all T0 calls. The generation is checked on every dispatch; a mismatch causes the caller to retry via the evolution waitqueue.

state_version is per-component: EvolutionManifest.state_version is per-component (tracks that component's state layout version). It is NOT a global epoch. Two components can have different state_version values. The global ordering is provided by t0_vtable_generation.

Ring drain EAGAIN handling: When a Tier 1 ring producer receives EAGAIN during evolution drain (ring consumer is draining), the producer retries with exponential backoff (1 us, 2 us, 4 us, ..., 1 ms max). After 100 ms total, the producer returns EBUSY to its caller.

Data/policy state spill: The physical allocator, page reclaim, VMM, and capability system all follow the same state-spill pattern as the I/O scheduler (Section 16.21) and qdisc subsystem (Section 16.21). All mutable state (free lists, watermarks, Page descriptors, PcpPagePool, LRU generation lists, shadow entries, VMA trees, page tables, CapTable, CapEntry) is owned by non-replaceable data structures. The replaceable policy traits (PhysAllocPolicy, PageReclaimPolicy, VmmPolicy, CapPolicy) are stateless algorithm dispatchers. Replacement swaps only the vtable pointer (~1 μs stop-the-world). No state export/import, no quiescence, no queue drain — because there is no policy-owned state to migrate.

13.18.5.2 Nucleus/Evolvable Classification Checklist¶

When adding a new subsystem or component, use these questions to determine whether it belongs in Nucleus (non-replaceable, verified) or Evolvable (replaceable):

Data or policy? Data structures that multiple subsystems read/write → Nucleus candidate. Algorithm/decision logic that can be swapped without affecting other subsystems → Evolvable.
Can the component be replaced without losing in-flight operations? If yes → Evolvable. If replacement requires draining all callers and no safe quiescence point exists → Nucleus.
Does correctness require formal verification? If the component is small enough to verify and a bug would corrupt fundamental kernel invariants → Nucleus.
Does a bug cause silent data corruption or a recoverable error? Silent corruption (wrong page mapped, capability check bypassed) → Nucleus. Recoverable error (suboptimal scheduling, wrong reclaim ratio) → Evolvable.
Can Extension Array / DATA_FORMAT_EPOCH handle future field additions? If the struct needs to be extended over a 50-year lifetime and Extension Array suffices → safe for Nucleus (the data layout is stable; new fields are appended).

Default: Evolvable unless classification proves otherwise. The verified Nucleus nucleus must remain minimal (~25-35 KB). Every component placed in Nucleus increases the formal verification burden and reduces live-evolution flexibility.

13.18.5.3 Stateless Policy Swap Rollback Protocol¶

The 10 stateless policy traits (PhysAllocPolicy, PageReclaimPolicy, VmmPolicy, CapPolicy, IoSchedOps, QdiscOps, CongestionOps, TierPolicy, NetClassPolicy, SlabAllocPolicy) use AtomicPtr swap with no Phase A/A'/B/C lifecycle. Without quiescence and state export, there is no built-in rollback mechanism. The following lightweight watchdog protocol ensures that a buggy policy replacement is detected and reverted automatically:

/// Per-policy-point rollback state. Allocated by orchestration before
/// the AtomicPtr swap and retained for the watchdog window duration.
/// One instance exists per active stateless policy swap (at most 10
/// concurrent swaps — one per policy point).
pub struct StatelessPolicyWatchdog {
    /// Old vtable pointer, retained for rollback.
    /// SAFETY: `old_vtable` points into memory backed by `old_pages`.
    /// Valid as long as `old_pages` is retained (i.e., for the lifetime
    /// of this `StatelessPolicyWatchdog` instance). Must not be
    /// dereferenced after `old_pages` is freed.
    old_vtable: *const PolicyVtableHeader,
    /// Old module's memory pages, retained until watchdog expiry.
    /// ArrayVec because policy modules are small (<64 KB typically,
    /// well under 16 pages of code).
    old_pages: ArrayVec<PhysPage, 16>,
    /// Watchdog expiry timestamp (monotonic ns).
    /// Default: 5 seconds after swap.
    deadline_ns: u64,
    /// Health check interval (monotonic ns). Default: 500ms.
    check_interval_ns: u64,
    /// Pointer to the AtomicPtr that holds the active vtable.
    vtable_slot: &'static AtomicPtr<PolicyVtableHeader>,
    /// Pre-swap baseline: average call latency (ns) over the last
    /// 1000 calls, captured before the swap.
    baseline_latency_ns: u64,
    /// Pre-swap baseline: error count per 1000 calls.
    baseline_error_rate: u32,
    /// Consecutive anomaly count (reset to 0 on each clean check).
    anomaly_count: u32,
}

Protocol:

Before swap (orchestration, Phase A equivalent for stateless swaps): a. Load and verify the new policy module (signature, manifest, KABI version). b. Capture baseline health metrics from the old policy's FMA counters: baseline_latency_ns and baseline_error_rate. c. Allocate StatelessPolicyWatchdog, retain old vtable pointer.
AtomicPtr swap: vtable_slot.store(new_vtable, Ordering::Release). In-flight callers on other CPUs see either old or new vtable atomically (no torn reads — pointer-sized atomic on all supported architectures).
Watchdog activation: Register a periodic timer callback (hrtimer, period = check_interval_ns = 500ms by default). The timer runs for deadline_ns / check_interval_ns = 10 checks over 5 seconds.
Each health check (timer callback context, runs on arbitrary CPU): a. Read the policy module's FMA health counters: current error rate and average call latency since the swap. b. Error rate check: if current error rate exceeds max(baseline_error_rate * 3, 50 per mille), increment anomaly_count. c. Latency check: if average call latency exceeds baseline_latency_ns + 10_000ns (10 μs absolute increase), increment anomaly_count. d. Fault check: if the policy module triggered a domain fault or panic since the last check, increment anomaly_count by 3 (immediate revert). e. If none of the above: reset anomaly_count to 0 (transient spike cleared). f. If anomaly_count >= 2: trigger automatic revert (step 5).
Automatic revert (when anomaly threshold exceeded): a. vtable_slot.store(old_vtable, Ordering::Release). b. Emit FMA event: HealthEventClass::PolicyAutoRevert with anomaly details. c. Unload the new module's pages. d. Log: "stateless policy {name}: auto-reverted after {anomaly_count} anomalies within {elapsed}ms of swap". e. The old policy resumes transparently — in-flight callers that loaded the new vtable before the revert may complete one more call on the new code, but the next call on every CPU will dispatch to the old vtable. This is safe because stateless policies have no mutable state — there is no inconsistency window.
Watchdog expiry (no revert triggered): a. Cancel the periodic timer. b. Wait for an RCU grace period to ensure no CPU is mid-call through the old vtable. c. Invoke all registered PostSwapNotify callbacks (step 7 below). d. Free the old module's pages and StatelessPolicyWatchdog struct. e. The swap is considered successful.
Post-swap notification: After the AtomicPtr swap and RCU grace period (ensuring all CPUs have observed the new vtable), invoke the post-swap callback list. Callbacks perform subsystem-specific reconciliation that the new policy may require:

/// Post-swap notification callback signature.
///
/// Called after a stateless policy swap has completed and an RCU grace period
/// has elapsed. The callback receives the policy point identifier so it can
/// determine which policy was swapped.
pub type PostSwapNotifyFn = fn(policy: PolicyPointId);

/// Maximum registered post-swap callbacks across all policy points.
/// Bounded because each policy point has at most 2-3 downstream
/// subsystems that need notification (e.g., PhysAllocPolicy → buddy
/// allocator watermarks + compaction thresholds).
const MAX_POST_SWAP_CALLBACKS: usize = 16;

/// Global post-swap notification registry. Static array — no heap
/// allocation. Populated at boot by subsystem init functions.
/// Protected by a SpinLock (cold path only — registration at boot,
/// invocation at policy swap time which is at most once per evolution event).
pub static POST_SWAP_NOTIFY: SpinLock<ArrayVec<PostSwapEntry, MAX_POST_SWAP_CALLBACKS>>
    = SpinLock::new(ArrayVec::new_const());

/// One entry in the post-swap notification registry.
pub struct PostSwapEntry {
    /// Which policy point this callback cares about. The callback is
    /// only invoked when the matching policy is swapped.
    pub policy: PolicyPointId,
    /// The notification function.
    pub callback: PostSwapNotifyFn,
    /// Debug name for FMA logging (e.g., "buddy_recalc_watermarks").
    pub name: &'static str,
}

/// Policy point identifiers (one per stateless policy trait).
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u8)]
pub enum PolicyPointId {
    PhysAllocPolicy   = 0,
    PageReclaimPolicy = 1,
    VmmPolicy         = 2,
    CapPolicy         = 3,
    IoSchedOps        = 4,
    QdiscOps          = 5,
    CongestionOps     = 6,
    /// Memory tiering policy — controls page placement across NUMA tiers
    /// (CXL, HBM, DRAM, PMEM). Maps to `PolicyType::MemoryTiering` (5)
    /// in the KABI manifest.
    TierPolicy        = 7,
    /// Network classification policy — packet prioritization, QoS marking.
    /// Maps to `PolicyType::NetClassifier` (4) in the KABI manifest.
    NetClassPolicy    = 8,
    /// Slab allocator policy — NUMA node selection, slab growth, GC eligibility.
    /// The hot-path magazine pop/push is NOT dispatched through this trait.
    SlabAllocPolicy   = 9,
}

/// Register a post-swap callback. Called during subsystem init (boot time).
/// Panics if MAX_POST_SWAP_CALLBACKS is exceeded (static configuration error).
pub fn register_post_swap_notify(entry: PostSwapEntry) {
    let mut list = POST_SWAP_NOTIFY.lock();
    assert!(
        !list.is_full(),
        "post_swap_notify: too many callbacks (max {})",
        MAX_POST_SWAP_CALLBACKS
    );
    list.push(entry);
}

/// Invoke all callbacks registered for the given policy point.
/// Called by the stateless swap protocol after AtomicPtr swap + RCU grace period.
/// Runs on the orchestration CPU, not in interrupt context.
fn invoke_post_swap_notify(policy: PolicyPointId) {
    let list = POST_SWAP_NOTIFY.lock();
    for entry in list.iter() {
        if entry.policy == policy {
            (entry.callback)(policy);
        }
    }
}

Registered callbacks (populated at boot by each subsystem's init function):

Policy Point	Callback	Purpose
`PhysAllocPolicy`	`buddy_allocator::recalc_watermarks`	Recompute zone watermarks (min/low/high) using the new policy's `watermark_scale()` method. Without this, stale watermarks could cause premature OOM or excessive page reclaim.
`PhysAllocPolicy`	`compaction::recalc_thresholds`	Update compaction trigger thresholds based on the new policy's fragmentation parameters.
`PageReclaimPolicy`	`kswapd::recalc_scan_balance`	Recalculate anon/file scan balance ratios for each memory zone.
`VmmPolicy`	`vmm::flush_policy_caches`	Invalidate any cached VMA merge decisions that may differ under the new policy.
`IoSchedOps`	`blk::rebalance_queues`	Rebalance per-CPU I/O queue depths if the new scheduler has different concurrency expectations.
`CapPolicy`	`cap::rebuild_delegation_cache`	Invalidate cached delegation check results. The new CapPolicy may have different constraint propagation rules, so stale cache entries could grant or deny capabilities incorrectly.
`QdiscOps`	`net::qdisc_reset_backlogs`	Reset per-interface qdisc backlog counters and recompute rate shaping parameters. The new qdisc may have different queue depth or scheduling properties.
`CongestionOps`	`tcp::reset_cong_monitoring`	Reset per-socket monitoring counters (retransmit count, ECN-CE count, loss event count) used by the watchdog anomaly detector to establish a fresh baseline for the new algorithm. Does NOT reset per-connection congestion state (cwnd, ssthresh, RTT estimates) — existing connections continue using the old vtable until close (see replacement mechanism table above), so their congestion state must remain consistent with the algorithm that set it. New connections pick up the swapped vtable and start with the new algorithm's initial values naturally.
`TierPolicy`	`numa::recalc_tier_weights`	Recalculate memory tier placement weights and migration thresholds for NUMA/CXL/HBM tiers based on the new policy's tier affinity parameters.

Observability: Watchdog state is exposed at /ukfs/kernel/policy_modules/{name}/watchdog_state with values: inactive (no swap in progress), monitoring (watchdog active, checks running), reverted (auto- revert triggered). FMA events are emitted for both successful swaps (after watchdog expiry) and reverts (on anomaly detection).

Cross-reference: The safe-kernel-extensibility module lifecycle step 3.e (Section 19.9) references this watchdog protocol for stateless policy swaps.

Console subsystem exclusion: The console/serial output path is explicitly not an EvolvableComponent. The console is a Tier 0 component that must remain functional during the entire evolution flow — including during Phase B (stop-the-world) and during panic handlers invoked mid-evolution. Making the console replaceable would create a circular dependency: the evolution framework needs console output for diagnostics, but the console itself would be in an indeterminate state during its own replacement. The console driver's code is small (< 500 lines), architecture-specific, and correct-by-verification. If a console driver update is needed, it is applied via full kernel image update (reboot), not live evolution.

The non-replaceable components (listed above) are Priority 1 verification targets (Section 24.4) — verification is the sole defence against defects in code that cannot be live-fixed. All nine — memory allocator data (PageArray, BuddyFreeList, PcpPagePool), page reclaim data (ZoneLru, CgroupLru, LruGeneration, ShadowEntry), page table hardware ops (arch::current::mm), capability data (CapTable, CapEntry, generation/permission checks), evolution primitive (IPI + remap + vtable swap), CpuFeatureTable data, alt_patch_apply(), syscall entry, and KABI dispatch trampoline — must be verified before Phase 2 exit. (The KABI dispatch trampoline is trivial — 3-4 instructions — but formally verified for completeness. The evolution primitive is ~2-3 KB of straight-line code with no loops or branches beyond error checks — a tractable Verus target.) The replaceable policy traits (PhysAllocPolicy, PageReclaimPolicy, VmmPolicy, CapPolicy) are verified against their dispatch correctness but are live-evolvable — a MonotonicVerifier gate ensures policy swaps only tighten security, never open holes. The evolution orchestration (~12-13 KB) is live-evolvable — self-upgrade is the first thing evolution can do. Together: verified core + evolvable policy + evolvable orchestration.

13.18.6 Performance Impact¶

Steady-state: zero. Between replacements, code paths are identical to a monolithic kernel. The EvolvableComponent trait adds no runtime code — it's a development contract.

During replacement: ~1-10 μs stop-the-world. Happens at most once per kernel update. Amortized over months of uptime: unmeasurable.

13.18.7 Evolution Framework Data/Code Split¶

The evolution framework follows the same data/policy split pattern as the memory allocator (PhysAllocPolicy), page reclaim (PageReclaimPolicy), VMM (VmmPolicy), and capability system (CapPolicy). The irreducible Nucleus primitive (~2-3 KB) performs only the atomic swap mechanism. The Evolvable orchestration (~12-13 KB) handles everything else — and is itself live-replaceable.

13.18.7.1 Nucleus: Evolution Primitive (~2-3 KB, verified)¶

The evolution primitive is the only code that cannot be live-fixed. It is minimal, straight-line, and formally verified:

Type cross-reference: Stateless policy vtables use PolicyVtableHeader (defined in Section 19.9) which contains only vtable_size and kabi_version — no runtime state fields, because stateless policies have no mutable state to track. The VtableHeader below is for stateful components that participate in the full Phase A/A'/B/C lifecycle.

// umka-core/src/evolution/primitive.rs — Nucleus (non-replaceable)

/// Common header for all stateful component vtables managed by the evolution
/// framework. Every vtable pointer stored in an `AtomicPtr<VtableHeader>` slot
/// begins with this layout. The evolution primitive reads and writes these
/// fields during stop-the-world swap; orchestration reads them during
/// pre-swap validation (Phase A).
///
/// # Layout
///
/// This struct is `#[repr(C)]` to guarantee field order across compiler
/// versions and across live-replacement boundaries (old and new Evolvable
/// images must agree on the header layout). Fields are ordered to avoid
/// padding on all supported architectures (all fields are naturally aligned).
///
/// # Relationship to `PolicyVtableHeader`
///
/// `PolicyVtableHeader` (see safe-kernel-extensibility) is the minimal
/// header for **stateless** policy vtables — it contains only `vtable_size`
/// and `kabi_version`. `VtableHeader` extends that concept for **stateful**
/// components: it includes runtime coordination fields (`quiescing`,
/// `pending_ops_ptr`, `pending_ops_total`) and state migration function
/// pointers (`state_version`, `export_state`, `import_state`) that the
/// evolution primitive calls during batch validation and swap.
// kernel-internal, not KABI
#[repr(C)]
pub struct VtableHeader {
    // ── Identity and bounds ──────────────────────────────────────────

    /// Byte size of the complete vtable struct (including this header).
    /// Used for bounds safety: the kernel reads only the first
    /// `min(vtable_size, KERNEL_EXPECTED_SIZE)` bytes. Methods beyond
    /// `vtable_size` are treated as absent (tombstoned).
    pub vtable_size: u64,

    /// Primary KABI version discriminant: `KabiVersion::as_u64()`.
    /// Orchestration checks this against the kernel's expected KABI
    /// version range before loading the component.
    pub kabi_version: u64,

    /// KABI service class identifier. Identifies which subsystem service
    /// this vtable implements (e.g., BLOCK_DRIVER, NET_DRIVER, SCHEDULER).
    /// Used by `verify_kabi_cross_compat()` to walk the dependency DAG
    /// and check that all dependent modules remain compatible with the
    /// new vtable version being swapped in. Registered in the KABI
    /// dependency registry at module load time.
    pub service_class: u32,

    /// Padding for alignment (service_class is u32, next field is u64).
    pub _sc_pad: u32,

    /// State format version. Monotonically increasing (u64 — no wrap
    /// within operational lifetime; see performance budget §Counter
    /// Longevity). Orchestration enforces `new.state_version >
    /// old.state_version` before swap. The evolution primitive reads
    /// this field during `is_compatible_batch()` validation.
    ///
    /// **Invariant (INV-5)**: `state_version` MUST be strictly monotonically
    /// increasing across successive replacements of the same component.
    /// The evolution framework rejects any replacement where
    /// `new.state_version <= old.state_version`. This prevents accidental
    /// downgrades and ensures state migration chains are well-ordered.
    /// The field is set by the component author at compile time (embedded
    /// in the vtable constant); it is never mutated at runtime. Violation
    /// of this invariant causes `EvolutionCompatError::VersionNotMonotonic`.
    pub state_version: u64,

    // ── Runtime coordination (written by primitive during swap) ──────

    /// Quiescence flag. Set to `true` by orchestration during Phase A
    /// to stop new calls from being dispatched through this vtable.
    /// Cleared by the evolution primitive in Phase B after the atomic
    /// swap completes. The vtable dispatch trampoline checks this flag
    /// before every call.
    ///
    /// **Quiescence model**: When `quiescing == 1`, new calls are enqueued
    /// to the per-CPU `PendingOpSlot` instead of being dispatched (stateful
    /// Evolvable components — Phase A' interception). The caller's thread
    /// is parked on a per-op completion token until Phase C replays the op
    /// through the new component's vtable.
    ///
    /// For ring-based Tier 1 dispatch (cross-domain), the ring consumer
    /// loop checks quiescing and returns `-EAGAIN` to the ring producer,
    /// which retries with exponential backoff. This is the separate ring
    /// quiescence mechanism, not the VtableHeader trampoline path.
    ///
    /// For the scheduler specifically, `pick_next_task` interception returns
    /// "keep running current task" as a sentinel value (no enqueue, no block).
    /// AtomicU8 instead of AtomicBool for cross-compiler-version safety
    /// (same rationale as `dry_run: u32` below — VtableHeader crosses live-
    /// replacement boundaries where AtomicBool layout is not guaranteed).
    pub quiescing: AtomicU8, // 0 = normal, 1 = quiescing

    /// Explicit padding between AtomicU8 `quiescing` and AtomicPtr
    /// `pending_ops_ptr` (repr(C) alignment: pointer requires 8-byte
    /// alignment on 64-bit targets). Initialized to zero. This padding
    /// crosses live-replacement boundaries and MUST NOT leak uninitialized
    /// kernel memory.
    pub _quiesce_pad: [u8; 7],

    /// Pointer to the per-CPU pending-ops structure. During Phase B,
    /// the evolution primitive transfers the old component's pending
    /// ops to the new component by storing the `PendingOpsPerCpu`
    /// pointer into this field on the new vtable. This is an O(1)
    /// pointer copy — the underlying per-CPU slot array is shared.
    pub pending_ops_ptr: AtomicPtr<PendingOpsPerCpu>,

    /// Total pending operation count across all CPUs. Written by the
    /// evolution primitive during batch swap to transfer the aggregate
    /// count from the old component's `PendingOpsPerCpu::total_pending`.
    pub pending_ops_total: AtomicU64,

    // ── State migration function pointers ────────────────────────────
    //
    // These are C-ABI function pointers, not Rust trait methods, because
    // the caller (Nucleus primitive or Evolvable orchestration) and the
    // callee (the component being swapped) may be compiled from different
    // Rust toolchain versions during live replacement. The `extern "C"`
    // ABI ensures call-site compatibility across compiler versions.

    /// Return this component's state format version.
    ///
    /// Called by `is_compatible_batch()` during Phase A to validate
    /// version monotonicity (`new > old`). Must be a pure read with
    /// no side effects — it may be called multiple times.
    pub state_version_fn: unsafe extern "C" fn(this: *const VtableHeader) -> u64,

    /// Export the component's current state as a serialized `ComponentState`.
    ///
    /// Called during Phase A on the **old** component while it is quiesced
    /// (no in-flight operations). The returned `ComponentState` is passed
    /// to the new component's `import_state_fn`. The caller owns the
    /// returned allocation and is responsible for freeing it after import
    /// (or after rollback, whichever comes first).
    ///
    /// Returns a pointer to a heap-allocated `ComponentState` on success,
    /// or null on failure (with error details written to the FMA log).
    pub export_state_fn: unsafe extern "C" fn(
        this: *const VtableHeader,
    ) -> *mut ComponentState,

    /// Import serialized state from a previous component version.
    ///
    /// Called during Phase A on the **new** component. When `dry_run` is
    /// `true`, the function validates that it *can* import the state
    /// (version compatibility, schema checks) without actually mutating
    /// anything — this is the pre-swap validation used by
    /// `is_compatible_batch()`. When `dry_run` is `false`, the function
    /// performs the actual state import.
    ///
    /// Returns 0 on success, or a negative errno on failure.
    pub import_state_fn: unsafe extern "C" fn(
        this: *mut VtableHeader,
        state: *const ComponentState,
        dry_run: u32, // 0 = live, 1 = dry_run. u32 instead of bool for cross-compiler-version safety.
    ) -> i32,
}
// Layout (64-bit): 8+8+4+4+8+1+7+8+8+8+8+8 = 80 bytes.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<VtableHeader>() == 80);
// Layout (32-bit): vtable_size(u64=8) + kabi_version(u64=8) + service_class(u32=4) +
//   _sc_pad(u32=4) + state_version(u64=8) + quiescing(u8=1) + _quiesce_pad([u8;7]=7)
//   + pending_ops_ptr(AtomicPtr=4) → offset 44. AtomicU64 requires 8-byte alignment
//   on most 32-bit targets (ARMv7 LDREXD/STREXD, PPC32 ldarx) → 4 bytes implicit
//   padding → pending_ops_total(AtomicU64=8) at offset 48 + state_version_fn(fn=4) +
//   export_state_fn(fn=4) + import_state_fn(fn=4) → offset 68. Struct align 8 (u64) →
//   4 bytes trailing → 72 bytes. Kernel-internal struct: the implicit padding between
//   pending_ops_ptr and pending_ops_total is acceptable because this struct never
//   crosses a KABI or wire boundary — it exists only within a single compilation unit.
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<VtableHeader>() == 72);

impl VtableHeader {
    /// Read the state format version from the header's function pointer.
    ///
    /// # Safety
    ///
    /// `self` must point to a valid, fully initialized `VtableHeader` in
    /// memory that has not been freed. The `state_version_fn` pointer must
    /// be valid for the lifetime of the call.
    pub unsafe fn state_version(&self) -> u64 {
        (self.state_version_fn)(self as *const Self)
    }

    /// Export state via the header's function pointer.
    ///
    /// # Safety
    ///
    /// `self` must point to a valid, quiesced component. The caller must
    /// free the returned `ComponentState` after use.
    pub unsafe fn export_state(&self) -> *mut ComponentState {
        (self.export_state_fn)(self as *const Self)
    }

    /// Import state via the header's function pointer.
    ///
    /// # Safety
    ///
    /// `self` must point to a valid `VtableHeader` for the new component.
    /// `state` must point to a valid `ComponentState`. When `dry_run` is
    /// `false`, this mutates the component's internal state — the caller
    /// must ensure no concurrent access.
    /// NOTE: `&mut self` is required for the live import path (DryRun::No)
    /// which writes state into the new component's fields. The dry-run path
    /// (DryRun::Yes) MUST NOT mutate despite having `&mut` access — it only
    /// validates compatibility. A separate `validate_state_fn` taking `*const`
    /// would be cleaner but adds 8 bytes to VtableHeader. The `dry_run` parameter
    /// encodes the contract: callers pass DryRun::Yes to signal read-only semantics.
    ///
    /// **Optional detection**: The evolution framework may compute CRC32C of the
    /// new component's mutable state before and after a dry-run call. A mismatch
    /// indicates a contract violation — the new component is rejected with
    /// `MigrationError::DryRunMutated`. This check is debug-only (disabled in
    /// release builds for performance) and adds ~2-5 us per import_state call.
    pub unsafe fn import_state(
        &mut self,
        state: *const ComponentState,
        dry_run: DryRun,
    ) -> Result<(), MigrationError> {
        // extern "C" ABI: pass u32, not bool. Rust's `bool` has a validity
        // invariant (must be 0 or 1); `u32` is safe across compiler versions
        // and across the live-replacement boundary.
        let dry_run_u32: u32 = if matches!(dry_run, DryRun::Yes) { 1u32 } else { 0u32 };
        let rc = (self.import_state_fn)(
            self as *mut Self,
            state,
            dry_run_u32,
        );
        if rc == 0 {
            Ok(())
        } else {
            Err(MigrationError::ImportFailed { errno: rc })
        }
    }
}

/// Global evolution serialization mutex. ALL evolution operations system-wide
/// (single-component, batch, and whole-Evolvable replacements) must hold this
/// mutex. This prevents concurrent evolution attempts from interfering with
/// each other — only one evolution can be in progress at any time.
///
/// The mutex is acquired by the evolution orchestration at the beginning of
/// Phase A (before `export_state()`) and held through Phase C (post-swap
/// activation and watchdog arming). It is released after the post-swap watchdog
/// is disarmed (or after rollback completes, if the watchdog fires).
///
/// The `stop_the_world()` function in `evolution_apply()` documents that the
/// caller must not hold any spinlocks "except the evolution mutex" — this is
/// that mutex. It is a sleeping mutex (not a spinlock) because Phase A may
/// take milliseconds (state export).
///
/// Lock ordering: `EVOLUTION_MUTEX` is at the top of the lock hierarchy
/// (level 0). No lock may be held when acquiring `EVOLUTION_MUTEX`. The
/// KABI registry write-side mutex (acquired in Phase A, line 636 above)
/// is acquired AFTER `EVOLUTION_MUTEX`, establishing the ordering:
/// `EVOLUTION_MUTEX` → KABI registry mutex → per-CPU runqueue locks.
pub static EVOLUTION_MUTEX: Mutex<()> = Mutex::new(());

/// The irreducible evolution mechanism. Installs a pre-verified, pre-loaded
/// component image by atomically swapping the active vtable pointer.
///
/// Preconditions (enforced by Evolvable orchestration before calling):
///   - `verified_image` has been loaded into memory and relocated
///   - Signature has been verified (ML-DSA-65 at runtime, LMS at boot)
///   - State has been exported from old component and imported into new
///   - `quiescing` flag has been set and in-flight ops have drained
///   - `new_vtable` points to the new component's vtable in `verified_image`
///   - `pending_ops` ring has been prepared for transfer
///
/// This function performs ONLY:
///   1. Stop-the-world IPI (all CPUs halt at known safe point)
///   2. Transfer `pending_ops` ring pointer to new component (O(1))
///   3. Remap pages if needed (or validate flat image already mapped)
///   4. Atomic write: `*target_slot = new_vtable`
///   5. Clear `quiescing` flag
///   6. Release CPUs (IPI acknowledge)
///
/// No ELF parsing. No signature verification. No symbol resolution.
/// No state export/import. No version checks. No error recovery beyond
/// "return error and leave old component active."
///
/// # Safety
///
/// Caller must ensure all preconditions are met. The primitive trusts that
/// Evolvable orchestration has validated the image. This trust is bootstrapped:
/// at boot, `lms_verify_shake256()` validates the initial Evolvable image
/// against the LMS public key baked into Nucleus (Phase 0.8,
/// [Section 2.21](02-boot-hardware.md#kernel-image-structure--phase-08-evolvable-boot-loading-protocol)).
/// Subsequent Evolvable replacements are verified by the currently-running
/// (trusted) Evolvable orchestration before calling this primitive.
pub unsafe fn evolution_apply(
    target_slot: &AtomicPtr<VtableHeader>,
    new_vtable: *mut VtableHeader,
    pending_ops: &PendingOpsPerCpu,
    image_pages: &[PhysPage],
) -> Result<(), EvolutionPrimitiveError> {
    // Step 1: Stop all CPUs via IPI
    let ipi_token = stop_the_world()?;

    // Step 2: Transfer per-CPU pending ops (O(N_cpu) slot pointer copy)
    // Each CPU's PendingOpSlot is transferred to the new component's
    // PendingOpsPerCpu — the slot array backing storage is shared via
    // Vec<PendingOpSlot>, so the transfer copies the Vec pointer
    // and total_pending counter atomically while all CPUs are stopped.
    // SAFETY: pending_ops is a &'static PendingOpsPerCpu allocated at
    // module registration time. The *const -> *mut cast is safe because:
    // (1) PendingOpsPerCpu uses interior mutability (AtomicU32, AtomicU8
    //     on all mutable fields) — all mutations go through atomic
    //     operations that take &self, not &mut self.
    // (2) All CPUs are halted (Phase B invariant), so no concurrent
    //     access exists during this store.
    // (3) The pointer remains valid for 'static lifetime (the backing
    //     storage outlives any component replacement cycle).
    (*new_vtable).pending_ops_ptr.store(
        pending_ops as *const PendingOpsPerCpu as *mut PendingOpsPerCpu,
        Ordering::Release,
    );
    // Transfer the pending_ops_total counter to match the batch path.
    // Without this, Phase C1 replay would see total_pending == 0 on the
    // new vtable and skip replay even though ops were enqueued.
    (*new_vtable).pending_ops_total.store(
        pending_ops.total_pending.load(Ordering::Acquire),
        Ordering::Release,
    );

    // Step 3: Ensure image pages are mapped (remap if loaded at new PA)
    for page in image_pages {
        // SAFETY: image_pages are pre-validated by orchestration.
        // Remapping is idempotent — if already mapped, this is a no-op.
        // CRITICAL: Do NOT use `?` here — if ensure_mapped fails, we must
        // release all CPUs before returning, or the system hangs permanently.
        // The batch path (evolution_apply_batch) handles this correctly;
        // this single-component path must match.
        if let Err(_) = arch::current::mm::ensure_mapped(page.phys, page.virt, page.flags) {
            release_all_cpus(ipi_token);
            return Err(EvolutionPrimitiveError::RemapFailed);
        }
    }

    // Step 4: Atomic vtable pointer swap
    // This is the point of no return for the swap itself.
    // After this store, all CPUs will dispatch to the new component.
    target_slot.store(new_vtable, Ordering::Release);

    // Step 5: Clear quiescing flag.
    // No explicit fence needed: all CPUs are halted during Phase B
    // (stop-the-world), so no CPU can observe the stores until after
    // release_all_cpus(). The Release store on quiescing provides
    // the ordering guarantee when CPUs resume.
    (*new_vtable).quiescing.store(0, Ordering::Release);

    // Step 6: Release CPUs
    release_all_cpus(ipi_token);

    Ok(())
}

/// Multi-slot variant: atomically swap multiple components in a single
/// stop-the-world window. Used for whole-Evolvable replacement.
///
/// Each entry in `swaps` describes one component to swap. All swaps happen
/// within a single IPI hold, ensuring no CPU ever observes a mix of old and
/// new components. The swap order within the IPI is arbitrary (all CPUs are
/// halted), but the dependency DAG is enforced by the export/import ordering
/// in Phase A (see "Full Evolvable replacement dependency ordering" below).
///
/// # Partial Failure Semantics: Forward-Only
///
/// `evolution_apply_batch()` uses **forward-only** semantics. If a swap
/// fails mid-batch (e.g., `ensure_mapped()` fails for a page in swap N),
/// the already-swapped slots (0..N-1) remain in their new state. The
/// failing slot and all remaining slots are NOT swapped. This is safe
/// because:
///
/// 1. **Pre-evolution compatibility check**: Before `evolution_apply_batch()`
///    is called, orchestration validates that ALL new component versions are
///    compatible with each other via `is_compatible_batch()`. This check runs
///    in Phase A (before stop-the-world) and verifies:
///    - Each new component's `state_version()` is strictly greater than the
///      old component's version (INV-5).
///    - Each new component's `import_state()` accepts the exported state from
///      the old component (dry-run import with `DryRun::Yes`).
///    - Cross-component KABI version compatibility: if component A exports
///      KABI version V1 and component B imports V1, the new versions of both
///      A and B must agree on V1's wire format. Checked via
///      `KabiVersion::is_compatible_with()`.
///    If any pre-check fails, the entire batch is aborted before entering
///    the stop-the-world window. No slots are swapped.
///
/// 2. **Forward-only justification**: Once inside the stop-the-world window,
///    rolling back already-swapped slots would require re-swapping them to
///    their old vtables — but the old vtable's `quiescing` flag has been
///    cleared and pending_ops transferred. Attempting a reverse swap in the
///    same IPI window adds complexity and cannot restore pending_ops state.
///    Instead, the watchdog mechanism handles recovery: if the partial batch
///    leaves the system in an inconsistent state, the post-swap watchdog
///    fires and reverts ALL slots (including the successfully swapped ones)
///    using the retained old vtables stored in `EvolutionWatchdog::batch_entries`.
///
/// 3. **Reporting**: On partial failure, the function returns
///    `EvolutionPrimitiveError::PartialBatchFailure { completed, failed_index, cause }`.
///    Orchestration logs the partial result via FMA and arms the watchdog.
///    The watchdog timeout gives the partially-swapped system a chance to
///    stabilize (the successfully swapped components may function correctly
///    even without the remaining swaps). If the system is healthy after the
///    watchdog window, orchestration retries the failed and remaining slots
///    in a new batch. If the system is unhealthy, the watchdog reverts all.
///
/// 4. **Partial failure watchdog arming**: When a batch evolution partially
///    fails (some components swapped, others not), the orchestration arms
///    the watchdog timer for the ENTIRE batch (all entries, including
///    successfully swapped ones) with a 30-second timeout (default,
///    configurable via `evolution.partial_failure_watchdog_ms`). This is
///    longer than the normal 5-second post-swap watchdog because the
///    partially-swapped system needs time to stabilize. During this
///    window:
///    - If any rolled-back or successfully-swapped component exhibits
///      issues (crash, fault, or behavioral degradation detected by FMA),
///      the watchdog fires and reverts the entire batch.
///    - If no issues are detected after the 30-second window, orchestration
///      disarms the watchdog and retries the failed components in a new
///      batch operation.
///    - The `EvolutionWatchdog::batch_entries` array stores the OLD vtable
///      pointers for ALL entries in the original batch, enabling a full
///      revert even if only some were swapped.
///
/// # Safety
///
/// Same preconditions as `evolution_apply`, applied to each swap entry.
/// Additionally: no two entries may reference the same `target_slot`.
///
/// # Multi-Component Dependency Ordering
///
/// When a full kernel image swap replaces multiple interdependent components
/// (e.g., scheduler + VFS + block layer), the `swaps` array MUST be ordered
/// by dependency depth (leaves first, roots last):
///
/// 1. **Leaf components** (no dependencies on other swapped components):
///    swapped first. Example: congestion control, I/O scheduler, page policy.
/// 2. **Intermediate components** (depend only on already-swapped leaves):
///    swapped next. Example: block layer (depends on I/O scheduler).
/// 3. **Root components** (depend on intermediates): swapped last.
///    Example: VFS (depends on block layer), scheduler (depends on
///    per-cgroup bandwidth which may be swapped).
///
/// This ordering ensures that when a component's `import_state()` runs,
/// all components it depends on are already at their new versions. The
/// pre-check (`is_compatible_batch()`) validates this ordering: if
/// component B depends on component A, A must appear before B in the
/// `swaps` array. Violation returns `EvolutionPrimitiveError::DependencyOrder`.
///
/// The dependency graph is extracted from KABI service declarations: if
/// component B's vtable calls methods on component A's vtable, B depends
/// on A. Circular dependencies are forbidden (enforced at build time by
/// the KABI IDL compiler).
pub unsafe fn evolution_apply_batch(
    swaps: &[EvolutionSwapEntry],
) -> Result<(), EvolutionPrimitiveError> {
    // Validate: no duplicate target_slots
    for i in 0..swaps.len() {
        for j in (i + 1)..swaps.len() {
            if core::ptr::eq(swaps[i].target_slot, swaps[j].target_slot) {
                return Err(EvolutionPrimitiveError::DuplicateSlot);
            }
        }
    }

    // Step 1: Stop all CPUs via IPI
    let ipi_token = stop_the_world()?;

    // Step 2: For each swap, perform steps 2-5 of evolution_apply
    let mut completed: usize = 0;
    for (idx, swap) in swaps.iter().enumerate() {
        // Transfer pending_ops per-CPU state: both the pointer (shared
        // Vec<PendingOpSlot> backing storage) and the aggregate counter.
        if !swap.pending_ops.is_null() {
            // SAFETY: pending_ops was validated non-null by the check above.
            // The pointer was set by evolution_prepare() from a &'static
            // PendingOpsPerCpu, so it remains valid for 'static lifetime.
            let pending = unsafe { &*swap.pending_ops };
            (*swap.new_vtable).pending_ops_ptr.store(
                pending as *const PendingOpsPerCpu as *mut PendingOpsPerCpu,
                Ordering::Release,
            );
            (*swap.new_vtable).pending_ops_total.store(
                pending.total_pending.load(Ordering::Acquire),
                Ordering::Release,
            );
        }

        // Ensure image pages are mapped — may fail (forward-only on failure)
        for page in swap.image_pages {
            if let Err(_) = arch::current::mm::ensure_mapped(page.phys, page.virt, page.flags) {
                // Release CPUs before returning the partial failure
                release_all_cpus(ipi_token);
                return Err(EvolutionPrimitiveError::PartialBatchFailure {
                    completed,
                    failed_index: idx,
                    cause: EvolutionPrimitiveCause::RemapFailed,
                });
            }
        }

        // Atomic vtable pointer swap
        swap.target_slot.store(swap.new_vtable, Ordering::Release);

        // Memory barrier between vtable swap and quiescing clear.
        // SeqCst fence is used (rather than a lighter Release fence) because
        // it orders two stores to DIFFERENT locations: (1) the vtable pointer
        // swap above (Release store to target_slot) and (2) the quiescing
        // flag clear below (Release store to quiescing). Without SeqCst,
        // on weakly-ordered architectures, a CPU could observe quiescing == 0
        // (cleared) before observing the new vtable pointer — causing it to
        // dispatch through the old (freed) vtable. SeqCst provides a total
        // store order guarantee: all CPUs observe the vtable pointer update
        // BEFORE observing quiescing == 0. This matches the acquire-side
        // pattern in kabi_call! which loads quiescing (Acquire) and then
        // loads the vtable pointer (Acquire).
        core::sync::atomic::fence(Ordering::SeqCst);
        (*swap.new_vtable).quiescing.store(0, Ordering::Release);
        completed += 1;
    }

    // Step 3: Release all CPUs — all swaps are now visible
    release_all_cpus(ipi_token);

    Ok(())
}

Phase B duration bound for batch swap:

Each component's Phase B work is a single atomic pointer store (target_slot.store(new_vtable, Release)) plus a SeqCst fence and a quiescing flag clear — total cost per component: <=1 us. A batch of N components takes <= N x 1 us for the swap loop itself. The IPI round-trip (stop_the_world() + release_all_cpus()) adds a fixed overhead of ~10-50 us depending on CPU count and interconnect latency (cross-NUMA IPI on large servers: ~50 us; single-socket: ~10 us).

Total Phase B for a full batch (11 components, worst case): <= 11 us (swap loop) + 50 us (IPI) = <= 61 us. RT and DL tasks are delayed by this amount. This is within the acceptable RT latency budget: POSIX real-time scheduling guarantees tolerate microsecond-scale IPI delays (the same stop_the_world primitive is used by TLB shootdown, which occurs regularly under normal operation). The per-component swap cost is O(1) — no iteration over data structures, no allocation, no lock acquisition beyond the IPI serialization.

/// Pre-evolution batch compatibility check. Runs in Phase A (before
/// stop-the-world) to validate that all components in the batch are
/// mutually compatible. If this check passes, `evolution_apply_batch()`
/// is expected to succeed without partial failure (the only remaining
/// failure mode is `ensure_mapped()` which indicates a severe memory
/// subsystem error).
///
/// # Checks performed
///
/// 1. State version monotonicity: new.state_version() > old.state_version()
///    for every component in the batch.
/// 2. State import dry-run: new.import_state(exported_state, DryRun::Yes)
///    succeeds for every component. The `exported_states` parameter contains
///    the already-exported state from Phase A step 2 (`export_state()` on
///    the quiesced old component). This function does NOT call `export_state()`
///    itself — calling `export_state()` on a live (non-quiesced) component
///    would race with concurrent modifications and produce an inconsistent
///    snapshot.
/// 3. KABI cross-compatibility: for every (exporter, importer) pair in the
///    batch's dependency DAG, the new exporter's KABI version is compatible
///    with the new importer's expected KABI version.
///
/// # Calling convention
///
/// Orchestration calls `is_compatible_batch()` AFTER Phase A' re-export
/// `export_state()` has completed (under quiescence), passing the exported
/// state buffers. This ensures the dry-run import validates against a
/// consistent state snapshot produced under quiescence, not a live moving
/// target from the concurrent Phase A export.
///
/// Returns Ok(()) if all checks pass, or Err with the first incompatibility.
pub fn is_compatible_batch(
    swaps: &[EvolutionSwapEntry],
    old_vtables: &[*const VtableHeader],
    exported_states: &[*const ComponentState],
) -> Result<(), EvolutionCompatError> {
    for (idx, ((swap, &old), &state)) in swaps.iter()
        .zip(old_vtables.iter())
        .zip(exported_states.iter())
        .enumerate()
    {
        // Version monotonicity
        let new_ver = unsafe { (*swap.new_vtable).state_version() };
        let old_ver = unsafe { (*old).state_version() };
        if new_ver <= old_ver {
            return Err(EvolutionCompatError::VersionNotMonotonic {
                index: idx, old_ver, new_ver,
            });
        }
        // Dry-run import using the already-exported state (read-only check).
        // Does NOT call export_state() — that was done in Phase A step 2.
        if let Err(e) = unsafe { (*swap.new_vtable).import_state(state, DryRun::Yes) } {
            return Err(EvolutionCompatError::ImportIncompatible {
                index: idx, cause: e,
            });
        }
    }
    // KABI cross-compatibility (dependency DAG walk)
    verify_kabi_cross_compat(swaps)?;
    Ok(())
}

/// Walk the KABI dependency DAG and verify that all dependent modules
/// remain compatible with the new vtable versions being swapped in.
///
/// # Algorithm
///
/// 1. For each `EvolutionSwapEntry` in `swaps`, identify the KABI service
///    class being replaced (from the vtable header's `service_class` field).
/// 2. Walk the dependency DAG: for each module that declares a dependency
///    on this service class (registered in the KABI dependency registry),
///    check that `dependent_module.min_required_vtable_size <= new_vtable.vtable_size`.
/// 3. If any dependent module requires a vtable larger than the new one
///    provides (i.e., it uses methods beyond the new vtable's extent),
///    record the incompatibility.
/// 4. Return `Ok(())` if all dependents are compatible, or
///    `Err(Vec<KabiIncompatibility>)` listing each incompatible dependent.
fn verify_kabi_cross_compat(
    swaps: &[EvolutionSwapEntry],
) -> Result<(), Vec<KabiIncompatibility>> {
    // Vec bound: at most one incompatibility per swap × max dependents.
    // swaps is bounded by MAX_EVOLUTION_SWAPS (32); dependents per service
    // class are bounded by KABI_MAX_DEPENDENTS (256). Worst case: 32 × 256
    // = 8192 entries (~200 KiB). Cold path (evolution is rare), acceptable.
    let mut incompatibilities = Vec::new();
    for swap in swaps {
        let header = unsafe { &*swap.new_vtable };
        for dep in kabi_registry::dependents_of(header.service_class) {
            if dep.min_required_vtable_size > header.vtable_size {
                incompatibilities.push(KabiIncompatibility {
                    service_class: header.service_class,
                    dependent_module: dep.module_name,
                    required_size: dep.min_required_vtable_size,
                    provided_size: header.vtable_size,
                });
            }
        }
    }
    if incompatibilities.is_empty() { Ok(()) } else { Err(incompatibilities) }
}

/// One entry in a batch evolution swap.
// Kernel-internal: used within evolution primitive, not KABI or wire format.
// No #[repr(C)] — standard Rust layout is sufficient (populated and consumed
// within the same compilation unit in batch_validate_and_prepare/execute_batch_swap).
pub struct EvolutionSwapEntry {
    pub target_slot: &'static AtomicPtr<VtableHeader>,
    pub new_vtable: *mut VtableHeader,
    /// Per-CPU pending-ops structure to transfer to the new component.
    /// Null if the old component has no pending operations.
    /// Uses raw pointer for explicit null semantics: `core::ptr::null()` is
    /// unambiguous in unsafe/NMI code.  `Option<&'static T>` would be valid
    /// here (single compilation unit) but raw pointer + null check is preferred
    /// in Nucleus primitives for clarity in formal verification.
    pub pending_ops: *const PendingOpsPerCpu,
    pub image_pages: &'static [PhysPage],
}

13.18.7.2 `stop_the_world()` Specification¶

The stop_the_world() function is the IPI-based CPU halt mechanism used during Phase B. It is part of the Nucleus primitive and must be formally verified.

/// Halt all CPUs except the caller at a known-safe point.
///
/// Returns an opaque token that must be passed to `release_all_cpus()`.
/// The caller must not hold any spinlocks except the evolution mutex
/// (which is acquired by orchestration in Phase A).
///
/// # Mechanism (per architecture)
///
/// | Architecture | IPI vector | Halt method | I-cache sync |
/// |---|---|---|---|
/// | x86-64 | NMI (vector 2) | `cli; hlt` loop | None needed (coherent) |
/// | AArch64 | SGI #15 (FIQ-level) | `wfi` loop | ISB after swap |
/// | ARMv7 | SGI #15 | `wfi` loop | ISB + BPIALL after swap |
/// | RISC-V | IPI via SBI `sbi_send_ipi()` | `wfi` loop | `fence.i` after swap |
/// | PPC32 | IPI via OpenPIC | `or 1,1,1; or 2,2,2` (yield hint) | `isync` after swap |
/// | PPC64LE | IPI via XIVE | `or 1,1,1` (yield hint) | `isync` after swap |
/// | s390x | SIGP `STOP` order | CPU enters stopped state | `CSP` (Compare and Swap and Purge) after swap |
/// | LoongArch64 | IPI via IOCSR mailbox | `idle` instruction | `ibar 0` after swap |
///
/// # Protocol
///
/// 1. Caller disables preemption and local interrupts.
/// 2. Caller sets `EVOLUTION_PHASE_B` flag in a global atomic.
/// 3. Caller sends IPI to all other online CPUs.
/// 4. Each receiving CPU:
///    a. Acknowledges the IPI (EOI to interrupt controller).
///    b. Snapshots `ABORT_GENERATION` into a local variable `my_gen`.
///    c. Saves minimal state (only what the IPI handler clobbers).
///    d. Checks `ABORT_GENERATION.load(Acquire) == my_gen` — if not, an abort
///       occurred between IPI receipt and halt entry; skip the halt loop.
///    e. Sets its per-CPU `HALTED` flag.
///    f. Enters a tight halt loop (architecture-specific, see table above).
///       The halt loop also checks `ABORT_GENERATION` on each iteration.
/// 5. Caller spins until all online CPUs have set `HALTED`.
///    Bounded by: IPI delivery latency + handler entry ≤ 10 μs on all archs.
///    If any CPU does not respond within 100 μs, the function returns
///    `EvolutionPrimitiveError::CpuHaltTimeout { cpu_id }` and the caller
///    must abort the evolution.
/// 6. Caller returns `IpiToken` — the evolution swap can now proceed.
///
/// # NMI Safety
///
/// On x86-64, the halt IPI uses NMI (non-maskable interrupt) to guarantee
/// delivery even if a CPU has interrupts disabled (e.g., inside a spinlock
/// critical section). The NMI handler checks `EVOLUTION_PHASE_B`:
///   - If set: enter halt loop (the spinlock state is frozen; the lock will
///     be released when the CPU resumes after Phase B).
///   - If not set: handle as normal NMI (performance counter overflow, etc.).
///
/// On architectures without NMI (AArch64 uses FIQ, RISC-V uses SBI), the
/// IPI is level-triggered and the handler polls until the target CPU
/// reaches a safe point (not inside a critical section). This is bounded
/// because all spinlock critical sections in UmkaOS are bounded (max hold
/// time documented per lock, see [Section 3.5](03-concurrency.md#locking-strategy)).
///
/// # Disambiguation with Crash Recovery NMI
///
/// On x86-64, both evolution STW and crash recovery reuse the NMI vector.
/// The handler disambiguates by checking `EVOLUTION_PHASE_B` first (if set,
/// enter evolution halt loop). On non-x86 architectures, evolution STW and
/// crash recovery use DISTINCT IPI mechanisms (see per-arch IPI table in
/// [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)): crash recovery uses NMI/FIQ
/// while evolution uses a separate software IPI vector. No disambiguation
/// is needed — the IPI vectors are disjoint by construction.
///
/// # Invariant
///
/// After `stop_the_world()` returns successfully:
///   ∀ cpu ∈ online_cpus \ {caller}: cpu.state == HALTED ∧ cpu.ip ∈ halt_loop
/// The caller is the only CPU executing.
/// Per-CPU halt flag for the stop-the-world IPI protocol.
/// Indexed by CPU ID (0..NR_CPUS). Each CPU sets its own flag when it
/// enters the halt loop; the orchestrating CPU polls all flags to
/// determine when all CPUs have halted.
///
/// `NR_CPUS` is a boot-discovered static capacity hint (see
/// [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--nr-cpus-capacity)).
/// Panic at boot if actual CPU count exceeds NR_CPUS.
/// Typical value: 1024 (covers all current server platforms).
static CPU_HALTED: [AtomicBool; NR_CPUS] = {
    // const-initialize all entries to false.
    // NR_CPUS is a compile-time constant (stack capacity hint).
    const INIT: AtomicBool = AtomicBool::new(false);
    [INIT; NR_CPUS]
};

/// Monotonically increasing abort generation counter. Incremented on each
/// `stop_the_world()` timeout abort. CPUs entering the halt loop snapshot
/// this value on IPI receipt; if it has changed by the time they reach
/// the halt loop entry, an abort occurred and they skip the halt loop.
static ABORT_GENERATION: AtomicU64 = AtomicU64::new(0);

/// Maximum cycles to wait for all CPUs to halt. Computed at boot:
/// `calibrate_cycles_per_us() * 100` (100 microsecond timeout).
/// Each architecture's `arch::current::time::calibrate_cycles_per_us()`
/// reads the cycle counter frequency (x86 TSC, AArch64 CNTFRQ_EL0,
/// RISC-V mtime freq from DTB, PPC timebase, s390x TOD clock, LoongArch
/// stable counter). The value is stored once during early init.
static HALT_TIMEOUT_CYCLES: AtomicU64 = AtomicU64::new(0);
// Initialized during boot: HALT_TIMEOUT_CYCLES.store(
//     arch::current::time::calibrate_cycles_per_us() * 100, Relaxed);

pub fn stop_the_world() -> Result<IpiToken, EvolutionPrimitiveError> {
    // Inhibit CPU hotplug for the duration of the STW window.
    // A CPU coming online during Phase B would not have received the
    // halt IPI and could observe a partially-swapped vtable state.
    // This uses the same approach as Linux's stop_machine():
    // acquire the cpu_hotplug_lock for write, preventing any
    // cpu_up() or cpu_down() from proceeding.
    let hotplug_guard = cpu_hotplug_lock_write();

    arch::current::interrupts::disable_local();
    EVOLUTION_PHASE_B.store(true, Ordering::SeqCst);

    let online = arch::current::cpu::online_mask();
    let self_id = arch::current::cpu::id();

    // Send IPI to all other online CPUs
    for cpu in online.iter() {
        if cpu != self_id {
            arch::current::ipi::send_halt(cpu)?;
        }
    }

    // Spin until all CPUs confirm halted
    let deadline = arch::current::time::read_cycle_counter()
        + HALT_TIMEOUT_CYCLES.load(Ordering::Relaxed);
    loop {
        let all_halted = online.iter()
            .filter(|&cpu| cpu != self_id)
            .all(|cpu| CPU_HALTED[cpu].load(Ordering::Acquire));
        if all_halted {
            break;
        }
        if arch::current::time::read_cycle_counter() > deadline {
            // Timeout — release any halted CPUs and abort.
            //
            // The abort path uses a generation counter to prevent the
            // halt-loop race: a CPU entering the halt loop after the
            // abort has already cleared EVOLUTION_PHASE_B would be
            // stranded permanently. By incrementing ABORT_GENERATION
            // before clearing the phase flag, late-arriving CPUs can
            // detect the abort:
            //   - CPUs already in the halt loop: exit when they see
            //     CPU_HALTED[self] == false (normal release path).
            //   - CPUs entering the halt loop after abort: check
            //     ABORT_GENERATION before entering; if it has
            //     incremented since IPI receipt, skip the halt loop
            //     entirely.
            //
            // Additionally, an IPI_EVOLUTION_ABORT is sent to all CPUs
            // to wake any that entered halt between the timeout check
            // and the flag clear.
            ABORT_GENERATION.fetch_add(1, Ordering::SeqCst);
            EVOLUTION_PHASE_B.store(false, Ordering::SeqCst);
            for cpu in online.iter() {
                if cpu != self_id {
                    CPU_HALTED[cpu].store(false, Ordering::Release);
                    arch::current::ipi::send_abort(cpu);
                }
            }
            return Err(EvolutionPrimitiveError::CpuHaltTimeout);
        }
        core::hint::spin_loop();
    }

    Ok(IpiToken { _hotplug_guard: hotplug_guard })
}

/// Release all halted CPUs. Must be called with the IpiToken from
/// stop_the_world(). After this returns, all CPUs resume normal execution.
pub fn release_all_cpus(_token: IpiToken) {
    // Architecture-specific I-cache synchronization before release
    // (see table above — some architectures need ISB/isync/fence.i
    // after code pages are remapped).
    arch::current::mm::icache_sync_all();

    // Clear HALTED flags — each CPU's halt loop checks this
    let online = arch::current::cpu::online_mask();
    let self_id = arch::current::cpu::id();
    for cpu in online.iter() {
        if cpu != self_id {
            CPU_HALTED[cpu].store(false, Ordering::Release);
        }
    }

    // Clear global phase flag
    EVOLUTION_PHASE_B.store(false, Ordering::Release);

    // Re-enable local interrupts on caller
    arch::current::interrupts::enable_local();
}

/// Opaque token proving that all CPUs are halted.
/// Prevents calling release_all_cpus() without a matching stop_the_world().
/// Embeds the `CpuHotplugWriteGuard` to prevent CPU hotplug during the
/// entire stop-the-world window (from stop_the_world() through
/// release_all_cpus()). The guard is dropped when IpiToken is consumed
/// by release_all_cpus(), releasing the cpu_hotplug_lock.
pub struct IpiToken { _hotplug_guard: CpuHotplugWriteGuard }

13.18.7.3 Phase B Panic: Known Design Limitation¶

A panic during the Phase B stop-the-world window is unrecoverable and requires a system reboot. This is a deliberate design trade-off, not an oversight.

Why Phase B panics cannot be recovered:

All CPUs except the caller are halted. The caller is the only executing CPU. If the caller panics (e.g., a bug in evolution_apply() or evolution_apply_batch()), no other CPU can detect the failure or initiate recovery — they are frozen in their halt loops waiting for release_all_cpus() which will never be called.
The hardware watchdog cannot help during Phase B. The evolution watchdog is armed after Phase B completes (in Phase C). During Phase B itself, the watchdog is not yet active. Even if a pre-armed hardware timer fires as NMI, the NMI handler for watchdog recovery calls evolution_apply() which calls stop_the_world() — but all CPUs are already stopped (the EVOLUTION_PHASE_B flag is set and CPU_HALTED flags are set), creating a deadlock.
Partial atomic swap state is unrecoverable. In evolution_apply_batch(), if the caller panics after swapping some slots but before completing the loop, the system is in an inconsistent state: some components point to new vtables, others to old vtables. No external observer can determine which slots were swapped and which were not. Even if recovery were possible, reconstructing a consistent state from this partial swap is not feasible.

Why this is acceptable:

Phase B code is in the Nucleus (~2-3 KB). The Nucleus is the smallest, most critical code in the kernel. It is formally verified via Verus proofs (INV-1 and INV-6) and contains no dynamic allocation, no recursion, and no unbounded loops. The evolution_apply() and evolution_apply_batch() functions are straight-line code: pointer stores, atomic fences, and a bounded loop over the swap array. The only fallible operation is ensure_mapped(), which is also Nucleus code operating on pre-validated physical addresses.
The stop-the-world window is microseconds. Phase B lasts 1-10 us (up to ~50 us for scheduler evolution with runqueue transfer). The probability of a hardware fault (ECC error, NMI from external source) during this window is negligible — comparable to the probability of hardware failure during any other microsecond-scale critical section in any operating system kernel.
No safe alternative exists. Any recovery mechanism for a Phase B panic would itself run in the Nucleus and face the same verification burden as the Phase B code. Adding recovery logic increases Nucleus complexity and attack surface without meaningfully improving reliability — the recovery code itself could panic. The simplest correct design is: verify Phase B code is correct (formal verification), accept that if it panics despite verification the system must reboot.

Mitigation: The Nucleus panic!() handler, when invoked during Phase B (detected via EVOLUTION_PHASE_B.load(Acquire) == true), outputs a diagnostic message to the serial console identifying the panic location and the evolution batch state (which slots were swapped, which were pending). This information aids post-mortem analysis. The handler then enters an infinite halt loop; the platform's hardware watchdog timer (if configured, typically 30-60 seconds) eventually triggers a full system reset.

13.18.7.4 Orchestration Crash Recovery (Nucleus Watchdog)¶

The orchestration crash rollback problem: if the orchestration itself crashes during or after an evolution operation, who handles the rollback? The orchestration is the component responsible for rollback, creating a circular dependency.

Resolution: Nucleus contains a minimal evolution watchdog that operates independently of orchestration. This is part of the Nucleus primitive (~300 bytes additional code) and is formally verified.

// umka-core/src/evolution/watchdog.rs — Nucleus (non-replaceable)

/// Post-swap watchdog. Started by orchestration after Phase B completes.
/// If the watchdog fires (orchestration or the new component crashes
/// before calling watchdog_disarm()), the Nucleus primitive automatically
/// reverts to the retained old component.
///
/// The watchdog is a simple hardware timer (architecture-specific) that
/// fires a NMI/FIQ when it expires. The NMI handler checks the
/// WATCHDOG_ARMED flag and, if set, calls evolution_apply() with the
/// retained old vtable to revert the swap.
///
/// Key design decisions:
///   1. The watchdog timer is a HARDWARE timer, not a software timer.
///      Software timers depend on the scheduler and interrupt framework,
///      which may be the very components being replaced. Hardware timers
///      (LAPIC timer on x86, CNTPS_CTL_EL1 on AArch64, SBI timer on
///      RISC-V, CPU Timer on s390x, Stable Counter on LoongArch64) run
///      independently of all kernel software.
///   2. The old vtable and image pages are RETAINED in memory until the
///      watchdog is disarmed. This is the "rollback reservation" — memory
///      that cannot be freed until the new component is confirmed healthy.
///   3. The NMI/FIQ handler for watchdog expiry calls evolution_apply()
///      directly — no orchestration involvement. This breaks the circular
///      dependency: Nucleus can revert to the old component even if
///      orchestration is crashed.
///   4. Self-evolution (orchestration replacing itself) uses the same
///      watchdog. If the new orchestration crashes, the Nucleus watchdog
///      fires and reverts to the old orchestration.

pub struct EvolutionWatchdog {
    /// 1 when a post-swap watchdog is active, 0 otherwise.
    /// AtomicU8 instead of AtomicBool for cross-compiler-version safety
    /// (consistent with VtableHeader.quiescing).
    pub armed: AtomicU8,
    /// Retained old vtable pointer for revert.
    pub old_vtable: AtomicPtr<VtableHeader>,
    /// Target slot that was swapped (for reverting).
    pub target_slot: AtomicPtr<AtomicPtr<VtableHeader>>,
    /// Timeout in hardware timer ticks.  AtomicU64 to prevent UB from
    /// concurrent NMI access (NMI fires at arbitrary points, ignoring
    /// release/acquire fences).  On 32-bit targets, plain u64 write is
    /// two 32-bit stores; an NMI between them sees a torn value.
    pub timeout_ticks: AtomicU64,
    /// For batch swaps: array of (target_slot, old_vtable) pairs.
    /// MAX_BATCH_REVERT = 16 (covers all Evolvable components).
    pub batch_entries: [WatchdogRevertEntry; 16],
    pub batch_count: AtomicU32,
}

// kernel-internal, not KABI (used within evolution primitive only,
// never crosses isolation domain or wire boundaries).
#[repr(C)]
pub struct WatchdogRevertEntry {
    pub target_slot: *const AtomicPtr<VtableHeader>,
    pub old_vtable: *mut VtableHeader,
}

/// Arm the watchdog after Phase B completes.
/// Called by Evolvable orchestration.
///
/// The caller MUST populate the revert data (`old_vtable`, `target_slot`,
/// and optionally `batch_entries`/`batch_count`) BEFORE calling this
/// function. The `armed` flag is stored with `Release` ordering to ensure
/// all revert data writes are visible to the NMI handler before it reads
/// `armed == true`.
///
/// # Single-component revert
///
/// ```
/// wd.old_vtable.store(old_vtable, Ordering::Relaxed);
/// wd.target_slot.store(target_slot, Ordering::Relaxed);
/// wd.batch_count.store(0, Ordering::Relaxed);
/// watchdog_arm(timeout_ms);
/// ```
///
/// # Batch revert
///
/// ```
/// for (i, entry) in batch.iter().enumerate() {
///     wd.batch_entries[i] = WatchdogRevertEntry {
///         target_slot: entry.target_slot,
///         old_vtable: entry.old_vtable,
///     };
/// }
/// wd.batch_count.store(batch.len() as u32, Ordering::Relaxed);
/// watchdog_arm(timeout_ms);
/// ```
pub fn watchdog_arm(timeout_ms: u32) {
    let wd = &EVOLUTION_WATCHDOG;
    wd.timeout_ticks.store(
        arch::current::time::ms_to_hw_ticks(timeout_ms as u64),
        Ordering::Relaxed,
    );
    // Release ensures all prior stores (old_vtable, target_slot,
    // batch_entries, batch_count) are visible before armed == true.
    wd.armed.store(1, Ordering::Release);
    arch::current::time::set_hw_watchdog(wd.timeout_ticks.load(Ordering::Relaxed));
}

/// Disarm the watchdog after the new component is confirmed healthy.
/// Called by Evolvable orchestration after Phase C completes and the
/// soak period passes without error.
pub fn watchdog_disarm() {
    let wd = &EVOLUTION_WATCHDOG;
    arch::current::time::clear_hw_watchdog();
    wd.armed.store(0, Ordering::Release);
    wd.batch_count.store(0, Ordering::Release);
    // Old vtable/images can now be freed by orchestration.
}

/// NMI/FIQ handler for watchdog expiry. Nucleus code — no orchestration.
///
/// # Safety
///
/// This runs in NMI context. It can only call evolution_apply() (which
/// calls stop_the_world).
///
/// # Concurrent NMI handling
///
/// On x86-64, the halt IPI uses NMI (same vector as the watchdog LAPIC
/// timer NMI). The two are distinguished by the `EVOLUTION_PHASE_B` flag:
/// - If `EVOLUTION_PHASE_B` is set: the NMI is a halt IPI, enter halt loop.
/// - If `EVOLUTION_PHASE_B` is not set: the NMI is a watchdog expiry.
///
/// The hardware watchdog timer fires on ALL CPUs simultaneously (LAPIC
/// timer NMI). To prevent multiple CPUs from concurrently reverting the
/// evolution, we use `armed.swap(0, Acquire)` instead of `armed.load()`.
/// Only the CPU that successfully swaps `armed` from 1 to 0 proceeds;
/// all others observe 0 and return immediately.
pub unsafe fn watchdog_nmi_handler() {
    let wd = &EVOLUTION_WATCHDOG;
    // Atomic swap ensures exactly one CPU proceeds with the revert.
    // All other CPUs that enter this handler concurrently will observe
    // armed == 0 (the swap already set it to 0) and return.
    if wd.armed.swap(0, Ordering::Acquire) == 0 {
        return; // Spurious, already disarmed, or another CPU is handling it
    }

    let batch = wd.batch_count.load(Ordering::Acquire);
    if batch > 0 {
        // Batch revert: reconstruct EvolutionSwapEntry array from saved entries
        let mut swaps: ArrayVec<EvolutionSwapEntry, 16> = ArrayVec::new();
        for i in 0..batch as usize {
            let entry = &wd.batch_entries[i];
            swaps.push(EvolutionSwapEntry {
                target_slot: &*entry.target_slot,
                new_vtable: entry.old_vtable, // "new" = old, we're reverting
                pending_ops: core::ptr::null(), // NMI crash path: PendingOps were already
                                  // replayed by Phase C1 before the crash.
                                  // The old component's re-imported state
                                  // reflects the pre-swap snapshot; any ops
                                  // that executed against the new component
                                  // are lost, but this is inherent to crash
                                  // recovery (same as any component crash).
                image_pages: &[], // Old pages already mapped
            });
        }
        // Check the Result: if evolution_apply_batch fails, the revert
        // could not complete. In NMI context we cannot propagate errors,
        // but we MUST log the failure for post-mortem diagnosis and avoid
        // disarming the watchdog (so a subsequent timer expiry retries).
        match evolution_apply_batch(&swaps) {
            Ok(()) => {
                // armed is already 0 (swapped at handler entry). Success.
                arch::current::serial::puts(
                    "EVOLUTION WATCHDOG: batch reverted to old components\n"
                );
            }
            Err(_e) => {
                // Revert failed. Re-arm so a subsequent NMI can retry.
                wd.armed.store(1, Ordering::Release);
                arch::current::serial::puts(
                    "EVOLUTION WATCHDOG: batch revert FAILED, watchdog re-armed\n"
                );
                // Forward-only semantics: already-swapped entries stay in
                // their new state. The system is in a partially-reverted
                // configuration — this is an unrecoverable evolution failure
                // that will likely require a reboot.
            }
        }
    } else {
        // Single-component revert
        match evolution_apply(
            &*wd.target_slot.load(Ordering::Acquire),
            wd.old_vtable.load(Ordering::Acquire),
            // Pending ops: use the boot-initialized empty sentinel.
            // Old component's pending ops were already replayed in Phase C1
            // before the crash. The empty sentinel ensures no stale ops
            // are replayed during revert.
            EMPTY_PENDING_OPS.get().expect("evolution init completed"),
            &[], // Old pages already mapped
        ) {
            Ok(()) => {
                // armed is already 0 (swapped at handler entry). Success.
                arch::current::serial::puts(
                    "EVOLUTION WATCHDOG: reverted to old component\n"
                );
            }
            Err(_e) => {
                // Revert failed. Re-arm so a subsequent NMI can retry.
                wd.armed.store(1, Ordering::Release);
                arch::current::serial::puts(
                    "EVOLUTION WATCHDOG: revert FAILED, watchdog re-armed\n"
                );
            }
        }
    }
}

13.18.7.5 Platform Driver Quiescence Protocol¶

Platform-critical drivers (interrupt controllers, timers, IOMMU) cannot be quiesced using the standard drain-and-swap protocol because their absence would halt all interrupt delivery, timer ticks, and DMA protection. These are handled specially.

Platform drivers are Tier 0 (in-kernel, non-isolated) and are NOT individually replaceable during normal operation. They can only be replaced as part of a whole-Evolvable replacement, where the new Evolvable image contains updated platform driver code. The replacement is atomic (Phase B stop-the-world), so the old platform code runs until the swap and the new platform code runs immediately after.

Platform Component	Why No Individual Quiescence	Replacement Strategy
Interrupt controller (GIC/APIC/PLIC/PSW-swap/EIOINTC)	Disabling would stop ALL interrupts including timer ticks, preventing any timeout-based quiescence	Phase B atomic swap: old IRQ handler runs until IPI, new handler runs after IPI. Interrupt state (pending/active) preserved across swap.
Timer subsystem	Disabling would prevent scheduler ticks, watchdog timers, and quiescence deadlines	Phase B atomic swap: timer hardware continues running; only the handler code pointer changes. Next tick fires into new handler.
IOMMU	Disabling would allow rogue DMA from any device	Phase B atomic swap: IOMMU page tables are DATA (not code), so they survive the swap. Only the fault handler code pointer changes.
Memory management (page fault handler)	Disabling would triple-fault on any page fault	Phase B atomic swap: page tables are DATA. Only the fault handler dispatch code changes.

Quiescence for platform drivers within whole-Evolvable replacement:

The whole-Evolvable replacement protocol handles platform drivers by exploiting the stop-the-world window:

Before Phase B: All non-platform Evolvable components are quiesced using the standard protocol (drain in-flight ops, export state). Platform drivers continue running normally during this period. For full Evolvable replacement, each Tier 1 driver in the batch has its DMA quiesced individually before the batch Phase B. The orchestrator calls dma_quiesce(driver) for each driver in dependency order, accumulating drained state. Phase B then swaps all vtables atomically.
Phase B (IPI halt): Once all CPUs are halted: a. Swap ALL vtable pointers (both platform and non-platform) in a single evolution_apply_batch() call. b. Platform driver state is PRESERVED because it is data, not code:
- Interrupt controller: pending interrupt bitmask, affinity routing tables, IRQ→handler mappings are in XArray/data structures, not in vtable code.
- Timer: current timer value, next-fire timestamp are hardware registers or per-CPU data, not in the replaced code.
- IOMMU: page tables, domain mappings are data structures that survive the code swap. c. The new platform code inherits exactly the same data state as the old code.
After Phase B release: The first interrupt/timer tick/page fault dispatches to the new handler code. Because the data is identical, the new code picks up seamlessly. No interrupt is lost, no timer tick is missed, no DMA mapping is stale.

Why this works: Platform drivers in UmkaOS follow the data/code split principle. All mutable state (IRQ routing, timer queues, page tables) lives in Nucleus non-replaceable data structures. The replaceable platform "code" is limited to: - IRQ dispatch logic (which handler to call for which vector) - Timer tick handler (what to do when a tick fires) - IOMMU fault handler (what to do on a DMA fault)

These are pure functions of the data state. Swapping the function pointers while preserving the data is always safe.

13.18.7.6 Whole-Evolvable Init Sequence (Post-Replacement)¶

When the entire Evolvable image is replaced (not individual components), the new Evolvable must re-initialize its subsystems. This differs from the boot init sequence (Section 2.21) because: - The system is already running (CPUs online, memory allocated, devices active) - State was exported from old Evolvable and must be imported by new Evolvable - No BootAlloc — buddy/slab allocators are already operational

Whole-Evolvable replacement init sequence (Phase C):

1. Import orchestration state (self — the new orchestration's own config)
2. Import slab allocator policy state (cache sizes, growth parameters)
3. Import workqueue state (thread pool config, active work items)
4. Import ACPI/DTB parser state (parsed device tree, table cache)
5. Import block layer state (request queues, dm target tables)
6. Import FMA/observability state (fault records, tracepoint registrations)
7. Import LSM state (security labels, policy database)
8. Import network stack state (TCP connections, routing table, socket hash)
9. Import IPC dispatch state (pipe buffers, SysV objects)
10. Import VFS state (mount tree, dentry cache, open file table)
11. Import scheduler state (runqueue contents, bandwidth controllers)
    — Scheduler is LAST because all other subsystems may create/modify
    tasks during their import. The scheduler import captures the final
    task state after all other imports are complete.

After all imports:
12. Drain pending ops from all per-CPU PendingOpsPerCpu queues.
    Each pending op is replayed through the new component's vtable.
    **Tombstoned method handling**: Before dispatching each pending op, validate
    its `method_id` against the new vtable's `vtable_size`. If `method_id >=
    new_vtable.vtable_size` (method was removed in the new version), the op is
    NOT replayed. Instead, return the method's tombstone error code: the new
    component's KABI manifest declares per-method removal error codes (see
    [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle--tombstone-stub-protocol)). If no explicit
    error code is declared, return `ENOSYS`. Log the skipped op via FMA
    (`EvolutionPendingOpSkipped` event) for auditability.

    **Replayed op error handling**: If a replayed pending op returns an error
    (the NEW component rejects the operation): the error is propagated to the
    original caller. The caller sees the error as if the operation failed
    normally — no special "evolution error" code. This is correct because the
    new component's rejection is a valid semantic response (e.g., a stricter
    policy rejects an operation the old policy accepted). The evolution framework
    does NOT retry or suppress errors from replayed ops. FMA event
    `EvolutionReplayError { method_id, error_code }` is logged for diagnostics.

    **Argument type compatibility across versions**: KABI vtable methods use
    `#[repr(C)]` types with explicit sizes (u32, u64, never usize). Field
    additions are append-only (KABI Rule 5). The pending op's serialized
    arguments are a fixed-size byte buffer matching the OLD method signature.
    If the NEW method has ADDITIONAL trailing fields (appended in a newer KABI
    version), the dispatch trampoline zero-fills the extra fields before calling
    the new method. If the NEW method REMOVED or REORDERED fields (violates KABI
    rules — should not happen), `is_compatible_batch()` rejects the swap during
    Phase A pre-validation. No runtime argument type checking is needed because
    KABI ABI rules guarantee forward-compatible argument layouts.
13. Start post-swap watchdog (default 5000ms).
14. Write evolution result to /sys/kernel/evolution/last_result.

This is the REVERSE of the export order (which is bottom-up). The import order is top-down: the component with the most dependencies imports first (VFS at position 11), and components with fewer dependencies import later. Each component's import_state_chunk() receives opaque serialized state from the matching export_state_chunk() — it does NOT reference other components' live state during import. Cross-component references (e.g., VFS holding block device handles) are re-established in Phase C1 (pending ops replay) after all imports complete and all vtable slots have been atomically swapped.

State version chain migration: If the new Evolvable is multiple versions ahead of the old Evolvable (e.g., v5 → v12), the orchestration applies a migration chain: export_v5 → migrate_v5_to_v6 → migrate_v6_to_v7 → ... → import_v12. Each migration step is a pure function in the new Evolvable image. The chain length is bounded by the support window (Section 12.4).

/// Boot-only: verify Evolvable flat image against LMS signature using the
/// baked-in public key. Called exactly once during boot Phase 0.8
/// ([Section 2.21](02-boot-hardware.md#kernel-image-structure--phase-08-evolvable-boot-loading-protocol)).
///
/// LMS verification (NIST SP 800-208) is purely hash-based:
///   1. SHAKE256 hash of signed content → message digest
///   2. Winternitz chain completion (~530 hash calls for W=4)
///   3. Merkle authentication path walk (H hash calls, e.g., 15 for H=15)
///   4. Compare candidate root to public key's T[1]
///
/// Uses the same Keccak-f[1600] permutation as SHA3-256 (already in Nucleus).
/// SHAKE256 differs only in the padding byte (0x1F vs 0x06).
///
/// After boot, all subsequent image verification is done by Evolvable's
/// ML-DSA-65 signature verifier (which is itself live-replaceable).
///
/// # Safety
///
/// `signed_content` must point to the Evolvable image bytes [0, image_size).
/// `signature` must point to the appended LMS signature bytes.
pub unsafe fn lms_verify_core1(
    public_key: &[u8; 56],
    signed_content: &[u8],
    signature: &[u8],
) -> Result<(), BootVerifyError> {
    if !lms_verify_shake256(public_key, signed_content, signature) {
        return Err(BootVerifyError::SignatureInvalid);
    }
    Ok(())
}

#[derive(Debug)]
pub enum EvolutionPrimitiveError {
    /// IPI failed (hardware error — should not happen on healthy system)
    IpiFailed,
    /// Page remapping failed (out of page table entries — very unlikely)
    RemapFailed,
    /// A CPU did not respond to the halt IPI within the timeout.
    CpuHaltTimeout,
    /// Two entries in a batch swap reference the same target slot.
    DuplicateSlot,
    /// Batch swap partially completed: slots 0..completed were swapped,
    /// slot at failed_index failed, remaining slots were not swapped.
    /// Orchestration should arm the watchdog and either retry the
    /// remaining slots or let the watchdog revert all.
    PartialBatchFailure {
        /// Number of slots successfully swapped (0..completed).
        completed: usize,
        /// Index of the slot that failed.
        failed_index: usize,
        /// Underlying error from the failed slot (typically RemapFailed).
        /// Stored as a flat enum variant — no Box/heap allocation, because
        /// Phase B runs during stop-the-world with all CPUs halted and the
        /// heap allocator may hold locks on stopped CPUs.
        cause: EvolutionPrimitiveCause,
    },
}

/// Flat (non-recursive) error cause for `PartialBatchFailure`.
/// Avoids heap allocation during Phase B stop-the-world.
#[derive(Clone, Copy, Debug)]
#[repr(u8)]
pub enum EvolutionPrimitiveCause {
    /// `WRPKRU`/`MSR`/equivalent instruction failed.
    IpiFailed = 0,
    /// Page remapping failed (out of page table entries).
    RemapFailed = 1,
    /// A CPU did not respond to the halt IPI within the timeout.
    CpuHaltTimeout = 2,
}

#[derive(Debug)]
pub enum EvolutionCompatError {
    /// New component's state_version() is not strictly greater than old.
    VersionNotMonotonic { index: usize, old_ver: u64, new_ver: u64 },
    /// New component's import_state() rejected the old component's state.
    ImportIncompatible { index: usize, cause: StateImportError },
    /// KABI cross-compatibility check failed between two components.
    KabiIncompatible { exporter_index: usize, importer_index: usize },
}

/// Orchestration-level error enum for the complete evolution pipeline.
/// This is the error type returned by the evolution ioctl handler and
/// the sysfs trigger. It wraps primitive and compat errors plus orchestration-
/// specific failure modes.
#[derive(Debug)]
pub enum EvolutionError {
    /// Phase A: export_state() timed out.
    ExportTimeout,
    /// Phase A: import_state(DryRun::No) timed out.
    ImportTimeout,
    /// Phase A: import_state() returned an error.
    ImportFailed(StateImportError),
    /// Pre-flight: new component version incompatible with old state.
    VersionIncompatible(EvolutionCompatError),
    /// Pre-flight: ELF signature verification failed.
    SignatureInvalid,
    /// Phase A': quiescence could not be achieved within the timeout.
    QuiescenceTimeout,
    /// Phase A': scheduler runqueue is too large for safe evolution.
    RunqueueTooLarge { cpu: u32, nr_running: u32 },
    /// Phase A'/B: KABI registry write-side mutex was contended.
    KabiRegistryContended,
    /// Phase B: primitive swap error (remap failure, IPI failure, etc.).
    PrimitiveError(EvolutionPrimitiveError),
    /// Pre-flight: batch compatibility check failed.
    CompatError(EvolutionCompatError),
    /// ELF loading: invalid or corrupt component ELF.
    ElfLoadFailed,
    /// Watchdog: evolution did not complete within the hard deadline.
    WatchdogTimeout,
}

#[derive(Debug)]
pub enum BootVerifyError {
    /// Evolvable LMS signature does not verify against the baked public key
    SignatureInvalid,
}

/// Sentinel empty pending-ops instance for components that do not use
/// stateful quiescence (e.g., watchdog single-component revert).
///
/// `PendingOpsPerCpu` uses `Vec<PendingOpSlot>` for its `slots` field,
/// which cannot be const-initialized (requires runtime CPU count discovery).
/// Instead, `EMPTY_PENDING_OPS` is initialized at boot via
/// `init_empty_pending_ops()` which allocates the `Vec` storage and
/// zeros all slots. The `total_pending` counter starts at 0.
///
/// The watchdog NMI handler and other cold-path callers that need an
/// empty pending ops reference use `&*EMPTY_PENDING_OPS` after boot.
static EMPTY_PENDING_OPS: OnceCell<PendingOpsPerCpu> = OnceCell::new();

/// Initialize the sentinel empty pending-ops at boot (Phase 1.2, after
/// per-CPU sections are allocated). Called once by the evolution subsystem
/// init function.
pub fn init_empty_pending_ops() {
    EMPTY_PENDING_OPS.set(PendingOpsPerCpu {
        slots: vec![PendingOpSlot {
            buf: ArrayVec::new(),
            has_pending: AtomicU8::new(0),
        }; nr_cpus()],
        total_pending: AtomicU64::new(0),
    }).expect("init_empty_pending_ops called twice");
}

Why this split: The evolution primitive is ~2-3 KB of straight-line code with no loops (except the page remap iteration, which is bounded by image size), no heap allocation, no external dependencies beyond arch::current::mm and the IPI mechanism. This is a tractable Verus verification target — comparable in complexity to the KABI dispatch trampoline. The previous ~15 KB monolithic evolution framework included ELF parsing (~4 KB), ML-DSA-65 signature verification (~5 KB), symbol resolution (~2 KB), and Phase A/A'/B/C orchestration (~4 KB) — none of which needs to be non-replaceable.

13.18.7.7 Evolvable: Evolution Orchestration (~12-13 KB, replaceable)¶

Everything that the evolution framework does except the atomic swap itself:

Responsibility	Why it can be in Evolvable
ELF loader	Parses new component ELF, performs relocations, allocates pages. Runs in Phase A (before stop-the-world). If it has a bug, we can live-replace the orchestration itself.
ML-DSA-65 signature verifier	Verifies component signatures against the trust anchor chain. Runs in Phase A. If the PQC standard evolves (e.g., ML-DSA → SLH-DSA rotation), replace the verifier without reboot.
Phase A/A'/B/C sequencer	Orchestrates the full replacement flow: state export, quiescence, pending-op interception, calling `evolution_apply()`, Phase C activation, watchdog setup. The sequencer calls into Nucleus's primitive for the atomic swap.
Symbol resolution	Resolves symbols between the new component and existing loaded components. String matching + vtable index lookup.
State export/import coordination	Calls `EvolvableComponent::export_state()` on the old component and `import_state()` on the new. Manages chained migrations (version v(K) → v(K+8)).
`PendingOpsPerCpu` management	Sets up the per-CPU pending op slots before quiescence, monitors total capacity during Phase A', extends deadlines if needed. (The per-CPU slot structure is transferred by the Nucleus primitive; the management logic is Evolvable.)
`DATA_FORMAT_EPOCH`	Manages data format versioning and extension arrays for non-replaceable data structures.
Version compatibility checks	Validates `state_version()` ordering, vtable compatibility, chain length bounds.
Rollback logic	If the new component crashes within the watchdog window, orchestration reactivates the old component from retained state by calling `evolution_apply()` again with the old vtable.

Self-evolution: The evolution orchestration can evolve itself. The sequence: 1. New orchestration image arrives (signed, verified by the currently-running old orchestration). 2. Old orchestration calls evolution_apply() (Nucleus primitive) to install the new orchestration. 3. The new orchestration is now active. If it crashes within the watchdog window, the Nucleus primitive + old orchestration (retained in memory) handle rollback — the rollback path uses only the Nucleus primitive, not orchestration logic.

Bootstrap trust chain: At boot (Phase 0.8), Nucleus verifies the embedded Evolvable image against an LMS signature (NIST SP 800-208) using a public key baked into Nucleus's .rodata. LMS verification is purely hash-based (SHAKE256 / Keccak-f[1600], the same permutation Nucleus uses for SHA3-256), adding ~1-3 KB of verifier code to Nucleus. The LMS public key is NOT tied to a specific Evolvable version — any Evolvable signed by the corresponding build system private key passes verification. This decouples Nucleus releases from Evolvable releases entirely.

After boot, the running orchestration verifies all subsequent evolution payloads using ML-DSA-65 signatures against the trust anchor chain (Section 9.3). The chain of trust:

build system LMS key → Nucleus lms_verify_shake256() → initial Evolvable orchestration
→ ML-DSA-65 trust anchor chain → all subsequent evolution operations

Two signature schemes at the boundary: LMS in the verified nucleus (minimal code, reuses existing Keccak), ML-DSA-65 in the replaceable orchestration (richer features, rotatable keys). See Section 2.21.

13.18.8 Evolution Framework Formal Invariants¶

The live kernel evolution primitive (Nucleus) is a Priority 1 verification target — it is the irreducible mechanism that allows all other kernel components to be fixed without reboot. A bug in the evolution primitive itself cannot be live-fixed. The following invariants are split between the Nucleus primitive and the Evolvable orchestration.

Nucleus primitive invariants (must be formally verified — cannot be live-fixed):

INV-1 through INV-1b and INV-6 apply to the primitive. INV-2 through INV-5 and INV-7 are enforced by orchestration (live-fixable if buggy).

The following invariants must hold across any replacement sequence and are candidates for formal verification via Verus (Section 24.4).

13.18.8.1 Safety Invariants (must hold at all times)¶

INV-1: Atomic visibility. (Nucleus primitive — formally verified) After Phase B completes (IPI released), every CPU observes exactly one of: (a) the old component's vtable, or (b) the new component's vtable. There is no state where CPU A dispatches to the old vtable while CPU B dispatches to the new vtable for the same component. Formally: ∀ cpu: vtable_ptr[cpu][component] == old_vtable ∨ vtable_ptr[cpu][component] == new_vtable, and after IPI completion: ∀ cpu: vtable_ptr[cpu][component] == new_vtable.

INV-2: No lost operations. (Evolvable orchestration — live-fixable) Every vtable call that arrives between Phase A' entry (quiescing = true) and Phase C1 completion (pending_ops drained) is either: (a) completed by the old component before the swap, or (b) enqueued in pending_ops and replayed by the new component after the swap. No operation is silently dropped. Formally: |completed_old| + |pending_ops| + |completed_new| == |total_calls|.

INV-3: State equivalence. (Evolvable orchestration — live-fixable) After import_state(export_state()), the new component produces outputs equivalent to the old component for the same inputs. This is specified per-component (each component defines its equivalence relation in its EvolvableComponent documentation). The framework guarantees that export_state() captures ALL mutable state owned by the component.

INV-4: Rollback safety. (Split: Nucleus primitive performs the rollback swap; Evolvable orchestration manages retained state and watchdog trigger) If the new component crashes within the watchdog window, the old component can be reactivated from the retained serialized state. The retained state is valid (CRC32C verified) and represents a consistent snapshot (taken during the quiescent Phase A' when no in-flight operations existed). Formally: import_state(retained_state) succeeds ⟹ old_component.equivalent_to(state_at_phase_A'). The Nucleus primitive's contribution to rollback is simply calling evolution_apply() with the old vtable — the same mechanism used for the forward swap.

On rollback, all PendingOps are replayed against the restored (old) component. The replay uses the same dispatch mechanism as normal Phase C1 activation: ops are drained from PendingOpsPerCpu in per-CPU FIFO order and dispatched through the old component's vtable. If a PendingOp targets a method that exists in the old component, it executes normally. This preserves INV-2 (no lost operations) across rollback — operations intercepted during the failed evolution's Phase A' are not discarded but are executed against the restored component.

INV-5: Generation monotonicity. (Evolvable orchestration — live-fixable) The EvolvableComponent::state_version() returned by the new component is strictly greater than the old component's version. The evolution framework rejects replacements where new.state_version() <= old.state_version(). This prevents accidental downgrade and ensures migration chains are well-ordered.

INV-6: PendingOpsPerCpu transfer integrity. (Nucleus primitive — formally verified) The PendingOpsPerCpu slot transfer during evolution_apply() preserves all enqueued operations. The primitive transfers the per-CPU slot structure pointer with Release ordering while all CPUs are stopped, ensuring no operation is lost during the transfer window. Each per-CPU slot is a bounded SPSC queue. During Phase A', each CPU writes only to its own PendingOpSlot (single producer: the interception path on that CPU) and the drain thread reads all CPU slots during Phase C1 (single consumer: the activation path). No slot overflows: if total_pending == sum(slot.buf.len()) == PENDING_OPS_TOTAL_CAPACITY, the quiescence deadline is extended rather than dropping an operation (deadline extension is Evolvable orchestration). Formally: ∀ t: total_pending[t] ≤ PENDING_OPS_TOTAL_CAPACITY.

INV-7: Trust chain continuity. (Evolvable orchestration — live-fixable; boot verification is Nucleus) After live evolution, the kernel's trust anchor chain (Section 9.3) is intact. The new kernel component's signature was verified against the active trust anchor before Phase A began. The trust anchor itself is stored in BSS and survives the evolution. No component replacement can bypass signature verification. At boot, lms_verify_shake256() (Nucleus) establishes the initial trust via the baked LMS public key; all subsequent verification is performed by Evolvable's ML-DSA-65 verifier.

13.18.8.2 Liveness Invariants (must eventually hold)¶

LIV-1: Quiescence termination. (Evolvable orchestration — live-fixable) Phase A' completes within quiescence_deadline_ms (configurable per component). If in-flight operations do not drain within the deadline, the evolution is aborted and the old component resumes unchanged. The system is never stuck in a quiescing state.

LIV-2: Watchdog termination. (Evolvable orchestration — live-fixable) The post-swap watchdog fires within watchdog_timeout_ms. If the new component does not acknowledge health within this window, automatic rollback to the old component occurs. The system never remains in the watchdog-pending state indefinitely.

13.18.8.3 Verification Approach¶

The Nucleus/Evolvable split dramatically reduces the verification burden. Only INV-1 and INV-6 (Nucleus primitive invariants) require formal Verus proofs that cannot be updated after deployment. INV-2 through INV-5, INV-7, LIV-1, and LIV-2 are enforced by Evolvable orchestration — if a bug is found, the orchestration itself is live-replaced.

These invariants are expressed as Verus proof obligations (Section 24.4):

INV-1, INV-6: (Nucleus — must be formally verified) Verified via Verus model of the stop-the-world IPI protocol and SPSC ring buffer. The proof shows that the IPI serializes the vtable pointer update and that the ring's head/tail invariants hold under single-producer/single-consumer access.
INV-2: (Evolvable — live-fixable) Verified by showing that the trampoline interception path and the Phase C1 drain path are exhaustive (every call site is intercepted during quiescence).
INV-3: (Evolvable — per-component obligation) The framework provides a test harness (evolution_roundtrip_test!) that asserts import_state(export_state(component)) ≅ component for each registered EvolvableComponent.
INV-4: (Split — Nucleus swap + Evolvable state management) The Nucleus primitive's contribution is that evolution_apply() works identically for forward and rollback swaps. Evolvable orchestration ensures CRC32C integrity and state format version compatibility imply successful import.
INV-5: (Evolvable — live-fixable) Trivially verified at load time (one integer comparison; rejection path returns EINVAL).
INV-7: (Evolvable — live-fixable; boot LMS verification is Nucleus) Verified by showing that the evolution orchestration calls crypto_verify_signature() before Phase A entry and that the trust anchor chain is in BSS (not in any replaceable component's state). At boot, Nucleus's lms_verify_shake256() establishes initial trust via the baked LMS public key (Section 2.21).

13.18.9 KABI Service Live Replacement¶

Sections 12.8.1-12.8.6 cover live replacement of core kernel components (scheduler, page replacement, I/O scheduler). Section 11.9 covers driver crash recovery and reload. Section 19.9 covers policy module hot-swap. This section closes the remaining gap: KABI services — the major kernel subsystems (VFS, networking stack, block layer, etc.) that are more stateful than policy modules, have complex inter-subsystem dependencies, and serve as infrastructure for drivers and userspace.

Why a separate mechanism: KABI services differ from core components and drivers:

Aspect	Core component (§13.8)	Policy module (§19.7)	Driver (§11.7)	KABI service (this section)
State complexity	Medium (~64-256 KB)	Low (~4-16 KB)	Medium (device-specific)	High (connection tables, routing FIB, mount tree, block queues — MB-scale)
Active connections	None (internal)	None (internal)	Device I/O queues	Thousands of client syscalls in-flight
Cross-subsystem deps	Few (self-contained)	None (stateless policy)	KABI vtable to Core	Many (VFS↔block↔net↔cgroup)
Replacement frequency	Rare (kernel update)	Moderate (policy tuning)	On crash	Development cycle (potentially frequent during agentic development)

13.18.9.1.1 Syscall Layer Architecture During Live Replacement¶

The syscall path has three layers, each with different replacement characteristics.

UmkaOS implements ~80% of Linux syscalls natively with identical POSIX semantics (Section 22.1). For these, the SysAPI layer performs only thin representation conversion (int fd → CapHandle<FileDescriptor>, void *buf → UserPtr<T>). The remaining ~15% need heavier translation (struct layout, flag reinterpretation), and ~5% are UmkaOS-native syscalls with no Linux equivalent. All three categories flow through the same replaceable layers:

Userspace process
  │  syscall instruction (SYSCALL / SVC / ecall / sc)
  │  nr > 0: Linux compat     nr < 0: UmkaOS native
  ▼
┌─────────────────────────────────────────────────────────┐
│ Layer 1: Syscall Entry (umka-core, Tier 0)              │ NOT replaceable
│   Hardware trap handler. Saves registers, sign-extends  │ (per-arch asm,
│   syscall number (cdqe / sxtw / implicit on 32-bit).   │  stable by design)
│   Calls Layer 2 entry function via atomic fn pointer.   │
│   Per-arch: SYSCALL/SYSRET, SVC/ERET, ecall/mret, sc.  │
│   32-on-64 compat: separate entry path (int 0x80 /     │
│   sysenter), calls 32-bit compat dispatch in Layer 2.   │
└─────────────┬───────────────────────────────────────────┘
              │ function pointer (updated atomically on Layer 2 replacement)
              ▼
┌─────────────────────────────────────────────────────────┐
│ Layer 2: SysAPI + Native (umka-sysapi, KABI service)    │ REPLACEABLE
│                                                         │ (via §13.8.7)
│   Bidirectional dispatch table (Section 19.1.4.1):      │
│     ┌──────────────────┬────────────────────┐           │
│     │ UmkaOS native    │ Linux compat        │           │
│     │ [ORIGIN-N]       │ [ORIGIN+N]          │           │
│     └──────────────────┴────────────────────┘           │
│   Positive nr → Linux handler (80% thin repr, 15%       │
│     heavy translation, 5% compat shims)                 │
│   Negative nr → UmkaOS native handler (caps, drivers,   │
│     WEA, sync, debug — direct dispatch, no multiplexer) │
│                                                         │
│   State: dispatch table + compat caches (~72KB).        │
│   Nearly stateless per-call.                            │
└─────────────┬───────────────────────────────────────────┘
              │ KABI vtable call
              ▼
┌─────────────────────────────────────────────────────────┐
│ Layer 3: Subsystem Service (VFS, net, block, etc.)      │ REPLACEABLE
│   Actual implementation. Owns the real state            │ (via §13.8.7)
│   (connection tables, mount tree, page cache, etc.).    │
│   Each is a separate KABI service with its own          │
│   vtable and EvolvableComponent implementation.         │
│                                                         │
│   Some services delegate to non-replaceable core:       │
│   ┌───────────────────────────────────────────────┐     │
│   │ Core (memory allocator, page tables, caps,    │     │
│   │ KABI dispatch) — NOT replaceable, verified.   │     │
│   └───────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────┘

100% of syscall implementations are live-fixable, because the logic lives in replaceable layers:

Bug location	Example	Fix by replacing
Thin repr conversion (80% of syscalls)	`fd` → `CapHandle` off-by-one	SysAPI layer (Layer 2)
Heavy ABI translation (15%)	`clone3` flag misinterpretation	SysAPI layer (Layer 2)
UmkaOS-native dispatch (5%)	`umka_op::DSM_MAP` handler error	SysAPI layer (Layer 2)
Subsystem logic	`read()` returns wrong data, TCP state bug	KABI service (Layer 3)
seccomp filter evaluation	False-positive filter match	SysAPI layer (Layer 2)

Exception: bugs in the non-replaceable core components (memory allocator data, page table hardware ops, capability data, evolution primitive, KABI dispatch trampoline) are NOT live-fixable. These are verified via formal methods (Section 24.4) and tested extensively to ensure correctness. Their corresponding policy/orchestration layers (PhysAllocPolicy, VmmPolicy, CapPolicy, evolution orchestration) ARE live-replaceable via atomic pointer swap with MonotonicVerifier safety guarantees (see Section 9.1). Syscalls that delegate to core (e.g., mmap → page table manager, brk → allocator) have their translation and VMA management in replaceable Layer 3, but the final page table manipulation is in verified core.

During live replacement of a Layer 3 service (e.g., networking stack): - Layer 1 (syscall entry) is unchanged — trap handler continues dispatching. - Layer 2 (compat) is unchanged — same Linux ABI translation. - Layer 3 vtable pointer is atomically swapped in Phase B. New socket() calls transparently hit the new networking stack. In-flight calls complete on the old stack (quiesced in Phase A').

During live replacement of Layer 2 (SysAPI layer): - Layer 1 blocks new syscall entries with -ERESTARTSYS during quiescence. - The SysAPI layer's vtable is swapped. New syscalls use the new translation logic. - Layer 3 services are unaffected — they see the same internal UmkaOS call interface. - Use case: fix a syscall struct layout bug, add a new syscall number, update seccomp filter evaluation logic — all without rebooting.

Why Layer 1 is not replaceable: The syscall entry trampoline is ~50 instructions of per-arch assembly (register save/restore, stack switch, dispatch table index). It is verified (Section 24.4) and changes approximately never. Making it replaceable would add complexity to the most performance-critical path in the kernel (every syscall crosses this boundary) for zero practical benefit.

Why the dispatch table is not replaceable: The dispatch table is a static array of vtable entry indices, mapping Linux syscall numbers to KABI vtable slots. Adding a new syscall means adding a new entry to this table — which is done by replacing the SysAPI layer (Layer 2), not the table itself. The table is just an index; the implementation lives in the replaceable layers above.

13.18.9.1.2 Service State Serialization Protocol¶

KABI services implement EvolvableComponent (Section 13.18) with an additional constraint: their state may be too large for a single ComponentState::data blob. Large services use incremental state export:

/// Extension for KABI services with large state (>1 MB).
/// State is exported in chunks to avoid monopolizing memory during Phase A.
pub trait ServiceEvolvable: EvolvableComponent {
    /// Estimate total serialized state size (bytes). Used by the evolution
    /// framework to pre-allocate buffer space and validate that sufficient
    /// memory is available before starting Phase A.
    fn estimated_state_size(&self) -> usize;

    /// Export state incrementally. Called repeatedly until it returns
    /// `ExportChunk::Done`. Each call produces one chunk (max 256 KB).
    /// The service continues handling requests between chunks — only
    /// per-chunk consistency is required (the final re-export in
    /// Phase A' captures the quiesced state atomically).
    fn export_state_chunk(&self, cursor: &mut ExportCursor) -> ExportChunk;

    /// Import state incrementally. Called with chunks in the same order
    /// as export. Returns Ready when all chunks are consumed and the
    /// service is initialized.
    fn import_state_chunk(&self, chunk: &[u8]) -> Result<ImportStatus, MigrationError>;
}

pub enum ExportChunk {
    /// More chunks follow. `data` contains up to 256 KB of serialized state.
    More { data: Vec<u8> },
    /// Final chunk. State export is complete.
    Done { data: Vec<u8> },
}

pub enum ImportStatus {
    /// More chunks needed.
    NeedMore,
    /// All state imported. Service is ready to activate.
    Ready,
}

13.18.9.1.3 export_state_chunk() Execution Context¶

export_state_chunk() is called on the old service during Phase A while the service continues handling normal requests. This concurrent execution requires a well-defined threading and locking protocol to prevent data races between the export path and the service's normal request-handling threads.

Dedicated export kthread: The evolution orchestration (Evolvable) spawns a dedicated kernel thread (kthread_create("evolution-export-%s", service_name)) for each service undergoing evolution. This kthread is the sole caller of export_state_chunk(). It runs at SCHED_NORMAL priority with nice +5 (background priority) to avoid starving normal request processing.

/// Per-service export context, created by orchestration at Phase A entry.
/// The export kthread holds an exclusive reference to this structure.
pub struct ExportContext {
    /// The service being exported. Shared reference only — the export kthread
    /// does not hold exclusive access to the service itself.
    pub service: &dyn ServiceEvolvable,
    /// Export cursor tracking incremental progress.
    pub cursor: ExportCursor,
    /// Pre-allocated chunk buffer (256 KB). Reused across chunks to avoid
    /// repeated allocation during the export loop.
    pub chunk_buffer: Vec<u8>,
    /// Export kthread handle (for join/abort).
    pub kthread: KthreadHandle,
    /// Completion flag, set when export_state_chunk() returns ExportChunk::Done.
    pub complete: AtomicBool,
}

Locking protocol between export kthread and ring consumer threads:

The export kthread calls export_state_chunk() which reads internal service state. Normal request-handling threads (KABI ring consumers, syscall dispatch threads) write service state concurrently. The following rules prevent data races:

Per-chunk RCU read-side sections: export_state_chunk() captures each chunk's worth of state under an RCU read-side critical section. This ensures that data structures being read by the export path cannot be freed by concurrent writers. RCU read locks are held for the duration of a single chunk (~256 KB serialization time, typically <1ms) — not for the entire export.
Service-internal locks respected: export_state_chunk() acquires the same internal locks that normal request handlers use (e.g., socket hash lock for networking, dentry lock for VFS). The export kthread contends for these locks like any other reader. Since export reads are bounded (~256 KB per chunk) and run at background priority, lock contention is minimal.
No exclusive locks across chunks: The export kthread must not hold any lock between successive export_state_chunk() calls. Each call is a self-contained operation that acquires and releases its own locks. This allows normal request processing to proceed between chunks without starvation.
Cursor consistency: The ExportCursor tracks which portion of the state has been exported (e.g., dentry tree traversal position, socket table hash bucket index). The cursor is private to the export kthread — no other thread reads or writes it. Cursor values are indices into stable data structures (XArray IDs, hash bucket numbers) that remain valid even as entries are added/removed concurrently (removed entries are RCU-freed after the export chunk completes).
Final re-export under quiescence: The incremental export during Phase A is a best-effort snapshot — concurrent modifications may cause inconsistencies between chunks exported at different times. The final re-export in Phase A' runs under quiescence (no concurrent writers), producing the authoritative consistent snapshot. The Phase A incremental export serves to pre-populate the import buffer, reducing the time spent under quiescence.

Abort handling: If the evolution is aborted (e.g., quiescence deadline exceeded, import_state failure), orchestration sends SIGKILL to the export kthread and joins it with a 100ms timeout. If the kthread does not exit within the timeout (stuck in a lock), orchestration marks it for deferred cleanup and proceeds with abort. The kthread holds no kernel resources that would leak — its ExportContext is freed by the join path, and any RCU read-side sections are automatically exited when the kthread terminates.

State size estimates for major KABI services:

Service	Typical state	Key data structures
VFS	2-50 MB	Dentry cache, inode cache, mount tree, open fd table
Networking	5-100 MB	Socket table, routing FIB, connection tracking, NAPI state
Block layer	1-10 MB	Request queues, I/O scheduler state, dm target tables
Cgroups	0.5-5 MB	Cgroup hierarchy, per-cgroup counters, BPF programs

13.18.9.1.4 Replacement Flow (Extended for Services)¶

KABI service replacement extends the core Phase A/A'/B/C flow with service-specific additions:

Phase A (Preparation) — concurrent with normal operation: 1. New service binary loaded and verified (KABI signature, vtable compatibility). 2. State export begins via export_state_chunk(). The old service continues handling requests. Chunks are buffered in kernel memory. 3. New service begins importing chunks via import_state_chunk(). 4. Client notification: ServiceDrainNotify is sent to all connected KABI clients (drivers, other services). Clients with the alternative_peer field set may preemptively reconnect to an alternative provider. Clients without alternatives complete in-flight operations normally.

Phase A' (Quiescence) — bounded deadline (default 100ms for services, vs 10ms for core components, because services have more in-flight state): 1. New syscall entries to this service are blocked at the umka-core domain boundary. Callers receive -ERESTARTSYS (kernel auto-retries after the swap). 2. In-flight syscalls are allowed to complete (bounded by the quiescence deadline). 3. Drain all DomainRingBuffer pairs for the target component. The kernel-side consumer drains remaining entries from each MPSC request ring; the kernel-side producer waits for the driver-side consumer to drain each SPSC completion ring (tail catches head). Bounded by a ring drain timeout (default 5ms). If the timeout expires before all rings are drained, the replacement is aborted. Ring state (head, tail, published indices) is captured into the EvolutionRingSnapshot for transfer to the replacement component at Phase B.

/// Snapshot of a DomainRingBuffer's position state at quiescence time.
/// Captured during Phase A' (step 3) after ring drain completes, and
/// consumed by the replacement component during Phase B to reconstruct
/// ring position continuity.
///
/// The snapshot captures only index/size metadata — ring data pages
/// are transferred separately via the DomainRingBuffer pointer swap.
/// After Phase B, the new component uses this snapshot to verify that
/// the ring was fully drained (head == tail == published) and to
/// initialize its own ring tracking state.
#[repr(C)]
pub struct EvolutionRingSnapshot {
    /// Write claim position at quiescence time. After a successful drain,
    /// `head == published` (no in-flight writes) and `head == tail`
    /// (consumer fully caught up).
    pub head: u64,
    /// Consumer read position at quiescence time.
    pub tail: u64,
    /// Published position at quiescence time. In a fully drained ring,
    /// `published == head` — all claimed slots have been published.
    pub published: u64,
    /// Ring capacity (number of entries). Power of two. Used by the
    /// replacement component to verify the ring geometry matches its
    /// expected layout.
    pub capacity: u32,
    /// Bytes per entry. Fixed at ring creation time. The replacement
    /// component verifies this matches its expected entry size —
    /// a mismatch indicates an incompatible ring format and aborts
    /// the evolution.
    pub entry_size: u32,
}
// EvolutionRingSnapshot: head(u64=8) + tail(u64=8) + published(u64=8) +
//   capacity(u32=4) + entry_size(u32=4) = 32 bytes.
const_assert!(core::mem::size_of::<EvolutionRingSnapshot>() == 32);

Final atomic state re-export on the now-quiesced service.

Phase B (Atomic Swap) — same as core components (~1-10 μs stop-the-world): 1. IPI stop-the-world. 2. KABI vtable pointer swap for the service. 3. ServiceDrainNotify → ServiceBindNotify transition: all clients' KABI endpoints are atomically redirected to the new service's vtable. 4. Pending ops ring transferred. 5. CPUs released.

Phase C (Activation) — new service drains pending ops, then accepts new requests: 1. New service replays queued operations. 2. Blocked syscalls are retried (transparent to userspace — applications see at most a brief latency spike, not an error). 3. Post-swap watchdog: 10 seconds for services (vs 5 seconds for core components) due to larger state and more complex activation.

Rollback: If the new service crashes within the watchdog window, the old service is reactivated from retained state. Client KABI endpoints are redirected back. If retained state is corrupted, the service is restarted fresh via initialize_fresh() — this is the VFS crash recovery path (Section 14.1).

13.18.9.1.5 Multikernel Rolling Replacement¶

UmkaOS's multikernel architecture (Section 11.1, Section 5.1) enables a testing strategy unavailable to monolithic kernels:

Test on one peer before rolling to all peers.

In a multikernel cluster (host + DPU peers, or multi-node cluster), a new version of a KABI service can be deployed to a single peer first:

1. DEPLOY: Load new service version on peer P1 only.
   - P1 runs the new version; all other peers run the old version.
   - Capability services (Section 5.1.2) ensure clients transparently route
     to whichever peer has an active provider.

2. SOAK: Run workload against P1 for a configurable soak period (default: 1 hour).
   - Monitor via FMA telemetry (Section 20.1): error rates, latency percentiles,
     memory usage, crash count.
   - If any anomaly detected: automatic rollback on P1. No other peers affected.

3. ROLL: If soak succeeds, roll to remaining peers one at a time.
   - Each peer goes through the Phase A/A'/B/C replacement flow.
   - `ServiceDrainNotify` with `alternative_peer` redirects clients to peers
     already running the new version during each peer's swap window.
   - Total cluster update time: N × (Phase A time + Phase B time) + soak time.
     For a 4-peer cluster: ~4 × 200ms + 1 hour ≈ 1 hour.

4. VERIFY: After all peers updated, run cluster-wide integration test.
   - DSM coherence check, DLM lock cycling, cross-peer RPC round-trip.

This rolling replacement model is inspired by Barrelfish's per-core kernel independence and Helios's satellite kernel model, adapted for UmkaOS's peer protocol. It provides zero-downtime kernel service updates across the cluster — the cluster as a whole never loses a capability service provider.

13.18.9.1.6 Driver Tier Promotion Protocol¶

The KABI driver model (Section 12.1) combined with tier isolation (Section 11.2) enables a structured development workflow where drivers are promoted through tiers as confidence increases:

Tier 2 (safest: Ring 3 + IOMMU)
  │  ← Develop and iterate here. Crashes are contained.
  │     Agent writes code, loads driver, tests. Repeat.
  │
  ▼ Promotion criteria:
  │  ✓ Test suite passes (100% of KABI conformance tests)
  │  ✓ No crashes in N hours of stress testing (default: 4 hours)
  │  ✓ No memory leaks (tracked via KABI allocation registry)
  │  ✓ KABI signature valid (ML-DSA-65, Section 12.4)
  │
Tier 1 (faster: Ring 0, MPK/POE isolated)
  │  ← Performance testing. Verify latency meets requirements
  │     without IOMMU overhead.
  │
  ▼ Promotion criteria:
  │  ✓ Stable at Tier 1 for N hours (default: 24 hours)
  │  ✓ Performance within 10% of target (measured via FMA, Section 20.1)
  │  ✓ No isolation domain escapes (MPK/POE containment verified)
  │  ✓ Kernel team sign-off (manual gate for Tier 0)
  │
Tier 0 (fastest: in-kernel, no isolation)
     ← Production deployment for performance-critical drivers.
        Only for drivers that genuinely need Tier 0 latency
        (APIC, timer, early serial — see Section 11.2.1).

Auto-demotion on crash:

Event	Action
Tier 0 crash	Kernel panic (Tier 0 = trusted, no isolation). Investigate root cause.
Tier 1 crash (first)	Crash recovery (Section 11.9): reload at Tier 1. Log incident.
Tier 1 crash (3 within 60 seconds)	Auto-demote to Tier 2. Alert via FMA.
Tier 2 crash (first)	Restart driver. Log incident.
Tier 2 crash (5 within 60 seconds)	Mark `DeviceState::Failed`. Require manual reload.

Tier control interface:

/ukfs/kernel/drivers/<name>/tier          # Read current tier; write 0/1/2 to change
/ukfs/kernel/drivers/<name>/crash_count   # Crash counter (read)
/ukfs/kernel/drivers/<name>/soak_hours    # Hours since last crash (read)

Writing to tier validates against signing certificate max_tier, license ceiling, and architecture capability. Returns -EPERM if the requested tier exceeds the driver's allowed maximum. Auto-demotion on crash (3 crashes in 60 seconds) also writes to this file internally.

Cross-references: - Tier isolation mechanisms: Section 11.2 - Crash recovery: Section 11.9 - KABI vtable and signing: Section 12.1 - Policy module hot-swap: Section 19.9 - Core component evolution: Sections 12.8.1-12.8.6 (above) - Multikernel peer protocol: Section 5.1 - ServiceDrainNotify: Section 5.11 - FMA telemetry: Section 20.1 - Agentic live development workflow: Section 25.17 - Data format evolution: Section 13.18 (below)

13.18.9.2 Per-Subsystem ServiceEvolvable Specifications¶

Each KABI service must define its ServiceEvolvable implementation: the state schema (what is exported), the quiescence mechanism (how in-flight operations drain), and the dependency ordering (which services must be replaced first).

VFS (umka-vfs):

Aspect	Specification
State schema	Dentry cache (name→inode mappings), inode cache (in-memory inode metadata), mount tree (mount point hierarchy), open file references (fd→OpenFile mappings per task)
Estimated size	2-50 MB (depends on dentry/inode cache pressure)
Quiescence	Block new path resolutions at syscall entry. In-flight path walks complete via refcount: each `PathWalk` holds a `dentry` refcount; quiescence waits until all `PathWalk` refcounts reach zero (bounded by I/O timeout). `I_RWSEM` write locks are not held across quiescence — only reference counts.
Export format	Chunked: dentry tree (breadth-first serialization), inode table (sorted by ino), mount table (parent-first), fd table (per-task). Each chunk is self-contained (no forward references).
Dependencies	VFS depends on block layer (for filesystem I/O) and page cache (for cached data). VFS must be replaced after block layer if both are being replaced simultaneously.

Networking stack (umka-net):

Aspect	Specification
State schema	TCP connection table (per-socket state: seq numbers, window, congestion state), routing FIB (prefix→nexthop), NAPI poll lists, socket references (fd→socket), netfilter/conntrack state
Estimated size	5-100 MB (scales with connection count and routing table)
Quiescence	Disable NAPI polling (set `NAPI_STATE_DISABLE` on all NAPI instances). Drain in-flight packets: wait until all NetBuf refcounts reach zero. TCP connections are paused (window = 0 advertisement). UDP is stateless — no drain needed.
Export format	Chunked: socket table (per-protocol), FIB entries (sorted by prefix), conntrack entries, per-NIC NAPI state.
Dependencies	Network stack depends on NIC drivers (Tier 1). When replacing umka-net only, NIC drivers are not replaced — NAPI callbacks are redirected via vtable swap. NIC drivers themselves can be replaced independently via the graceful Tier 1 replacement protocol (Section 13.18). The two replacements must not overlap: if both umka-net and a NIC driver need replacement, replace the NIC driver first, then umka-net.

Scheduler state export (umka-core scheduler policy):

Aspect	Specification
State schema	Per-CPU `RunqueueSnapshot`: task list sorted by vruntime with lag/deadline/weight per task, CBS server states (budget, period, deadline), DL task parameters (runtime, deadline, period)
Estimated size	10-500 KB (scales with task count)
Quiescence	IPI all CPUs to scheduler quiescence point (between `pick_next_task()` calls). Each CPU's runqueue is locked; state is serialized atomically. Total quiescence time bounded by max(quiescence IPI latency) ≈ 1-5ms.
Export format	Single blob (small enough): `SchedStateExport` (defined below).

/// Exported scheduler state for live evolution. Produced by `export_state()`,
/// validated by `is_compatible()`, consumed by `import_state()` and
/// `post_swap_runqueue_audit()`.
pub struct SchedStateExport {
    /// Schema version for forward-compatibility. The new component's
    /// `is_compatible()` checks this version before accepting the state.
    pub version: u64,
    /// Per-CPU runqueue snapshots. Length equals the number of possible CPUs
    /// at export time. Each entry contains the full runqueue state for one CPU.
    pub per_cpu: Vec<RunqueueSnapshot>,
    /// Total number of DL (deadline) tasks across all runqueues.
    /// Used by `post_swap_runqueue_audit()` for cross-CPU consistency checks.
    pub dl_task_count: u32,
    /// Total number of RT (real-time) tasks across all runqueues.
    pub rt_task_count: u32,
    /// Total number of CFS (EEVDF) tasks across all runqueues.
    pub cfs_task_count: u32,
    /// Padding for alignment.
    pub _pad: u32,
}

| DL task handling | DL tasks are exempted from quiescence interception: the trampoline's sched_quiescence_exempt() check dispatches pick_next_task through the old vtable when rq.dl.nr_running > 0. At Phase B (step 5c), DL runqueue entries are transferred atomically into the new scheduler's DL runqueue: (runtime, deadline, period) parameters are copied verbatim (see "Full runqueue ownership transfer" above). No DL deadline misses during replacement. | | RT task handling | RT-FIFO and RT-RR tasks are exempted from quiescence interception, same mechanism as DL: the trampoline dispatches through the old vtable when rq.rt.nr_running > 0. At Phase B (step 5c), RT runqueue entries are transferred atomically: (priority, policy, remaining_rr_slice) are copied into the new scheduler's RT runqueue. No POSIX real-time latency guarantee violations during replacement. | | CFS/EEVDF handling | CFS tasks whose pick_next_task calls were queued during Phase A' quiescence are NOT replayed from PendingOps. Instead, the complete CFS runqueue (vruntime-ordered task list, min_vruntime, CBS server state) is transferred atomically at Phase B (step 5c). PendingOps for queued pick_next_task calls are discarded post-Phase B — the runqueue transfer subsumes them. | | Post-swap audit | After Phase B runqueue transfer and Phase C1 non-runqueue state reconstruction, post_swap_runqueue_audit() verifies every TASK_RUNNING task is on exactly one runqueue and DL/RT/CFS counts match the exported state. Mismatch triggers FMA warning and automatic repair or watchdog rollback. | | Dependencies | None — scheduler is a core component, not a KABI service. Replaced via EvolvableComponent, not ServiceEvolvable. |

Slab allocator (umka-core slab policy):

Aspect	Specification
State schema	Active cache list (cache name, object size, alignment, flags), per-cache depot inventory (full/partial slab counts), per-CPU magazine state
Estimated size	1-10 KB (metadata only — slab pages are not exported)
Quiescence	IPI drain all per-CPU magazines to depot. Verify no in-flight allocations (allocation refcount per cache). Slab pages remain in place — only the allocator policy metadata is exported/imported.
Export format	Single blob: `SlabStateExport { caches: [(name, obj_size, align, flags, slab_count)] }`.
Dependencies	None — slab is a core component.

Block layer (umka-block):

Aspect	Specification
State schema	Request queues per device, I/O scheduler state (deadline queues, BFQ weights), dm target table, partition table cache
Estimated size	1-10 MB
Quiescence	Drain all in-flight bios: set each request queue to `QUEUE_FLAG_DYING`, wait for all bio completion callbacks to fire (bounded by I/O timeout, default 30s). New I/O requests return `-ERESTARTSYS`.
Export format	Chunked: per-device queue state, dm target tables, partition cache.
Dependencies	Block layer is below VFS and above NVMe/SCSI drivers. Replace block before VFS if both are replaced.

ML policy framework (umka-mlpolicy orchestration):

Aspect	Specification
State schema	`PolicyServiceRegistration` table (registered Tier 2 services — service_id, ring FDs, rate limiter state), `MlPolicyCss` per-cgroup parameter overrides, active decay timers (param_id, deadline_ns pairs), `AtomicModelRef` for in-kernel model weights (if loaded)
Estimated size	10-50 KB (registration table is small; per-cgroup overrides scale with cgroup count; model weights are Tier 2 owned, not exported)
Quiescence	(1) Block new `ML_POLICY_REGISTER` / `ML_POLICY_UNREGISTER` ioctls. (2) Block new `PolicyUpdateMsg` ring submissions by setting a `POLICY_QUIESCED` flag checked by the ring consumer. (3) Drain in-flight parameter updates — bounded by rate limiter (max 1000 msg/s × max ring depth 256 = at most 256 in-flight messages, drain completes in <10 ms). (4) `KernelParamStore` data (the param array itself) is NOT exported — it is Nucleus data and survives replacement. Only the orchestration metadata is replaced.
Export format	Single blob: `MlPolicyExport { services: ArrayVec<PolicyServiceSnapshot, 16>, cgroup_overrides: Vec<(u64, ArrayVec<ParamOverride, 32>)>, decay_timers: ArrayVec<(ParamId, u64), 64> }`.
Tier 2 service reconnection	Tier 2 services detect replacement via a `generation: AtomicU64` field in the shared-memory `KernelParamStoreShadow` header (Section 23.1). On generation mismatch, the service re-issues `ML_POLICY_REGISTER`. The new ML policy orchestration imports the service's previous `PolicyServiceRegistration` state (ring FDs, rate limiter) so reconnection is seamless if the service re-registers within 5 seconds of the swap. After 5 seconds, stale registrations are purged.
Dependencies	ML policy depends on FMA (observation bus uses FMA health events). Replace ML policy after FMA if both are replaced simultaneously. Independent of VFS/Net/Block — ML policy only reads `KernelParamStore` atomics and consumes `ObservationRing` entries.

Other Evolvable components (workqueue, IPC dispatch, ACPI/DTB parser, LSM, FMA/observability):

Component	State	Quiescence	Notes
Workqueue	Active work items, thread pool config, per-CPU worker states	Drain all pending work items (bounded by max work duration). Block new `queue_work()` calls.	State is small (<100 KB). Single-blob export.
IPC dispatch	Per-process fd table references, pipe buffer states, SysV IPC objects	Drain in-flight IPC operations via refcount. Block new syscall entries.	Depends on VFS (pipe fds are file objects).
ACPI/DTB parser	Parsed device tree, ACPI table cache	No quiescence needed — cold path only. Export parsed tree + raw table cache.	Rarely replaced (only on firmware update).
LSM	Per-inode security labels, policy database, audit rules	Drain in-flight hook evaluations (bounded, <1ms). Block new hook entries.	Policy database may be large (1-50 MB for SELinux). Chunked export.
FMA/observability	Active fault records, tracepoint registration table, metric counters	Drain in-flight fault reports. Block new `fma_report()` calls.	Metric counters are lossy — acceptable to drop observations during swap.

Full Evolvable replacement dependency ordering:

When multiple Evolvable components are replaced simultaneously (e.g., during a major kernel evolution), the replacement follows a dependency DAG:

Bottom-up export order (components with no deps export first):
  1. Slab allocator policy
  2. Scheduler (EEVDF policy, migration policy, load balancer)
  3. Workqueue
  4. ACPI/DTB parser
  5. Block layer
  6. FMA/observability
  7. ML policy framework (depends on FMA)
  8. LSM
  9. Network stack
 10. IPC dispatch
 11. VFS (last — depends on block, IPC, LSM)

Top-down import order (reverse of export):
 11. VFS
 10. IPC dispatch
  9. Network stack
  8. LSM
  7. ML policy framework
  6. FMA/observability
  5. Block layer
  4. ACPI/DTB parser
  3. Workqueue
  2. Scheduler
  1. Slab allocator policy

The framework serializes replacements in this order. Each component's import_state_chunk() receives its own serialized state blob and does not directly reference other components' live state. Cross-component references are re-established during Phase C1 (pending ops replay) after all imports complete and all vtable slots have been atomically swapped in Phase B.

13.18.10 Runtime Evolution Trigger Interface¶

Live kernel evolution is initiated through three interfaces: a sysfs control interface for interactive and scripted use, an ioctl interface for programmatic use from system management daemons, and an IPC message for cluster-wide orchestration. All three interfaces feed into the same Evolvable orchestration sequencer; they differ only in invocation mechanism.

13.18.10.1 Sysfs Control Interface¶

The evolution control files are exposed under /sys/kernel/evolution/:

/sys/kernel/evolution/
  trigger            (write-only) — initiate an evolution operation
  status             (read-only)  — current evolution state machine state
  last_result        (read-only)  — result of the most recent evolution operation
  watchdog_timeout_ms (read-write) — post-swap watchdog duration (default: 5000)
  components/        (directory)  — per-component evolution state
    scheduler/
      version        (read-only)  — current component state_version()
      state          (read-only)  — "active" | "quiescing" | "swapping" | "watchdog"
      last_swap_ns   (read-only)  — ktime of last successful swap
    page_reclaim/
      ...
    vfs/
      ...
    <component_name>/
      ...

Trigger file format: Writing to /sys/kernel/evolution/trigger initiates an evolution operation. The write payload is a newline-terminated string:

<target> <elf_path> [options...]

Field	Format	Description
`target`	Component name string	Target component: `scheduler`, `page_reclaim`, `vfs`, `net`, `block`, `slab`, `lsm`, `sysapi`, `kernel` (whole Evolvable), or a driver name from the device registry
`elf_path`	Absolute path string	Path to a signed ELF binary. For subsystem targets, the ELF contains the new component. For `kernel`, the ELF is a full kernel monolith — the loader extracts the `.evolvable_image` section and performs whole-Evolvable replacement
`--force`	Flag (optional)	Skip the soak period for cluster rolling replacement
`--timeout=N`	Integer ms (optional)	Override quiescence deadline (capped at 10000ms)
`--dry-run`	Flag (optional)	Validate the ELF, verify signature, check version compatibility, but do not perform the swap

The kernel target triggers whole-Evolvable replacement: the orchestrator opens the monolith ELF, locates the .evolvable_image section, validates the EvolvableImageHeader signature (ML-DSA-65), and proceeds with the full dependency-ordered batch swap (see "Full Evolvable replacement dependency ordering" above). This is the same Evolvable image that would be loaded at next boot — live evolution simply applies it without a reboot. There is no separate whole-Evolvable ELF; the kernel monolith in /boot/ is the canonical source for both boot and live evolution.

Examples:

# Replace the scheduler policy module:
echo "scheduler /lib/umka/evolution/sched-v42.elf" > /sys/kernel/evolution/trigger

# Replace the VFS service with extended quiescence timeout:
echo "vfs /lib/umka/evolution/vfs-v7.elf --timeout=500" > /sys/kernel/evolution/trigger

# Whole-Evolvable replacement from a new kernel monolith:
echo "kernel /boot/umka-kernel.elf" > /sys/kernel/evolution/trigger

# Dry-run validation only:
echo "net /lib/umka/evolution/net-v12.elf --dry-run" > /sys/kernel/evolution/trigger

Return values: The write() syscall returns the number of bytes written on successful initiation (the evolution operation runs asynchronously). On validation failure, the write returns an error:

Error	Meaning
`-EINVAL`	Malformed trigger string, unknown component name, or invalid option
`-ENOENT`	ELF path does not exist or is not readable
`-EPERM`	Caller lacks `CAP_SYS_ADMIN` (or the UmkaOS `EvolutionCap` capability)
`-EBUSY`	An evolution operation is already in progress
`-ENOSIG`	ELF signature verification failed (ML-DSA-65)
`-EVERSION`	State version incompatible (new version cannot import old state)

Status file: Reading /sys/kernel/evolution/status returns the current state:

idle                    — no evolution in progress
validating <target>     — Phase A: ELF loaded, signature check in progress
exporting <target>      — Phase A: state export in progress
quiescing <target>      — Phase A': quiescence in progress
swapping <target>       — Phase B: atomic swap in progress
activating <target>     — Phase C: pending ops drain + audit
watchdog <target> <ms>  — post-swap watchdog active, <ms> remaining

Last result file: Reading /sys/kernel/evolution/last_result returns:

success <target> <old_version> -> <new_version> <duration_us>
failed <target> <phase> <error_code> <description>
rollback <target> <new_version> -> <old_version> <reason>

13.18.10.2 Ioctl Interface¶

For programmatic use by system management daemons (e.g., umka-evold), the evolution subsystem exposes an ioctl interface on /dev/umka-evolution:

/// File descriptor opened on /dev/umka-evolution.
/// Requires CAP_SYS_ADMIN or EvolutionCap.

/// Initiate a live evolution operation.
/// The ioctl blocks until the operation completes (success, failure, or rollback).
/// This is the synchronous equivalent of the sysfs trigger.
/// Note: The ioctl number is defined below with the EvolutionTriggerIoctl struct.

/// Query current evolution status (non-blocking).
pub const UMKA_EVOLUTION_STATUS: u32 = 0x8010_4501; // _IOR('E', 1, EvolutionStatus)

/// Abort an in-progress evolution (only during Phase A or A').
/// Cannot abort during Phase B (atomic swap) or Phase C (activation).
pub const UMKA_EVOLUTION_ABORT: u32 = 0x4004_4502; // _IOW('E', 2, u32 target_id)

/// Combined request/result structure for UMKA_EVOLUTION_TRIGGER.
/// The ioctl is `_IOWR` — the kernel reads request fields from the user buffer
/// and writes result fields back to the same buffer. Both halves must fit in
/// the same struct to avoid the size mismatch that would occur if the kernel
/// wrote an `EvolutionResult` (40 bytes) to a buffer sized for a separate
/// `EvolutionRequest` (32 bytes). The ioctl number encodes 40 bytes.
///
/// On entry: userspace fills the request fields; result fields are ignored.
/// On return: the kernel overwrites the result fields; request fields are
/// preserved for userspace logging convenience.
#[repr(C)]
pub struct EvolutionTriggerIoctl {
    // --- Request fields (written by userspace) ---

    /// Component identifier (enum value from EvolvableComponentId).
    /// Use EVOLUTION_TARGET_KERNEL (0xFFFF_FFFF) for whole-Evolvable replacement;
    /// the ELF path must point to a kernel monolith containing `.evolvable_image`.
    pub target: u32,
    /// Flags: EVOLUTION_F_FORCE (0x1), EVOLUTION_F_DRY_RUN (0x2).
    pub flags: u32,
    /// Quiescence timeout override in milliseconds. 0 = use default.
    /// Capped at 10000ms by the kernel.
    pub timeout_ms: u32,
    /// Reserved for future use. Must be zero.
    pub _reserved_req: u32,
    /// Pointer to the ELF path string (userspace address).
    pub elf_path_ptr: u64,
    /// Length of the ELF path string (excluding null terminator).
    pub elf_path_len: u32,
    /// Padding to align to 8-byte boundary.
    pub _pad_req: u32,

    // --- Result fields (written by kernel on completion) ---

    /// 0 = success, negative = error code.
    pub status: i32,
    /// Phase where failure occurred (0=A, 1=A', 2=B, 3=C, 4=watchdog).
    /// Only meaningful when status != 0.
    pub failed_phase: u32,
    /// Old component version before the swap.
    pub old_version: u64,
    /// New component version after the swap (or attempted version on failure).
    pub new_version: u64,
    /// Total operation duration in microseconds.
    pub duration_us: u64,
    /// Stop-the-world duration in nanoseconds (Phase B only).
    pub stw_duration_ns: u64,
}
// 32 bytes (request) + 40 bytes (result) = 72 bytes total.
const_assert!(size_of::<EvolutionTriggerIoctl>() == 72);

pub const UMKA_EVOLUTION_TRIGGER: u32 = 0xC048_4500; // _IOWR('E', 0, EvolutionTriggerIoctl), size=0x48=72

The ioctl interface provides synchronous semantics: the calling thread blocks in UMKA_EVOLUTION_TRIGGER until the entire Phase A/A'/B/C sequence completes. This simplifies daemon logic (no polling loop needed). The thread is interruptible during Phase A and A' (returning -EINTR triggers a clean abort). During Phase B and C, the operation is non-interruptible (the swap must complete atomically).

13.18.10.3 Capability Requirements¶

All evolution trigger interfaces require the EvolutionCap capability (Section 9.1), which is a strict superset of Linux's CAP_SYS_ADMIN. This capability is granted only to the init process and explicitly delegated system management services. The capability check is performed at the sysfs write() / ioctl entry point, before any ELF loading or state inspection occurs.

13.18.10.4 Tier 1 Driver Evolution Trigger¶

For Tier 1 driver replacement (not crash recovery — see Section 11.9 for the crash path), the same trigger interfaces are used with the driver's device registry name as the target component. The sysfs path for per-driver evolution state is:

/sys/kernel/evolution/drivers/<bus>:<slot>/
  version        (read-only)  — current driver state_version()
  state          (read-only)  — "active" | "quiescing" | "swapping" | "watchdog"
  last_swap_ns   (read-only)  — ktime of last successful swap

Example: replacing an Intel NIC driver:

echo "pci:0000:03:00.0 /lib/umka/drivers/igc-v5.elf" > /sys/kernel/evolution/trigger

The driver target name is resolved through the device registry (Section 11.4). The orchestration layer determines whether to use the graceful Tier 1 replacement protocol (see "Graceful Tier 1 Driver Replacement" below) or the crash recovery path based on whether the current driver is healthy.

13.18.11 Graceful Tier 1 Driver Replacement¶

The crash recovery path (Section 11.9) handles unplanned driver failures: fault detection, domain revocation, -EIO for in-flight I/O, device reset, and reload. This section specifies the planned replacement protocol for Tier 1 drivers — a deliberate upgrade where the old driver is healthy and cooperating.

13.18.11.1 Evolution Trigger Classification¶

Every evolution operation carries an EvolutionTrigger that determines the orchestration protocol variant used:

/// Reason for initiating a driver or component replacement.
/// Determines which protocol variant the orchestration layer uses.
#[repr(u8)]
pub enum EvolutionTrigger {
    /// Unplanned driver fault detected by the crash recovery subsystem.
    /// Protocol: crash recovery path ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)).
    /// - Device is reset (FLR or soft reset).
    /// - In-flight I/O is drained with -EIO (non-PERSISTENT) or replayed (PERSISTENT).
    /// - Version compatibility checks are SKIPPED — the same driver binary is
    ///   reloaded (the goal is to restore service, not to upgrade).
    /// - Binary source: Module Binary Store (MBS), a kernel-resident compressed
    ///   cache populated at module load time. No filesystem I/O needed — eliminates
    ///   the circular dependency when the crashed driver IS the filesystem driver.
    ///   See [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation) step 8 for MBS details.
    /// - No `export_state()` — state is reconstructed from invariants and the
    ///   HMAC-verified state buffer checkpoint.
    CrashRecovery = 0,

    /// Deliberate driver or component upgrade initiated by operator or daemon.
    /// Protocol: graceful replacement (this section).
    /// - Old driver is healthy and cooperating.
    /// - KABI version compatibility check is REQUIRED: the new driver's
    ///   `KabiDriverManifest.kabi_version` must be compatible with the kernel's
    ///   current KABI version ([Section 12.1](12-kabi.md#kabi-overview--version-compatibility)).
    ///   Incompatible versions are rejected at Phase A (load time).
    /// - `export_state()` provides complete, verified state.
    /// - In-flight I/O is drained to completion (no -EIO).
    /// - Device is NOT reset (stays operational throughout).
    Upgrade = 1,

    /// Hot-swap of a policy module (stateless AtomicPtr swap).
    /// Protocol: stateless policy swap with watchdog (see "Stateless Policy
    /// Swap Rollback Protocol" above).
    /// - No quiescence, no state export/import, no Phase A/A'/B/C lifecycle.
    /// - Version compatibility: the policy module's vtable layout must match
    ///   the current trait definition (vtable_size check).
    /// - Used for: `PhysAllocPolicy`, `PageReclaimPolicy`, `VmmPolicy`,
    ///   `CapPolicy`, `IoSchedOps`, `QdiscOps`, `CongestionOps`.
    PolicyUpdate = 2,
}

The trigger is recorded in the evolution event log and FMA telemetry. The orchestration layer selects the protocol variant based on the trigger:

Trigger	Version check	State export	Device reset	I/O handling
`CrashRecovery`	Skip (same binary)	No (reconstruct)	Yes (FLR)	-EIO + PERSISTENT replay
`Upgrade`	Required (KABI compat)	Yes (`export_state()`)	No	Drain to completion
`PolicyUpdate`	Vtable size only (kabi_version verified at load time, not re-checked at swap)	No (stateless)	N/A	No impact

The key difference: graceful replacement drains in-flight I/O to completion (no -EIO errors visible to applications) and performs a cooperative state handoff (the old driver exports its state, rather than the kernel reconstructing it from invariants after a crash).

13.18.11.2 Graceful Replacement Protocol¶

Phase A — Preparation (same as core component evolution):
  1. Load new driver ELF, verify ML-DSA-65 signature, resolve symbols.
  2. Initialize new driver instance in a dormant state (no device access yet).
  3. Old driver exports state via EvolvableComponent::export_state():
     - Device configuration (registers, features, offload settings).
     - Per-queue state (ring buffer positions, interrupt coalescing params).
     - Statistics counters (for continuity in monitoring).

Phase A' — Graceful quiescence (driver-class-specific):

  NIC driver quiescence:
    4. Set NAPI_STATE_SCHED_DISABLE on all NAPI instances owned by this driver.
       This prevents new NAPI polls from being scheduled. Currently-running
       NAPI polls complete normally (bounded by NAPI weight, typically 64 packets).
    5. Wait for all in-flight NAPI polls to complete. The NAPI subsystem
       tracks per-instance poll state; quiescence waits for all instances
       to reach NAPI_STATE_IDLE. Bounded by: NAPI weight × packet processing
       time ≈ 64 × 1 μs = 64 μs per instance.
    6. Drain TX completions: wait for the NIC to report completion for all
       submitted TX descriptors. The TX completion ring is polled (not
       interrupt-driven during quiescence) with a 100ms timeout. If the
       timeout expires without all completions, the replacement is aborted
       (the device may be hung — fall back to the crash recovery path).
    7. Drain RX ring: process all packets currently in the RX descriptor ring
       through the network stack. No packets are dropped. This is bounded by
       the RX ring size (typically 1024-4096 descriptors) × per-packet
       processing time.
    7a. **RX packets arriving during quiescence**: Between steps 4 and 8, the
       NIC hardware continues receiving packets (link is UP, DMA is active).
       Packets arriving after NAPI is disabled are handled by reusing the
       crash recovery NIC RX drain mechanism
       ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)): the per-NIC recovery
       RX ring (pre-allocated at driver probe time with capacity
       `max(256, 2 * nic.rx_ring_size)` — the SAME ring used for crash
       recovery) accumulates incoming packets. The NAPI poll is replaced by a
       lightweight interrupt-driven drain loop that copies RX descriptors
       into the temporary ring without full network stack processing.
       After the Phase B swap, the new driver's NAPI instances drain the
       temporary ring as their first action in Phase C. If the temporary
       ring overflows (sustained line-rate during the quiescence window),
       excess packets are dropped and counted via
       `NicEvolutionStats::rx_quiesce_drops`. This counter is exported in
       the new driver's statistics for monitoring. Under typical conditions
       (quiescence window < 10ms, 10GbE), the temporary ring handles the
       arrival rate without overflow.
    8. After drain, the old driver holds no in-flight DMA mappings and no
       NAPI polls are active. The device is quiescent but still operational
       (link is UP, NIC hardware is not reset).

  Block driver quiescence (NVMe, VirtIO-blk, AHCI):
    4. Set the per-device request queue to `QUEUE_FLAG_QUIESCING`. New I/O
       submissions are held in the block layer's staging queue (not returned
       with -EIO). Existing in-flight bios drain normally through the old
       driver instance — only NEW submissions are held.
    5. Wait for all in-flight commands to complete. For NVMe: poll CQ entries
       until all outstanding command IDs have completions. Bounded by the
       device's I/O timeout (default: 30s for NVMe, configurable).
    6. If the timeout expires without all completions, abort the replacement
       (fall back to crash recovery path with FLR). Clear `QUEUE_FLAG_QUIESCING`
       and release held submissions.
    7. After drain, resume held I/O submissions through the new driver after
       the swap (Phase C step 18 clears `QUEUE_FLAG_QUIESCING`).

  **`QUEUE_FLAG_QUIESCING` definition:**

  ```rust
  bitflags! {
      /// Flags on a block device request queue controlling I/O submission behavior.
      pub struct QueueFlags: u32 {
          /// Queue is permanently shutting down. New submissions return -EIO.
          /// Set when a device is being removed from the system.
          const DYING      = 1 << 0;
          /// Queue is temporarily quiesced for live driver evolution.
          /// New submissions are HELD (not failed) in a staging queue.
          /// Distinct from DYING: QUIESCING is reversible — submissions
          /// resume when the flag is cleared after the driver swap.
          ///
          /// Set by: graceful Tier 1 driver replacement (Phase A', step 4).
          /// Cleared by: Phase C (step 18) after the new driver is active,
          ///   or on abort (step 6) if the replacement is cancelled.
          ///
          /// **Interaction with DYING**: If both DYING and QUIESCING are set
          /// (device removal races with an in-progress evolution), DYING takes
          /// precedence — held submissions are failed with -EIO and the
          /// evolution is aborted.
          ///
          /// **bio_submit() behavior when QUIESCING**:
          /// `bio_submit()` checks `bdev.queue_flags` before dispatch. If
          /// `QUIESCING` is set, the bio is appended to the per-device
          /// staging queue (`bdev.quiesce_queue: SpinLock<VecDeque<Bio>>`)
          /// and the function returns immediately. The staging queue is
          /// bounded by `MAX_QUIESCE_QUEUE_DEPTH` (default: 4096 bios).
          /// If the staging queue is full, `bio_submit()` blocks on the
          /// staging queue's waitqueue until space is available or the
          /// flag is cleared — bios are never silently dropped.
          const QUIESCING  = 1 << 1;
          /// Queue is frozen (no I/O dispatch to hardware). Used during
          /// device suspend/resume.
          const FROZEN     = 1 << 2;
      }
  }
  ```

  Generic Tier 1 driver quiescence (ring dispatch model):
    With ring-based Tier 1 dispatch, quiescing a Tier 1 module means:
    4. Set all inbound rings to `Disconnected` state
       (`ring.state.store(DISCONNECTED, Release)`). This prevents new
       submissions from producers — any producer attempting to enqueue
       observes `Disconnected` on its next `state.load(Acquire)` and
       returns `Err(Disconnected)`.
    5. Wait for in-flight ring operations to complete: poll
       `ring.inflight_count.load(Acquire) == 0` for each ring. Bounded
       by ring capacity × per-entry processing time. The consumer loop
       continues draining published entries until the ring is empty.
    6. Drain pending DMA operations: issue IOMMU flush for the driver's
       domain, then wait for all DMA fences to signal. After this step,
       no DMA initiated by the old driver is in-flight.
    After steps 4-6, the driver has no in-flight operations through
    either the KABI ring interface or DMA. The ring drain replaces the
    previous "wait for direct KABI calls to return" model — with ring
    dispatch, there are no direct cross-domain calls to wait for.

  **Policy modules via Tier 1 ring transport**: If a policy module is delivered
  via Tier 1 KABI ring transport (not the default `AtomicPtr` path), the evolution
  orchestrator drains the ring using the standard DomainRingBuffer drain protocol
  (steps 4-6 above) before Phase B swap.

Phase B — Atomic swap (~1-10 μs stop-the-world):
  9.  IPI all CPUs, brief hold.
  10. Swap the driver's vtable pointer (same Nucleus evolution primitive).
  11. Transfer DomainRingBuffer ownership to the new driver instance:
      Ring buffer transfer protocol:
      (a) Quiesce producers (set `ring.quiescing` flag — already done in Phase A' step 4).
      (b) Drain pending entries: the kernel-side consumer processes all remaining
          entries in the ring (bounded by ring capacity, completed in Phase A' step 5).
      (c) Swap ring pointer via `AtomicPtr::store(new_ring, Release)` — the new
          component's ring descriptor becomes the active ring. MPSC rings: reset
          head/tail/published indices to 0, set state = Active. SPSC completion
          rings: reset similarly. **Sequence continuity**: the ring-level sequence
          numbers reset to zero because the new ring is a fresh allocation with no
          history. Higher-level protocols (KABI request/response correlation) use
          per-domain generation counters (not ring sequence numbers) for stale-
          message detection, so the reset is safe.
      (d) New component resumes production on the new ring after Phase C activation.
      Old ring memory is freed after an RCU grace period to ensure no in-flight
      readers hold stale ring pointers. The ring memory itself belongs to the
      kernel's isolation domain, not the driver's.
  12. Redirect interrupt handlers from old driver to new driver.
  13. If the old and new drivers use different isolation domain keys:
      remap the driver's memory pages to the new domain key. Otherwise,
      inherit the existing domain (zero-cost).
  14. CPUs released.

Phase C — Activation:
  15. New driver calls import_state() with the exported state from Phase A.
  16. New driver re-programs device configuration from imported state (no
      device reset needed — the device was never stopped, just quiesced).
  17. For NIC drivers: re-enable NAPI polling (clear NAPI_STATE_SCHED_DISABLE,
      call napi_schedule() on each instance). Packet processing resumes.
  18. For block drivers: clear QUEUE_FLAG_QUIESCING, release held I/O
      submissions. The block layer's staging queue drains through the new
      driver.
  19. Post-swap watchdog starts (5 seconds).

13.18.11.3 Graceful vs Crash Recovery Comparison¶

Aspect	Graceful replacement	Crash recovery
Old driver state	Healthy, cooperating	Faulted, revoked
In-flight I/O	Drained to completion (no errors)	Completed with -EIO
Device reset	Not needed (device stays operational)	FLR or soft reset
State source	`export_state()` (complete, verified)	State buffer (HMAC-verified checkpoint)
Application impact	Brief latency spike (~1-10ms)	I/O errors, retries needed
Link state (NIC)	Stays UP throughout	Goes DOWN during recovery
TCP connections	No impact (no packet loss)	Brief pause, retransmission
Trigger	Operator/daemon via sysfs/ioctl	Automatic on fault detection

13.18.11.4 NIC Driver Graceful Replacement Timing¶

For a typical NIC driver with 8 RX queues, 8 TX queues, 1024 descriptors per queue:

Phase	Duration	Limiting factor
Phase A (ELF load, sig verify)	~5-20 ms	ELF parsing + ML-DSA-65 verification
Phase A' (NAPI drain)	~50-200 μs	8 NAPI instances × 64 packets × 1 μs
Phase A' (TX completion drain)	~1-10 ms	8 queues × 1024 completions × ~1 μs
Phase A' (RX ring drain)	~1-5 ms	8 queues × 1024 packets × ~1 μs
Phase B (atomic swap)	~1-10 μs	Nucleus evolution primitive
Phase C (import + re-enable NAPI)	~100-500 μs	Device register writes + NAPI schedule
Total	~7-35 ms	Dominated by TX completion drain

This is 5-20x faster than crash recovery (~50-150 ms) because no device reset is needed and no I/O errors are generated.

13.18.12 Data Format Evolution¶

Sections 12.8.1-12.8.8 address code evolution — replacing algorithms, policies, and entire subsystem implementations. This section addresses a complementary problem: data format evolution. Over a 50-year uptime, the non-replaceable data structures (Page descriptors, DomainRingBuffer headers, BuddyFreeList entries) may need new fields, changed semantics, or structural resizing — without ever unmapping the data or halting the subsystem that owns it.

13.18.12.1 Problem Statement¶

Non-replaceable data structures are by definition not swapped during live evolution. Their layout is baked into every kernel component that touches them. Adding a field to Page (currently 64 bytes, one cache line) or widening a DomainRingBuffer header field affects every producer and consumer simultaneously. A naive approach — stop the world, rewrite all instances — violates the zero-disruption principle and is infeasible for structures with billions of instances (one Page per 4 KB of physical RAM).

13.18.12.2 Design Principle: Format Epoch¶

Every non-replaceable data family declares a format epoch — a monotonically increasing u64 that identifies the current layout version. The epoch is stored in a global atomic and is visible to all components:

/// Global format epoch counter. Incremented when any non-replaceable data
/// format changes. All epoch changes happen exclusively during live
/// evolution Phase B (stop-the-world), guaranteeing that no concurrent
/// reader observes a partial transition.
///
/// Readers do NOT check the epoch on every access (that would add a branch
/// to every hot path). Instead, readers are recompiled against the new
/// layout during live evolution — the epoch serves as a compile-time
/// compatibility gate and a runtime diagnostic, not a per-access dispatch.
pub static DATA_FORMAT_EPOCH: AtomicU64 = AtomicU64::new(1);

Format changes are not self-service. A data format change requires a live evolution payload that contains: (a) the new component code compiled against the new layout, (b) a migration function that transforms existing data in-place or via shadow copy, and (c) the new epoch value. The evolution framework orchestrates the migration during its normal Phase A/A'/B/C flow.

13.18.12.3 Three Evolution Patterns¶

Different data structures require different migration strategies. The framework provides three patterns, selected based on instance count, access frequency, and structural constraints:

13.18.12.3.1 Pattern 1: Extension Array¶

Use when: The core struct must remain at its current size (cache-line aligned), but new per-instance metadata is needed.

Mechanism: A parallel array (or vmemmap-mapped sparse array) indexed by the same key as the core struct. New fields live in the extension array; the core struct is untouched. Components that need the new fields access both arrays; components that don't need them are unaffected.

Concrete example — PageExtArray:

The Page struct is frozen at 64 bytes (one cache line). Over 50 years, subsystems will need per-page metadata that doesn't exist today (memory tiering heat counters, CXL fabric locality tags, hardware-assisted tagging state, etc.). Rather than growing Page and breaking every consumer, new fields go into PageExtArray:

/// Per-page extension metadata. Parallel to PageArray, indexed by PFN.
/// Allocated via vmemmap at `PAGE_EXT_BASE` (a separate VA range from
/// the main `VMEMMAP_BASE`). Only backed by physical pages when
/// extension features are active — zero memory cost when unused.
///
/// **Size**: 16 bytes per page (one quarter cache line). At 16 B per
/// 4 KB page, a 1 TB system requires 4 GB of extension metadata.
/// For systems that don't use any extension features, no physical
/// pages are allocated — the VA range exists but faults on access
/// (triggering a kernel OOPS, caught by the FMA framework).
///
/// **NON-REPLACEABLE DATA**: same status as PageArray. Layout changes
/// require a Data Format Evolution payload (see Pattern 2 below for
/// how to resize this struct).
#[repr(C, align(16))]
pub struct PageExt {
    /// Memory tiering heat counter. Incremented by the page replacement
    /// policy on access; decayed by the kswapd background scanner.
    /// Used by the ML-guided tier migration policy (Section 23.1)
    /// to decide DRAM ↔ CXL tier placement.
    /// Initial value: 0 (cold). Range: 0..=65535.
    pub heat: AtomicU16,
    /// CXL fabric locality tag. Identifies the CXL switch/port topology
    /// path to this page's physical device. Set during CXL hot-add
    /// (Section 5.10) and used by the NUMA fallback policy for
    /// latency-aware placement. 0 = local DRAM (no CXL).
    pub cxl_locality: u16,
    /// Hardware memory tag (MTE on AArch64, future tagging on other
    /// architectures). Stored here rather than in Page.flags to keep
    /// Page at 64 bytes. 0 = untagged.
    pub hw_tag: u8,
    /// Extension feature flags. Bit 0: heat counter active.
    /// Bit 1: CXL locality valid. Bit 2: hw_tag valid.
    /// Bit 3: wb_fail_count valid.
    /// Bits 4-7: reserved (must be zero).
    pub ext_flags: u8,
    /// Writeback failure count. Incremented on I/O error; page marked
    /// PERMANENT_ERROR when >= 3. See [Section 4.6](04-memory.md#writeback-subsystem).
    pub wb_fail_count: AtomicU8,
    /// Reserved for future per-page extensions. Must be zero.
    /// When a new field is needed, carve it from _reserved and
    /// assign a new ext_flags bit. When _reserved is exhausted,
    /// use Pattern 2 (Shadow-and-Migrate) to grow PageExt.
    pub _reserved: [u8; 9],
}
// PageExt: heat(AtomicU16=2) + cxl_locality(u16=2) + hw_tag(u8=1) + ext_flags(u8=1) +
//   wb_fail_count(AtomicU8=1) + _reserved([u8;9]=9) = 16 bytes. align(16) satisfied.
const_assert!(core::mem::size_of::<PageExt>() == 16);

/// Per-architecture PageExt vmemmap base addresses.
/// Placed in a separate VA region from the main PageArray vmemmap.
///
/// | Architecture | `PAGE_EXT_BASE`           | Coverage           |
/// |--------------|---------------------------|--------------------|
/// | x86-64       | `0xFFFF_E100_0000_0000`   | 256 TB of physical |
/// | AArch64      | `0xFFFF_8100_0000_0000`   | 256 TB of physical |
/// | RISC-V (Sv48)| `0xFFFF_C100_0000_0000`   | 128 TB of physical |
/// | PPC64LE      | `0xC001_0000_0000_0000`   | 64 TB of physical  |
/// | ARMv7        | Flat array (limited VA)   | ≤4 GB physical     |
/// | PPC32        | Flat array (limited VA)   | ≤4 GB physical     |

pub static PAGE_EXT_ARRAY: OnceCell<PageExtArray> = OnceCell::new();

pub struct PageExtArray {
    /// Base VA of the extension vmemmap.
    pub base: VirtAddr,
    /// Maximum PFN covered (must match PageArray::max_pfn).
    pub max_pfn: AtomicU64,
}

impl PageExtArray {
    /// Get the extension metadata for a physical page.
    /// Returns None if extensions are not initialized or PFN is out of range.
    #[inline]
    pub fn get(&self, pfn: u64) -> Option<&PageExt> {
        if pfn >= self.max_pfn.load(Ordering::Acquire) {
            return None;
        }
        // SAFETY: vmemmap pages for PFNs < max_pfn are backed.
        Some(unsafe { &*((self.base.as_usize() + pfn as usize * 16) as *const PageExt) })
    }
}

Performance: Accessing PageExt is a second cache-line load (~3-5 ns L1 hit, ~10-15 ns L2 hit). Components that don't use extensions pay nothing — they never dereference PAGE_EXT_ARRAY. Components that do (e.g., MGLRU tiering, CXL placement) accept the extra cache miss as the cost of the feature, not the cost of the framework.

Lifecycle: PAGE_EXT_ARRAY is initialized lazily — only when the first extension feature is activated (e.g., CXL device hot-add, MTE policy enablement). Until then, PAGE_EXT_ARRAY.get() returns None and no physical pages are consumed for extension vmemmap.

13.18.12.3.2 Pattern 2: Shadow-and-Migrate¶

Use when: The data structure itself must change size or layout (e.g., growing PageExt from 16 to 32 bytes, or changing a field's type/offset).

Mechanism: Allocate a new vmemmap region with the new layout. A background kthread migrates data from old → new, page-by-page (for vmemmap arrays) or entry-by-entry (for indexed structures). During migration, reads check a per-region flag to determine which array to read from. After migration completes, the old array is unmapped.

/// State machine for a shadow-and-migrate operation.
///
/// The migration runs as a background kthread at SCHED_IDLE priority.
/// Normal kernel operation continues throughout — the only atomic
/// operation is the final pointer swap (Phase B, ~1 μs stop-the-world,
/// same as live evolution).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum MigrationPhase {
    /// No migration in progress. All accesses go to the current array.
    Idle,
    /// Shadow array allocated. Background migration in progress.
    /// Reads: check per-PFN migrated bitmap; read from new if migrated,
    /// old if not. Writes: always go to the new array (write-back
    /// propagated to old array if not yet migrated for that PFN, to
    /// ensure the migration kthread copies the latest data).
    Migrating,
    /// All entries migrated. Waiting for quiescence (same as live
    /// evolution Phase A' — drain in-flight readers of the old array).
    Quiescing,
    /// Atomic swap complete. Old array is unmapped and its physical
    /// pages returned to the buddy allocator.
    Complete,
}

/// Descriptor for a shadow-and-migrate operation on a vmemmap array.
pub struct VmemmapMigration {
    /// Current (old) array base VA.
    pub old_base: VirtAddr,
    /// New array base VA (shadow copy).
    pub new_base: VirtAddr,
    /// Per-entry size in old array.
    pub old_entry_size: u32,
    /// Per-entry size in new array.
    pub new_entry_size: u32,
    /// Total entries to migrate.
    pub entry_count: u64,
    /// Bitmap: 1 bit per entry. 1 = migrated to new array.
    /// Allocated from the buddy allocator during migration init.
    pub migrated: *mut u64,
    /// Current migration position (next entry to migrate).
    pub cursor: AtomicU64,
    /// Current phase.
    pub phase: AtomicU8,
    /// Transform function: converts one entry from old format to new.
    /// Called by the migration kthread for each entry.
    /// `old_ptr` points to the entry in the old array.
    /// `new_ptr` points to the pre-allocated slot in the new array.
    /// The function must copy and transform all fields.
    pub transform: fn(old_ptr: *const u8, new_ptr: *mut u8),
}

Migration rate: The background kthread processes entries in batches of 4096 (one vmemmap page worth). At ~100 ns per entry transform (memcpy + field init), one batch takes ~400 μs. With 60-second sleep between batches (SCHED_IDLE, yields to any real work), migrating 256 GB of physical memory (67M Page entries) takes ~67M / 4096 × 60s ≈ 270 hours ≈ 11 days. This is acceptable for a format change that happens once per decade. For urgent migrations, the batch interval is tunable via /proc/sys/kernel/data_migration_interval_ms (minimum: 0, maximum: 3600000).

Read interlock during migration: Readers that need to access an entry check the migrated bitmap (one bit per entry, O(1) lookup). This adds ~2 ns per access during migration only. After migration completes and the pointer swap occurs, the bitmap check is removed (the new array is the sole source of truth). The bitmap check is gated on migration.phase == Migrating — an atomic load that is branch-predicted as "not taken" after migration completes.

Invariant: At no point does any reader see a partially-migrated entry. Each entry is migrated atomically: the transform function writes the new entry, then the migrated bit is set with Release ordering. Readers load the bit with Acquire ordering before choosing which array to read from.

13.18.12.3.3 Pattern 3: Versioned Wire Protocol¶

Use when: The data structure is a message format used in a ring buffer or IPC channel, where producer and consumer may be at different evolution versions (e.g., during rolling replacement of KABI services, or between host kernel and DPU peer kernel at different versions).

Mechanism: Each message carries a version tag and a length field. Consumers use the version to select the appropriate parser. New fields are always appended (never reordered or removed). Consumers at an older version ignore trailing bytes they don't understand. Consumers at a newer version handle missing trailing fields with documented defaults.

Concrete example — DomainRingBuffer message envelope:

/// Versioned message envelope for DomainRingBuffer entries.
///
/// Every entry in a DomainRingBuffer begins with this 8-byte header.
/// The header is part of the entry's `entry_size` allocation — it does
/// NOT add overhead beyond the existing per-entry budget.
///
/// **Version negotiation**: When a ring is created, producer and consumer
/// exchange their maximum supported `format_version` via the ring setup
/// handshake (existing KABI vtable `create_ring()` call). The ring
/// operates at `min(producer_version, consumer_version)`. If a live
/// evolution upgrades the producer to a newer version, the ring
/// continues at the old version until the consumer is also upgraded.
/// The producer checks the negotiated version before each produce and
/// formats the entry accordingly.
///
/// **Backward compatibility rule**: Version N+1 messages are a strict
/// superset of version N. Fields are append-only — existing field
/// offsets never change. A version N consumer can safely read a version
/// N+1 message by ignoring bytes beyond `payload_len`. A version N+1
/// consumer reading a version N message uses documented defaults for
/// the missing trailing fields.
#[repr(C)]
pub struct RingEntryHeader {
    /// Format version of this entry. Starts at 1.
    /// Incremented when new fields are appended to the entry format.
    pub format_version: u16,
    /// Total bytes of payload following this header (excluding the
    /// header itself). `payload_len + size_of::<RingEntryHeader>()
    /// <= ring.entry_size`.
    pub payload_len: u16,
    /// Entry type discriminant (command, completion, event, etc.).
    /// Interpretation is ring-specific (defined by the KABI service
    /// that owns the ring).
    pub entry_type: u16,
    /// Reserved. Must be zero. Available for future header extensions.
    pub _reserved: u16,
    // Payload bytes follow immediately.
}
// RingEntryHeader: 4 × u16 = 8 bytes.
const_assert!(core::mem::size_of::<RingEntryHeader>() == 8);

Compatibility matrix:

Producer version	Consumer version	Behavior
N	N	Normal operation. All fields understood.
N+1	N	Producer formats at version N (negotiated). No new fields sent.
N	N+1	Consumer reads version N entries. Missing new fields use defaults.
N+1	N+1	Both upgraded. New fields available.

Performance: The 8-byte header is part of the existing entry allocation. Consumers already read the first bytes of each entry to determine the command type — adding format_version and payload_len to the read is zero additional cache misses. The version check is a single u16 comparison, branch-predicted as "taken" (versions match in steady state).

13.18.12.4 Migration Orchestration¶

Data format evolution is orchestrated by the existing live evolution framework (Sections 12.8.1-12.8.6) with one additional phase:

Live Evolution with Data Format Change:

  Phase A — Preparation (existing):
    1. New component loaded. Contains code compiled against new data layout.
    2. MIGRATION ADDITION: evolution payload includes a VmemmapMigration
       descriptor (for Pattern 2) or updated format_version (for Pattern 3).
    3. For Pattern 2: shadow array allocated, background migration kthread
       started. Migration runs concurrently with normal operation.
    4. For Pattern 1: PageExtArray vmemmap pages mapped for the new fields.
       No migration needed — extension array is new (zero-initialized).

  Phase A' — Quiescence (existing, extended):
    5. For Pattern 2: wait for migration to reach Quiescing phase
       (all entries migrated). If migration is incomplete at quiescence
       deadline, evolution is ABORTED (old component continues, shadow
       array freed). This bounds the stop-the-world window.
    6. For Pattern 3: no special action — wire protocol is backward
       compatible by construction.

  Phase B — Atomic swap (existing, ~1-10 μs):
    7. Existing: vtable pointer swap, pending ops transfer, IPI.
    8. MIGRATION ADDITION: for Pattern 2, swap the array pointer
       (e.g., PAGE_EXT_ARRAY base) from old to new. For Pattern 1,
       update ext_flags to indicate new fields are valid.
    9. Increment DATA_FORMAT_EPOCH.

  Phase C — Cleanup (existing, extended):
    10. For Pattern 2: unmap old array's vmemmap pages, return physical
        memory to buddy allocator. Free migration bitmap.
    11. New component uses the new data layout natively.

13.18.12.5 Safety Invariants for Data Format Evolution¶

INV-DF1: No partial reads. During Pattern 2 migration, a reader never sees a half-written entry. The migrated bitmap is set with Release ordering after the full entry is written; readers load it with Acquire before choosing the array.

INV-DF2: No lost writes. During Pattern 2 migration, writes to already-migrated entries go to the new array. Writes to not-yet-migrated entries go to the old array AND are propagated to the new array by the migration kthread when it reaches that entry (the kthread re-reads the old entry before transforming it, using Acquire ordering).

INV-DF3: Epoch monotonicity. DATA_FORMAT_EPOCH is strictly increasing. It is only incremented during Phase B (stop-the-world), so no concurrent reader observes a stale epoch value after the IPI completes.

INV-DF4: Wire protocol backward compatibility. For Pattern 3, version N+1 messages never reorder or remove fields from version N. This is enforced by compile-time layout assertions: each version's struct includes the prior version's fields at the same offsets, with new fields appended.

INV-DF5: Extension array isolation. For Pattern 1, the extension array (PAGE_EXT_ARRAY) is in a separate VA range from the core array (PAGE_ARRAY). A bug in extension array access (null PAGE_EXT_ARRAY, wrong PFN bounds) cannot corrupt the core Page descriptors.

13.18.12.6 When to Use Each Pattern¶

Criterion	Pattern 1: Extension Array	Pattern 2: Shadow-and-Migrate	Pattern 3: Versioned Wire
Instance count	Billions (per-page)	Millions-billions	Thousands (per-ring)
Core struct must stay same size?	Yes	No (struct grows/shrinks)	N/A (message format)
Migration time	Instant (new array)	Hours-days (background)	Instant (per-message)
Hot-path overhead	Extra cache miss (when used)	~2 ns bitmap check (during migration only)	Zero (header in existing read)
Use case	New per-page metadata	Grow PageExt, resize BuddyFreeList	DomainRingBuffer entry format, KABI wire protocol
Frequency	Every few years	Once per decade	Every major KABI version

13.18.12.7 Cross-references¶

PageExtArray design: Section 4.2
Page struct (frozen at 64 bytes): Section 4.3
DomainRingBuffer (ring entry header): Section 11.8
PageArray vmemmap: Section 4.2
Live evolution Phase A/A'/B/C: Section 13.18 (above)
Formal invariants: Section 13.18 (above)
Format epoch global: DATA_FORMAT_EPOCH in umka-core/src/evolution/mod.rs

13.18.13 Evolvable Module Developer SDK¶

The live evolution architecture above is powerful but places a correctness burden on every driver contributor: if a developer accidentally stores a static AtomicU64 in a replaceable module instead of the persistent data layer, the next live update silently loses that state. Linux does not have this problem because Linux modules are never replaced at runtime.

UmkaOS addresses this with three layers of defense — link-time enforcement (zero false negatives), compile-time ergonomics (#[umka_evolvable] proc macro), and an SDK that makes the correct pattern the path of least resistance (PersistentState derive macro with automatic state migration).

13.18.13.1 Layer 1: Link-Time Section Check (Zero False Negatives)¶

An evolvable .uko module must have no .data or .bss section — no mutable statics at all. The KABI module loader (Section 12.7) already parses ELF sections during driver_load(). The evolvable-module validation adds a hard reject gate:

/// Called during driver_load() for modules with transport_mask indicating
/// evolvable replacement capability. Rejects modules that contain mutable
/// global state — such state would be silently lost on live replacement.
fn validate_evolvable_module(elf: &ElfImage) -> Result<(), LoadError> {
    for section_name in [".data", ".bss", ".data.rel.ro", ".tdata", ".tbss"] {
        if let Some(sec) = elf.section_by_name(section_name) {
            if sec.size > 0 {
                return Err(LoadError::StatefulEvolvable {
                    section: section_name,
                    size: sec.size,
                    hint: "Evolvable modules must be stateless. Move mutable \
                           state to the Nucleus data layer via PersistentSlot<T>. \
                           See umka-driver-sdk::evolvable for patterns.",
                });
            }
        }
    }
    // .rodata is allowed (immutable constants, vtable pointers, string literals).
    // .text is required (the code itself).
    // .umka_evolvable_marker is required (placed by #[umka_evolvable] macro).
    if elf.section_by_name(".umka_evolvable_marker").is_none() {
        return Err(LoadError::MissingEvolvableMarker {
            hint: "Module declares evolvable capability but was not compiled \
                   with #[umka_evolvable]. Add the attribute to the module \
                   entry struct.",
        });
    }
    Ok(())
}

This catches static mut, static AtomicU64, lazy_static!, OnceCell, static Mutex<T>, and any other construct that generates a mutable global (.data/.bss). It also catches thread_local! (Rust) and __thread (C), which generate .tdata/.tbss sections — per-thread mutable state that is equally lost on module replacement. It is a hard gate at load time, not a lint that can be overridden. Zero false negatives — if the module has any mutable static or thread-local state, it has one of the five checked sections.

What about stack and heap state?

Stack variables (local variables in callbacks) are not a concern. Callbacks run, use locals, and return. Module replacement happens between callbacks (the Phase A quiescence protocol ensures no CPU is mid-execution in the old module). Stack frames are transient by nature — they were never meant to survive across calls and don't need migration.
Long-running worker threads with loop-local state are also safe. The old thread continues executing old code (RCU protects the old module from being unloaded). The new module starts a new worker. The old thread's stack state is consumed naturally before the old module is freed.
Heap-allocated state hidden behind raw pointers (e.g., Box::into_raw() cast to u64 and smuggled into a CongPriv field) is a misuse that PersistentState prevents: the derive macro enforces all fields are Copy types (no pointers, no Box, no Arc). Developers who bypass the SDK and write raw bytes into CongPriv.data can circumvent this — but they are explicitly opting out of the safety contract. The PersistentState derive macro's Copy bound is the defense; runtime detection of smuggled pointers is not feasible without hardware tagging (MTE/HWASAN).

13.18.13.2 Layer 2: `#[umka_evolvable]` Proc Macro (Compile-Time Enforcement)¶

The #[umka_evolvable] attribute macro, provided by umka-driver-sdk, enforces statelessness at compile time and generates the link-time marker section:

use umka_driver_sdk::evolvable;

#[evolvable::entry]
pub struct MyCongestionOps;

impl CongestionOps for MyCongestionOps {
    fn on_ack(&self, state: &mut CongPriv, ack: &AckInfo) {
        // `state` is per-connection persistent data (lives in the socket,
        // survives module replacement). `self` is zero-sized — the struct
        // is a pure policy selector with no owned state.
        let bbr = state.as_typed::<BbrState>();
        bbr.cwnd = min(bbr.cwnd + 1, bbr.ssthresh);
    }
}

The proc macro performs these compile-time checks:

Zero-size or immutable-only struct: The entry struct must either be a ZST (no fields) or have only &'static reference fields (immutable configuration pointers). Any field containing AtomicXxx, UnsafeCell, Mutex, RefCell, Cell, or raw mutable pointers emits a compile error:

error: evolvable module entry struct contains mutable state
  --> my_congestion.rs:4:5
  |
4 |     counter: AtomicU64,
  |     ^^^^^^^^^^^^^^^^^^ mutable state in evolvable module
  |
  = help: move this state to a #[derive(PersistentState)] struct
          and access it via PersistentSlot<T> or CongPriv::as_typed()

ZST assertion: Generates const _: () = assert!(core::mem::size_of::<Self>() == 0); for entry structs with no fields (catches phantom state via PhantomData<T> where T has non-zero size).
Marker section: Generates #[link_section = ".umka_evolvable_marker"] on a static byte, which the link-time check (Layer 1) verifies.
No #[allow(mutable_statics)]: The macro's generated code includes #[deny(mutable_transmutes)] to prevent unsafe workarounds that reinterpret immutable data as mutable.

13.18.13.3 Layer 3: `PersistentState` — State Definition and Migration¶

The persistent state infrastructure has two entry paths that produce identical output:

Path	Source	When to use
KABI IDL (recommended)	`.kabi` file with `state` block	Cross-language modules (C + Rust), third-party drivers, any module that may be reimplemented in a different language
Rust derive macro	`#[derive(PersistentState)]` in `.rs`	Rust-only modules where the developer prefers native Rust syntax and no `.kabi` file exists for the state struct

Both paths produce layout-identical #[repr(C)] structs with the same version numbers, migration functions, and binary format. The KABI compiler's Rust output and the derive macro's output are interchangeable — a state slot written by an IDL-generated module can be read by a derive-macro module and vice versa.

When the .kabi IDL already defines the state struct (the common case for KABI drivers that ship both C and Rust bindings), Rust code uses the generated bbr_state.rs directly — no #[derive(PersistentState)] needed. The derive macro exists for Rust-only modules that have no .kabi IDL file (e.g., kernel-internal evolvable components like the scheduler policy, page reclaim algorithm, or VFS writeback policy, which are Rust-only and never need C bindings).

13.18.13.3.1 `PersistentState` via KABI IDL (Recommended Path)¶

When a .kabi IDL file defines the state struct (see the C Driver Support section below for the full IDL syntax), umka-kabi-gen generates both C and Rust versions. The Rust driver simply imports the generated struct:

// Generated by umka-kabi-gen from bbr.kabi — layout-identical to C version
use bbr_kabi::BbrState;

#[evolvable::entry]
pub struct Bbr;

impl CongestionOps for Bbr {
    fn on_ack(&self, priv_state: &mut CongPriv, ack: &AckInfo) {
        let state = priv_state.as_typed::<BbrState>();  // IDL-generated type
        state.btl_bw = max(state.btl_bw, ack.delivered_bytes * 1_000_000 / ack.rtt_us);
    }
}

No #[derive(PersistentState)] annotation is needed — the IDL compiler generates the impl PersistentState block, including migrate_in_place(), version constants, and layout hash. The generated code is in a separate crate (bbr_kabi) that both C and Rust modules depend on.

13.18.13.3.2 `PersistentState` Derive Macro (Rust-Only Convenience Path)¶

For Rust-only evolvable components that have no .kabi IDL, the #[derive(PersistentState)] macro provides equivalent functionality with native Rust syntax:

use umka_driver_sdk::evolvable::PersistentState;

/// Per-connection state for the BBR congestion control algorithm.
/// Lives in the socket's CongPriv slot (Nucleus data), NOT in the
/// evolvable module. Survives module replacement transparently.
#[derive(PersistentState)]
#[repr(C)]
pub struct BbrState {
    // --- v1 fields (offsets fixed forever once published) ---
    /// Minimum RTT observed during the connection lifetime.
    pub min_rtt_us: u64,        // offset 0
    /// Estimated bottleneck bandwidth (bytes/sec).
    pub btl_bw: u64,            // offset 8
    /// Current BBR state machine phase.
    pub phase: BbrPhase,        // offset 16  (u8 enum)
    pub _pad0: [u8; 3],         // offset 17  (alignment)
    /// Congestion window (bytes).
    pub cwnd: u32,              // offset 20

    // --- v2 additions (appended — v1 state reads zeros here) ---
    /// Number of rounds spent in PROBE_RTT phase. Added in v2 to
    /// improve exit timing. Default 10 for connections upgraded from v1.
    #[since(version = 2, default = "10")]
    pub probe_rtt_rounds: u32,  // offset 24

    // --- v3 additions ---
    /// ECN alpha parameter for DCTCP-style marking. Added in v3.
    #[since(version = 3, default = "0")]
    pub ecn_alpha: u32,         // offset 28
}
// BbrState: min_rtt_us(u64=8) + btl_bw(u64=8) + phase(u8=1) + _pad0(3) +
//   cwnd(u32=4) + probe_rtt_rounds(u32=4) + ecn_alpha(u32=4) = 32 bytes.
const_assert!(core::mem::size_of::<BbrState>() == 32);

The #[derive(PersistentState)] macro generates:

impl PersistentState for BbrState {
    const CURRENT_VERSION: u64 = 3;
    const MIN_SUPPORTED_VERSION: u64 = 1;
    const LAYOUT_HASH: u64 = 0x...; // FNV-1a of field names + types + offsets

    /// In-place migration from an older version. Called by the SDK when
    /// the stored state version is older than CURRENT_VERSION.
    /// Fields beyond `stored_size` are already zero (Nucleus zeroes slots).
    fn migrate_in_place(&mut self, stored_version: u64, _stored_size: usize) {
        if stored_version < 2 {
            self.probe_rtt_rounds = 10;
        }
        if stored_version < 3 {
            self.ecn_alpha = 0;
        }
    }
}

Compile-time invariants enforced by the macro:

#[repr(C)] is required (stable binary layout across module versions).

All #[since] fields must come after all non-#[since] fields (append-only invariant). Violating this emits:

error: non-#[since] field after #[since] field — append-only layout violated
  --> bbr_state.rs:15:5
  |
15 |     pub z: u64,
  |     ^^^^^^^^^^ move before #[since] fields or add #[since(version = N)]

Fields must not contain pointers, Box, Arc, Vec, String, or any heap-allocated type (non-portable across replacement boundaries). Only fixed-size Copy types are allowed. The macro checks T: Copy for every field type.
#[deprecated_since(version = N)] marks fields that are no longer used by the current version but must remain in the struct for layout compatibility. The generated code zeroes them on migration from versions where they were active.

13.18.13.3.3 `PersistentSlot<T>` — Nucleus-Owned State Handle¶

For components with global state (not per-connection), the SDK provides PersistentSlot<T>:

/// Nucleus-owned state slot for an evolvable component. The Nucleus
/// allocates and owns the memory; evolvable code borrows it via
/// type-safe accessor. State persists across module replacements.
pub struct PersistentSlot<T: PersistentState> {
    /// State allocation (Nucleus-owned memory). Uses `UnsafeCell` to allow
    /// `&self` → `&mut T` borrows; external synchronization (the socket lock
    /// or per-component mutex) guarantees exclusive access.
    data: UnsafeCell<T>,
    /// Version of the state as last written.
    stored_version: AtomicU64,
    /// Byte size of the state as last written.
    stored_size: AtomicU32,
}

impl<T: PersistentState> PersistentSlot<T> {
    /// Borrow the persistent state with automatic version migration.
    /// Called by evolvable module code to access its persistent state.
    ///
    /// If the stored version is older than `T::CURRENT_VERSION`, runs
    /// in-place migration (for append-only changes) or returns
    /// `MigrationRequired` (for breaking changes that need explicit
    /// `MigrateFrom` implementation).
    pub fn borrow(&self) -> Result<&mut T, SlotError> {
        let stored_ver = self.stored_version.load(Acquire);
        let stored_sz = self.stored_size.load(Acquire);

        if stored_ver == T::CURRENT_VERSION {
            // Fast path: same version, no migration needed.
            // SAFETY: External synchronization (socket lock / per-component mutex)
            // guarantees that only one thread calls borrow() at a time, so the
            // &mut T does not alias. UnsafeCell permits interior mutability.
            return Ok(unsafe { &mut *self.data.get() });
        }

        if stored_ver < T::MIN_SUPPORTED_VERSION {
            return Err(SlotError::VersionTooOld {
                stored: stored_ver,
                min_supported: T::MIN_SUPPORTED_VERSION,
                hint: "State is from a version older than the 5-release \
                       support window. Slot will be re-initialized.",
            });
        }

        // Migration path: fill defaults for fields added in newer versions.
        // SAFETY: Same exclusive-access guarantee as the fast path above.
        let state = unsafe { &mut *self.data.get() };
        state.migrate_in_place(stored_ver, stored_sz as usize);
        self.stored_version.store(T::CURRENT_VERSION, Release);
        self.stored_size.store(core::mem::size_of::<T>() as u32, Release);
        Ok(state)
    }
}

The Nucleus stores a version registry alongside each slot: (module_id, state_version: u64, state_size: u32, layout_hash: u64). The layout hash is a compile-time FNV-1a of field names, types, and offsets. If the hash of the prefix (up to stored_size) matches the new module's prefix hash, migration is skipped even if the version number differs (pure field additions produce compatible prefix layouts).

13.18.13.3.4 `MigrateFrom<Old>` — Breaking State Migration¶

When append-only evolution is insufficient (field type change, unit conversion, struct reorganization), the developer implements MigrateFrom:

#[derive(PersistentState)]
#[repr(C)]
pub struct BbrStateV4 {
    pub min_rtt_ns: u64,       // Changed: was min_rtt_us (μs → ns)
    pub btl_bw_bytes: u64,     // Changed: was btl_bw (packets → bytes)
    pub phase: BbrPhase,
    pub _pad0: [u8; 3],
    pub cwnd_bytes: u32,       // Changed: was cwnd (packets → bytes)
    pub probe_rtt_rounds: u32,
    pub ecn_alpha: u32,
}
// BbrStateV4: same field layout as BbrState v3 (field meanings changed, not layout).
const_assert!(core::mem::size_of::<BbrStateV4>() == 32);

impl MigrateFrom<BbrStateV3> for BbrStateV4 {
    /// Convert v3 state to v4 layout. Called once per slot during
    /// the evolution primitive's Phase A (preparation).
    ///
    /// `context` provides connection metadata (MSS, socket info) needed
    /// for unit conversions. For global state, context carries the
    /// component's configuration.
    fn migrate(old: &BbrStateV3, context: &MigrationContext) -> Result<Self, MigrationError> {
        let mss = context.tcp_mss().ok_or(MigrationError::MissingContext("tcp_mss"))?;
        Ok(Self {
            min_rtt_ns: old.min_rtt_us.checked_mul(1000)
                .ok_or(MigrationError::Overflow("min_rtt_ns"))?,
            btl_bw_bytes: old.btl_bw.saturating_mul(mss as u64),
            phase: old.phase,
            _pad0: [0; 3],
            cwnd_bytes: old.cwnd.saturating_mul(mss),
            probe_rtt_rounds: old.probe_rtt_rounds,
            ecn_alpha: old.ecn_alpha,
        })
    }
}

The SDK dispatches migration based on version distance:

// Inside PersistentSlot<T>::borrow() — breaking migration path
if T::has_breaking_migration(stored_ver) {
    // Allocate temporary buffer for old state
    let old_bytes = core::slice::from_raw_parts(self.data as *const u8, stored_sz);
    let new_state = T::migrate_from_version(stored_ver, old_bytes, &migration_ctx)?;
    core::ptr::write(self.data, new_state);
    self.stored_version.store(T::CURRENT_VERSION, Release);
    self.stored_size.store(core::mem::size_of::<T>() as u32, Release);
}

Dry-run validation: During Phase A of the evolution primitive (Section 13.18), the migration is executed with DryRun::Yes on a copy of the state. If migration fails (overflow, missing context, incompatible layout), the evolution swap is aborted before Phase B. No state is modified. The FMA subsystem logs the migration failure with the specific error and version pair.

Per-connection lazy migration: For per-socket state (congestion control, TCP options), migration does not run all-at-once during Phase A. Instead:

Phase A records (old_version, new_version) in a global migration descriptor.
Phase B swaps the module pointer (new code is now active).
Each connection migrates its CongPriv state lazily on the next callback invocation (ACK processing, retransmit timer). The callback checks state.version < CURRENT_VERSION and calls migrate_in_place() or MigrateFrom::migrate() before proceeding.
Migration cost is spread across connections and time — no thundering herd.

Connections that are idle (no packets for hours) migrate on their next activity. Connections that close before migrating simply free their old-format state — no migration needed for dead connections.

13.18.13.3.5 Version Support Window¶

Each PersistentState struct declares MIN_SUPPORTED_VERSION — the oldest version from which migration is possible. This follows the KABI 5-release support window (Section 12.2):

Scenario	Behavior
`stored_version == CURRENT_VERSION`	Fast path — no migration
`stored_version ∈ [MIN_SUPPORTED, CURRENT)`	Migration (in-place or breaking)
`stored_version < MIN_SUPPORTED_VERSION`	Slot re-initialized from defaults; old state discarded. For per-connection state: connection is reset (RST). FMA warning logged.

The version support window ensures migration functions remain bounded in complexity — each version only needs to migrate from the previous version, not from arbitrary ancient versions. Chain migration (v1→v2→v3→v4) is supported but the chain length is bounded by the support window (max 5 hops).

13.18.13.4 Complete Developer Examples¶

13.18.13.4.1 Example A: KABI IDL Path (Recommended — Cross-Language)¶

The IDL file is the single source of truth. Both C and Rust code use IDL-generated types.

Step 1: Define state in .kabi IDL:

// bbr.kabi
state BbrState @version(3) @min_version(1) {
    min_rtt_us:       u64;
    btl_bw:           u64;
    phase:            u8;
    @pad(3);
    cwnd:             u32;
    pacing_rate:      u64;
    round_count:      u64;

    @since(2) @default(10)
    probe_rtt_rounds: u32;

    @since(3) @default(0)
    ecn_alpha:        u32;
}

Step 2: Rust driver imports generated type:

// file: umka-bbr/src/lib.rs
use bbr_kabi::BbrState;  // Generated by umka-kabi-gen from bbr.kabi
use umka_driver_sdk::evolvable;
use umka_driver_sdk::net::{CongestionOps, CongPriv, AckInfo, LossInfo};

#[evolvable::entry]
pub struct Bbr;

impl CongestionOps for Bbr {
    fn on_ack(&self, priv_state: &mut CongPriv, ack: &AckInfo) {
        let state = priv_state.as_typed::<BbrState>();
        state.btl_bw = max(state.btl_bw, ack.delivered_bytes * 1_000_000 / ack.rtt_us);
        state.min_rtt_us = min(state.min_rtt_us, ack.rtt_us);
        state.round_count += 1;
    }

    fn on_loss(&self, priv_state: &mut CongPriv, loss: &LossInfo) {
        let state = priv_state.as_typed::<BbrState>();
        state.cwnd = max(state.cwnd / 2, 4 * loss.mss);
    }

    fn init_state(&self, priv_state: &mut CongPriv, mss: u32) {
        let state = priv_state.init_typed::<BbrState>();
        state.cwnd = 10 * mss;
        state.pacing_rate = 0;
        state.min_rtt_us = u64::MAX;
    }
}

Step 3: C driver (same .kabi, same state format):

/* file: umka-bbr-c/src/bbr.c */
#include "bbr_state.h"  /* Generated by umka-kabi-gen from bbr.kabi */

UMKA_EVOLVABLE_ENTRY(bbr_ops);

static void bbr_on_ack(struct cong_priv *priv, const struct ack_info *ack)
{
    struct bbr_state *s = umka_typed(priv, bbr_state);
    s->btl_bw = max(s->btl_bw, ack->delivered_bytes * 1000000 / ack->rtt_us);
    s->min_rtt_us = min(s->min_rtt_us, ack->rtt_us);
    s->round_count++;
}

The Rust module and the C module can replace each other at runtime — the persistent BbrState is binary-compatible because both use IDL-generated types.

Compile and load behavior (same for both languages): - Layer 1 validates the .uko (no .data/.bss sections). - #[evolvable::entry] (Rust) or UMKA_EVOLVABLE_ENTRY (C) generated the .umka_evolvable_marker section. - State migration is handled by IDL-generated migrate_in_place() / bbr_state_migrate(). - Per-connection BbrState slots are lazily migrated on next ACK. - Connections using the old module version continue until their next callback; no connections are disrupted.

13.18.13.4.2 Example B: Rust Derive Macro Path (Rust-Only Convenience)¶

For kernel-internal evolvable components (scheduler policy, page reclaim, VFS writeback) that will never have C implementations, the developer can define the state struct directly in Rust without a .kabi IDL file:

// file: umka-core/src/reclaim/policy_state.rs

use umka_driver_sdk::evolvable::PersistentState;

#[derive(PersistentState)]
#[repr(C)]
pub struct ReclaimPolicyState {
    pub refault_distance: u64,              // 8 bytes, offset 0
    pub scan_balance: i32,                  // 4 bytes, offset 8
    pub _pad: [u8; 4],                      // 4 bytes, offset 12
    pub generation: u64,                    // 8 bytes, offset 16

    #[since(version = 2, default = "50")]
    pub anon_cost_ratio: u32,               // 4 bytes, offset 24
    /// Explicit tail padding: content ends at offset 28. Struct align = 8
    /// (from u64 fields). 28 % 8 = 4, need 4 bytes. CLAUDE.md rule 11.
    /// Crosses live-replacement boundaries — must not leak uninitialized memory.
    pub _pad_tail: [u8; 4],                 // 4 bytes, offset 28
}
// ReclaimPolicyState layout (all padding explicit):
//   refault_distance(u64=8) + scan_balance(i32=4) + _pad([u8;4]=4) +
//   generation(u64=8) + anon_cost_ratio(u32=4) + _pad_tail([u8;4]=4) = 32 bytes.
//   Struct align 8. 32 % 8 = 0. No implicit padding.
const_assert!(core::mem::size_of::<ReclaimPolicyState>() == 32);

This path produces the same PersistentState trait impl, the same version checking, and the same migration behavior as the IDL path. The only difference is that no C header is generated — this state struct is only usable from Rust.

When to use which path:

Scenario	Path
KABI driver with C and/or Rust implementations	IDL (mandatory)
KABI driver that is Rust-only today but may get C port	IDL (recommended)
Kernel-internal evolvable component (never externally visible)	Derive macro (convenience)
Third-party driver	IDL (mandatory — enables cross-language replacement)

13.18.13.5 Cross-references¶

Live evolution Phase A/B/C flow: Section 13.18
VtableHeader (stateful evolution header): above
PolicyVtableHeader (stateless swap): Section 19.9
KABI ABI rules and 5-release support window: Section 12.2
CongPriv per-connection state: Section 16.8
Data Format Evolution patterns (Extension Array, Shadow-and-Migrate): above

13.18.13.6 C Driver Support via KABI IDL Code Generation¶

The sections above define the SDK abstractions (Layers 1-3) and show both the IDL path and the Rust derive-macro path. This section specifies the C-specific code generation — how the KABI compiler (umka-kabi-gen, Section 12.5) transforms the .kabi IDL state annotations into C headers, migration functions, and registration macros.

The key principle: the .kabi IDL file is the single source of truth for state layout, versioning, and migration. A C driver and a Rust driver can replace each other at runtime — the persistent state format is language-independent because it is defined in the IDL, not in either language's type system.

13.18.13.6.1 KABI IDL State Versioning Syntax¶

The state keyword defines a persistent state struct with versioning metadata:

// bbr.kabi — KABI IDL with state versioning

state BbrState @version(3) @min_version(1) {
    // v1 fields (offsets fixed forever once published)
    min_rtt_us:        u64;       // Minimum RTT observed (microseconds)
    btl_bw:            u64;       // Estimated bottleneck bandwidth (bytes/sec)
    phase:             u8;        // BBR state machine phase
    @pad(3);
    cwnd:              u32;       // Congestion window (bytes)

    // v2 additions (appended — v1 state reads zeros here)
    @since(2) @default(10)
    probe_rtt_rounds:  u32;       // Rounds in PROBE_RTT phase

    // v3 additions
    @since(3) @default(0)
    ecn_alpha:         u32;       // ECN alpha for DCTCP-style marking
}

IDL annotations:

Annotation	Scope	Meaning
`@version(N)`	`state` block	Current version of this state struct
`@min_version(N)`	`state` block	Oldest version that can be migrated from
`@since(N)`	field	Field was added in version N; absent in earlier versions
`@default(expr)`	field	Default value when migrating from a version before `@since`
`@pad(N)`	field	Explicit padding (N bytes) for alignment
`@deprecated_since(N)`	field	Field unused since version N; zeroed on migration
`@migrate_from(OldType, func)`	`state` block	Breaking migration from `OldType` via named function

The KABI compiler validates at IDL parse time: - All @since fields appear after all non-@since fields (append-only invariant). - @default expressions are compile-time constants. - @pad sizes produce correct alignment for the following field. - @version ≥ @min_version. - All field types are fixed-size value types (no pointers, no heap types).

13.18.13.6.2 Generated C Output¶

From the IDL above, umka-kabi-gen produces:

/* bbr_state.h — Auto-generated by umka-kabi-gen from bbr.kabi — DO NOT EDIT */
#pragma once
#include <umka/driver_sdk.h>

#define BBR_STATE_CURRENT_VERSION  3
#define BBR_STATE_MIN_VERSION      1

struct __attribute__((packed)) bbr_state {
    uint64_t min_rtt_us;        /* offset 0  — v1 */
    uint64_t btl_bw;            /* offset 8  — v1 */
    uint8_t  phase;             /* offset 16 — v1 */
    uint8_t  _pad0[3];          /* offset 17 — alignment */
    uint32_t cwnd;              /* offset 20 — v1 */
    uint32_t probe_rtt_rounds;  /* offset 24 — v2, default 10 */
    uint32_t ecn_alpha;         /* offset 28 — v3, default 0 */
};

_Static_assert(sizeof(struct bbr_state) == 32, "bbr_state layout");

/// Auto-generated in-place migration from @since/@default annotations.
/// Called by the SDK when stored_version < BBR_STATE_CURRENT_VERSION.
/// Fields beyond stored_size are already zero (Nucleus zeroes slots).
static inline void bbr_state_migrate(struct bbr_state *s,
                                     uint64_t stored_version)
{
    if (stored_version < 2)
        s->probe_rtt_rounds = 10;
    if (stored_version < 3)
        s->ecn_alpha = 0;
}

/// Register this state type with the PersistentSlot infrastructure.
/// Provides version, min_version, size, and migration function pointer.
UMKA_REGISTER_STATE(bbr_state, BBR_STATE_CURRENT_VERSION,
                    BBR_STATE_MIN_VERSION, bbr_state_migrate);

The UMKA_REGISTER_STATE macro expands to a static descriptor placed in a dedicated ELF section (.umka_state_registry), which the Nucleus scans at module load time to populate its version registry.

13.18.13.6.3 Generated Rust Output¶

The same IDL produces the equivalent Rust code:

// bbr_state.rs — Auto-generated by umka-kabi-gen from bbr.kabi — DO NOT EDIT

#[repr(C)]
pub struct BbrState {
    pub min_rtt_us: u64,
    pub btl_bw: u64,
    pub phase: u8,
    pub _pad0: [u8; 3],
    pub cwnd: u32,
    pub probe_rtt_rounds: u32,
    pub ecn_alpha: u32,
}
const_assert!(core::mem::size_of::<BbrState>() == 32);

impl PersistentState for BbrState {
    const CURRENT_VERSION: u64 = 3;
    const MIN_SUPPORTED_VERSION: u64 = 1;

    fn migrate_in_place(&mut self, stored_version: u64, _stored_size: usize) {
        if stored_version < 2 { self.probe_rtt_rounds = 10; }
        if stored_version < 3 { self.ecn_alpha = 0; }
    }
}

Both outputs are layout-identical (#[repr(C)] matches default C struct layout with natural alignment and padding for the same field sequence). A C module can be replaced by a Rust module (or vice versa) and the persistent state is binary-compatible.

13.18.13.6.4 C Driver Evolvable Entry¶

The C equivalent of #[evolvable::entry]:

#include <umka/driver_sdk.h>
#include "bbr_state.h"

/* Declare this module as evolvable — generates .umka_evolvable_marker
 * section and zero-size static assertion. */
UMKA_EVOLVABLE_ENTRY(bbr_ops);

UMKA_EVOLVABLE_ENTRY(name) expands to:

#define UMKA_EVOLVABLE_ENTRY(name)                                          \
    __attribute__((section(".umka_evolvable_marker")))                       \
    static const char _umka_evolvable_##name = 1;                           \
    /* No struct definition — C evolvable entries are not instantiated.   */ \
    /* The vtable (const struct xxx_ops) is the module's sole export.     */

Unlike Rust's ZST enforcement, C cannot prevent a developer from declaring static int counter = 0; in the module source. Layer 1 (link-time .data/.bss check) catches this at load time — the module is rejected with a diagnostic message pointing to the offending section.

13.18.13.6.5 C State Access API¶

The C SDK provides umka_typed() for type-safe access to persistent state with automatic version migration:

/// Access persistent state with automatic lazy migration.
/// On first access after module replacement, checks stored_version
/// against CURRENT_VERSION and calls the registered migrate function
/// if needed. Subsequent accesses are a single pointer dereference.
///
/// # Serialization requirement
///
/// Caller must hold the owning socket lock or equivalent serialization
/// when calling umka_typed(). The macro performs a plain (non-atomic)
/// read of stored_version and a plain store on version update. On
/// weakly-ordered architectures (ARM, RISC-V, PPC), these plain stores
/// are safe only under external serialization — without the socket lock,
/// a concurrent reader on another CPU may observe a torn version write
/// and skip or double-execute the migration function.
///
/// The Rust PersistentSlot uses AtomicU64 for stored_version because
/// Rust callers may not hold the socket lock (e.g., read-side RCU access
/// to congestion control state). The C SDK deliberately uses plain stores
/// for performance — the socket lock is always held on the C congestion
/// control fast path.
///
/// Usage:
///   struct bbr_state *s = umka_typed(priv, bbr_state);
///   s->cwnd = max(s->cwnd / 2, 4 * mss);
#define umka_typed(priv, type_name)                                         \
    ({                                                                      \
        struct type_name *_s = (struct type_name *)(priv)->data;            \
        if (__builtin_expect((priv)->stored_version                         \
                             < type_name##_CURRENT_VERSION, 0)) {           \
            type_name##_migrate(_s, (priv)->stored_version);                \
            (priv)->stored_version = type_name##_CURRENT_VERSION;           \
            (priv)->stored_size = sizeof(struct type_name);                 \
        }                                                                   \
        _s;                                                                 \
    })

The __builtin_expect(..., 0) hint tells the compiler the migration path is cold (only taken once after module replacement). The fast path (version matches) is a single branch + pointer cast — no function call overhead.

13.18.13.6.6 Breaking Migration in C¶

For breaking state changes (field type conversion, semantic reinterpretation), the developer writes a migration function and declares it in the IDL:

// bbr_v4.kabi

state BbrStateV4 @version(4) @min_version(2)
    @migrate_from(BbrStateV3, bbr_migrate_v3_to_v4)
{
    min_rtt_ns:        u64;   // Changed: was min_rtt_us (μs → ns)
    btl_bw_bytes:      u64;   // Changed: was btl_bw (packets → bytes)
    phase:             u8;
    @pad(3);
    cwnd_bytes:        u32;   // Changed: was cwnd (packets → bytes)
    probe_rtt_rounds:  u32;
    ecn_alpha:         u32;
}

The KABI compiler generates the dispatch table; the developer implements the function in C:

/* Developer-written breaking migration (referenced in .kabi IDL).
 * Called during Phase A (preparation) on a copy of the state for
 * dry-run validation, then during lazy per-connection migration. */
int bbr_migrate_v3_to_v4(const struct bbr_state_v3 *old,
                          struct bbr_state_v4 *new,
                          const struct umka_migration_ctx *ctx)
{
    uint32_t mss = ctx->tcp_mss;
    if (mss == 0)
        return -EINVAL;  /* Missing context — abort migration */

    new->min_rtt_ns    = old->min_rtt_us * 1000;
    new->btl_bw_bytes  = old->btl_bw * (uint64_t)mss;
    new->phase         = old->phase;
    new->cwnd_bytes    = old->cwnd * mss;
    new->probe_rtt_rounds = old->probe_rtt_rounds;
    new->ecn_alpha     = old->ecn_alpha;
    return 0;  /* 0 = success, negative errno = failure */
}

The KABI compiler generates a dispatch function that chains migrations if the version distance spans multiple breaking changes:

/* Auto-generated migration dispatcher */
static int bbr_state_v4_migrate_dispatch(void *old_data, uint32_t old_ver,
                                          struct bbr_state_v4 *new,
                                          const struct umka_migration_ctx *ctx)
{
    switch (old_ver) {
    case 3:
        return bbr_migrate_v3_to_v4(
            (const struct bbr_state_v3 *)old_data, new, ctx);
    case 2: {
        /* Chain: v2 → v3 (append-only) → v4 (breaking) */
        struct bbr_state_v3 tmp;
        memcpy(&tmp, old_data, sizeof(struct bbr_state_v2));
        bbr_state_v3_migrate(&tmp, 2);  /* append-only v2→v3 */
        return bbr_migrate_v3_to_v4(&tmp, new, ctx);
    }
    default:
        return -ENOTSUP;  /* Below MIN_VERSION — slot re-initialized */
    }
}

13.18.13.6.7 C SDK Header Summary¶

The <umka/driver_sdk.h> header provides:

Macro / Function	Purpose
`UMKA_EVOLVABLE_ENTRY(name)`	Declare module as evolvable (generates marker section)
`UMKA_REGISTER_STATE(type, ver, min_ver, migrate_fn)`	Register state type with Nucleus version registry
`umka_typed(priv, type_name)`	Access persistent state with lazy migration
`umka_init_typed(priv, type_name)`	Initialize a fresh state slot (new connection)
`umka_state_version(priv)`	Read the stored version without migration
`struct umka_migration_ctx`	Context for breaking migrations (MSS, config, etc.)

13.18.13.6.8 Language Interop Guarantee¶

Because both C and Rust outputs are generated from the same .kabi IDL:

Layout identical: #[repr(C)] Rust matches default C struct layout (with natural alignment and padding) for the same field sequence. The KABI compiler verifies byte-for-byte layout match as part of its build step.
Migration compatible: A v2 state written by a C module can be migrated by a Rust module's migrate_in_place() (or vice versa) because both use the same version numbers and field offsets.
Cross-language replacement: A C congestion control module can be replaced at runtime by a Rust implementation (or vice versa). The persistent state format is defined by the IDL, not by either language's runtime.

This means the choice of implementation language is purely a developer preference — it does not affect the evolution story, the state migration path, or the operational lifecycle.

13.19 Hardware Watchdog Framework¶

The hardware watchdog framework exposes /dev/watchdog and /dev/watchdogN character devices to userspace. Its purpose is production system health monitoring: if the privileged userspace daemon (typically systemd) stops petting the watchdog before its timeout expires, the hardware unconditionally resets the machine. This guarantees recovery from hung kernels, deadlocked daemons, and runaway processes — scenarios where orderly shutdown is impossible.

This section describes the system-level hardware watchdog (WDOG). It is distinct from: - The clocksource watchdog (Section 7.8): detects unstable TSC and switches clocksources. Internal kernel mechanism, no userspace interface. - The driver crash watchdog (Section 11.4): per-driver health timer used by the crash recovery subsystem. Also internal.

The WDOG is the final line of defense: it runs in hardware and is guaranteed to fire even if the kernel itself is completely hung.

13.19.1 WatchdogOps KABI Vtable¶

Tier 1 watchdog drivers implement the WatchdogOps vtable. The vtable follows the standard KABI conventions (Section 12.1): vtable_size for forward compatibility, KabiResult for error propagation, unsafe extern "C" for ABI stability.

// umka-core/src/watchdog/ops.rs

/// KABI vtable for hardware watchdog drivers (Tier 1).
/// All function pointers are called from the watchdog core with IRQs enabled
/// and no spinlocks held, unless noted otherwise.
#[repr(C)]
pub struct WatchdogOps {
    /// Must be set to `size_of::<WatchdogOps>()` by the driver.
    /// Bounds-safety check: the watchdog core reads only
    /// `min(vtable_size, KERNEL_WATCHDOG_OPS_SIZE)` bytes.
    pub vtable_size: u64,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,

    /// Start the watchdog countdown. After this call returns `KabiResult::Ok`,
    /// the kernel MUST call `keepalive` before `timeout_s` seconds elapse.
    /// Called once when `/dev/watchdog` is first opened.
    /// Returns: KabiResult::Ok, KabiResult::Err(EBUSY) if already running.
    pub start: unsafe extern "C" fn(wdd: *mut WatchdogDev) -> KabiResult,

    /// Stop the watchdog. Not all hardware supports stopping once started.
    /// If `None`, the watchdog cannot be stopped after `start()`.
    /// Production systems should set `nowayout` to prevent calling `stop`.
    pub stop: Option<unsafe extern "C" fn(wdd: *mut WatchdogDev) -> KabiResult>,

    /// Pet the watchdog: reset the hardware countdown to `timeout_s`.
    /// This is the hot path — called on every write to `/dev/watchdog`
    /// and on every `WDIOC_KEEPALIVE` ioctl. Must be fast and non-sleeping.
    pub keepalive: unsafe extern "C" fn(wdd: *mut WatchdogDev) -> KabiResult,

    /// Set the watchdog timeout. `timeout_s` is the requested timeout in
    /// seconds. Returns the actual timeout the hardware was set to (hardware
    /// may round to the nearest supported granularity). If `None`, the timeout
    /// is fixed at the hardware default. Called before `start()` if the user
    /// sets a timeout via `WDIOC_SETTIMEOUT`.
    /// Returns 0 on success (timeout_s is updated in wdd), negative errno on failure.
    /// Matches Linux `watchdog_ops::set_timeout` which returns `int`.
    pub set_timeout: Option<unsafe extern "C" fn(
        wdd: *mut WatchdogDev,
        timeout_s: u32,
    ) -> i32>,

    /// Return remaining time before hardware reset, in seconds.
    /// Optional: returns 0 if not implemented (field is `None`).
    pub get_timeleft: Option<unsafe extern "C" fn(wdd: *mut WatchdogDev) -> u32>,

    /// Return the current hardware status word (bitfield of `WatchdogStatus`).
    /// Optional: returns 0 if not implemented.
    pub status: Option<unsafe extern "C" fn(wdd: *mut WatchdogDev) -> WatchdogStatus>,
}
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<WatchdogOps>() == 64);
// 32-bit: vtable_size(8) + kabi_version(8) + start(4) + stop(4) + keepalive(4) +
//   set_timeout(4) + get_timeleft(4) + status(4) = 40 bytes.
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<WatchdogOps>() == 40);

13.19.2 WatchdogDev — The Watchdog Device Descriptor¶

Each registered watchdog has one WatchdogDev. The watchdog core allocates and owns this struct; drivers receive a raw pointer to it in every vtable call.

// umka-core/src/watchdog/dev.rs

use core::ffi::c_void;
use crate::notify::NotifierBlock;

/// A hardware (or software) watchdog device.
/// Owned by the watchdog core after `watchdog_register_device()` succeeds.
///
/// **KABI constraint**: This struct is passed by raw pointer (`*mut WatchdogDev`)
/// to every `WatchdogOps` vtable function. `#[repr(C)]` ensures stable field
/// layout across separately-compiled modules (kernel core vs Tier 1 drivers).
///
/// **Architectural note**: The watchdog vtable passes `*mut WatchdogDev` (full
/// kernel struct) rather than `*mut c_void` (opaque context). This matches
/// Linux's watchdog_ops pattern for driver porting ease, but exposes kernel
/// internals to Tier 1 drivers. The SoundWire vtable ([Section 13.26](#soundwire-bus-framework))
/// uses the safer `*mut c_void` pattern. Future KABI revisions may migrate
/// watchdog to the opaque-context pattern.
///
/// **Driver access contract**: Tier 1 watchdog drivers MUST access only the
/// `priv_data` field to reach their own driver-specific state. All other fields
/// are owned by the watchdog core. The driver receives `wdd: *mut WatchdogDev`
/// to allow reading `priv_data` and calling core helpers (e.g., `timeout_s`
/// update via vtable return convention). Direct field access to core-owned fields
/// (open_mutex, flags, reboot_nb) from a Tier 1 driver is undefined behavior.
///
/// **Size note**: `NotifierBlock` and `Mutex<()>` have implementation-dependent
/// sizes. The const_assert below documents the expected layout for the reference
/// implementation; alternative Mutex implementations must match this size or
/// the struct must be restructured (e.g., move Mutex out-of-line behind a pointer).
#[repr(C)]
pub struct WatchdogDev {
    /// Vtable of driver-implemented operations.
    pub ops: &'static WatchdogOps,
    /// Static device identity (filled in by the driver before registration).
    pub info: WatchdogInfo,
    /// Current timeout in seconds. Updated by `set_timeout`.
    /// AtomicU32: may be read from timer callback while written from ioctl.
    pub timeout_s: AtomicU32,
    /// Minimum timeout supported by the hardware. Validated on `WDIOC_SETTIMEOUT`.
    pub min_timeout_s: u32,
    /// Maximum timeout supported by the hardware.
    pub max_timeout_s: u32,
    /// Pretimeout in seconds: fire the pretimeout notifier this many seconds
    /// before the hardware reset. `0` means pretimeout is disabled.
    pub pretimeout_s: u32,
    /// Operational flags. Stored as `AtomicU32` because flags are read and
    /// mutated from multiple contexts (open/close/ioctl/reboot notifier)
    /// without a single encompassing lock. Use `flags_load()` / `flags_cas()`
    /// helpers below.
    pub flags: AtomicU32,
    /// Status at last boot: why the previous session ended.
    /// Checked at driver registration time and cached here.
    pub bootstatus: WatchdogStatus,
    /// Notifier block used to cleanly stop the watchdog on orderly reboot.
    /// Registered with the kernel reboot notifier chain at device registration.
    pub reboot_nb: NotifierBlock,
    /// Device index (0 = /dev/watchdog0). Assigned by the core.
    pub index: u32,
    /// Driver-private data pointer. Opaque to the watchdog core.
    ///
    /// # Safety
    /// Owned by the driver. The watchdog core never dereferences this pointer.
    /// The driver is responsible for ensuring the pointed-to data outlives the
    /// `WatchdogDev` registration (i.e., valid from `watchdog_register_device()`
    /// until `watchdog_unregister_device()`). Typically points to a driver-
    /// specific struct allocated alongside the device and freed in the driver's
    /// `remove()` callback.
    pub priv_data: *mut c_void,
    /// Exclusive-open mutex: only one userspace process may open at a time.
    pub open_mutex: Mutex<()>,
    /// `MagicClose` state: whether a `'V'` character has been written.
    /// Used to distinguish orderly close from userspace crash.
    /// `AtomicU8` (0 = false, 1 = true) instead of `AtomicBool` because
    /// `AtomicBool` wraps `bool` which has a validity invariant (must be 0 or 1)
    /// and no guaranteed stable C ABI across separately-compiled modules.
    /// Same pattern as `VtableHeader.quiescing` in live-kernel-evolution.md.
    /// The write path (setting 1 on 'V') and the release path (reading + clearing)
    /// may execute on different CPUs.
    pub magic_close_armed: AtomicU8, // 0 = false, 1 = true
}

/// Static identity information filled in by the driver.
#[repr(C)]
pub struct WatchdogInfo {
    /// WDIOF_* option flags — bitmask of `WatchdogStatus` capability bits.
    pub options: u32,
    /// Driver/firmware version (driver-defined, often the hardware revision).
    pub firmware_version: u32,
    /// Human-readable identity string (e.g., "Intel TCO Watchdog").
    pub identity: [u8; 32],
}
// WatchdogInfo: options(u32=4) + firmware_version(u32=4) + identity([u8;32]=32) = 40 bytes.
const_assert!(size_of::<WatchdogInfo>() == 40);

Flag types:

// umka-core/src/watchdog/flags.rs

/// Operational state flags for WatchdogDev (used as bit constants).
/// The actual flags field in `WatchdogDev` is `AtomicU32`; these constants
/// define the bit positions. Use the atomic helpers below for all access.
pub mod WatchdogFlags {
    /// The watchdog hardware is currently running (counting down).
    pub const ACTIVE: u32           = 1 << 0;
    /// The watchdog was kept alive at least once (first ping received).
    pub const ALIVE: u32            = 1 << 1;
    /// Userspace wrote 'V' — safe to stop on close.
    pub const MAGIC_CLOSE: u32      = 1 << 2;
    /// The watchdog was started when /dev/watchdog was opened.
    pub const RUNNING: u32          = 1 << 3;
    /// nowayout: watchdog cannot be stopped once started.
    /// Set by `umka.watchdog.nowayout=1` boot parameter.
    pub const NOWAYOUT: u32         = 1 << 4;
    /// Pretimeout interrupt was delivered (informational, cleared on keepalive).
    pub const PRETIMEOUT_FIRED: u32 = 1 << 5;
    /// Handshake timeout mode: driver requires 2-phase keepalive (rare).
    /// Phase 1: userspace writes the keepalive ("ping"). Phase 2: the
    /// driver acknowledges by clearing a hardware status bit within
    /// `handshake_timeout_ms` (default: half the watchdog timeout).
    /// If the driver fails to acknowledge, the watchdog fires even though
    /// the keepalive was written — detecting driver hangs, not just
    /// userspace hangs. Used by some BMC/IPMI watchdog controllers.
    pub const HANDSHAKE: u32        = 1 << 6;
    /// Device file is currently open. Set in watchdog_open, cleared in
    /// WatchdogFile::drop. Ensures exclusive-open semantics persist
    /// across the entire open duration.
    pub const OPEN: u32             = 1 << 15;
}

impl WatchdogDev {
    /// Atomically read the current flags.
    pub fn flags_load(&self) -> u32 {
        self.flags.load(Ordering::Acquire)
    }
    /// Atomically set `bits` in flags via CAS loop.
    pub fn flags_insert(&self, bits: u32) {
        let _ = self.flags.fetch_or(bits, Ordering::AcqRel);
    }
    /// Atomically clear `bits` in flags via CAS loop.
    pub fn flags_remove(&self, bits: u32) {
        let _ = self.flags.fetch_and(!bits, Ordering::AcqRel);
    }
    /// Test whether all `bits` are set.
    pub fn flags_contains(&self, bits: u32) -> bool {
        self.flags_load() & bits == bits
    }
}

bitflags! {
    /// Hardware status bits: device capabilities (in WatchdogInfo::options)
    /// and runtime status (returned by WatchdogOps::status and boot status).
    pub struct WatchdogStatus: u32 {
        /// Temperature exceeded threshold.
        const OVERHEAT        = 0x0001;
        /// Fan fault detected.
        const FANFAULT        = 0x0002;
        /// External fault input 1 asserted.
        const EXTERN1         = 0x0004;
        /// External fault input 2 asserted.
        const EXTERN2         = 0x0008;
        /// Power supply undervoltage detected.
        const POWERUNDER      = 0x0010;
        /// Last reset was caused by the watchdog (this boot).
        const CARDRESET       = 0x0020;
        /// Power supply overvoltage detected.
        const POWEROVER       = 0x0040;
        /// Timeout is settable at runtime (WDIOF_SETTIMEOUT).
        const SETTIMEOUT      = 0x0080;
        /// Magic-close ('V') is required to stop the watchdog on close (WDIOF_MAGICCLOSE).
        const MAGICCLOSE      = 0x0100;
        /// Pretimeout notification before reset is supported (WDIOF_PRETIMEOUT).
        const PRETIMEOUT      = 0x0200;
        /// Alarm notification (driver-specific) supported (WDIOF_ALARMONLY).
        const ALARMONLY       = 0x0400;
        /// A keepalive ping was successfully received (WDIOF_KEEPALIVEPING).
        const KEEPALIVEPING   = 0x8000;
    }
}

13.19.3 Character Device Interface — `/dev/watchdog`¶

The watchdog core registers a cdev at /dev/watchdogN for each registered watchdog (N = device index). /dev/watchdog is a symlink to /dev/watchdog0. The cdev provides the standard Linux WDOG interface; systemd, daemon supervisors, and any POSIX-compliant watchdog client work without modification.

open()

// umka-core/src/watchdog/cdev.rs

fn watchdog_open(dev: &Arc<WatchdogDev>) -> Result<WatchdogFile, KernelError> {
    // Exclusive open: serialize with close() via open_mutex to prevent the
    // TOCTOU race where a new open races with a concurrent close. The mutex
    // ensures that if close is clearing OPEN + calling stop(), we wait for
    // that to complete before checking OPEN.
    let _guard = dev.open_mutex.lock();

    // Check the OPEN flag under the mutex. If OPEN is already set, another
    // opener holds the device (they are not in close — close holds the mutex).
    if dev.flags.fetch_or(WatchdogFlags::OPEN, Ordering::AcqRel)
        & WatchdogFlags::OPEN != 0
    {
        return Err(KernelError::EBUSY);
    }

    // Start the watchdog if it is not already running.
    if !dev.flags_contains(WatchdogFlags::RUNNING) {
        // SAFETY: ops pointer is valid for the lifetime of WatchdogDev.
        // start() is called with no spinlocks held, IRQs enabled.
        let result = unsafe { (dev.ops.start)(dev.as_ptr()) };
        result.into_result()?;
        dev.flags_insert(WatchdogFlags::ACTIVE | WatchdogFlags::RUNNING);
    }

    // Clear magic-close flag from any previous open.
    dev.magic_close_armed.store(0, Ordering::Release);

    Ok(WatchdogFile { dev: dev.clone() })
}

write(buf, len)

A write of any bytes pets the watchdog. If the buffer contains the ASCII character 'V' (0x56), the MagicClose flag is set, signalling that the next close() is an orderly shutdown (safe to stop the watchdog). This matches Linux's magic-close convention: the watchdog client writes "V" immediately before closing to signal intentional teardown, as opposed to crashing.

// umka-core/src/watchdog/cdev.rs

fn watchdog_write(
    file: &mut WatchdogFile,
    buf: &[u8],
) -> Result<usize, KernelError> {
    let dev = &file.dev;

    // Check for magic-close character.
    if buf.iter().any(|&b| b == b'V') {
        dev.flags_insert(WatchdogFlags::MAGIC_CLOSE);
    }

    // Pet the watchdog.
    // SAFETY: keepalive() is designed for hot-path use; it is non-sleeping
    // and safe to call from any context where IRQs are enabled.
    let result = unsafe { (dev.ops.keepalive)(dev.as_ptr()) };
    result.into_result()?;

    dev.flags_insert(WatchdogFlags::ALIVE | WatchdogFlags::KEEPALIVEPING);
    dev.flags_remove(WatchdogFlags::PRETIMEOUT_FIRED);

    Ok(buf.len())
}

ioctl handlers:

// umka-core/src/watchdog/cdev.rs

fn watchdog_ioctl(
    file: &mut WatchdogFile,
    cmd: u32,
    arg: usize,
) -> Result<i32, KernelError> {
    let dev = &file.dev;

    match cmd {
        WDIOC_GETSUPPORT => {
            // Copy WatchdogInfo to userspace.
            copy_to_user(arg as *mut WatchdogInfo, &dev.info)?;
            Ok(0)
        }
        WDIOC_GETSTATUS => {
            let status = if let Some(status_fn) = dev.ops.status {
                // SAFETY: status() is read-only, non-sleeping.
                unsafe { status_fn(dev.as_ptr()) }
            } else {
                WatchdogStatus::empty()
            };
            copy_to_user(arg as *mut u32, &status.bits())?;
            Ok(0)
        }
        WDIOC_GETBOOTSTATUS => {
            copy_to_user(arg as *mut u32, &dev.bootstatus.bits())?;
            Ok(0)
        }
        WDIOC_SETTIMEOUT => {
            let mut timeout_s: u32 = 0;
            copy_from_user(&mut timeout_s, arg as *const u32)?;

            if timeout_s < dev.min_timeout_s || timeout_s > dev.max_timeout_s {
                return Err(KernelError::EINVAL);
            }

            match dev.ops.set_timeout {
                Some(set_fn) => {
                    // SAFETY: set_timeout() updates hardware registers; non-sleeping.
                    // Returns 0 on success, negative errno on failure.
                    // On success, wdd.timeout_s is updated by the driver.
                    let ret = unsafe { set_fn(dev.as_ptr(), timeout_s) };
                    if ret < 0 {
                        return Err(KernelError::from_errno(ret));
                    }
                }
                None => return Err(KernelError::EOPNOTSUPP),
            };
            // Read back the actual timeout (driver may have rounded).
            let actual = dev.timeout_s.load(Ordering::Acquire);
            copy_to_user(arg as *mut u32, &actual)?;
            Ok(0)
        }
        WDIOC_GETTIMEOUT => {
            let t = dev.timeout_s.load(Ordering::Acquire);
            copy_to_user(arg as *mut u32, &t)?;
            Ok(0)
        }
        WDIOC_GETTIMELEFT => {
            let left = match dev.ops.get_timeleft {
                Some(f) => {
                    // SAFETY: get_timeleft() reads a hardware counter; non-sleeping.
                    unsafe { f(dev.as_ptr()) }
                }
                None => 0,
            };
            copy_to_user(arg as *mut u32, &left)?;
            Ok(0)
        }
        WDIOC_KEEPALIVE => {
            // Explicit keepalive ioctl — same effect as writing to the device.
            // SAFETY: keepalive() is non-sleeping.
            let result = unsafe { (dev.ops.keepalive)(dev.as_ptr()) };
            result.into_result()?;
            dev.flags_insert(WatchdogFlags::ALIVE);
            Ok(0)
        }
        _ => Err(KernelError::ENOTTY),
    }
}

close()

// umka-core/src/watchdog/cdev.rs

fn watchdog_release(file: WatchdogFile) {
    let dev = &file.dev;

    // Hold open_mutex across the entire release path, including stop()
    // and keepalive(). This serializes close-vs-open: a concurrent
    // watchdog_open() will block on this mutex until the release path
    // completes, preventing the TOCTOU race where a new opener calls
    // start() while the closer is about to call stop().
    //
    // Linux avoids this race by holding the open_lock across close()
    // in drivers/watchdog/watchdog_dev.c::watchdog_release().
    let _guard = dev.open_mutex.lock();

    if dev.flags_contains(WatchdogFlags::NOWAYOUT) {
        // nowayout: never stop. Pet once to extend lifetime (avoid
        // accidental expiry during close processing).
        // SAFETY: keepalive() is non-sleeping.
        let _ = unsafe { (dev.ops.keepalive)(dev.as_ptr()) };
        log::warn!(
            "watchdog{}: nowayout set — watchdog stays active",
            dev.index
        );
        // Clear OPEN flag AFTER the mutex-protected work completes.
        dev.flags.fetch_and(!WatchdogFlags::OPEN, Ordering::Release);
        return;
    }

    if dev.flags_contains(WatchdogFlags::MAGIC_CLOSE) {
        // Orderly close: stop the watchdog if the driver supports it.
        if let Some(stop_fn) = dev.ops.stop {
            // SAFETY: stop() is non-sleeping; called with no spinlocks held.
            let result = unsafe { stop_fn(dev.as_ptr()) };
            if result.into_result().is_ok() {
                dev.flags_remove(WatchdogFlags::ACTIVE | WatchdogFlags::RUNNING);
                log::info!("watchdog{}: stopped (magic close)", dev.index);
                // Clear OPEN flag AFTER stop() succeeds.
                dev.flags.fetch_and(!WatchdogFlags::OPEN, Ordering::Release);
                return;
            }
        }
    }

    // No magic close, or stop() failed or unavailable.
    // Log a warning. The watchdog continues counting and will reset
    // the machine unless another process opens and pets it in time.
    log::warn!(
        "watchdog{}: closed without magic character ('V') — watchdog NOT stopped",
        dev.index
    );

    // Clear OPEN flag AFTER all release processing completes.
    dev.flags.fetch_and(!WatchdogFlags::OPEN, Ordering::Release);
    // open_mutex guard is dropped here, unblocking any waiting open().
}

13.19.4 Nowayout Boot Option¶

umka.watchdog.nowayout=1 is a kernel boot parameter that permanently enables WatchdogFlags::NOWAYOUT for all watchdog devices at registration time. Once set, no stop() call is ever issued, regardless of magic-close. On hardware that supports it (e.g., Intel TCO watchdog with NO_REBOOT bit cleared), the nowayout state is also committed to hardware so that even a compromised kernel cannot disable it.

Nowayout is the correct default for production: a malicious or buggy userspace process that manages to open /dev/watchdog and write 'V' should not be able to disable the last-resort reset mechanism. nowayout=0 is provided for development environments where a watchdog-triggered reboot during testing is disruptive.

13.19.5 Pretimeout Notifier¶

If pretimeout_s > 0, the hardware (or the watchdog core, if the hardware does not support pretimeout interrupts natively) fires the pretimeout notifier chain WATCHDOG_PRETIMEOUT_GOVERNOR at pretimeout_s seconds before the expiry deadline. This gives the system a final window to collect a crash dump or trigger a controlled panic before the hard reset occurs.

// umka-core/src/watchdog/pretimeout.rs

/// Pretimeout event delivered to the governor.
pub struct WatchdogPretimeoutEvent {
    pub dev: Arc<WatchdogDev>,
    /// Seconds remaining until hard reset at the moment of this event.
    pub timeleft_s: u32,
}

/// A pretimeout governor: decides what to do when pretimeout fires.
pub trait PretimeoutGovernor: Send + Sync {
    fn name(&self) -> &'static str;
    fn pretimeout(&self, event: &WatchdogPretimeoutEvent);
}

/// Built-in governor: log and do nothing. Default.
pub struct NoopGovernor;
impl PretimeoutGovernor for NoopGovernor {
    fn name(&self) -> &'static str { "noop" }
    fn pretimeout(&self, event: &WatchdogPretimeoutEvent) {
        log::warn!(
            "watchdog{}: pretimeout — hard reset in {}s",
            event.dev.index, event.timeleft_s
        );
    }
}

/// Built-in governor: trigger kernel panic to generate a crash dump.
/// The panic handler writes a minidump via the crash dump subsystem
/// before the hard reset occurs.
pub struct PanicGovernor;
impl PretimeoutGovernor for PanicGovernor {
    fn name(&self) -> &'static str { "panic" }
    fn pretimeout(&self, event: &WatchdogPretimeoutEvent) {
        panic!(
            "watchdog{}: pretimeout governor triggered panic for crash dump \
             ({} seconds before hard reset)",
            event.dev.index, event.timeleft_s
        );
    }
}

The active governor is selectable per-device via sysfs:

/sys/bus/watchdog/devices/watchdog0/pretimeout_governor

Writing "noop" or "panic" to this file changes the active governor atomically. The list of available governors is read from /sys/bus/watchdog/devices/watchdog0/pretimeout_available_governors.

13.19.6 Software Watchdog (`softdog`)¶

When the system has no hardware watchdog (embedded platforms, VMs without virtio-wdt), softdog provides a kernel-timer-based fallback. Its WatchdogOps implementation:

// umka-core/src/watchdog/softdog.rs

static SOFTDOG_TIMER: OnceLock<KernelTimer> = OnceLock::new();
static SOFTDOG_DEV: OnceLock<WatchdogDev> = OnceLock::new();

const SOFTDOG_DEFAULT_TIMEOUT_S: u32 = 60;
const SOFTDOG_MIN_TIMEOUT_S: u32     = 1;
const SOFTDOG_MAX_TIMEOUT_S: u32     = 65535;

unsafe extern "C" fn softdog_start(wdd: *mut WatchdogDev) -> KabiResult {
    let wdd = unsafe { &*wdd };
    let timeout_jiffies = secs_to_jiffies(wdd.timeout_s.load(Ordering::Acquire));
    // SAFETY: timer is initialized before registration; mod_timer is safe
    // with a valid timer pointer and non-zero expiry.
    unsafe {
        mod_timer(
            SOFTDOG_TIMER.get().unwrap(),
            jiffies_add(jiffies(), timeout_jiffies),
        );
    }
    KabiResult::Ok
}

unsafe extern "C" fn softdog_stop(wdd: *mut WatchdogDev) -> KabiResult {
    // SAFETY: del_timer_sync blocks until any running timer callback completes.
    unsafe { del_timer_sync(SOFTDOG_TIMER.get().unwrap()) };
    KabiResult::Ok
}

unsafe extern "C" fn softdog_keepalive(wdd: *mut WatchdogDev) -> KabiResult {
    let wdd = unsafe { &*wdd };
    let timeout_jiffies = secs_to_jiffies(wdd.timeout_s.load(Ordering::Acquire));
    // SAFETY: same as softdog_start; mod_timer is idempotent if already pending.
    unsafe {
        mod_timer(
            SOFTDOG_TIMER.get().unwrap(),
            jiffies_add(jiffies(), timeout_jiffies),
        );
    }
    KabiResult::Ok
}

unsafe extern "C" fn softdog_set_timeout(wdd: *mut WatchdogDev, timeout_s: u32) -> i32 {
    // Software timer has 1-second granularity — accept any requested value.
    // SAFETY: wdd is valid (passed by the watchdog core which owns it).
    (*wdd).timeout_s.store(timeout_s, Ordering::Release);
    0 // success
}

/// Called when the kernel timer fires (keepalive not received in time).
fn softdog_fire(_timer: &KernelTimer) {
    let wdd = SOFTDOG_DEV.get().expect("softdog timer fired before device init");

    if wdd.flags_contains(WatchdogFlags::NOWAYOUT)
        || wdd.flags_contains(WatchdogFlags::ACTIVE)
    {
        log::crit!("softdog: watchdog timer expired — initiating emergency reboot");
        // kernel_restart() triggers the reboot path (orderly if possible).
        // SAFETY: we are in a safe call context; no locks are held.
        unsafe { kernel_restart(core::ptr::null()) };
    }
}

static SOFTDOG_OPS: WatchdogOps = WatchdogOps {
    vtable_size: core::mem::size_of::<WatchdogOps>() as u64,
    kabi_version: KabiVersion::new(1, 0, 0).as_u64(),
    start:       softdog_start,
    stop:        Some(softdog_stop),
    keepalive:   softdog_keepalive,
    set_timeout: Some(softdog_set_timeout),
    get_timeleft: None,
    status:      None,
};

softdog is registered during kernel init via watchdog_register_device() if and only if no hardware watchdog driver has claimed device index 0. It is not compiled out — having a software fallback is always safer than no watchdog at all.

13.19.7 systemd Integration¶

systemd opens /dev/watchdog at startup and uses it as its primary liveness signal:

WATCHDOG_USEC=N environment variable (set by the service manager) informs systemd of the current watchdog timeout in microseconds. systemd derives its keepalive interval as WATCHDOG_USEC / 2 — it pets the watchdog twice per timeout period.
sd_notify(0, "WATCHDOG=1") is the keepalive call; it maps to a write() of "1" to /dev/watchdog. The '1' character does not trigger magic-close (only 'V' does).
On clean shutdown, systemd writes 'V' to /dev/watchdog before closing, enabling the watchdog to be stopped (unless nowayout is set).
WDIOC_GETTIMEOUT is called at startup so systemd can populate WATCHDOG_USEC in the environment for spawned services.

No UmkaOS-specific changes to systemd are needed — the standard Linux WDOG interface is fully compatible.

13.19.8 Device Registration¶

Drivers call watchdog_register_device() to install a watchdog. The function validates the vtable, assigns a device index, creates the cdev and sysfs entries, and queries the hardware for the boot status (cause of last reset).

// umka-core/src/watchdog/register.rs

/// Maximum number of concurrently registered watchdog devices.
const WATCHDOG_MAX_DEVICES: u32 = 32;

/// Register a watchdog device. `wdd` must be fully initialized before calling.
/// On success, wdd.index is set and /dev/watchdogN + sysfs entries are created.
pub fn watchdog_register_device(wdd: &mut WatchdogDev) -> Result<(), KernelError> {
    // Validate vtable.
    if wdd.ops.vtable_size < core::mem::size_of::<WatchdogOps>() as u64 {
        return Err(KernelError::EINVAL);
    }

    // Validate timeout bounds. `timeout_s` is AtomicU32; since we have `&mut`,
    // use `get_mut()` for exclusive access (no atomic overhead).
    let timeout = *wdd.timeout_s.get_mut();
    if wdd.min_timeout_s == 0
        || wdd.max_timeout_s < wdd.min_timeout_s
        || timeout < wdd.min_timeout_s
        || timeout > wdd.max_timeout_s
    {
        return Err(KernelError::EINVAL);
    }

    // Assign device index.
    let index = WATCHDOG_REGISTRY.lock().allocate_index()
        .ok_or(KernelError::ENOSPC)?;
    if index >= WATCHDOG_MAX_DEVICES {
        return Err(KernelError::ENOSPC);
    }
    wdd.index = index;

    // Query boot status (why did the system last reset?).
    wdd.bootstatus = if let Some(status_fn) = wdd.ops.status {
        // SAFETY: device is not yet started; status() is read-only.
        unsafe { status_fn(wdd as *mut WatchdogDev) }
    } else {
        WatchdogStatus::empty()
    };

    // Apply nowayout boot parameter.
    if WATCHDOG_NOWAYOUT.load(Ordering::Relaxed) {
        wdd.flags_insert(WatchdogFlags::NOWAYOUT);
    }

    // Create character device at /dev/watchdogN.
    cdev_register(wdd)?;

    // Register sysfs entries under /sys/bus/watchdog/devices/watchdogN/.
    sysfs_watchdog_register(wdd)?;

    // Register reboot notifier: stop watchdog on orderly reboot
    // (unless nowayout is set).
    register_reboot_notifier(&mut wdd.reboot_nb, watchdog_reboot_handler)?;

    // If this is watchdog0, create /dev/watchdog symlink.
    if index == 0 {
        devfs_symlink("watchdog", "watchdog0")?;
    }

    log::info!(
        "watchdog{}: registered '{}' (timeout: {}s, min: {}s, max: {}s, nowayout: {})",
        wdd.index,
        core::str::from_utf8(&wdd.info.identity).unwrap_or("?").trim_end_matches('\0'),
        wdd.timeout_s,
        wdd.min_timeout_s,
        wdd.max_timeout_s,
        wdd.flags_contains(WatchdogFlags::NOWAYOUT),
    );

    Ok(())
}

/// Reboot notifier callback: stop the watchdog on orderly system shutdown.
/// Not called if nowayout is set.
fn watchdog_reboot_handler(wdd: &mut WatchdogDev) {
    if wdd.flags_contains(WatchdogFlags::NOWAYOUT)
        || !wdd.flags_contains(WatchdogFlags::ACTIVE)
    {
        return;
    }
    if let Some(stop_fn) = wdd.ops.stop {
        // SAFETY: reboot notifier runs with IRQs enabled, no spinlocks held.
        let _ = unsafe { stop_fn(wdd as *mut WatchdogDev) };
        log::info!("watchdog{}: stopped for orderly reboot", wdd.index);
    }
}

sysfs entries under /sys/bus/watchdog/devices/watchdogN/:

File	Access	Description
`identity`	ro	`WatchdogInfo::identity` string
`timeout`	rw	Current timeout in seconds
`min_timeout`	ro	Hardware minimum
`max_timeout`	ro	Hardware maximum
`pretimeout`	rw	Pretimeout in seconds (0 = disabled)
`pretimeout_governor`	rw	Active governor name
`pretimeout_available_governors`	ro	Space-separated list of available governors
`timeleft`	ro	Remaining time (calls `get_timeleft`, or 0)
`bootstatus`	ro	`WatchdogStatus` bits from last boot
`nowayout`	ro	`1` if nowayout is in effect
`status`	ro	Current `WatchdogStatus` bits

13.20 SPI Bus Framework¶

SPI (Serial Peripheral Interface) is a synchronous full-duplex serial bus connecting a master controller to one or more peripheral devices (ADCs, DACs, flash memory, display controllers, RF transceivers, SD cards in SPI mode, and sensor modules). Unlike I2C, SPI transfers are full-duplex: MOSI (Master Out Slave In) and MISO (Master In Slave Out) operate simultaneously on every clock edge. The UmkaOS SPI framework (umka-core/src/bus/spi.rs) provides a KABI trait for platform SPI controller drivers, a higher-level SpiDevice handle for peripheral drivers, and a spidev character device for userspace access.

13.20.1 SpiController KABI Trait¶

Platform SPI controller drivers implement SpiController. The trait is in umka-core/src/bus/spi.rs.

/// SPI bus mode (polarity + phase combination).
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u8)]
pub enum SpiMode {
    /// CPOL=0, CPHA=0: clock idle low, data captured on rising edge.
    Mode0 = 0,
    /// CPOL=0, CPHA=1: clock idle low, data captured on falling edge.
    Mode1 = 1,
    /// CPOL=1, CPHA=0: clock idle high, data captured on falling edge.
    Mode2 = 2,
    /// CPOL=1, CPHA=1: clock idle high, data captured on rising edge.
    Mode3 = 3,
}

/// A single SPI transfer (one segment of a complete SPI message).
pub struct SpiTransfer<'a> {
    /// Data to transmit (MOSI). None → send zeros.
    pub tx_buf:           Option<&'a [u8]>,
    /// Buffer to receive data into (MISO). None → discard received data.
    pub rx_buf:           Option<&'a mut [u8]>,
    /// Transfer length in bytes (max of tx_buf.len() and rx_buf.len()).
    /// Kernel-internal type (not ABI-crossing); usize is appropriate.
    /// The ABI-crossing `SpiIocTransfer` uses u32 for Linux compatibility.
    pub len:              usize,
    /// Override clock speed for this transfer (Hz). 0 = use device default.
    pub speed_hz:         u32,
    /// Bits per word for this transfer (4–32). 0 = use device default (typically 8).
    pub bits_per_word:    u8,
    /// Delay in microseconds after this transfer before CS is deasserted or
    /// the next transfer starts.
    pub delay_us:         u16,
    /// If true, deassert CS between this transfer and the next.
    /// If false, hold CS asserted for the following transfer (typical for
    /// multi-segment reads).
    pub cs_change:        bool,
    /// Word delay in nanoseconds between each word within this transfer
    /// (for slow devices). u16 allows up to 65535 ns (~65 us), covering
    /// all practical SPI word delay requirements.
    // Converted from userspace SpiIocTransfer.word_delay_usecs (u8,
    // microseconds) via multiply by 1000.  Finer granularity for hardware
    // drivers that support sub-microsecond word delays.
    pub word_delay_ns:    u16,
}

/// A complete SPI message: one or more transfers sharing a single CS assertion.
pub struct SpiMessage<'a> {
    /// Ordered list of transfers.
    pub transfers: &'a mut [SpiTransfer<'a>],
    /// Completion status (Ok or error code). Set by the controller after transfer.
    pub status:    Option<Result<(), KernelError>>,
}

/// SPI controller trait (kernel-internal, Tier 0). Implemented by platform SPI
/// controller drivers. This is a Rust trait, not a KABI vtable — no #[repr(C)],
/// no vtable_size/kabi_version. SPI controllers are platform-specific hardware
/// compiled into the kernel.
pub trait SpiController: Send + Sync {
    /// Execute a complete SPI message synchronously.
    ///
    /// The controller asserts CS for the peripheral at `cs_index` for the entire
    /// duration of the message, deasserts between transfers only if
    /// `SpiTransfer::cs_change` is set, and deasserts after the last transfer.
    fn transfer_one_message(
        &self,
        cs_index: u8,
        mode:     SpiMode,
        speed_hz: u32,
        msg:      &mut SpiMessage<'_>,
    ) -> Result<(), KernelError>;

    /// Maximum supported clock speed in Hz (hardware limit).
    fn max_speed_hz(&self) -> u32;

    /// Number of native chip selects (additional CS via GPIO is handled above
    /// this layer).
    fn num_chipselect(&self) -> u8;

    /// Bitmask of supported bits-per-word values. Bit N is set if
    /// `bits_per_word = N+1` is supported. Bit 7 set means 8-bit words are
    /// supported.
    fn bits_per_word_mask(&self) -> u32;
}

/// Handle to an SPI peripheral at a specific CS on a specific controller.
pub struct SpiDevice {
    /// Underlying controller.
    pub controller:    Arc<dyn SpiController>,
    /// Chip select index on the controller.
    pub cs_index:      u8,
    /// SPI mode (polarity + phase).
    pub mode:          SpiMode,
    /// Maximum clock speed for this device in Hz.
    pub max_speed_hz:  u32,
    /// Bits per word (usually 8).
    pub bits_per_word: u8,
}

impl SpiDevice {
    /// Full-duplex transfer: send `tx`, receive into `rx` simultaneously.
    /// Returns `EINVAL` if `tx` and `rx` have different lengths.
    pub fn transfer(&self, tx: &[u8], rx: &mut [u8]) -> Result<(), KernelError> {
        if tx.len() != rx.len() {
            return Err(KernelError::EINVAL);
        }
        let mut transfer = SpiTransfer {
            tx_buf:        Some(tx),
            rx_buf:        Some(rx),
            len:           tx.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut msg = SpiMessage {
            transfers: core::slice::from_mut(&mut transfer),
            status:    None,
        };
        self.controller.transfer_one_message(
            self.cs_index, self.mode, self.max_speed_hz, &mut msg,
        )
    }

    /// Write only (discard MISO).
    pub fn write(&self, data: &[u8]) -> Result<(), KernelError> {
        let mut transfer = SpiTransfer {
            tx_buf:        Some(data),
            rx_buf:        None,
            len:           data.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut msg = SpiMessage {
            transfers: core::slice::from_mut(&mut transfer),
            status:    None,
        };
        self.controller.transfer_one_message(
            self.cs_index, self.mode, self.max_speed_hz, &mut msg,
        )
    }

    /// Write a register address, then read response (2-segment, CS held asserted
    /// between segments).
    pub fn write_then_read(&self, cmd: &[u8], rx: &mut [u8]) -> Result<(), KernelError> {
        let mut t0 = SpiTransfer {
            tx_buf:        Some(cmd),
            rx_buf:        None,
            len:           cmd.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut t1 = SpiTransfer {
            tx_buf:        None,
            rx_buf:        Some(rx),
            len:           rx.len(),
            speed_hz:      self.max_speed_hz,
            bits_per_word: self.bits_per_word,
            delay_us:      0,
            cs_change:     false,
            word_delay_ns: 0,
        };
        let mut msg = SpiMessage { transfers: &mut [t0, t1], status: None };
        self.controller.transfer_one_message(
            self.cs_index, self.mode, self.max_speed_hz, &mut msg,
        )
    }
}

Tier classification: SPI controller drivers are Tier 1. SPI peripheral drivers (sensors, transceivers, display controllers) follow their own tier classification based on function. spidev is Tier 2.

Device enumeration: SPI devices are enumerated from ACPI (SPISerialBus resource) or device-tree (spi-bus compatible node with reg property for CS index). The bus manager matches each ACPI/DT node to a registered SPI device driver by compatible string or ACPI HID.

CS GPIO: Many boards use GPIO pins as additional chip selects beyond what the hardware SPI controller provides natively. GPIO CS abstraction is handled in the bus manager layer: SpiController::transfer_one_message receives the already-resolved hardware CS index; the bus manager handles GPIO assertion and deassertion for GPIO-based CS lines before and after each controller call.

13.20.2 spidev — Userspace SPI Access¶

spidev exposes SPI devices to userspace via /dev/spidev<bus>.<cs> (e.g., /dev/spidev0.0). This allows userspace drivers and test tools to communicate with SPI peripherals without a kernel driver, using the same ioctl interface as Linux.

/// SPI transfer descriptor for the SPI_IOC_MESSAGE ioctl.
/// Layout matches Linux `struct spi_ioc_transfer` for ABI compatibility.
#[repr(C)]
pub struct SpiIocTransfer {
    /// Userspace pointer to TX data buffer (0 for RX-only transfers).
    pub tx_buf:           u64,
    /// Userspace pointer to RX data buffer (0 for TX-only transfers).
    pub rx_buf:           u64,
    /// Transfer length in bytes.
    pub len:              u32,
    /// Transfer clock speed override in Hz (0 = device default).
    pub speed_hz:         u32,
    /// Inter-transfer delay in microseconds.
    pub delay_usecs:      u16,
    /// Bits per word override (0 = device default).
    pub bits_per_word:    u8,
    /// If non-zero, deassert CS after this transfer before the next.
    pub cs_change:        u8,
    /// Dual/quad SPI TX mode (0 = standard).
    pub tx_nbits:         u8,
    /// Dual/quad SPI RX mode (0 = standard).
    pub rx_nbits:         u8,
    /// Inter-word delay in microseconds.
    pub word_delay_usecs: u8,
    /// Reserved; must be zero.
    pub _pad:             u8,
}
const_assert!(size_of::<SpiIocTransfer>() == 32);

ioctls on /dev/spidevN.M:

ioctl	Direction	Description
`SPI_IOC_RD_MODE`	Read	Get SPI mode byte (SPI_MODE_0…3 + flags)
`SPI_IOC_WR_MODE`	Write	Set SPI mode byte
`SPI_IOC_RD_MODE32`	Read	Get mode with extended flags (32-bit)
`SPI_IOC_WR_MODE32`	Write	Set mode with extended flags
`SPI_IOC_RD_LSB_FIRST`	Read	Get bit order (0 = MSB first)
`SPI_IOC_WR_LSB_FIRST`	Write	Set bit order
`SPI_IOC_RD_BITS_PER_WORD`	Read	Get bits per word
`SPI_IOC_WR_BITS_PER_WORD`	Write	Set bits per word
`SPI_IOC_RD_MAX_SPEED_HZ`	Read	Get maximum speed in Hz
`SPI_IOC_WR_MAX_SPEED_HZ`	Write	Set maximum speed in Hz
`SPI_IOC_MESSAGE(n)`	Write	Transfer N `SpiIocTransfer` structs in one CS assertion

Linux compatibility: identical ioctl codes and SpiIocTransfer struct layout to Linux spidev. Userspace programs using <linux/spi/spidev.h> compile and run without modification.

13.21 rfkill — RF Kill Switch Framework¶

rfkill manages radio transmitters across all wireless technologies (WiFi, Bluetooth, UWB, WWAN, NFC, GPS). A "kill" can be hardware-initiated (a physical slide switch or button) or software-initiated (NetworkManager, airplane mode toggle, userspace rfkill tool). The framework tracks per-device block state, enforces the invariant that hard-blocked radios cannot be unblocked by software, and exposes current state to userspace via /dev/rfkill and sysfs.

13.21.1 Data Structures¶

/// Radio technology type.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u32)]
pub enum RfkillType {
    /// Meta-type: affects all radios when used in RFKILL_OP_CHANGE_ALL.
    All       = 0,
    /// IEEE 802.11 WiFi.
    Wlan      = 1,
    /// Bluetooth (BR/EDR and LE).
    Bluetooth = 2,
    /// Ultra-Wideband (deprecated in new hardware).
    Uwb       = 3,
    /// WiMAX.
    Wimax     = 4,
    /// WWAN / cellular modem (LTE, 5G NR).
    Wwan      = 5,
    /// GPS receiver.
    Gps       = 6,
    /// FM radio.
    Fm        = 7,
    /// Near-Field Communication.
    Nfc       = 8,
}

/// Aggregated block state for a single rfkill device.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum RfkillState {
    /// Radio may transmit.
    Unblocked,
    /// Blocked by software (rfkill_block_soft). Overridable.
    SoftBlocked,
    /// Blocked by hardware kill switch. Cannot be overridden by software.
    HardBlocked,
}

/// An rfkill device registered by a wireless driver.
///
/// **KABI interaction**: Although `RfkillOps` is described as a "kernel-internal
/// Rust trait" (line 67), WiFi and Bluetooth drivers are Tier 1 (hardware-isolated,
/// separately compiled). The rfkill core manipulates `soft_blocked` on behalf of
/// Tier 1 drivers via `/dev/rfkill` writes and sysfs. Therefore `soft_blocked`
/// effectively crosses a KABI boundary and must use `AtomicU8` (not `AtomicBool`)
/// per CLAUDE.md rule 8.
pub struct RfkillDevice {
    /// Unique index (auto-assigned at registration, 0-based, never reused).
    pub idx:          u32,
    /// Radio technology.
    pub type_:        RfkillType,
    /// Human-readable name (e.g., "phy0", "hci0", "wwan0"). NUL-terminated.
    pub name:         [u8; 32],
    /// Current software block state. 0 = unblocked, 1 = soft-blocked.
    /// Uses `AtomicU8` (not `AtomicBool`) because this field is accessed across
    /// the KABI boundary by Tier 1 wireless drivers. `AtomicBool` wraps `bool`
    /// which has a validity invariant (must be 0 or 1); a separately-compiled
    /// driver storing any other byte value triggers undefined behavior.
    /// Same pattern as `WatchdogDev.magic_close_armed`.
    pub soft_blocked: AtomicU8,  // 0 = unblocked, 1 = soft-blocked
    /// Driver operations table. Arc<dyn> is used because rfkill drivers may be
    /// dynamically loaded kernel modules; for static-only drivers, &'static dyn
    /// would avoid refcount overhead. This is a cold-path allocation (device
    /// registration only).
    pub ops:          Arc<dyn RfkillOps>,
    /// Handle for sysfs uevent and netlink event notification.
    pub event_handle: RfkillEventHandle,
}

/// Operations implemented by the wireless driver.
/// RfkillOps is a kernel-internal Rust trait, not a KABI vtable. The rfkill
/// core is compiled together with wireless drivers and dispatches via Rust
/// vtable. `bool` parameters are safe for Rust-to-Rust dispatch.
pub trait RfkillOps: Send + Sync {
    /// Apply the soft block state to the hardware.
    ///
    /// `blocked = true` → shut down transmitter; `blocked = false` → enable
    /// transmitter. The driver must not transmit while blocked and may power-gate
    /// the radio hardware.
    fn set_block(&self, blocked: bool);

    /// Query the hardware kill switch state.
    ///
    /// Returns true if the hardware kill switch is asserted (hard-blocked).
    /// Called periodically and on GPIO interrupt to refresh hard-block state.
    /// Default: no hardware kill switch present.
    fn query_hardware(&self) -> bool {
        false
    }
}

13.21.2 /dev/rfkill — Userspace Interface¶

The /dev/rfkill character device (major 10, minor 242) provides a unified interface for monitoring and controlling all registered rfkill devices. It is the interface used by NetworkManager, ConnMan, iwd, and the rfkill(8) userspace tool.

/// rfkill event structure exchanged with userspace via /dev/rfkill.
/// Layout matches Linux `struct rfkill_event` (8 bytes) for ABI compatibility.
#[repr(C, packed)]
pub struct RfkillEvent {
    /// Device index (matches RfkillDevice::idx).
    pub idx:   u32,
    /// Radio type (RfkillType as u8).
    pub type_: u8,
    /// Operation code: one of the RFKILL_OP_* constants below.
    pub op:    u8,
    /// Software block state: 1 = soft-blocked, 0 = not soft-blocked.
    pub soft:  u8,
    /// Hardware block state: 1 = hard-blocked, 0 = not hard-blocked.
    pub hard:  u8,
}

/// Extended rfkill event structure (Linux 5.11+, 9 bytes).
/// systemd 247+ writes `rfkill_event_ext`. The kernel detects the
/// read/write buffer size to determine which struct is in use:
///   - buffer >= RFKILL_EVENT_SIZE_V1_EXT (9): use extended struct
///   - buffer >= RFKILL_EVENT_SIZE_V1 (8):     use base struct
#[repr(C, packed)]
pub struct RfkillEventExt {
    /// Base fields (identical layout to RfkillEvent).
    pub idx:                u32,
    pub type_:              u8,
    pub op:                 u8,
    pub soft:               u8,
    pub hard:               u8,
    /// Bitmask of hard-block reasons (Linux 5.11+).
    /// Bit 0: signal from a platform rfkill provider (e.g., SAR sensor).
    pub hard_block_reasons: u8,
}

const_assert!(size_of::<RfkillEvent>() == 8);
const_assert!(size_of::<RfkillEventExt>() == 9);

pub const RFKILL_EVENT_SIZE_V1:     usize = 8;
pub const RFKILL_EVENT_SIZE_V1_EXT: usize = 9;

/// New device registered; sent once per device on first open.
pub const RFKILL_OP_ADD:        u8 = 0;
/// Device unregistered (driver unloaded or hardware removed).
pub const RFKILL_OP_DEL:        u8 = 1;
/// Block state changed for a specific device.
pub const RFKILL_OP_CHANGE:     u8 = 2;
/// Block state changed for all devices of the given type.
pub const RFKILL_OP_CHANGE_ALL: u8 = 3;

Read: Returns one RfkillEvent (8 bytes) per call. The first read after open() returns one RFKILL_OP_ADD event per currently registered rfkill device (device enumeration), then subsequent reads block until a state change occurs. Supports O_NONBLOCK + poll()/epoll().

Write: Write an RfkillEvent with op = RFKILL_OP_CHANGE to block (soft=1) or unblock (soft=0) a specific device identified by idx. Write with op = RFKILL_OP_CHANGE_ALL to block or unblock all devices of the given type_.

sysfs: Each rfkill device is exposed under /sys/class/rfkill/rfkill<N>/:

File	Access	Description
`name`	ro	Device name string
`type`	ro	Technology name ("wlan", "bluetooth", "wwan", etc.)
`state`	ro	Aggregate state: "0" = blocked, "1" = unblocked
`hard`	ro	Hardware block: "0" or "1"
`soft`	rw	Software block: "0" or "1" (write to change)
`uevent`	—	Generates uevent on any state change

13.21.3 rfkill-input: Hardware Kill Switch¶

When a hardware kill switch (GPIO or ACPI button device) changes state, it notifies rfkill-input, which calls rfkill_set_hw_state() on all rfkill devices of the associated type (typically RfkillType::All for a physical airplane-mode switch). The WiFi driver and Bluetooth driver each register their own rfkill devices; the single switch event propagates to all of them simultaneously through the framework.

Linux compatibility: same /dev/rfkill ABI; same sysfs layout; same ioctl codes. rfkill(8) from util-linux, NetworkManager, ConnMan, and iwd all work without modification.

13.22 MTD — Memory Technology Device Framework¶

MTD (Memory Technology Device) provides a uniform kernel interface to flash memory: NOR flash (bit-erasable, supports random read and byte-granular 1→0 bit writes, erases entire sectors to all-ones), NAND flash (page-write with ECC, block erase, sequential access pattern required), and eMMC in raw partition mode. MTD is used for bootloader storage, firmware update partitions, and embedded root filesystems (UBIFS on NAND, JFFS2 on NOR). The MTD layer sits below filesystems and above hardware flash controller drivers.

13.22.1 MtdInfo and MtdDevice¶

/// MTD device type.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u32)]
pub enum MtdType {
    /// No device / slot unused.
    Absent    = 0,
    /// RAM (volatile, no erase needed).
    Ram       = 1,
    /// Read-only NOR (firmware ROM).
    Rom       = 2,
    /// NOR flash: bit-alterable (can flip 1→0 without erase), sector-erase.
    NorFlash  = 3,
    /// NAND flash: page write, block erase, ECC required.
    NandFlash = 4,
    /// Atmel DataFlash (SPI, power-of-two pages).
    DataFlash = 6,
    /// UBI logical volume over NAND.
    UbiVolume = 7,
    /// Multi-level cell NAND.
    MlcNand   = 8,
}

bitflags! {
    /// MTD device capability flags.
    pub struct MtdFlags: u32 {
        /// Device supports write operations.
        const WRITEABLE     = 0x400;
        /// Individual bits may be set (NOR: can flip 1→0 without erase).
        const BIT_WRITEABLE = 0x800;
        /// No erase needed (RAM, ROM).
        const NO_ERASE      = 0x1000;
        /// OOB (Out-Of-Band / spare) area accessible via read_oob/write_oob.
        const OOB           = 0x2000;
        /// Hardware ECC engine present and active.
        const ECC           = 0x4000;
        /// Continuous (linearly-addressed) memory space (NOR).
        const MAPPED        = 0x8000;
    }
}

/// Static MTD device descriptor returned by MtdDevice::info().
pub struct MtdInfo {
    /// Device type.
    pub type_:          MtdType,
    /// Capability flags.
    pub flags:          MtdFlags,
    /// Total device size in bytes.
    pub size:           u64,
    /// Minimum erase unit in bytes (erase block size).
    /// NOR: typically 64 KiB or 128 KiB.
    /// NAND: typically 128 KiB or 256 KiB.
    /// u32 range: max 4 GiB — largest flash erase blocks are ~4 MiB.
    pub erasesize:      u32,
    /// Minimum write unit in bytes.
    /// NAND: page size (512 B, 2 KiB, 4 KiB). NOR: 1 (bit-alterable).
    /// u32 range: max 4 GiB — largest flash page size is 16 KiB.
    pub writesize:      u32,
    /// OOB (spare) bytes per page (NAND only; typically 64 or 128).
    /// u32 range: max 4 GiB — largest OOB is ~1 KiB per page.
    pub oobsize:        u32,
    /// OOB bytes per page available for filesystem use (after ECC overhead).
    /// Always <= oobsize.
    pub oobavail:       u32,
    /// Device model name (e.g., "mx25l25635f"). NUL-terminated.
    pub name:           [u8; 64],
    /// MTD device index (matches /dev/mtdN minor number N/2).
    pub index:          u32,
    /// ECC strength: correctable bits per ecc_step_size bytes.
    pub ecc_strength:   u32,
    /// ECC step size in bytes (typically 512 or 1024).
    pub ecc_step_size:  u32,
}

/// MTD device trait (kernel-internal, Tier 0). Implemented by flash controller
/// drivers. This is a Rust trait, not a KABI vtable — no #[repr(C)],
/// no vtable_size/kabi_version.
pub trait MtdDevice: Send + Sync {
    /// Return static MTD info for this device.
    fn info(&self) -> &MtdInfo;

    /// Read `buf.len()` bytes from device offset `from` into `buf`.
    ///
    /// Returns `(bytes_read, max_bit_flips)` where `max_bit_flips` is the
    /// maximum number of bit errors corrected in any single ECC step during
    /// the read (0 if no ECC or no errors).
    fn read(&self, from: u64, buf: &mut [u8]) -> Result<(usize, u32), MtdError>;

    /// Write `data` to device offset `to`. Must be aligned to writesize.
    ///
    /// NOR: can only write 0-bits (1→0 only); cannot flip 0→1 without erase.
    /// NAND: must write entire pages; partial-page writes are not supported.
    fn write(&self, to: u64, data: &[u8]) -> Result<usize, MtdError>;

    /// Erase one erase block starting at `addr`. Must be aligned to erasesize.
    fn erase(&self, addr: u64) -> Result<(), MtdError>;

    /// Read page data and OOB area simultaneously (NAND only).
    fn read_oob(
        &self,
        from:     u64,
        data_buf: &mut [u8],
        oob_buf:  &mut [u8],
    ) -> Result<(), MtdError>;

    /// Write page data and OOB area simultaneously (NAND only).
    fn write_oob(
        &self,
        to:   u64,
        data: &[u8],
        oob:  &[u8],
    ) -> Result<(), MtdError>;

    /// Check whether a NAND block is bad (factory-marked or runtime-marked).
    fn block_isbad(&self, ofs: u64) -> Result<bool, MtdError>;

    /// Mark a NAND block as bad after an unrecoverable ECC error.
    fn block_markbad(&self, ofs: u64) -> Result<(), MtdError>;
}

/// MTD-specific error codes.
#[derive(Debug)]
pub enum MtdError {
    /// Generic I/O error.
    Io(KernelError),
    /// Uncorrectable ECC error: bit flip count exceeded ECC strength.
    EccError,
    /// Byte offset or length exceeds device bounds.
    OutOfBounds,
    /// Write or erase targeting a bad block (NAND).
    BadBlock,
    /// Device is write-protected (WP# pin asserted or software lock active).
    WriteProtected,
}

13.22.2 MTD Partitions¶

Raw flash devices are divided into named partitions analogous to disk partitions. Each partition is a contiguous subrange of the parent MTD device and appears as its own /dev/mtdN node.

/// An MTD partition: a named subrange of a parent MTD device.
///
/// **Kernel-internal type** — NOT a wire format or KABI type. This struct is
/// constructed at probe time from device tree or command-line parsing and
/// consumed only by the kernel's MTD partition registration path. `bool`
/// fields are safe here (no ABI boundary crossing, no ring buffer storage).
pub struct MtdPartition {
    /// Partition name (e.g., "bootloader", "kernel", "rootfs"). NUL-terminated.
    pub name:   [u8; 64],
    /// Byte offset within the parent MTD device. Must be erasesize-aligned.
    pub offset: u64,
    /// Partition size in bytes. Must be a multiple of erasesize.
    pub size:   u64,
    /// True if this partition is read-only (erase and write are rejected).
    pub ro:     bool,
}

Partition source priority (highest first):

Kernel command line (mtdparts= parameter via cmdlinepart driver): mtdparts=spi0.0:512k(bootloader),1m(kernel),-(rootfs)
Device tree (partitions subnode with compatible = "fixed-partitions", child nodes with reg and label properties)
RedBoot FIS table (self-describing NOR flash partition table at a known offset)

Each partition appears as its own MTD device: the parent is /dev/mtd0; partitions are /dev/mtd1, /dev/mtd2, etc., in registration order.

13.22.3 Character Devices: /dev/mtdN and /dev/mtdblockN¶

/dev/mtdN (major 90, minor 2N): raw MTD character device. Supports sequential read()/write() with lseek(). ioctls:

ioctl	Description
`MEMGETINFO`	Returns `MtdInfo` for this device
`MEMERASE`	Erase blocks (`struct erase_info_user { start, length }`)
`MEMREAD`	Read with OOB data
`MEMWRITE`	Write with OOB data
`MEMGETBADBLOCK`	Query bad block at given offset
`MEMSETBADBLOCK`	Mark block at given offset as bad
`MEMGETOOBSEL`	Get OOB layout (ECC byte positions, free byte positions)
`MEMLOCK`	Write-lock sectors (NOR flash with hardware lock bits)
`MEMUNLOCK`	Write-unlock sectors

/dev/mtdblockN (major 31, minor N): block device interface over MTD. Translates block layer read/write requests into MTD read/erase/write sequences. Suitable for FAT filesystems on NOR flash. Not suitable for NAND (use UBI + UBIFS instead — the block interface performs destructive random writes that destroy NAND without wear leveling).

13.22.4 UBI (Unsorted Block Images)¶

UBI is a wear-leveling and bad-block management layer that sits between raw NAND and UBIFS. It maintains a volume table, distributes erases evenly across all physical erase blocks, and transparently remaps bad blocks.

/// UBI volume type.
pub enum UbiVolumeType {
    /// Writable and erasable (standard data partition).
    Dynamic,
    /// Read-only after finalization; integrity verified by ECC on every read.
    Static,
}

/// Maximum number of UBI volumes per UBI device. Matches Linux's
/// `UBI_MAX_VOLUMES` (128). Volume IDs are in the range [0, UBI_MAX_VOLUMES).
pub const UBI_MAX_VOLUMES: u32 = 128;

/// A UBI logical volume.
/// Kernel-internal type — NOT a wire format or KABI type. The actual UBI ioctl
/// structures (`struct ubi_mkvol_req`, `struct ubi_volume_desc`) are defined
/// separately with `#[repr(C)]` matching Linux `include/uapi/linux/ubi-user.h`.
pub struct UbiVolume {
    /// Volume ID (0 to UBI_MAX_VOLUMES-1).
    pub vol_id:    u32,
    /// Volume type.
    pub type_:     UbiVolumeType,
    /// Volume name. NUL-terminated.
    pub name:      [u8; 128],
    /// Volume size in bytes (multiple of leb_size).
    pub size:      u64,
    /// Logical erase block size = MTD erasesize − UBI per-block overhead (~64 B).
    pub leb_size:  u32,
    /// Logical erase block data alignment in bytes (usually 1).
    pub alignment: u32,
}

UBI exposes volumes as /dev/ubiN_M (UBI device N, volume M). UBIFS mounts directly on a UBI volume (mount -t ubifs ubi0:rootfs /).

Linux compatibility: same mtd-utils commands (flash_erase, flashcp, nandwrite, nanddump, ubiformat, ubimkvol, ubinfo, ubirename) work without modification. Identical ioctl codes, same /dev/mtdN and /dev/mtdblockN node layout.

13.23 IPMI — Intelligent Platform Management Interface¶

IPMI (Intelligent Platform Management Interface, version 2.0) enables out-of-band system monitoring and management via the Baseboard Management Controller (BMC). The kernel communicates with the BMC through one of four system interfaces: KCS (Keyboard Controller Style), SMIC, BT (Block Transfer), or SSIF (SMBus System Interface over I2C). Capabilities provided include: temperature, voltage, and fan telemetry via Sensor Data Records (SDR); remote power control; system event log (SEL) access; serial-over-LAN (SOL); and hardware watchdog. IPMI is universally present on server-class hardware and is required for IPMI-aware management frameworks (ipmitool, freeipmi, Redfish BMC integration).

13.23.1 IPMI Message¶

/// IPMI message: a request sent to or a response received from the BMC.
pub struct IpmiMsg {
    /// Network Function (NetFn). Even = request, odd = response.
    /// Common values: 0x04 Sensor/Event, 0x06 Application, 0x0A Storage,
    /// 0x2C Group Extension, 0x30–0x3F OEM/Site-specific.
    pub netfn:    u8,
    /// Command code within the NetFn.
    pub cmd:      u8,
    /// Completion code: 0x00 = success; non-zero = error (in responses).
    pub ccode:    u8,
    /// Valid bytes in `data`.
    ///
    /// **INVARIANT**: `data_len <= data.len()`. All constructors enforce this.
    /// Direct field writes MUST validate this invariant.
    ///
    /// Transport-specific maximum payload size:
    /// - KCS / BT: `data_len <= 64`
    /// - SSIF (SMBus): `data_len <= 32` (SMBus block transfer limit)
    ///
    /// All constructors (`IpmiMsg::new()`, `IpmiMsg::for_ssif()`) validate
    /// this invariant and return `Err(IpmiError::InvalidDataLen)` on
    /// violation. The `data_len` field is `u8` (0-255) but the `data`
    /// array is only 64 bytes; without validation, an out-of-bounds
    /// `data_len` would cause buffer overflows in serialization paths.
    pub data_len: u8,
    /// Message payload (max 64 bytes for KCS; 32 bytes for SSIF due to SMBus
    /// block transfer limit).
    pub data:     [u8; 64],
}

impl IpmiMsg {
    /// Create a new IPMI message for KCS or BT interfaces.
    /// Returns `Err(IpmiError::InvalidDataLen)` if `data_len > 64`.
    pub fn new(netfn: u8, cmd: u8, data: &[u8]) -> Result<Self, IpmiError> {
        if data.len() > 64 {
            return Err(IpmiError::InvalidDataLen);
        }
        let mut msg = Self::default();
        msg.netfn = netfn;
        msg.cmd = cmd;
        msg.data_len = data.len() as u8;
        msg.data[..data.len()].copy_from_slice(data);
        Ok(msg)
    }

    /// Create a new IPMI message for the SSIF (SMBus) interface.
    /// Returns `Err(IpmiError::InvalidDataLen)` if `data_len > 32`.
    pub fn for_ssif(netfn: u8, cmd: u8, data: &[u8]) -> Result<Self, IpmiError> {
        if data.len() > 32 {
            return Err(IpmiError::InvalidDataLen);
        }
        let mut msg = Self::default();
        msg.netfn = netfn;
        msg.cmd = cmd;
        msg.data_len = data.len() as u8;
        msg.data[..data.len()].copy_from_slice(data);
        Ok(msg)
    }
}

impl Default for IpmiMsg {
    fn default() -> Self {
        Self { netfn: 0, cmd: 0, ccode: 0, data_len: 0, data: [0u8; 64] }
    }
}

/// IPMI Logical Unit Number (sub-channel within a network function).
#[repr(u8)]
pub enum IpmiLun {
    /// BMC hardware.
    Bmc      = 0,
    /// OEM channel 1.
    Oem1     = 1,
    /// IPMB channel 0.
    IpmbChan = 2,
    /// OEM channel 2.
    Oem2     = 3,
}

13.23.2 System Interface Drivers¶

/// IPMI system interface trait. Implemented by KCS, SMIC, BT, and SSIF drivers.
pub trait IpmiSi: Send + Sync {
    /// Send an IPMI request and receive the BMC's response synchronously.
    ///
    /// Blocks until the BMC response is ready or `timeout_ms` elapses.
    fn send_recv(
        &self,
        request:    &IpmiMsg,
        response:   &mut IpmiMsg,
        timeout_ms: u32,
    ) -> Result<(), IpmiError>;

    /// Short name identifying this interface type (e.g., "kcs", "ssif", "bt").
    fn interface_type(&self) -> &'static str;
}

/// IPMI system interface error codes.
pub enum IpmiError {
    /// BMC did not respond within the timeout.
    Timeout,
    /// BMC returned a NACK (SMBus) or error completion code.
    Nack,
    /// Malformed response data.
    InvalidData,
    /// BMC busy; retry.
    DeviceBusy,
    /// `data_len` exceeds the transport-specific maximum (64 for KCS/BT,
    /// 32 for SSIF). Returned by `IpmiMsg::new()` and `IpmiMsg::for_ssif()`.
    InvalidDataLen,
    /// Underlying I/O error.
    Io(KernelError),
}

KCS (Keyboard Controller Style): The most common system interface. Uses two I/O-port register pairs: DATA_IN/DATA_OUT and STATUS/CMD. The driver implements the KCS state machine (KCS_IDLE → KCS_WRITE_START → KCS_WRITE_DATA → KCS_READ → KCS_IDLE), polling at 100 µs intervals. Switches to interrupt-driven operation if the BMC asserts a system IRQ.

SSIF (SMBus System Interface): IPMI over SMBus (I2C). Maximum payload 32 bytes per SMBus block transfer. Messages exceeding 32 bytes use multi-part transactions. Implemented by IpmiSsif on top of the I2cBus trait (Section 13.13).

BT (Block Transfer): Three I/O-port registers; supports BMC-initiated interrupts to the host. Deprecated in new platform designs.

13.23.3 /dev/ipmiN Character Device¶

Each IPMI interface creates /dev/ipmi0 (major 239, dynamic minor per kernel assignment):

/// Userspace request structure for IPMICTL_SEND_COMMAND.
/// Layout matches Linux `struct ipmi_req` for ABI compatibility.
/// Note: `addr` is pointer-sized and `msgid` is C `long` (4 bytes on ILP32,
/// 8 bytes on LP64). On 64-bit kernels, a 32-bit compat ioctl handler
/// must translate compat_uptr_t and compat_long_t.
#[repr(C)]
pub struct IpmiReq {
    /// Pointer to IPMI address structure (kernel reads via copy_from_user).
    pub addr:     *const IpmiAddrT,
    /// Size of the address structure.
    pub addr_len: u32,
    /// Caller-assigned message ID; returned unchanged with the response.
    /// C `long` — use KernelLong for correct width on ILP32 vs LP64.
    pub msgid:    KernelLong,
    /// Message header: netfn, cmd, data length, and pointer to data buffer.
    pub msg:      IpmiMsgHdr,
}

#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<IpmiReq>() == 40);
// 32-bit (ILP32): addr(ptr=4) + addr_len(u32=4) + msgid(KernelLong=4) + msg(IpmiMsgHdr).
// IpmiMsgHdr is pointer-width-dependent (contains *const u8 data pointer).
// Exact size depends on IpmiMsgHdr definition. The compat ioctl layer
// translates between 32-bit compat_ipmi_req and the kernel's native layout.
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<IpmiReq>() == 24);

// ioctl command codes — computed from _IOR/_IOW/_IOWR macros at compile
// time with correct sizeof for the target architecture. The values shown
// here are for LP64 targets. On ILP32 targets (ARMv7, PPC32), sizeof
// differs for pointer-containing structs (ipmi_req, ipmi_recv) and the
// compat ioctl layer translates using compat_ipmi_req/compat_ipmi_recv
// (matching Linux drivers/char/ipmi/ipmi_devintf.c:compat_ipmi_ioctl()).
// Formula: dir(2) << 30 | size(14) << 16 | type(8) << 8 | nr(8)
// IPMI_IOC_MAGIC = 'i' = 0x69
pub const IPMICTL_SEND_COMMAND:           u32 = 0x8028690D; // _IOR('i', 13, ipmi_req) [LP64]
pub const IPMICTL_RECEIVE_MSG_TRUNC:      u32 = 0xC030690B; // _IOWR('i', 11, ipmi_recv) [LP64]
pub const IPMICTL_RECEIVE_MSG:            u32 = 0xC030690C; // _IOWR('i', 12, ipmi_recv) [LP64]
pub const IPMICTL_REGISTER_FOR_CMD:       u32 = 0x8002690E; // _IOR('i', 14, ipmi_cmdspec(2 bytes))
pub const IPMICTL_UNREGISTER_FOR_CMD:     u32 = 0x8002690F; // _IOR('i', 15, ipmi_cmdspec(2 bytes))
pub const IPMICTL_SET_MY_CHANNEL_ADDRESS: u32 = 0x80046983;
pub const IPMICTL_GET_MY_CHANNEL_ADDRESS: u32 = 0x40046984;
pub const IPMICTL_SET_TIMING_PARMS:       u32 = 0x80106985;
pub const IPMICTL_GET_TIMING_PARMS:       u32 = 0x40106986;

select()/poll()/epoll() on /dev/ipmiN: becomes readable when a response or an asynchronous event message from the BMC is available. Multiple processes may open /dev/ipmiN simultaneously; responses are demultiplexed by the msgid field.

13.23.4 Platform Event / Panic Notifier¶

On kernel panic, UmkaOS sends a Platform Event Message to the BMC so it can log the event, alert the management network, or trigger an automatic power cycle after a configurable delay.

/// Sends an IPMI OS Critical Stop event to the BMC on kernel panic.
pub struct IpmiPanicNotifier {
    /// IPMI system interface to use.
    pub si: Arc<dyn IpmiSi>,
}

impl PanicNotifier for IpmiPanicNotifier {
    fn notify_panic(&self, _msg: &str) {
        // Platform Event Message: NetFn=0x04 (Sensor/Event), Cmd=0x02
        let mut req = IpmiMsg::default();
        req.netfn    = 0x04;
        req.cmd      = 0x02;
        req.data_len = 8;
        req.data[0]  = 0x41; // Generator ID: 0x41 = system software (BIOS/SMS, IPMI Table 5-4)
        req.data[1]  = 0x04; // EvMRev: IPMI 2.0 Platform Event format (0x04 per Table 29-5)
        req.data[2]  = 0x20; // Sensor Type: OS Critical Stop
        req.data[3]  = 0xFF; // Sensor Number: OS critical stop (0xFF = unspecified)
        req.data[4]  = 0x6F; // Event Dir: assertion; Event Type: sensor-specific
        req.data[5]  = 0x01; // Event Data 1: run-time critical stop (bit 0 = offset 01h)
        req.data[6]  = 0xFF; // Event Data 2: unspecified
        req.data[7]  = 0xFF; // Event Data 3: unspecified
        // Ignore errors: BMC may not respond during a panic.
        let _ = self.si.send_recv(&req, &mut IpmiMsg::default(), 500);
    }
}

Linux compatibility: identical /dev/ipmiN ioctl interface; ipmitool, freeipmi, and OpenIPMI userspace libraries work without modification. ACPI _HID = "IPI0001" device detection and ipmi_si PnP IDs are supported.

13.24 UIO — Userspace I/O¶

UIO (Userspace I/O) allows complete device drivers to be implemented in userspace. A minimal kernel stub registers the device, maps device memory regions (MMIO BARs, reserved RAM) into the process address space via mmap() on /dev/uioN, and delivers hardware interrupts to userspace via a blocking read(). This is appropriate for FPGAs, industrial I/O cards, custom hardware with no existing kernel driver, and legacy proprietary hardware where a vendor supplies a userspace driver binary. Kernel code for UIO devices is minimal: it must only set up the UioDevice trait implementation; the rest lives in userspace.

13.24.1 UioDevice Trait¶

/// Kernel stub trait for a UIO device. Implemented once per device type.
pub trait UioDevice: Send + Sync {
    /// Device name shown in /sys/class/uio/uioN/name.
    fn name(&self) -> &str;

    /// Driver version string shown in /sys/class/uio/uioN/version.
    fn version(&self) -> &str;

    /// Memory regions to expose via mmap. Maximum UIO_MAX_MAPS (5) regions.
    fn mem_regions(&self) -> &[UioMem];

    /// Called when userspace writes 1 to /dev/uioN to re-enable the interrupt
    /// after it has been delivered. Prevents interrupt storms before userspace
    /// has finished processing.
    fn irq_control(&self, enable: bool);

    /// Called in interrupt context when the hardware asserts the IRQ.
    ///
    /// The implementation must disable the interrupt at the hardware level
    /// (to prevent re-entry) and return true to wake all blocked readers on
    /// /dev/uioN.
    fn irq_handler(&self) -> bool;
}

/// A physical or virtual memory region exposed via mmap on /dev/uioN.
pub struct UioMem {
    /// Physical base address of the region (for MMIO BARs or reserved RAM).
    pub addr:  u64,
    /// Size of the region in bytes. Must be a multiple of PAGE_SIZE.
    /// u64 for platform independence — some FPGA BAR regions exceed 4 GiB
    /// (not representable in usize on 32-bit platforms).
    pub size:  u64,
    /// Memory type: determines how the mmap mapping is established.
    pub type_: UioMemType,
    /// Region name shown in sysfs maps/mapN/name. NUL-terminated.
    pub name:  [u8; 32],
}

/// How a UIO memory region is physically mapped into userspace.
pub enum UioMemType {
    /// Slot is unused (padding to preserve index of later slots).
    None,
    /// Physically contiguous memory (device MMIO or reserved RAM).
    /// mmap returns an uncached (write-combining or device) mapping.
    PhysContiguous,
    /// Kernel virtual memory (vmalloc area).
    /// mmap remaps the kernel virtual pages into the user VMA.
    Virtual,
    /// Kernel logical memory (struct page array, contiguous in kernel VA).
    /// mmap uses remap_pfn_range over the page frames.
    Logical,
}

13.24.2 /dev/uioN Character Device¶

mmap(): Each UioMem region is mapped at a fixed file offset: region 0 at offset 0, region 1 at offset 1 * PAGE_SIZE, region N at offset N * PAGE_SIZE (where PAGE_SIZE equals getpagesize() and UIO_MAX_MAPS = 5). Example: mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, region_index * getpagesize()). Userspace accesses hardware registers directly via the returned virtual address.

read(): Blocks until the UIO interrupt fires. Returns a u32 (4 bytes) containing the cumulative interrupt count since the device was opened. Supports O_NONBLOCK + select()/poll()/epoll().

write(): Write the value 1u32 (4 bytes) to re-enable the hardware interrupt after processing. This is required before the next read() will block again on a new interrupt edge.

sysfs under /sys/class/uio/uioN/:

Path	Description
`name`	Device name (from `UioDevice::name()`)
`version`	Driver version (from `UioDevice::version()`)
`event`	Current interrupt count (mirrors read())
`maps/map0/addr`	Physical address of region 0 (hex)
`maps/map0/size`	Size of region 0 (hex)
`maps/map0/name`	Name of region 0
`maps/map0/offset`	mmap offset for region 0 (always 0)
`maps/mapN/…`	Same fields for regions 1–4

13.24.3 uio_pdrv_genirq¶

uio_pdrv_genirq is a generic kernel component that turns any platform device with an IRQ into a UIO device without any device-specific kernel code. The interrupt handler disables the IRQ line and wakes userspace readers; the userspace driver re-enables the IRQ via write(1). This is the primary mechanism for FPGA and custom I/O card support in UmkaOS.

Linux compatibility: same /dev/uioN ABI; same sysfs layout; Linux UIO userspace libraries (libuio) and drivers written for Linux UIO work without modification.

13.25 NVMEM — Non-Volatile Memory Framework¶

The NVMEM (Non-Volatile Memory) framework provides a unified kernel interface to small non-volatile storage cells: I2C EEPROMs (AT24C series), SPI EEPROMs (AT25 series), OTP fuses (silicon eFuse banks), NVRAM cells inside RTC chips, battery-backed SRAM, and U-Boot environment variable storage. The framework decouples providers (EEPROM/fuse drivers that know how to read and write bytes) from consumers (e.g., Ethernet drivers needing a programmed MAC address, audio drivers needing factory calibration constants, clock drivers needing trim values). Consumers reference cells by symbolic name via device-tree or ACPI declarations rather than by raw byte offset.

13.25.1 Data Structures¶

/// Maximum number of named cells per NVMEM device (cold-path probe-time bound).
pub const NVMEM_MAX_CELLS: usize = 256;

/// An NVMEM provider device (one EEPROM chip, one OTP bank, etc.).
///
/// **Tier**: Tier 0 (kernel-internal). NVMEM provider drivers are compiled into
/// the kernel. `NvmemOps` is a Rust trait, not a KABI vtable.
pub struct NvmemDevice {
    /// Device model name (e.g., "at24c256", "imx-ocotp"). NUL-terminated.
    pub name:      [u8; 64],
    /// Total addressable size in bytes. u64 for platform independence —
    /// largest known NVMEM devices (SPI NOR flash) are ~256 MiB, well within
    /// u32 range, but u64 avoids 32-bit platform truncation for future devices.
    pub size:      u64,
    /// True if the device or the current software policy prohibits writes.
    pub read_only: bool,
    /// True if this is a One-Time-Programmable device (bits can only be
    /// written once and cannot be erased).
    pub otp:       bool,
    /// Named cells registered for this device.
    /// Bounded: typically 4-32 cells per NVMEM device (MAC address, serial
    /// number, calibration data). Max NVMEM_MAX_CELLS. Populated at device
    /// probe (cold path). If the device tree declares more than NVMEM_MAX_CELLS
    /// cells, probe logs an FMA warning and truncates.
    // Collection policy: Vec on cold path (probe).  Capacity bounded by
    // NVMEM_MAX_CELLS (256).  Typical: 4-8 cells.  ArrayVec<NvmemCell, 256>
    // would waste ~20 KiB per device for the common 4-cell case.
    pub cells:     Vec<NvmemCell>,
    /// Read/write operations implemented by the provider driver.
    pub ops:       Arc<dyn NvmemOps>,
}

/// A named data cell within an NVMEM device.
pub struct NvmemCell {
    /// Cell name as declared in device-tree or ACPI (e.g., "mac-address",
    /// "calibration-data", "serial-number"). NUL-terminated.
    pub name:       [u8; 64],
    /// Byte offset of the cell's first byte within the NVMEM device.
    /// u32 range: max 4 GiB — sufficient for all known NVMEM devices
    /// (largest EEPROM is ~256 KiB, largest SPI NOR is ~256 MiB).
    pub offset:     u32,
    /// Cell size in bits. Cells smaller than 8 bits use `bit_offset`.
    pub nbits:      u32,
    /// Bit offset within the byte at `offset` for sub-byte cells
    /// (e.g., a 4-bit trim value packed into the upper nibble of a byte).
    pub bit_offset: u8,
    /// True if the cell may be written. False for OTP cells already programmed
    /// or cells in read-only regions.
    pub writable:   bool,
}

/// NVMEM provider operations trait.
pub trait NvmemOps: Send + Sync {
    /// Read `buf.len()` bytes starting at byte `offset` within the NVMEM device.
    fn read(&self, offset: u32, buf: &mut [u8]) -> Result<(), KernelError>;

    /// Write `data.len()` bytes starting at byte `offset`.
    ///
    /// Returns `Err(EROFS)` if `NvmemDevice::read_only` is true.
    /// Returns `Err(EPERM)` if OTP cell is already programmed.
    fn write(&self, offset: u32, data: &[u8]) -> Result<(), KernelError>;
}

13.25.2 Consumer API¶

Consumer drivers call these functions from umka-core/src/nvmem/consumer.rs:

/// Look up an NVMEM cell handle by consumer device node and cell name.
///
/// `consumer` is the device node of the driver consuming the cell (used to
/// resolve the `nvmem-cells` + `nvmem-cell-names` DT properties).
/// `cell_name` is the symbolic cell name (e.g., "mac-address").
pub fn nvmem_cell_get(
    consumer:  &DeviceNode,
    cell_name: &str,
) -> Result<NvmemCellHandle, KernelError>;

/// Read the entire contents of `handle`'s cell into `buf`.
///
/// Returns the number of bytes written to `buf`.
pub fn nvmem_cell_read(
    handle: &NvmemCellHandle,
    buf:    &mut [u8],
) -> Result<usize, KernelError>;

/// Convenience wrapper: read a 6-byte MAC address from the cell named
/// "mac-address" on `consumer`. Handles big-endian byte order if the cell
/// is stored MSB-first (as is conventional in EEPROM MAC storage).
pub fn nvmem_cell_read_mac_address(
    consumer: &DeviceNode,
) -> Result<[u8; 6], KernelError>;

/// Write `data` to `handle`'s cell.
///
/// Returns `Err(EROFS)` if the device or cell is read-only.
/// Returns `Err(EPERM)` if the OTP bit is already set.
pub fn nvmem_cell_write(
    handle: &NvmemCellHandle,
    data:   &[u8],
) -> Result<(), KernelError>;

13.25.3 sysfs Interface¶

/sys/bus/nvmem/devices/<name>/
├── nvmem              rw if not read_only; r if read_only
│                      Raw byte access: supports lseek() + read()/write()
│                      with byte offset mapping directly to NVMEM address space
└── cells/
    └── <cell_name>    r--  Raw bytes of the named cell

Linux compatibility: same device-tree bindings (nvmem-cells, nvmem-cell-names, #nvmem-cell-cells); same sysfs layout; same consumer API function names. nvmem-tools userspace utilities work without modification.

13.26 SoundWire Bus Framework¶

SoundWire (MIPI Alliance SoundWire Specification version 1.2) is a two-wire (clock + data) serial audio bus used on Intel Tiger Lake, Alder Lake, Meteor Lake, and Raptor Lake SoCs to connect digital audio peripherals (codecs, amplifiers, DMIC arrays) via the PCH High-Definition Audio Multi-Link (HDAML) controller. SoundWire replaces the parallel HDA pin connections used by previous generations of external codecs. The UmkaOS SoundWire framework lives in umka-kernel/src/drivers/soundwire/ and integrates with the ASoC framework (Section 21.4) for stream management.

13.26.1 Bus Architecture¶

SoC PCH
├── Intel HDAML controller  (soundwire-intel driver)
│   ├── SoundWire link 0    (master, 48 MHz ref clock, 12.288 Mbit/s)
│   │   ├── Peripheral 0: RT712 codec      (Realtek, dev_num 1)
│   │   └── Peripheral 1: RT715 DMIC array (Realtek, dev_num 2)
│   └── SoundWire link 1    (master, second codec pair)
│       └── Peripheral 0: CS35L45 amplifier (Cirrus Logic, dev_num 1)
└── Legacy HDA controller   (for internal speakers / HDA codecs)

Each SoundWire link is a separate logical bus. Peripherals are automatically enumerated by the master during link startup: each peripheral responds with its MIPI manufacturer ID, part ID, class code, and firmware version.

13.26.2 Data Structures¶

/// A discovered SoundWire peripheral (codec, amplifier, or microphone array).
pub struct SdwPeripheral {
    /// SoundWire unique address assigned during enumeration (1-14; 0 and 15
    /// are reserved per MIPI SoundWire 1.2 spec). During enumeration, if a
    /// peripheral reports dev_num 0 or 15, the bus driver rejects it with
    /// an FMA warning (`"SoundWire: reserved dev_num {N} on link {link_id}"`)
    /// and does not create a device node.
    pub dev_num:    u8,
    /// MIPI-registered manufacturer ID (e.g., 0x025D = Realtek).
    pub mfr_id:     u16,
    /// Manufacturer-assigned part identifier.
    pub part_id:    u16,
    /// MIPI device class code (0x01 = audio codec, 0x02 = amplifier,
    /// 0x03 = microphone).
    pub class_code: u8,
    /// Peripheral firmware revision number.
    pub version:    u8,
}

/// PCM audio stream configuration for a SoundWire link.
/// `#[repr(C)]`: crosses KABI boundary via `*const SdwStream` in `SdwPeripheralOps`.
#[repr(C)]
pub struct SdwStream {
    /// Human-readable name for debug output (e.g., "playback", "capture").
    /// NUL-terminated.
    pub name:            [u8; 32],   // 32 bytes (offset 0)
    /// Number of audio channels (e.g., 2 for stereo, 8 for surround).
    pub num_channels:    u8,         // 1 byte   (offset 32)
    /// Explicit padding for u32 alignment of `sample_rate`.
    pub _pad0:           [u8; 3],    // 3 bytes  (offset 33)
    /// PCM sample rate in Hz (e.g., 44100, 48000, 96000, 192000).
    pub sample_rate:     u32,        // 4 bytes  (offset 36)
    /// Bit depth per sample (16, 20, 24, or 32).
    pub bits_per_sample: u8,         // 1 byte   (offset 40)
    /// SoundWire frame shape: number of rows per audio frame.
    /// Valid values: 48, 50, 60, 64, 72, 75, 80, 125, 147, 192, 250.
    pub frame_rows:      u8,         // 1 byte   (offset 41)
    /// SoundWire frame shape: number of columns per audio frame (2–16).
    pub frame_cols:      u8,         // 1 byte   (offset 42)
    /// Explicit padding for u32 alignment of `stream_id`.
    pub _pad1:           u8,         // 1 byte   (offset 43)
    /// Stream identifier assigned by the bus manager during `stream_enable`.
    /// Used by `stream_disable` to identify the stream to tear down.
    pub stream_id:       u32,        // 4 bytes  (offset 44)
    /// Data port assignments: which SoundWire data ports carry this stream.
    /// Bounded: max 14 data ports per MIPI SoundWire spec (DP1-DP14).
    /// Fixed array with explicit count for KABI safety (not ArrayVec).
    pub ports:           [SdwPortConfig; 14], // 168 bytes (offset 48)
    /// Number of valid entries in `ports`.
    pub num_ports:       u8,         // 1 byte   (offset 216)
    /// Explicit trailing padding to u32 struct alignment.
    pub _pad2:           [u8; 3],    // 3 bytes  (offset 217)
    // Total: 32+1+3+4+1+1+1+1+4+168+1+3 = 220 bytes.
}
const_assert!(size_of::<SdwStream>() == 220);

/// Mapping of a stream channel group to a SoundWire data port.
/// `#[repr(C)]`: embedded in `SdwStream` which crosses KABI boundary.
#[repr(C)]
pub struct SdwPortConfig {
    /// Data port number on the peripheral (1–14).
    pub port_num:  u8,              // 1 byte  (offset 0)
    /// Explicit padding for u32 alignment of `ch_mask`.
    pub _pad0: [u8; 3],            // 3 bytes (offset 1)
    /// Bitmask of channels assigned to this port within the stream.
    pub ch_mask:   u32,             // 4 bytes (offset 4)
    /// Data port mode: 0 = isochronous (default), 1 = tx controlled,
    /// 2 = rx controlled, 3 = simplified.
    pub port_mode: u8,              // 1 byte  (offset 8)
    /// Explicit trailing padding to u32 struct alignment.
    pub _pad1: [u8; 3],            // 3 bytes (offset 9)
    // Total: 1 + 3 + 4 + 1 + 3 = 12 bytes.
}
const_assert!(size_of::<SdwPortConfig>() == 12);

/// KABI vtable for a SoundWire peripheral driver.
///
/// Peripheral drivers (codec, amplifier) implement this vtable. The
/// SoundWire bus manager calls into it for register access, stream lifecycle,
/// and interrupt handling.
#[repr(C)]
pub struct SdwPeripheralOps {
    // SAFETY: `ctx` in all vtable functions is the peripheral driver's
    // private data pointer, valid for the lifetime of the device registration
    // (from `sdw_register_peripheral()` until `sdw_unregister_peripheral()`).
    // Owned by the peripheral driver. The bus manager never dereferences `ctx`
    // except through these vtable functions.

    /// Bounds-safety check: byte count of this vtable. Must be `u64` (not `usize`)
    /// per [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 3: vtable_size is part of the stable KABI and must
    /// have the same width on 32-bit and 64-bit targets.
    pub vtable_size: u64,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,
    /// Read a SoundWire register at `addr` (32-bit address, 8-bit value
    /// per SoundWire spec section 10). Returns value in low 8 bits; high
    /// bits are zero on success, 0xFFFF_FFFF on bus error.
    pub read_reg: unsafe extern "C" fn(
        ctx:  *mut c_void,
        addr: u32,
    ) -> u32,
    /// Write `value` (8 bits) to SoundWire register at `addr`.
    /// The bus manager passes only the low 8 bits; peripheral drivers
    /// SHOULD ignore bits [31:8] but MUST NOT rely on them being zero.
    /// This matches the defensive pattern: bus manager masks to `value & 0xFF`
    /// before calling; driver ignores high bits.
    pub write_reg: unsafe extern "C" fn(
        ctx:   *mut c_void,
        addr:  u32,
        value: u32,
    ),
    /// Prepare and enable a PCM stream on this peripheral.
    ///
    /// `dir`: 0 = capture (peripheral → host), 1 = playback (host → peripheral).
    /// Returns 0 on success, negative errno on error.
    pub stream_enable: unsafe extern "C" fn(
        ctx:    *mut c_void,
        stream: *const SdwStream,
        dir:    u8,
    ) -> i32,
    /// Disable and release the stream identified by `stream_id`.
    pub stream_disable: unsafe extern "C" fn(
        ctx:       *mut c_void,
        stream_id: u32,
    ),
    /// Handle a SoundWire interrupt delivered to this peripheral.
    ///
    /// `status` is the INTSTAT register value. The driver clears the
    /// interrupt source and returns.
    pub interrupt: unsafe extern "C" fn(
        ctx:    *mut c_void,
        status: u32,
    ),
}
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SdwPeripheralOps>() == 56);
// 32-bit: vtable_size(8) + kabi_version(8) + 5 fn ptrs(4 each) = 36 bytes.
// Struct alignment = 8 (from u64 fields on ARMv7/PPC32), so 36 padded to 40.
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SdwPeripheralOps>() == 40);

13.26.3 Power States¶

SoundWire defines three standardized power management states:

State	Description	Clock
D0	Active; streams running normally.	Full speed
ClockStop	Inactive; all peripherals maintain register state across the stop. Master gates the clock pin after ClockStop Prepare handshake.	Stopped (gated)
ClockStop2	Deepest sleep; peripherals may discard volatile register state. Non-volatile configuration (e.g., OTP-based defaults) is preserved.	Stopped (gated)

Transition into ClockStop: master broadcasts ClockStop Prepare command → all peripherals ACK → master asserts ClockStop status → master gates clock. Wake: master restarts clock → peripherals detect clock activity → bus enumeration runs again → streams re-established from saved state.

13.26.4 Integration with ASoC (ALSA SoC)¶

SoundWire peripherals register as ASoC codec components. The sdw_master_device (the Intel HDAML controller) binds each SoundWire link to the ASoC machine driver. Stream bring-up sequence:

ASoC DAPM (Dynamic Audio Power Management) resolves the active audio route.
The machine driver identifies the SoundWire data ports involved.
SdwPeripheralOps::stream_enable is called on each peripheral on the path.
The Intel HDAML hardware programs the SoundWire frame shape (frame_rows × frame_cols) and asserts the SoundWire clock to start isochronous data transfer.

Linux compatibility: UmkaOS's SoundWire implementation follows the MIPI SoundWire 1.2 specification. Peripheral devices supported by the Linux soundwire-intel driver (Realtek RT711, RT712, RT715; Cirrus Logic CS35L45; Maxim MAX98373) work on UmkaOS using the same ACPI firmware tables. The sdw_stream_config register layout and MIPI frame-shape encoding are spec-compliant and identical to Linux. ASoC machine driver DT/ACPI bindings are compatible.

13.27 Regulator Framework¶

Tier: Tier 0 (in-kernel). PMIC drivers are boot-critical (they power other devices) and cannot be isolation-contained. A PMIC failure is unrecoverable. All regulator framework types are kernel-internal and do not cross KABI boundaries.

Embedded SoCs and mobile platforms contain multiple independently-controllable power rails, each managed by a PMIC (Power Management Integrated Circuit) connected via I2C, SPI, or SPMI. Device drivers need to: - Enable or disable their power supply rail when the device is powered on or off - Request a specific operating voltage (e.g., core logic at 0.9 V, I/O at 1.8 V) - Share a rail with other devices (the rail stays enabled as long as any consumer is active) - Limit current draw to within the PMIC's rated range

Without a regulator framework, each driver accesses PMIC registers directly, duplicating protocol code, creating concurrent access races on shared I2C buses, and making it impossible to track which drivers are holding a rail enabled.

Cross-references: clock framework (Section 2.24), device probe sequence (Section 11.6), per-device runtime PM (Section 7.5).

13.27.1 Design: Voltage Voting Model¶

Multiple consumers can simultaneously request different voltages from the same rail. UmkaOS uses a voting model: each consumer declares an acceptable [min_uv, max_uv] range. The regulator core sets the rail voltage to max(all consumer min_uv), provided this is within min(all consumer max_uv). This satisfies all consumers simultaneously at the lowest voltage that meets all minimum requirements — saving power.

UmkaOS vs Linux regulator framework: - Linux: struct regulator_dev with void *driver_data, struct regulator_ops table - UmkaOS: RegulatorOps is a Rust trait (type-safe); RegulatorConsumer auto-disables on drop (RAII — impossible to forget to call regulator_disable)

13.27.2 Core Types¶

/// Description of one voltage regulator provided by a PMIC driver.
/// Registered by the PMIC driver at probe time.
/// Kernel-internal. Does not cross KABI boundaries.
pub struct RegulatorDesc {
    /// Unique name (used by consumers via regulator_get).
    pub name:        &'static str,
    /// Name of this regulator's power supply (parent regulator or fixed supply).
    /// Forms a rail dependency tree; the parent is enabled before this regulator.
    pub supply_name: &'static str,
    /// Number of discrete voltage steps (used with step_uv).
    pub n_voltages:  u32,
    /// Minimum voltage in microvolts. 0 if voltage is not configurable.
    pub min_uv:      u32,
    /// Maximum voltage in microvolts.
    pub max_uv:      u32,
    /// Voltage step in microvolts. 0 = use volt_table instead.
    pub step_uv:     u32,
    /// Explicit voltage table for PMICs with non-linear voltage steps.
    /// If Some: volt_table[register_value] gives voltage in µV.
    pub volt_table:  Option<&'static [u32]>,
    /// PMIC hardware operations.
    pub ops:         &'static dyn RegulatorOps,
    /// PMIC-private data (I2C address, register offsets, etc.).
    pub driver_data: *mut (),
}

// SAFETY: All RegulatorOps methods are called with the RegulatorInstance's
// hw_lock held (serialized access).  Drivers MUST NOT retain or access
// driver_data outside of ops callbacks.  The lock ensures no concurrent
// access to the pointed-to data.
unsafe impl Send for RegulatorDesc {}
unsafe impl Sync for RegulatorDesc {}

/// Hardware operations implemented by each PMIC driver for each regulator.
/// All methods are called with the regulator instance lock held.
pub trait RegulatorOps: Send + Sync {
    fn enable(&self, data: *mut ()) -> Result<(), KernelError>;
    fn disable(&self, data: *mut ()) -> Result<(), KernelError>;
    fn is_enabled(&self, data: *mut ()) -> Result<bool, KernelError>;
    fn get_voltage_uv(&self, data: *mut ()) -> Result<u32, KernelError>;
    /// Set voltage within [min_uv, max_uv]. Returns the actually-set voltage.
    fn set_voltage_uv(
        &self,
        data:   *mut (),
        min_uv: u32,
        max_uv: u32,
    ) -> Result<u32, KernelError>;
}

/// Internal state for one physical regulator instance.
pub struct RegulatorInstance {
    pub desc:         &'static RegulatorDesc,
    /// Count of consumers that have called enable(). Hardware is enabled while > 0.
    /// **Invariant**: `enable_count >= 0` at all times. `disable()` checks
    /// `enable_count.load(Acquire) > 0` before decrementing; if already 0,
    /// returns `Err(KernelError::InvalidState)` and logs an FMA warning
    /// (`"regulator {name}: disable() called with enable_count == 0"`).
    /// Using `AtomicI32` (not `AtomicU32`) allows detecting underflow bugs
    /// at runtime rather than silently wrapping.
    pub enable_count: AtomicI32,
    /// Currently voted and set voltage in microvolts.
    pub voted_uv:     AtomicU32,
    /// Fast-path lock for enable_count and voted_uv updates (atomic-only operations).
    state_lock:       SpinLock<()>,
    /// Slow-path lock for hardware access (set_voltage_uv, enable/disable via I2C/SPI).
    /// Mutex because I2C/SPI transactions are blocking (~100-500μs) and cannot be
    /// held under a SpinLock. Acquired AFTER state_lock is released (never nested).
    ///
    /// **TOCTOU mitigation**: After acquiring hw_lock, the caller MUST re-read
    /// `voted_uv` and recompute the satisfiable range if it changed since
    /// state_lock was released. This eliminates the race where a concurrent
    /// consumer vote arrives between state_lock release and hw_lock acquisition,
    /// which could momentarily program a voltage outside a consumer's valid
    /// range. The re-read check costs one extra atomic load (~1 cycle) on the
    /// slow I2C/SPI path (~100-500μs) — negligible overhead.
    ///
    /// Worst-case TOCTOU window without re-read: one I2C/SPI transaction
    /// duration (~100-500μs). With re-read: eliminated entirely (the
    /// programmed voltage always reflects the latest voted_uv).
    hw_lock:          Mutex<()>,
}

/// A consumer's handle to one regulator. Multiple devices may hold handles
/// to the same physical rail; the rail stays enabled while any is enabled.
///
/// On drop: auto-disables if this handle had enabled the regulator.
/// Clock leaks from forgotten regulator_disable() calls are impossible.
pub struct RegulatorConsumer {
    regulator:    Arc<RegulatorInstance>,
    enabled:      bool,
    /// This consumer's minimum voltage vote in µV. 0 = no preference.
    requested_uv: u32,
}

13.27.3 API¶

/// Register a regulator from a PMIC driver. Called at PMIC probe time.
///
/// All regulators must be registered before any consumer drivers probe.
/// The boot sequence ensures this (PMIC drivers probe at phase 5,
/// consumer drivers at phase 6).
pub fn regulator_register(
    desc: &'static RegulatorDesc,
) -> Result<Arc<RegulatorInstance>, KernelError>;

/// Get a consumer handle to the regulator with the given supply name.
///
/// `supply` matches the `name` field of a RegulatorDesc.
/// The handle is device-scoped: a device that probes while its regulator
/// is not yet registered receives `Err(KernelError::EprobeDefer)`. This
/// causes the driver framework to re-queue the probe attempt for later,
/// after the regulator's PMIC driver has had a chance to register. This
/// matches Linux's `-EPROBE_DEFER` behavior for regulator dependency
/// ordering and avoids permanent probe failures due to non-deterministic
/// driver init ordering.
pub fn regulator_get(
    dev:    &DeviceNode,
    supply: &str,
) -> Result<RegulatorConsumer, KernelError>;

impl RegulatorConsumer {
    /// Enable this regulator. Increments enable_count; hardware is enabled
    /// on the first enable() call. Idempotent for this handle.
    pub fn enable(&mut self) -> Result<(), KernelError>;

    /// Disable this regulator. Decrements enable_count; hardware is disabled
    /// when count reaches 0. Idempotent for this handle.
    pub fn disable(&mut self) -> Result<(), KernelError>;

    /// Get the currently-set hardware voltage in microvolts.
    pub fn get_voltage_uv(&self) -> Result<u32, KernelError>;

    /// Request a voltage for this consumer. The regulator core applies the
    /// voltage voting algorithm across all active consumers (see §13.15.4).
    ///
    /// Returns the actually-set voltage, which may be higher than `min_uv`
    /// if another consumer has a higher minimum vote.
    pub fn set_voltage_uv(
        &mut self,
        min_uv: u32,
        max_uv: u32,
    ) -> Result<u32, KernelError>;

    /// True if this consumer has called enable() and not yet called disable().
    pub fn is_enabled(&self) -> bool;
}

impl Drop for RegulatorConsumer {
    fn drop(&mut self) {
        // Auto-disable on drop: prevents rails staying on after driver unload.
        if self.enabled {
            let _ = self.disable();
        }
    }
}

13.27.4 Voltage Voting Algorithm¶

When any consumer calls set_voltage_uv(min_uv, max_uv):

  1. Record this consumer's vote: (min_uv, max_uv).

  2. Compute the satisfiable range across all consumers:
     vote_min = max(all consumer min_uv)   // Highest of all minimums
     vote_max = min(all consumer max_uv)   // Lowest of all maximums

  3. If vote_min > vote_max:
     // Votes are conflicting; no single voltage satisfies all consumers.
     Return Err(KernelError::InvalidArgument).
     The regulator voltage is NOT changed.
     The caller should widen its acceptable range or coordinate with other consumers.

  4. Else: acquire hw_lock. Re-read voted_uv and recompute the satisfiable
     range if it changed since state_lock was released (TOCTOU re-check).
     Call ops.set_voltage_uv(data, vote_min, vote_max).
     The hardware sets the voltage to the nearest achievable value within [vote_min, vote_max].
     Update voted_uv with the returned value. Release hw_lock.
     Return the actually-set voltage.

Example:
  Consumer A: [1800000, 1900000] µV
  Consumer B: [1700000, 2000000] µV
  vote_min = max(1800000, 1700000) = 1800000
  vote_max = min(1900000, 2000000) = 1900000
  → Set to 1800000 µV (minimum of satisfiable range; saves power vs 1900000).

13.27.5 Linux External ABI¶

/sys/class/regulator/regulator.N/
  name           : regulator name (from RegulatorDesc.name)
  status         : "enabled" | "disabled"
  microvolts     : current voltage in µV (read-only)
  min_microvolts : hardware minimum in µV (read-only)
  max_microvolts : hardware maximum in µV (read-only)
  opmode         : "normal" | "idle" | "standby" (read-only)
  num_users      : current enable_count (read-only)
  type           : "voltage" | "current" (always "voltage" unless current-only rail)

These files are read by userspace power tools (e.g., regulator-info from the Linux kernel tools directory) and by systemd for boot-time power rail validation.

13.27.6 Multi-Architecture Notes¶

x86-64 servers: Typically no PMIC accessible from the OS. Server voltage regulators (CPU VR, DRAM VR) are controlled via SVID (Serial VID) directly by the CPU or BMC. The kernel does not participate. The regulator framework is relevant for x86 only on: - Single-board computers with PMICs (UP Board/AXP288, LattePanda/AXP288) - Intel Elkhart Lake/Jasper Lake embedded platforms (GPIO-controlled supplies)

AArch64/ARMv7: Rich PMIC ecosystem: - Qualcomm: PMIC9xxx via SPMI (Serial Peripheral Management Interface) - Rockchip: RK808, RK809, RK817, RK818 via I2C - NXP: PF8100, PF9100 via I2C - Texas Instruments: TPS65xxx family via I2C - Mediatek: MT6358, MT6360 via I2C

Typical platform: 10–40 individual regulators for core voltages, I/O rails, camera supplies, display supplies, and peripheral power domains.

RISC-V, PPC32, PPC64LE: PMIC access via I2C is common on embedded boards.

Initialization sequence: PMIC drivers must probe during boot phase 5 (bus initialization). The probe_ordering = "early" device tree property (or ACPI _DEP dependency) ensures PMIC initialization before consumer drivers. All regulators must be registered before any consumer driver's probe() is called (boot phase 6).

13.28 RTC Subsystem¶

The Real-Time Clock (RTC) is a hardware counter powered by a dedicated battery or supercapacitor that continues running during system power-off. The kernel RTC subsystem provides: - A standardized driver interface (RtcDevice trait) for any RTC hardware - Boot-time wall clock initialization from the RTC - A Linux-compatible character device (/dev/rtc0) with the full ioctl API - Wake-on-alarm support (system wakeup from S3/S5 at a scheduled time) - Multiple RTC priority: highest-priority battery-backed RTC is the "system RTC"

Cross-references: clock framework (Section 2.24), ACPI S3/S5 wake events (Section 2.4), timekeeping (Section 7.8).

13.28.1 RtcDevice Trait¶

/// Implemented by each RTC hardware driver.
///
/// All methods may block (I2C/SPI reads are involved); they run in process context.
/// None of these methods are called from IRQ handlers.
pub trait RtcDevice: Send + Sync {
    /// Read the current time from the RTC hardware.
    ///
    /// Returns seconds since Unix epoch as u64 (Y2K38-safe: u64 does not
    /// overflow until year 584,542,046,090 AD).
    fn read_time(&self) -> Result<u64, KernelError>;

    /// Set the RTC hardware to the given Unix timestamp.
    fn set_time(&self, unix_secs: u64) -> Result<(), KernelError>;

    /// Read the pending alarm time. Returns None if no alarm is set.
    fn read_alarm(&self) -> Result<Option<u64>, KernelError>;

    /// Set an alarm to fire at the given Unix timestamp.
    ///
    /// When the alarm fires, the RTC asserts an interrupt (if supported) and,
    /// if the system is in S3/S5, asserts the WAKEUP signal to resume.
    /// The alarm fires only once; it does not repeat automatically.
    fn set_alarm(&self, unix_secs: u64) -> Result<(), KernelError>;

    /// Clear any pending alarm. After this call, no alarm is set.
    fn clear_alarm(&self) -> Result<(), KernelError>;

    /// True if this RTC supports hardware alarm interrupts and wake-from-sleep.
    fn has_alarm(&self) -> bool;

    /// True if this RTC is battery-backed and survives power cycles.
    ///
    /// A non-persistent RTC (e.g., on-chip RTC without external battery) loses
    /// its time value when the system loses power; it is not suitable for use
    /// as the system RTC for boot-time wall clock initialization.
    fn is_persistent(&self) -> bool;

    /// Short descriptive name (appears in /sys/class/rtc/rtcN/name).
    /// Examples: "pcf8563", "rv3028", "hctosys-ds3231"
    fn name(&self) -> &str;
}

13.28.2 RTC Registry¶

/// Global registry of all registered RTC devices.
/// The RTC alarm interrupt handler receives the device pointer directly
/// from the driver's stored reference and does not access RtcRegistry.
/// All registry access is from process context (probe, hctosys, ioctl).
pub struct RtcRegistry {
    /// All registered RTCs, indexed by device number (0, 1, ...).
    devices:    RwLock<ArrayVec<(u8, Arc<dyn RtcDevice>), RTC_MAX_DEVICES>>,
    /// The system RTC: the first battery-backed RTC to register.
    /// Used for rtc_hctosys() at boot and rtc_systohc() at shutdown.
    system_rtc: OnceCell<Arc<dyn RtcDevice>>,
}

/// Maximum number of concurrently registered RTC devices.
pub const RTC_MAX_DEVICES: usize = 8;

pub static RTC_REGISTRY: OnceCell<RtcRegistry> = OnceCell::new();

/// Register a new RTC device. Returns the assigned device index (0, 1, ...).
///
/// The first persistent (battery-backed) RTC to register becomes the system RTC.
/// Non-persistent RTCs are registered but do not become the system RTC unless
/// no persistent RTC is available.
///
/// Returns Err(AlreadyExists) if RTC_MAX_DEVICES are already registered.
pub fn rtc_register(dev: Arc<dyn RtcDevice>) -> Result<u8, KernelError>;

/// Set the kernel's CLOCK_REALTIME from the system RTC.
///
/// Called during boot phase 6 after all RTC drivers have probed.
/// If the RTC time appears invalid (year before 2000 or after 2100 for a
/// new RTC without battery), logs KERN_WARNING and leaves CLOCK_REALTIME
/// at the default boot time (Jan 1, 2024).
pub fn rtc_hctosys() -> Result<(), KernelError>;

/// Write the kernel's current CLOCK_REALTIME to the system RTC.
///
/// Called at clean shutdown or when hwclock -w is executed.
pub fn rtc_systohc() -> Result<(), KernelError>;

13.28.3 Linux External ABI — /dev/rtcN¶

The /dev/rtcN character device provides the full Linux-compatible RTC ioctl interface. All ioctl values are binary-identical to Linux (from linux/rtc.h).

/// Structure for RTC time ioctls. Binary-identical to Linux struct rtc_time.
/// Must be repr(C) with exactly these field types and order.
#[repr(C)]
pub struct RtcTime {
    pub tm_sec:   i32,   // Seconds: 0-60 (60 for leap second)
    pub tm_min:   i32,   // Minutes: 0-59
    pub tm_hour:  i32,   // Hours: 0-23
    pub tm_mday:  i32,   // Day of month: 1-31
    pub tm_mon:   i32,   // Month: 0-11 (January = 0)
    pub tm_year:  i32,   // Years since 1900 (e.g., 2024 = 124)
    pub tm_wday:  i32,   // Day of week: 0-6 (Sunday = 0)
    pub tm_yday:  i32,   // Day of year: 0-365
    pub tm_isdst: i32,   // Daylight saving: >0 yes, 0 no, <0 unknown
}

/// Structure for wake alarm ioctls. Binary-identical to Linux struct rtc_wkalrm.
#[repr(C)]
pub struct RtcWkalrm {
    pub enabled: u8,     // 1 if alarm is enabled
    pub pending: u8,     // 1 if alarm is pending (has fired but not acknowledged)
    _pad:        [u8; 2],
    pub time:    RtcTime,
}
const_assert!(size_of::<RtcTime>() == 36);
const_assert!(size_of::<RtcWkalrm>() == 40);

Supported ioctls (values from linux/rtc.h, binary ABI is stable). RTC ioctl values are architecture-independent because RtcTime and RtcWkalrm contain only fixed-width integer fields (no pointers or long types), making the _IOR/_IOW size-encoding identical across all 8 supported architectures:

ioctl constant	Value	Direction	Description
`RTC_RD_TIME`	`0x80247009`	read	Read current time → `RtcTime`
`RTC_SET_TIME`	`0x4024700a`	write	Set RTC time ← `RtcTime` (requires CAP_SYS_TIME)
`RTC_ALM_READ`	`0x80247008`	read	Read alarm time → `RtcTime`
`RTC_ALM_SET`	`0x40247007`	write	Set alarm time ← `RtcTime`
`RTC_AIE_ON`	`0x00007001`	none	Enable alarm interrupt
`RTC_AIE_OFF`	`0x00007002`	none	Disable alarm interrupt
`RTC_UIE_ON`	`0x00007003`	none	Enable 1-Hz update interrupt
`RTC_UIE_OFF`	`0x00007004`	none	Disable 1-Hz update interrupt
`RTC_IRQP_READ`	`0x8004700b`	read	Read periodic interrupt rate → `u32` (Hz)
`RTC_IRQP_SET`	`0x4004700c`	write	Set periodic interrupt rate ← `u32` (2–8192 Hz)
`RTC_EPOCH_READ`	`0x8004700d`	read	Read epoch → `u32` (always returns 1970)
`RTC_EPOCH_SET`	`0x4004700e`	write	Set epoch (accepted, no-op; UmkaOS uses Unix epoch)
`RTC_WKALM_RD`	`0x80287010`	read	Read wake alarm → `RtcWkalrm`
`RTC_WKALM_SET`	`0x4028700f`	write	Set wake alarm ← `RtcWkalrm`

13.28.4 sysfs Interface¶

/sys/class/rtc/rtcN/
  name              : driver name (e.g., "ds3231", "hctosys-builtin")
  date              : current date in YYYY-MM-DD format (read-only)
  time              : current time in HH:MM:SS format (read-only)
  since_epoch       : current time as Unix seconds u64 (read-only)
  max_user_freq     : maximum periodic interrupt rate in Hz (default: 64)
  hctosys           : "1" if this is the system RTC; "0" otherwise
  wakealarm         : Unix timestamp of next scheduled wake alarm (read-write)
                      Write 0 to clear the alarm.
                      Read 0 = no alarm set.
  dev               : major:minor device numbers (e.g., "250:0")

13.28.5 Boot Sequence and Y2K38¶

Boot sequence:

Phase 5: RTC hardware drivers probe.
  RTC drivers call rtc_register() as part of their probe function.
  The first is_persistent() == true RTC becomes the system RTC.

Phase 6: rtc_hctosys() is called.
  Reads time from system RTC.
  Validity check: if year < 2000 or year > 2100:
    Log KERN_WARNING "rtc: time appears invalid (year %d), ignoring"
    Leave CLOCK_REALTIME unchanged (default: Jan 1, 2024 00:00:00 UTC).
  Else: set CLOCK_REALTIME from RTC time.

Phase 7+: NTP daemon (systemd-timesyncd, chrony) syncs from NTP server.
  After successful NTP sync: daemon calls rtc_systohc() (or ioctl RTC_SET_TIME)
  to store the accurate NTP time in the RTC hardware.

Y2K38 handling:
  Internal RTC timestamps use u64 (Unix seconds). Overflow at year 584 billion.

  Hardware RTCs with 2-digit BCD year storage require epoch interpretation:
  - Years 00-69 → 2000-2069
  - Years 70-99 → 1970-1999
  This follows the standard Y2K convention. The RTC driver is responsible for
  this conversion; the rtc_hctosys() / rtc_systohc() interface uses u64 Unix
  seconds throughout.

  RTCs with 4-digit year registers (e.g., RV-3032-C7, DS3231M) store
  years 1970-2099 directly and do not require epoch interpretation.

  RTC drift: Hardware RTCs drift ±2–10 ppm (≈1–5 min/year). Drift compensation
  is handled entirely by userspace (hwclock --adjust stores drift in /etc/adjtime).
  The kernel does not perform drift correction.

Multiple RTC priority:
  rtc0 = system RTC (highest-priority, battery-backed, probed first)
  rtc1, rtc2, ... = secondary RTCs (read-only access for hwclock -D)
  hwclock(8) uses /dev/rtc0 by default; --rtc=/dev/rtc1 selects others.

13.29 USB Device Forwarding Service Provider¶

Provider model: USB device forwarding is a host-proxy service — the host's USB stack manages the physical USB device and forwards URBs to/from remote peers. A USB host controller (xHCI) with Tier M firmware could also act as a device-native provider, forwarding URBs directly without host CPU involvement. In either case, the wire protocol (UrbWire/UrbCompletion) is identical. Sharing model: exclusive (one peer binds per USB device at a time).

A node with locally attached USB devices can export them as cluster services via the peer protocol (see Section 5.7). Any peer in the cluster can discover and bind to the exported USB device, using it as if it were locally attached. Forwarding operates at the URB (USB Request Block) level — the universal unit of USB I/O — so it works for all USB device classes (mass storage, HID, audio, video, serial, printer, etc.) without per-class forwarding code. This is an instantiation of the capability service provider model defined in Section 5.7.

13.29.1 Service Identity and Discovery¶

ServiceId: ServiceId("usb_device", 1)

PeerCapFlags: USB_DEVICE (bit 10). A node advertising USB device forwarding sets this bit in its PeerCapFlags. Peers filter discovery queries by this flag to find nodes that export USB devices.

PeerServiceDescriptor.properties (32 bytes):

#[repr(C)]
pub struct UsbDeviceProperties {
    pub vendor_id: u16,
    pub product_id: u16,
    pub device_class: u8,      // USB class code (e.g., 0x08 = mass storage, 0x03 = HID)
    pub device_subclass: u8,
    pub device_protocol: u8,
    pub speed: u8,             // 0 = low, 1 = full, 2 = high, 3 = super, 4 = super+
    pub num_interfaces: u8,
    pub _pad: [u8; 7],
    pub serial: [u8; 16],      // USB serial string (truncated, informational)
}
// Layout: 2+2+1+1+1+1+1+7+16 = 32 bytes.
const_assert!(size_of::<UsbDeviceProperties>() == 32);

13.29.2 Wire Protocol¶

Five opcodes define the URB-level forwarding protocol:

#[repr(u16)]
pub enum UsbServiceOpcode {
    SubmitUrb    = 0x0001,  // Client -> provider: submit USB request block
    UrbComplete  = 0x0002,  // Provider -> client: URB completion
    ResetDevice  = 0x0010,  // Client -> provider: USB device reset
    SetConfig    = 0x0011,  // Client -> provider: set USB configuration
    UnlinkUrb    = 0x0012,  // Client -> provider: cancel pending URB
}

UrbWire (64-byte fixed header, variable-length data via bulk transfer or inline):

#[repr(C, align(8))]
pub struct UrbWire {
    /// Unique ID for matching this URB to its completion.
    pub urb_id: Le64,             // 8 bytes  (offset 0)
    /// USB endpoint address. Bit 7 = direction (0=OUT, 1=IN).
    /// Bits 3:0 = endpoint number (0-15).
    pub endpoint: u8,             // 1 byte   (offset 8)
    /// Transfer type: 0=control, 1=isochronous, 2=bulk, 3=interrupt.
    pub transfer_type: u8,        // 1 byte   (offset 9)
    /// Flags: bit 0 = short_not_ok, bit 1 = zero_packet, bit 2 = iso_asap.
    pub flags: u8,                // 1 byte   (offset 10)
    /// Number of isochronous frame descriptors following the header.
    /// Zero for control, bulk, and interrupt transfers.
    pub num_iso_packets: u8,      // 1 byte   (offset 11)
    /// Total data length in bytes.
    pub transfer_length: Le32,    // 4 bytes  (offset 12)
    /// Polling interval for interrupt/isochronous endpoints (microframes).
    pub interval: Le32,           // 4 bytes  (offset 16)
    /// Isochronous start frame number (when iso_asap flag is not set).
    pub start_frame: Le32,        // 4 bytes  (offset 20)
    /// USB setup packet (8 bytes, for control transfers only; zero otherwise).
    pub setup_packet: [u8; 8],    // 8 bytes  (offset 24)
    /// Offset within the pre-registered data region where payload data
    /// starts. Zero means payload follows inline after the header (for
    /// small transfers that fit in the ring entry).
    pub data_region_offset: Le64, // 8 bytes  (offset 32)
    /// Length of payload data at the region offset. For bulk transfers
    /// this may exceed ring entry size. For inline transfers this equals
    /// transfer_length.
    pub data_length: Le32,        // 4 bytes  (offset 40)
    /// Reserved, must be zero.
    pub _reserved: [u8; 20],      // 20 bytes (offset 44)
    // Total: 44 + 20 = 64 bytes.
}
// Layout: 8+1+1+1+1+4+4+4+8+8+4+20 = 64 bytes. align(8) satisfied.
const_assert!(size_of::<UrbWire>() == 64);

UrbCompletion (64-byte fixed header, returned by provider for each completed URB):

/// Completion response for a submitted URB. Sent from provider to client
/// via `UsbServiceOpcode::UrbComplete`. For IN transfers, received data
/// is at the data region offset specified in the original UrbWire (provider
/// pushes data to the client's region before sending this completion). For
/// OUT transfers, this confirms transmission. For isochronous transfers,
/// `num_iso_packets` IsoFrameDesc entries follow this header.
#[repr(C, align(8))]
pub struct UrbCompletion {
    /// Matches the urb_id from the original UrbWire submission.
    pub urb_id: Le64,             // 8 bytes  (offset 0)
    /// USB status code: 0 = success, negative = error (Le32 on wire).
    /// Maps to Linux URB status: -ENOENT (cancelled), -ECONNRESET (unlinked),
    /// -ESHUTDOWN (device removed), -EPROTO (bitstuff/protocol error),
    /// -EILSEQ (CRC mismatch), -ETIME (no response), -EPIPE (stall).
    pub status: Le32,             // 4 bytes  (offset 8)
    /// Actual number of bytes transferred (may be less than transfer_length
    /// for short packets). For isochronous transfers, this is the sum of
    /// all per-frame actual_length values.
    pub actual_length: Le32,      // 4 bytes  (offset 12)
    /// Number of isochronous frames with errors. Zero for non-isoc transfers.
    pub error_count: Le32,        // 4 bytes  (offset 16)
    /// Number of isochronous frame descriptors following this header.
    /// Zero for control, bulk, and interrupt transfers.
    pub num_iso_packets: Le32,    // 4 bytes  (offset 20)
    /// Start frame number (echoed from submission, or actual frame for iso_asap).
    pub start_frame: Le32,        // 4 bytes  (offset 24)
    /// Length of IN data written at the data region offset.
    pub data_length: Le32,        // 4 bytes  (offset 28)
    /// Offset within the client's data region where IN data was written.
    /// Zero for OUT transfers or if data was inline.
    pub data_region_offset: Le64, // 8 bytes  (offset 32)
    /// Reserved, must be zero.
    pub _reserved: [u8; 24],      // 24 bytes (offset 40)
    // Total: 40 + 24 = 64 bytes.
}
// Layout: 8+4+4+4+4+4+4+8+24 = 64 bytes. align(8) satisfied.
const_assert!(size_of::<UrbCompletion>() == 64);

IsoFrameDesc (16-byte per-frame descriptor for isochronous transfers):

/// Per-frame descriptor for isochronous URBs. An array of num_iso_packets
/// IsoFrameDesc entries follows the UrbWire (submission) or UrbCompletion
/// (completion) header. Each frame carries one microframe of audio/video data.
#[repr(C)]
pub struct IsoFrameDesc {
    /// Byte offset of this frame's data within the transfer buffer.
    pub offset: Le32,             // 4 bytes  (offset 0)
    /// Expected (submit) or actual (complete) data length for this frame.
    pub length: Le32,             // 4 bytes  (offset 4)
    /// Actual bytes transferred (completion only; zero on submit).
    pub actual_length: Le32,      // 4 bytes  (offset 8)
    /// Per-frame status: 0 = success, negative errno on error (Le32 on wire).
    /// Common: -EPROTO, -EILSEQ, -EXDEV (partial frame), -EOVERFLOW.
    pub status: Le32,             // 4 bytes  (offset 12)
    // Total: 16 bytes per frame.
}
// Layout: 4+4+4+4 = 16 bytes.
const_assert!(size_of::<IsoFrameDesc>() == 16);

Isochronous data layout in wire messages:

For SubmitUrb with transfer_type = isochronous:

[UrbWire header (64 bytes)]
[IsoFrameDesc × num_iso_packets (16 bytes each)]

Total control message size: 64 + (num_iso_packets × 16) bytes. Maximum num_iso_packets: 256 (USB 2.0 high-speed allows up to 8 microframes × 32 intervals = 256 frames per URB). Maximum message: 64 + 4096 = 4160 bytes. This exceeds the single ring entry inline threshold (~256 bytes) so isochronous control messages are written into the provider's registered data region via push_page(), followed by a lightweight doorbell notification via ring pair send (4-byte sequence number).

For UrbCompletion with isochronous:

[UrbCompletion header (64 bytes)]
[IsoFrameDesc × num_iso_packets (16 bytes each, with actual_length
 and status filled in by provider)]

Actual audio/video data is always transferred via bulk push (too large for inline). The IsoFrameDesc.offset fields describe byte offsets within the data region where each frame's data begins.

Short packet semantics:

A short packet (actual_length < transfer_length) is reported when the device sends less data than requested. This is normal for:

Bulk IN: short packet indicates end of data (e.g., USB mass storage CSW). actual_length reflects bytes received. status = 0 (success).
Control IN: short packet in data phase. status = 0.
Interrupt IN: actual_length may vary per poll. status = 0.

Short packets are NOT errors unless the original URB had URB_SHORT_NOT_OK flag set (encoded in UrbWire.flags bit 0). When this flag is set and a short packet occurs, status = -EREMOTEIO.

UrbWire.flags bit encoding:

Bit	Name	Meaning
0	`SHORT_NOT_OK`	Fail on short packet (`-EREMOTEIO`)
1	`ISO_ASAP`	Schedule isochronous ASAP (ignore `start_frame`)
2	`ZERO_PACKET`	Send zero-length packet after full-size transfer

Connection handshake — exchanged at ServiceBind time before URB forwarding begins:

/// Sent by client to provider immediately after ServiceBind succeeds.
/// Negotiates the connection parameters for URB forwarding.
#[repr(C)]
pub struct UsbConnectRequest {
    /// Size of client's pre-registered IN data region in bytes.
    /// Provider pushes IN transfer data into this region before sending
    /// UrbCompletion. The region is established at ServiceBind time;
    /// the transport binding handles addressing.
    pub client_in_region_size: Le32, // 4 bytes  (offset 0)
    /// Reserved (was client_in_rkey; region token is in ServiceBind params).
    pub _reserved_rkey: Le32,     // 4 bytes  (offset 4)
    /// Maximum packet size the client VHCI can handle for EP0. Typically
    /// 8 (low-speed), 64 (full/high), or 512 (super-speed).
    pub max_packet_size_ep0: Le16, // 2 bytes  (offset 8)
    /// Requested USB device speed. Must not exceed UsbDeviceProperties.speed.
    /// Provider may grant a lower speed if the device was reconfigured.
    pub requested_speed: u8,      // 1 byte   (offset 10)
    pub _pad: u8,                 // 1 byte   (offset 11)
    /// Reserved.
    pub _reserved: [u8; 20],      // 20 bytes (offset 12)
    // Total: 4 + 4 + 2 + 1 + 1 + 20 = 32 bytes. No padding holes.
}
const_assert!(size_of::<UsbConnectRequest>() == 32);

/// Sent by provider to client in response to UsbConnectRequest.
#[repr(C)]
pub struct UsbConnectResponse {
    /// Granted speed (may be less than requested if device was reconfigured).
    pub granted_speed: u8,
    pub _pad: [u8; 3],
    /// Size of provider's pre-registered OUT data region in bytes.
    /// Client pushes OUT transfer data into this region before sending
    /// SubmitUrb. The region is established at ServiceBind time.
    pub provider_out_region_size: Le32,
    /// Reserved (was provider_out_rkey; region token is in ServiceBind params).
    pub _reserved_rkey: Le32,
    /// USB device descriptor (18 bytes, per USB 2.0 spec Table 9-8).
    /// Cached from the physical device so the client can begin enumeration
    /// without an extra round trip.
    pub device_descriptor: [u8; 18],
    pub _pad2: [u8; 2],
    // Total: 1 + 3 + 4 + 4 + 18 + 2 = 32 bytes.
}
const_assert!(size_of::<UsbConnectResponse>() == 32);

Connection handshake sequence:

Client completes ServiceBind and receives ServiceBindAck with transport queue pair information.
Client sends UsbConnectRequest via ring pair send on the bound queue.
Provider validates: a. requested_speed <= device's actual speed. If exceeded: grant actual speed (downgrade silently). b. client_in_region_size >= device's max_packet_size x 2. If not: reject with granted_speed = 0xFF (rejection sentinel) in UsbConnectResponse.
Provider sends UsbConnectResponse with:
granted_speed: min(requested, actual), or 0xFF on rejection.
provider_out_region_size: size of provider's pre-registered OUT data region.
device_descriptor: cached 18-byte USB device descriptor.
On success (granted_speed != 0xFF): client signals port connect to USB core via root hub status change. Enumeration begins (GET_DESCRIPTOR, SET_ADDRESS, SET_CONFIGURATION).
On failure: client logs error, sends ServiceUnbind, returns -ENODEV to the initiating bind operation.

Timeout: client waits up to 5 seconds for UsbConnectResponse. If no response: ServiceUnbind + retry (up to 3 attempts with exponential backoff: 1s, 2s, 4s). After 3 failures: report -ETIMEDOUT and abandon.

13.29.3 Client-Side Integration (VHCI)¶

On the consuming peer, the USB forwarding service registers a virtual USB host controller (VHCI) with the kernel's USB subsystem. The VHCI presents the remote device as a locally attached USB device. The kernel's standard USB enumeration proceeds through the VHCI — the USB stack reads descriptors, selects a configuration, and binds matching class drivers (mass storage, HID, audio, video, serial) automatically. Applications see /dev/sda, /dev/input/event*, /dev/snd/pcmC*, etc. The forwarding is completely transparent to userspace.

/// Virtual USB Host Controller. One instance per bound remote USB device.
/// Registered with the USB subsystem as an HCD; intercepts URBs submitted
/// by class drivers and forwards them to the remote provider.
///
/// Tier assignment: Tier 1 (runs in the USB subsystem isolation domain).
pub struct VhciController {
    /// ServiceBind connection to the remote USB device provider.
    connection: ServiceBindHandle,
    /// Peer providing the USB device.
    peer_id: PeerId,
    /// Negotiated connection parameters from UsbConnectResponse.
    granted_speed: UsbSpeed,
    /// Data region sizes for bulk data transfer (negotiated at connect time).
    client_in_region_size: u32,
    provider_out_region_size: u32,
    /// In-flight URB tracking. Maps urb_id to the kernel URB that was
    /// submitted by a class driver and is awaiting completion from the
    /// provider. XArray for O(1) lookup on completion (integer-keyed,
    /// hot path per the collection policy in [Section 3.6](03-concurrency.md#lock-free-data-structures)).
    inflight: XArray<UrbInflight>,
    /// Next URB ID. Monotonically increasing u64 — no wrap concern
    /// (at 10M URBs/s, wraps after 58,000 years).
    next_urb_id: AtomicU64,
    /// USB root hub emulation state (port status, connection status).
    root_hub: VhciRootHub,
    /// Device connected flag. Set after UsbConnectResponse received,
    /// cleared on ServiceUnbind/drain/error.
    connected: AtomicBool,
}

/// Tracks a single in-flight URB from submission to completion.
struct UrbInflight {
    /// Original kernel URB pointer (returned to USB stack on completion).
    /// SAFETY: valid while the URB is in-flight; the USB stack guarantees
    /// the URB is not freed until the HCD completes or unlinks it.
    kernel_urb: *mut UsbUrb,
    /// Timestamp of submission (for timeout detection).
    submit_time: Instant,
    /// Transfer type (cached from the original URB for fast dispatch).
    transfer_type: u8,
}

/// Emulated root hub state for the VHCI. Tracks port status for the
/// single virtual port exposed to the USB core.
struct VhciRootHub {
    /// Port status register (USB_PORT_STAT_* flags).
    /// bit 0: CONNECTION, bit 1: ENABLE, bit 2: SUSPEND, etc.
    port_status: AtomicU32,
    /// Port status change flags (cleared after hub_status_data read).
    port_change: AtomicU32,
    /// Wait queue for hub status polling (khubd wakes on change).
    hub_wq: WaitQueue,
}

VhciRootHub implements the USB HCD root hub interface expected by the USB core:

hub_status_data(): Returns port_change bits. USB core polls this (via the hub thread) to detect connection/disconnection events. Returns 1 if any change bits are set, 0 otherwise.
hub_control(): Handles hub class requests from USB core:
GET_PORT_STATUS: returns port_status.load(Acquire).
SET_PORT_FEATURE(PORT_RESET): resets the virtual port. Sends ResetDevice to the remote provider via the service connection, waits for completion (up to 5 seconds), then sets PORT_ENABLE in port_status.
CLEAR_PORT_FEATURE(C_PORT_CONNECTION): clears the connection change bit in port_change (atomic AND).
SET_PORT_FEATURE(PORT_SUSPEND): sets SUSPEND in port_status. Does not forward to provider (USB suspend is local policy).
urb_enqueue(): The main URB submission path (hot path). Converts the kernel UsbUrb to UrbWire, assigns urb_id from next_urb_id.fetch_add(1, Relaxed), inserts into inflight XArray, sends via ring pair. No allocation — pre-allocated entries from a per-queue pool. Returns -ESHUTDOWN if connected is false.
urb_dequeue(): Cancel a pending URB. Sends UnlinkUrb to the provider (best-effort — provider may have already completed it), removes from inflight XArray, completes the kernel URB with -ECONNRESET.

Hot-plug emulation: When ServiceBind succeeds and the connection handshake completes, the VHCI's root hub signals a port-status change (connection detected) to the USB core. The USB core's standard hub thread (khubd equivalent) detects the new device and begins enumeration: GET_DESCRIPTOR, SET_ADDRESS, SET_CONFIGURATION. All enumeration URBs are forwarded through the VHCI to the remote provider. On ServiceUnbind or provider disconnect (ServiceDrainNotify), the VHCI signals port disconnect. The USB core removes the device, unbinds class drivers, and cleans up.

Bulk transfer data path: For bulk OUT transfers larger than the ring entry payload capacity (~224 bytes), the client pushes the data into the provider's pre-registered OUT data region (via the peer transport's push_page()), then sends the UrbWire control message with data_region_offset pointing to the written location. For bulk IN transfers, the provider pushes received data into the client's pre-registered IN data region, then sends UrbCompletion with data_region_offset/data_length pointing to the data. Small transfers (control setup, short interrupt data) use inline payload after the header for lower latency (one fewer transport operation).

13.29.4 Exclusive Access¶

USB devices are single-client: one peer binds to a given device at a time. A second peer attempting to bind receives EBUSY. Multiple interfaces of the same composite USB device (e.g., a headset with audio + HID interfaces) can be used concurrently by the same client.

13.29.5 Isochronous Transfer Support¶

Isochronous transfers (audio, video) are supported but latency-sensitive. For USB Audio Class devices at 1 ms polling intervals, the RDMA RTT (~3-5 us) is negligible and does not affect audio quality. For USB Video Class devices at 125 us microframe intervals, the VHCI buffers 2-3 microframes to absorb RDMA latency jitter. The buffering depth is configurable via UsbDeviceProperties.speed-based heuristics.

Frame gap detection: The VHCI tracks expected vs actual frame numbers using start_frame in UrbCompletion. If start_frame in the completion differs from expected by more than the buffering depth (2-3 frames), a gap is detected:

Missing frames are reported with IsoFrameDesc.status = -EXDEV.
actual_length for missing frames = 0.
Three consecutive gaps (>= 9 missing frames at 125 us intervals) trigger an FMA link quality warning (Section 20.1).
The VHCI does NOT attempt retransmission — isochronous data is time-sensitive and retransmission would cause worse artifacts than silence/blank frames.

The expected frame counter is initialized from the first UrbCompletion received after connect and advances by num_iso_packets per completion. On device reset or reconnect, the counter is re-initialized from the next completion.

13.29.6 Performance¶

Bulk transfers (mass storage): throughput is limited by the USB device itself, not the network. A USB 3.0 device at 5 Gbps is far below typical cluster fabric bandwidth (~100 Gbps on RDMA). The forwarding overhead is a single ring pair send/recv per URB.

Control and interrupt transfers: each URB incurs one transport round trip (~3-5 us on RDMA, ~50-200 us on TCP). At USB polling intervals of 1-8 ms, this latency is invisible. High-frequency interrupt endpoints (e.g., 125 us for gaming mice) add at most ~3% to the polling interval.

13.29.7 Security¶

Only the interface owner (the peer that successfully bound the device) can submit URBs. The capability system gates access: the client must hold CAP_USB_REMOTE to bind a remote USB device. USB devices with security-sensitive functions — FIDO2 authenticators, smartcard readers, TPM tokens — require an additional device-specific capability check (CAP_USB_SECURITY_DEVICE), and the provider node's policy may restrict which peers are allowed to bind such devices.

Setup packet filtering: The provider filters USB control transfers (setup packets) to prevent dangerous operations from remote clients.

Category	Requests	Policy
Blocked always	`SET_FEATURE(TEST_MODE)`	Puts device in permanent electrical test mode. Returns `-EPERM`.
Blocked for security devices	Vendor-specific requests to class codes `0xFE` (application-specific) and `0xFF` (vendor-specific)	Requires `CAP_USB_SECURITY_DEVICE`. Without it: `-EPERM`.
Allowed	`GET_DESCRIPTOR` (all types)	Read-only, safe.
Allowed	`SET_CONFIGURATION`	Needed for enumeration.
Allowed	`SET_INTERFACE`	Needed for alternate settings.
Allowed	`CLEAR_FEATURE(ENDPOINT_HALT)`	Needed for error recovery.
Allowed	All class-specific requests for non-security devices	No restriction.

Blocked requests return UrbCompletion with status = -EPERM. The filter runs on the provider side before the setup packet reaches the physical device, so blocked operations never touch hardware. The filter inspects UrbWire.setup_packet bytes: bmRequestType (byte 0), bRequest (byte 1), and wValue (bytes 2-3) per USB 2.0 spec Table 9-2.

13.29.8 Drain and Disconnect¶

On graceful provider shutdown, ServiceDrainNotify is sent to the bound client. The client's VHCI signals a device disconnection event to the USB stack, which triggers standard USB unplug handling: open file descriptors receive errors, mounted filesystems are force-unmounted, and class drivers release their interfaces. The sequence mirrors physical USB cable removal.

13.29.9 URB Timeout Handling¶

URBs that receive no completion within a deadline are failed locally:

Control/bulk transfers: 30-second timeout (configurable via sysfs at /sys/bus/usb/devices/<vhci_dev>/timeout_ms). On timeout, the VHCI sends UnlinkUrb to cancel the remote URB and completes the local URB with -ETIME. If the provider later sends a completion for the cancelled URB, it is silently discarded (urb_id no longer in inflight XArray).
Interrupt transfers: no timeout (periodic, re-submitted by the USB stack). If the service connection is lost, the VHCI's root hub signals disconnect, which naturally stops interrupt polling.
Isochronous transfers: no per-URB timeout. Missing frames are reported as -EXDEV (partial frame) in the IsoFrameDesc. Three consecutive missed frames trigger a link quality warning via FMA (Section 20.1).

The timeout scanner runs as a periodic timer (fires every 5 seconds) that walks the inflight XArray and checks submit_time against the deadline.

13.29.10 Comparison with Linux USB/IP¶

Linux USB/IP (in-tree since Linux 2.6.28) forwards URBs over TCP between a userspace daemon (usbipd) and a kernel VHCI driver. UmkaOS USB device forwarding differs in three ways: (1) transport is the peer protocol (RDMA or PCIe ring buffers) rather than TCP, yielding lower latency and higher throughput; (2) discovery and access control are integrated with the capability service framework — no separate daemon or manual usbip attach commands; (3) lifecycle events (drain, failover) follow the standard service provider model, enabling coordinated cluster-wide USB device migration.

13.30 Auxiliary Device Subsystems¶

This section specifies six device subsystems that are individually small but collectively essential for hardware platform support: the auxiliary bus (multi-function device decomposition), devfreq (non-CPU frequency scaling), LED, PWM, backlight, and power_supply. Each is a self-contained mini-framework with its own trait, registry, and sysfs interface. They share common design principles:

RAII: Consumer handles auto-release on drop — no leaked enables, no orphaned PWM channels.
Trait-based drivers: Each hardware driver implements a Rust trait (type-safe, no void *).
Linux-compatible sysfs: All sysfs paths and semantics match Linux exactly for userspace tool compatibility (e.g., tlp, brightnessctl, upower, systemd-backlight).
Tier 1 by default: These subsystems are small, hardware-coupled, and latency-sensitive. Drivers run in Tier 1 (Section 11.3) unless the device is untrusted.

Cross-references: device model (Section 11.4), regulator framework (Section 13.27), system power management (Section 7.4), sysfs exposure (Section 20.5).

13.30.1 Auxiliary Bus Framework¶

The auxiliary bus allows a single PCI or platform device to expose multiple sub-functions as independent devices, each with its own driver. This is used by multi-function NICs (Intel ice exposes RDMA, switchdev, and PTP as separate auxiliary devices), sound cards, and composite sensor hubs. Without the auxiliary bus, these sub-functions must be managed by a single monolithic driver, coupling unrelated subsystems and preventing independent fault recovery.

UmkaOS vs Linux auxiliary bus: - Linux: struct auxiliary_device with void * private data, module refcount management. - UmkaOS: AuxiliaryDevice is a proper Rust struct with Arc<dyn Device> parent reference; matching is by KString name (zero-copy, no strcmp chains); driver binding is tracked by the device registry (Section 11.4).

13.30.1.1 Core Types¶

/// An auxiliary device: a sub-function of a parent device, exposed as an
/// independent device on the auxiliary bus. The parent driver creates these
/// during its own probe to decompose a multi-function device.
///
/// Naming convention: "<parent_modname>.<func_name>.<id>"
/// Example: "ice.rdma.0", "ice.switchdev.0", "snd_hda_intel.dsp.0"
pub struct AuxiliaryDevice {
    /// Parent device that created this auxiliary device.
    pub parent: Arc<dyn Device>,
    /// Device name: "<parent_modname>.<func_name>.<id>"
    pub name: KString,
    /// Unique ID within the parent's set of auxiliary devices.
    /// Practical limit: parents typically create 2-8 auxiliary devices
    /// (e.g., ice.rdma.0, ice.switchdev.0). u32 is per-parent, not global.
    pub id: u32,
    /// Device node in the device registry.
    pub dev: DeviceNode,
}

/// Identifier for matching auxiliary drivers to auxiliary devices.
/// Matching is by the "<modname>.<func_name>" prefix of the device name.
/// Kernel-internal match structure. Does not cross KABI boundaries.
pub struct AuxiliaryDeviceId {
    /// Match string: "<modname>.<func_name>" (without the trailing ".<id>").
    pub name: &'static str,
}

/// Trait implemented by drivers that bind to auxiliary devices.
pub trait AuxiliaryDriver: Send + Sync {
    /// Device IDs this driver supports. The auxiliary bus matches by
    /// comparing each entry's `name` against the device's name prefix.
    fn id_table(&self) -> &[AuxiliaryDeviceId];

    /// Probe: initialize the auxiliary device. Called when the bus matches
    /// this driver to a newly-registered auxiliary device.
    ///
    /// The parent device is guaranteed to be bound and operational when
    /// probe is called. If the parent is removed, the auxiliary device is
    /// removed first (child-before-parent ordering).
    fn probe(&self, adev: &AuxiliaryDevice) -> Result<(), KernelError>;

    /// Remove: tear down the auxiliary device. Called when the device is
    /// unregistered (parent removal or explicit auxiliary_device_delete).
    fn remove(&self, adev: &AuxiliaryDevice);
}

13.30.1.2 API¶

/// Register an auxiliary device. Called by the parent driver during its probe.
///
/// The device is added to the auxiliary bus and driver matching is triggered.
/// If a matching AuxiliaryDriver is already registered, its probe() is called
/// synchronously before this function returns.
///
/// The parent driver must ensure the auxiliary device's resources remain valid
/// until auxiliary_device_delete() is called or the parent is removed.
pub fn auxiliary_device_add(
    adev: &mut AuxiliaryDevice,
) -> Result<(), KernelError>;

/// Unregister an auxiliary device. If a driver is bound, its remove() is
/// called before the device is removed from the bus.
pub fn auxiliary_device_delete(adev: &mut AuxiliaryDevice);

/// Register an auxiliary driver. Triggers matching against all existing
/// unbound auxiliary devices on the bus.
pub fn auxiliary_driver_register(
    drv: Arc<dyn AuxiliaryDriver>,
) -> Result<(), KernelError>;

13.30.1.3 sysfs¶

/sys/bus/auxiliary/
  devices/
    ice.rdma.0 -> ../../../devices/.../ice.rdma.0
    ice.switchdev.0 -> ../../../devices/.../ice.switchdev.0
  drivers/
    irdma/               # RDMA driver bound to ice.rdma.*
    ice_switchdev/       # switchdev driver bound to ice.switchdev.*

13.30.2 devfreq — Device Frequency Scaling¶

devfreq provides dynamic frequency and voltage scaling for non-CPU devices: GPUs, memory controllers, interconnect buses, and DSPs. It is the device-class analogue of cpufreq. Each managed device reports utilization statistics; a governor uses these to select an appropriate operating frequency, balancing performance against power consumption.

Cross-references: regulator framework (Section 13.27) for coupled voltage scaling, system power management (Section 7.4) for system-wide power budgeting.

UmkaOS vs Linux devfreq: - Linux: struct devfreq with void *data, governor ops as function pointers. - UmkaOS: DevfreqDeviceOps and DevfreqGovernor are Rust traits; QoS constraints are enforced atomically; polling uses the workqueue framework (Section 3.11) instead of a private kthread.

13.30.2.1 Core Types¶

/// A device managed by devfreq: frequency is adjusted at runtime based
/// on utilization and QoS constraints.
pub struct DevfreqDevice {
    /// The device being frequency-managed.
    pub dev: Arc<dyn Device>,
    /// Available frequency table (Hz), sorted ascending. Hardware-discovered
    /// at probe time. Maximum 32 discrete OPPs (operating performance points).
    pub freq_table: ArrayVec<u64, 32>,
    /// Current target frequency (Hz).
    pub cur_freq: AtomicU64,
    /// Active governor. Arc<dyn> chosen because governors may be dynamically
    /// loaded. For static-only governors, &'static dyn would avoid refcount
    /// overhead. The governor is read every polling_ms (warm path); the Arc
    /// clone (~50ns) on a 50ms interval is <0.0001% overhead.
    pub governor: RcuCell<Arc<dyn DevfreqGovernor>>,
    /// Device-specific operations.
    pub ops: Arc<dyn DevfreqDeviceOps>,
    /// Polling interval in milliseconds. 0 = event-driven (no polling).
    pub polling_ms: u32,
    /// Per-device frequency transition statistics.
    pub stats: SpinLock<DevfreqStats>,
    /// QoS constraints (min/max frequency bounds).
    pub qos: DevfreqQos,
}

/// Device-specific operations implemented by the hardware driver.
pub trait DevfreqDeviceOps: Send + Sync {
    /// Read the current device frequency from hardware.
    fn get_cur_freq(&self) -> Result<u64, KernelError>;

    /// Set the target device frequency. The hardware selects the nearest
    /// available OPP >= the requested frequency. Returns the actually-set
    /// frequency.
    fn set_target_freq(&self, freq: u64) -> Result<u64, KernelError>;

    /// Read device utilization since the last call. The governor uses
    /// these statistics to compute the next target frequency.
    fn get_dev_status(&self) -> Result<DevfreqStatus, KernelError>;
}

/// Utilization statistics for one measurement window.
pub struct DevfreqStatus {
    /// Total elapsed time in the measurement window (ns).
    pub total_time: u64,
    /// Time the device was actively processing (ns).
    pub busy_time: u64,
    /// Frequency during this measurement window (Hz).
    pub current_frequency: u64,
}

/// QoS constraints: bounds on allowable frequency range.
pub struct DevfreqQos {
    /// Minimum allowed frequency (Hz). Set by PM QoS or sysfs min_freq.
    pub min_freq: AtomicU64,
    /// Maximum allowed frequency (Hz). Set by PM QoS or sysfs max_freq.
    pub max_freq: AtomicU64,
}

/// Frequency transition statistics, updated on every frequency change.
pub struct DevfreqStats {
    /// Total number of frequency transitions.
    pub total_transitions: u64,
    /// Time spent at each frequency index (ns), indexed by freq_table position.
    pub time_in_state: ArrayVec<u64, 32>,
    /// Timestamp of last frequency change (ns, monotonic).
    pub last_update: u64,
}

/// Governor trait: determines when and how to change device frequency.
/// DevfreqGovernor is a kernel-internal policy trait. Governors are compiled
/// into the kernel, not loaded as KABI drivers. `&str` return is safe.
pub trait DevfreqGovernor: Send + Sync {
    /// Governor name (e.g., "simple_ondemand", "performance").
    fn name(&self) -> &str;

    /// Compute the target frequency given current utilization and QoS bounds.
    /// Called either on the polling timer or in response to a device event.
    fn get_target_freq(
        &self,
        status: &DevfreqStatus,
        qos: &DevfreqQos,
    ) -> u64;
}

13.30.2.2 Standard Governors¶

Governor	Algorithm
`simple_ondemand`	Scale up when utilization > 90% of current capacity; scale down when < 50%. Selects the minimum frequency that provides headroom above current load.
`performance`	Always select the maximum frequency in `freq_table`.
`powersave`	Always select the minimum frequency in `freq_table`.
`userspace`	Frequency is set directly via sysfs `target_freq`; governor does not poll.
`passive`	Follow another devfreq device's frequency (coupled scaling). Used when a memory controller's clock must track the GPU.

13.30.2.3 sysfs Interface¶

/sys/class/devfreq/<device>/
  governor               : current governor name (read-write)
  available_governors    : space-separated list of governors (read-only)
  cur_freq               : current hardware frequency in Hz (read-only)
  target_freq            : governor's target frequency in Hz (read-only)
  available_frequencies  : space-separated list of OPPs in Hz (read-only)
  min_freq               : QoS minimum frequency in Hz (read-write)
  max_freq               : QoS maximum frequency in Hz (read-write)
  polling_interval       : governor polling interval in ms (read-write)
  trans_stat             : frequency transition table (read-only)

These files are used by tlp, power-profiles-daemon, and thermald for dynamic power management on laptops and embedded systems.

13.30.3 LED Subsystem¶

The LED subsystem provides a unified interface for controlling indicator LEDs: keyboard LEDs (capslock, numlock), power/activity indicators, network link LEDs, and notification LEDs on mobile devices. Each LED supports brightness control and optional software or hardware-accelerated blink patterns via a trigger mechanism.

13.30.3.1 Core Types¶

/// An LED device registered with the LED subsystem.
///
/// LED naming convention: "<devicename>:<color>:<function>"
/// Examples: "input0::capslock", "platform::power", "phy0-0::link"
pub struct LedDevice {
    /// Device node in the device registry.
    pub dev: DeviceNode,
    /// LED name following the naming convention above.
    pub name: KString,
    /// Maximum brightness value (hardware-dependent; typically 1 or 255).
    pub max_brightness: u32,
    /// Current brightness. 0 = off, max_brightness = full on.
    pub brightness: AtomicU32,
    /// Hardware-accelerated blink support (None if software-only).
    pub blink_ops: Option<Arc<dyn LedBlinkOps>>,
    /// Active trigger. RCU-protected for lockless read from IRQ context
    /// (disk-activity trigger fires in block I/O completion path).
    pub trigger: RcuCell<Option<Arc<dyn LedTrigger>>>,
}

/// Hardware operations for LED brightness and blink control.
pub trait LedBlinkOps: Send + Sync {
    /// Set LED brightness. 0 = off, max_brightness = full on.
    /// Called from process context; may involve I2C/GPIO transactions.
    fn set_brightness(&self, brightness: u32) -> Result<(), KernelError>;

    /// Configure hardware blink. `on_ms` and `off_ms` are in/out: the driver
    /// may adjust them to the nearest hardware-supported values.
    /// Returns Ok(()) if hardware blink is configured; Err if the driver
    /// cannot blink at the requested rate (software fallback will be used).
    fn blink_set(
        &self,
        on_ms: &mut u32,
        off_ms: &mut u32,
    ) -> Result<(), KernelError>;
}

/// LED trigger: a named pattern that drives LED brightness changes.
///
/// Triggers are registered globally and can be attached to any LED.
/// When attached, the trigger's activate() is called; when detached,
/// deactivate() is called. The trigger calls led_set_brightness()
/// on the LED to change its state.
pub trait LedTrigger: Send + Sync {
    /// Trigger name (e.g., "heartbeat", "disk-activity", "netdev").
    fn name(&self) -> &str;

    /// Called when this trigger is attached to an LED.
    fn activate(&self, led: &LedDevice) -> Result<(), KernelError>;

    /// Called when this trigger is detached from an LED.
    fn deactivate(&self, led: &LedDevice);
}

13.30.3.2 Standard Triggers¶

Trigger	Behavior
`none`	LED stays at its current brightness; no automatic changes.
`default-on`	LED is turned to max_brightness when the trigger is activated.
`heartbeat`	Double-blink heartbeat pattern (100ms on, 100ms off, 100ms on, 700ms off). Indicates the kernel is alive.
`timer`	Configurable blink: `delay_on` / `delay_off` in ms, set via sysfs.
`disk-activity`	Blinks on block I/O completion. Fires from IRQ context (uses atomic brightness update).
`netdev`	Blinks on network activity for a configured interface.
`panic`	Turns on during kernel panic for visual indication.
`backlight`	Mirrors the backlight subsystem state (on when backlight > 0).

13.30.3.3 sysfs Interface¶

/sys/class/leds/<name>/
  brightness      : current brightness (read-write, 0 to max_brightness)
  max_brightness  : maximum brightness value (read-only)
  trigger         : list triggers (read), select trigger (write)
  delay_on        : blink on-time in ms (timer trigger, read-write)
  delay_off       : blink off-time in ms (timer trigger, read-write)

13.30.4 PWM — Pulse Width Modulation¶

The PWM subsystem manages hardware PWM controllers that generate periodic digital waveforms with configurable period and duty cycle. PWM signals drive backlights, LEDs (with analog-like dimming), fans, motors, buzzers, and servo controls. Each PWM controller (chip) provides one or more independent channels.

13.30.4.1 Core Types¶

/// A PWM controller chip providing one or more PWM output channels.
pub struct PwmChip {
    /// Device node in the device registry.
    pub dev: DeviceNode,
    /// Number of PWM channels this chip provides.
    pub npwm: u32,
    /// Hardware operations.
    pub ops: Arc<dyn PwmChipOps>,
}

/// Hardware operations implemented by each PWM controller driver.
/// All methods run in process context (may involve register writes
/// or I2C transactions).
/// Kernel-internal Rust trait, not a KABI vtable. PWM drivers are compiled
/// into the kernel; dispatch is via Rust vtable.
pub trait PwmChipOps: Send + Sync {
    /// Configure a PWM channel's period and duty cycle.
    /// Both values are in nanoseconds. duty_ns must be <= period_ns.
    fn config(
        &self,
        channel: u32,
        duty_ns: u64,
        period_ns: u64,
    ) -> Result<(), KernelError>;

    /// Enable a PWM channel: start generating the waveform.
    fn enable(&self, channel: u32) -> Result<(), KernelError>;

    /// Disable a PWM channel: output goes to the inactive level.
    fn disable(&self, channel: u32) -> Result<(), KernelError>;

    /// Set output polarity. Normal: high during duty portion.
    /// Inversed: low during duty portion.
    fn set_polarity(
        &self,
        channel: u32,
        polarity: PwmPolarity,
    ) -> Result<(), KernelError>;

    /// Read the current hardware state for a channel.
    /// Used to synchronize kernel state with hardware at probe time
    /// (hardware may have been configured by the bootloader).
    fn get_state(&self, channel: u32) -> Result<PwmState, KernelError>;
}

/// PWM output polarity.
pub enum PwmPolarity {
    /// Output is high during the duty portion, low during the rest.
    Normal,
    /// Output is low during the duty portion, high during the rest.
    Inversed,
}

/// Complete state of one PWM channel.
pub struct PwmState {
    /// Waveform period in nanoseconds.
    pub period_ns: u64,
    /// Active (high or low, depending on polarity) duration in nanoseconds.
    pub duty_ns: u64,
    /// Output polarity.
    pub polarity: PwmPolarity,
    /// True if the channel is currently generating output.
    pub enabled: bool,
}

/// Consumer handle to a single PWM channel. Obtained via pwm_get().
/// On drop: disables the channel if this consumer enabled it.
pub struct PwmConsumer {
    chip: Arc<PwmChip>,
    channel: u32,
    enabled: bool,
}

impl Drop for PwmConsumer {
    fn drop(&mut self) {
        if self.enabled {
            let _ = self.chip.ops.disable(self.channel);
        }
    }
}

13.30.4.2 API¶

/// Look up and claim a PWM channel by device tree / ACPI reference.
///
/// `dev` is the consumer device. `con_id` is the connection identifier
/// from the device tree `pwms` property or ACPI `_DSD` entry.
/// Returns a consumer handle that auto-disables on drop.
pub fn pwm_get(
    dev: &DeviceNode,
    con_id: &str,
) -> Result<PwmConsumer, KernelError>;

13.30.4.3 sysfs Interface¶

/sys/class/pwm/pwmchipN/
  npwm            : number of channels (read-only)
  export          : write channel number to expose as pwmM (write-only)
  unexport        : write channel number to remove pwmM (write-only)

/sys/class/pwm/pwmchipN/pwmM/
  period          : waveform period in nanoseconds (read-write)
  duty_cycle      : active-time in nanoseconds (read-write)
  polarity        : "normal" or "inversed" (read-write)
  enable          : "1" to start output, "0" to stop (read-write)

13.30.5 Backlight Subsystem¶

The backlight subsystem controls display panel brightness. It provides a sysfs interface consumed by systemd-backlight, brightnessctl, desktop environment brightness sliders, and ACPI brightness hotkeys. Each display panel has one backlight device.

13.30.5.1 Core Types¶

/// A display backlight device.
pub struct BacklightDevice {
    /// Device node in the device registry.
    pub dev: DeviceNode,
    /// Device name (e.g., "intel_backlight", "acpi_video0", "panel0-backlight").
    pub name: KString,
    /// Backlight type: determines priority when multiple backlights exist
    /// for the same display (raw > platform > firmware).
    pub bl_type: BacklightType,
    /// Maximum brightness value (hardware-dependent; typically 100 or 65535).
    pub max_brightness: u32,
    /// Current brightness. Updated on every successful set operation.
    pub brightness: AtomicU32,
    /// Hardware operations.
    pub ops: Arc<dyn BacklightOps>,
}

/// Hardware operations for backlight control.
pub trait BacklightOps: Send + Sync {
    /// Set the backlight brightness. The value is in the range
    /// [0, max_brightness]. 0 = panel off (backlight disabled).
    /// Called from process context; may involve I2C, GPIO, or ACPI methods.
    fn update_status(&self, brightness: u32) -> Result<(), KernelError>;

    /// Read the current hardware brightness. Used to synchronize kernel
    /// state with hardware (e.g., if firmware changed brightness at resume).
    fn get_brightness(&self) -> Result<u32, KernelError>;
}

/// Backlight type: determines selection priority.
///
/// When multiple backlight devices exist for the same display panel
/// (common on x86 laptops with both ACPI and GPU-native backlight),
/// userspace tools select the highest-priority type.
#[repr(u8)]
pub enum BacklightType {
    /// Direct hardware register control (GPU PWM register).
    /// Highest priority — most accurate and responsive.
    Raw = 1,
    /// Platform firmware control (ACPI _BCM/_BCL methods, DT backlight node).
    /// Second priority.
    Platform = 2,
    /// Legacy firmware-assisted control (deprecated ACPI video backlight).
    /// Lowest priority — used only when Raw and Platform are unavailable.
    Firmware = 3,
}

13.30.5.2 sysfs Interface¶

/sys/class/backlight/<name>/
  brightness         : set/get brightness (read-write, 0 to max_brightness)
  max_brightness     : maximum value (read-only)
  actual_brightness  : current hardware brightness (read-only, calls get_brightness)
  bl_power           : power state: 0 = on, 4 = off (read-write, FB_BLANK values)
  type               : "raw", "platform", or "firmware" (read-only)

systemd-backlight reads /sys/class/backlight/*/brightness at shutdown and restores it at boot. brightnessctl and desktop environments use the same sysfs interface. The ACPI brightness hotkey handler calls update_status() directly and emits a uevent so userspace can update its slider.

13.30.6 power_supply Subsystem¶

The power_supply subsystem models batteries, AC adapters, USB chargers, and wireless chargers. It provides property-based introspection via sysfs and notifies userspace of state changes (plug/unplug, charge level, temperature) via uevent. This is the interface consumed by upower, systemd-logind, and battery indicator widgets.

Cross-references: system power management (Section 7.4) for suspend-on-low-battery policy.

13.30.6.1 Core Types¶

/// A power supply device (battery, AC adapter, USB charger, etc.).
pub struct PowerSupplyDevice {
    /// Device node in the device registry.
    pub dev: DeviceNode,
    /// Device name (e.g., "BAT0", "AC0", "ucsi-source-psy-0").
    pub name: KString,
    /// Power supply type.
    pub ps_type: PowerSupplyType,
    /// Properties this supply reports. Static — set at registration time.
    pub properties: &'static [PowerSupplyProperty],
    /// Hardware operations.
    pub ops: Arc<dyn PowerSupplyOps>,
    /// Notifier chain for state changes. Listeners (upower daemon via uevent,
    /// the suspend policy engine) register here.
    pub notifier: NotifierChain,
}

/// Power supply classification. Values match Linux `enum power_supply_type`
/// (include/linux/power_supply.h) for sysfs and uevent compatibility.
#[repr(u32)]
pub enum PowerSupplyType {
    /// Type not yet determined.
    Unknown = 0,
    /// Rechargeable battery.
    Battery = 1,
    /// Uninterruptible power supply.
    Ups = 2,
    /// AC mains adapter.
    Mains = 3,
    /// USB Standard Downstream Port (500 mA max).
    Usb = 4,
    /// USB Dedicated Charging Port (1.5 A max).
    UsbDcp = 5,
    /// USB Charging Downstream Port (1.5 A max + data).
    UsbCdp = 6,
    /// USB Accessory Charger Adapter.
    UsbAca = 7,
    /// USB Type-C (up to 3 A at 5 V without PD).
    UsbTypec = 8,
    /// USB Power Delivery (up to 240 W with EPR).
    UsbPd = 9,
    /// USB PD Dual Role Port (source + sink).
    UsbPdDrp = 10,
    /// Apple-specific charger identification (via D+/D- signaling).
    AppleBrickId = 11,
    /// Wireless charging (Qi, AirFuel).
    Wireless = 12,
}

/// Properties a power supply can report. Each property maps to one sysfs file.
/// Not all properties apply to all supply types (e.g., Capacity is meaningless
/// for an AC adapter).
pub enum PowerSupplyProperty {
    /// Charging status: Charging, Discharging, Full, Not charging, Unknown.
    Status,
    /// Health: Good, Overheat, Dead, OverVoltage, UnspecifiedFailure, Cold.
    Health,
    /// 1 if the supply is physically present (battery inserted).
    Present,
    /// 1 if the supply is online (AC plugged in, USB connected).
    Online,
    /// Battery chemistry: NiMH, LiIon, LiPoly, LiFe, NiCd, LiMn.
    Technology,
    /// Current voltage in microvolts.
    VoltageNow,
    /// Current current in microamps (positive = charging, negative = discharging).
    CurrentNow,
    /// Remaining capacity as a percentage (0–100).
    Capacity,
    /// Coarse capacity level: Unknown, Critical, Low, Normal, High, Full.
    CapacityLevel,
    /// Battery temperature in 0.1 degrees Celsius.
    Temp,
    /// Design charge capacity in microamp-hours.
    ChargeFullDesign,
    /// Last-measured full charge capacity in microamp-hours.
    ChargeFull,
    /// Current charge in microamp-hours.
    ChargeNow,
    /// Design energy capacity in microwatt-hours.
    EnergyFullDesign,
    /// Last-measured full energy capacity in microwatt-hours.
    EnergyFull,
    /// Current stored energy in microwatt-hours.
    EnergyNow,
    /// Battery charge cycle count.
    CycleCount,
    /// Estimated time to empty in seconds.
    TimeToEmpty,
    /// Estimated time to full charge in seconds.
    TimeToFull,
    /// Battery serial number string.
    SerialNumber,
    /// Battery manufacturer string.
    Manufacturer,
    /// Battery model name string.
    ModelName,
}

/// Value returned by get_property / passed to set_property.
pub enum PowerSupplyValue {
    /// Integer value (voltage, current, capacity, temperature, cycle count, time).
    Int(i64),
    /// String value (serial number, manufacturer, model name).
    /// String properties are fixed at probe time; implementations should
    /// cache them in the device struct and return clones. Polling interval
    /// is ~2 seconds (upower); heap allocation pressure is negligible.
    Str(KString),
}

/// Hardware operations for reading and writing power supply properties.
pub trait PowerSupplyOps: Send + Sync {
    /// Read a property value from the hardware.
    fn get_property(
        &self,
        prop: PowerSupplyProperty,
    ) -> Result<PowerSupplyValue, KernelError>;

    /// Write a property value to the hardware. Not all properties are writable;
    /// call property_is_writeable() to check before calling this.
    fn set_property(
        &self,
        prop: PowerSupplyProperty,
        val: PowerSupplyValue,
    ) -> Result<(), KernelError>;

    /// Returns true if the given property can be written via set_property().
    /// Typically only charge current limits and voltage limits are writable.
    fn property_is_writeable(&self, prop: PowerSupplyProperty) -> bool;
}

13.30.6.2 State Change Notification¶

When a power supply state changes (battery level crosses a threshold, charger is plugged/unplugged, temperature exceeds a limit), the driver calls power_supply_changed(), which:

Fires all registered notifier callbacks synchronously.
Emits a uevent (KOBJ_CHANGE) on the device's sysfs node.
Userspace daemons (upower, systemd-logind) receive the uevent via netlink and re-read the changed properties from sysfs.

/// Notify the power_supply core that this device's state has changed.
/// Called by the hardware driver from process or IRQ context.
///
/// Debounced: if called more than once within 500 ms, only one uevent
/// is emitted (coalesced via a workqueue delayed work item).
pub fn power_supply_changed(psy: &PowerSupplyDevice);

13.30.6.3 sysfs Interface¶

/sys/class/power_supply/<name>/
  type                : "Battery", "Mains", "USB", etc. (read-only)
  status              : "Charging", "Discharging", "Full", "Not charging" (read-only)
  health              : "Good", "Overheat", "Dead", etc. (read-only)
  present             : "1" or "0" (read-only)
  online              : "1" or "0" (read-only, AC/USB supplies)
  technology          : "Li-ion", "Li-poly", etc. (read-only, batteries)
  voltage_now         : current voltage in µV (read-only)
  current_now         : current current in µA (read-only)
  capacity            : 0-100 percent (read-only)
  capacity_level      : "Normal", "Low", "Critical", etc. (read-only)
  temp                : temperature in 0.1°C (read-only)
  charge_full_design  : design capacity in µAh (read-only)
  charge_full         : last full capacity in µAh (read-only)
  charge_now          : current charge in µAh (read-only)
  energy_full_design  : design energy in µWh (read-only)
  energy_full         : last full energy in µWh (read-only)
  energy_now          : current energy in µWh (read-only)
  cycle_count         : charge cycle count (read-only)
  time_to_empty_avg   : estimated seconds to empty (read-only)
  time_to_full_avg    : estimated seconds to full (read-only)
  serial_number       : battery serial number (read-only)
  manufacturer        : battery manufacturer (read-only)
  model_name          : battery model name (read-only)

Uevent properties emitted on state change (KOBJ_CHANGE):

POWER_SUPPLY_NAME=BAT0
POWER_SUPPLY_TYPE=Battery
POWER_SUPPLY_STATUS=Discharging
POWER_SUPPLY_CAPACITY=72

13.30.7 Multi-Architecture Notes¶

x86-64: All six subsystems are relevant. Backlight is heavily used (Intel/AMD GPU backlight, ACPI video backlight). power_supply is essential for laptops. devfreq is used for discrete GPU frequency scaling. PWM controls CPU fans on some motherboards. The auxiliary bus is used by Intel ice (RDMA, switchdev), Mellanox mlx5 (RDMA, vDPA), and Intel HDA (DSP offload).

AArch64/ARMv7: Primary consumers of devfreq (Mali GPU, memory bus), PWM (backlight, fan, motor control on embedded boards), LED (board status LEDs), and power_supply (battery management on phones and tablets). Backlight is PWM-driven on most ARM SoCs. The auxiliary bus is used by Qualcomm ath11k (WiFi + BT decomposition).

RISC-V: PWM controllers available on SiFive and Allwinner D1 SoCs. LED and power_supply follow the same model as ARM embedded boards. devfreq support depends on the SoC's frequency scaling capability (available on StarFive JH7110).

PPC32/PPC64LE: LED subsystem is used for system attention indicators on IBM Power servers (via RTAS/OPAL firmware calls). power_supply is relevant for UPS monitoring. devfreq and PWM are less common but follow the same framework.

Initialization sequence: These subsystems initialize during boot phase 6 (after bus enumeration and regulator/clock framework init in phase 5). The auxiliary bus initializes in phase 6 alongside the parent driver's probe; auxiliary device matching occurs immediately when auxiliary_device_add() is called.