Skip to content

Chapter 21: AI/ML and Accelerators

Unified accelerator framework, accelerator memory/P2P DMA, isolation/scheduling, in-kernel inference, accelerator networking, unified compute model


21.1 Unified Accelerator Framework

This section defines kernel-level AI/ML infrastructure: unified accelerator management, heterogeneous memory, peer-to-peer DMA, multi-tenant isolation, in-kernel inference, and distributed ML networking. These are kernel-internal capabilities — training frameworks and model serving remain in userspace.

Design Constraints:

  1. Drop-in compatibility: Existing Linux GPU/accelerator userspace (CUDA runtime, ROCm, OpenCL, Vulkan compute, /dev/dri, /dev/nvidia*) must work through the compatibility layer. Existing tools must not break.
  2. Superset, not replacement: UmkaOS exposes additional accelerator management capabilities beyond what Linux provides. Userspace software that wants better scheduling, isolation, or memory management can opt in. Software that doesn't care sees standard Linux behavior.
  3. IP-clean: Built from public specifications (PCIe, VFIO, DRM/KMS interfaces), academic algorithms, and open standards. No proprietary API reimplementation.

21.1.1 Motivation

21.1.1.1 The Current State

In 2026, accelerators (GPUs, NPUs, TPUs, FPGAs, custom ASICs) are as fundamental to computing as network interfaces. Every cloud instance, every phone, every laptop has at least one. AI/ML workloads are the dominant growth driver in datacenter compute.

Yet the Linux kernel treats accelerators as dumb peripherals. The kernel's relationship to a GPU is roughly:

Linux kernel:
  - Maps PCI BARs (MMIO)
  - Sets up IOMMU
  - Routes interrupts
  - That's it. The driver owns everything else.

GPU driver (e.g., nvidia.ko):
  - Owns the command queue
  - Owns memory management (VRAM allocation, page tables)
  - Owns scheduling (which process runs on which compute unit)
  - Owns isolation (who can see whose memory)
  - Owns power management (GPU clock, thermal)
  - The kernel knows nothing about any of this.

This is equivalent to how operating systems managed CPUs before preemptive multitasking — the hardware vendor's runtime owns the resource, the kernel is blind.

21.1.1.2 What This Causes

Problem Impact
No kernel-visible GPU scheduling Two containers sharing a GPU have no fairness guarantees. One can starve the other.
No cgroup integration cpu.max limits CPU time. Nothing limits GPU time. Kubernetes has no way to enforce GPU QoS.
No memory accounting GPU VRAM usage is invisible to the OOM killer. A GPU memory leak crashes the GPU driver, not just the leaking process.
No preemption A long-running GPU kernel (training batch) blocks interactive inference. Linux has solved this for CPUs since 1991.
No unified memory management CUDA UVM, AMD SVM, Intel SVM — each vendor's "unified memory" is driver-internal. The kernel's page fault handler doesn't know about GPU page tables.
No isolation GPU memory between processes is isolated by the driver, not the kernel. Driver bugs = security holes.
Driver crash = system crash NVIDIA driver hang requires full system reboot. No crash recovery.
No P2P DMA management GPUDirect is vendor-specific. No kernel facility for "DMA from NVMe directly to GPU VRAM."

21.1.1.3 The Opportunity

A kernel designed from scratch in 2026 can treat accelerators as first-class scheduled resources, the same way it treats CPUs, memory, and I/O bandwidth. This is not about running PyTorch in the kernel — it is about the kernel managing accelerator hardware the way it manages all other hardware: scheduling, memory, isolation, recovery.

UmkaOS's existing architecture is uniquely suited for this: - KABI: Stable driver ABI means GPU driver updates are decoupled from kernel updates - Crash recovery: GPU driver crashes can be survived (driver binary reload in ~50-150ms; total recovery including GPU hardware reset: ~100ms–5s) (varies by hardware: simple GPU reset ~100-500ms; datacenter GPUs with GSP firmware reload such as NVIDIA H100 may take 2-5s) - Capability system: Natural fit for fine-grained accelerator access control - Device registry: Models accelerator topology (GPU → engines, VRAM, display, encode) - cgroup integration: CPU bandwidth guarantees (Section 6.3) extend naturally to accelerator time - Zero-copy I/O paths: Generalize to device-to-device DMA


21.1.2 Unified Accelerator Framework

21.1.2.1 Design: umka-accel

A new KABI interface family for accelerator devices. Every accelerator driver — GPU, NPU, TPU, FPGA, DSP, custom ASIC — implements the same base interface. Hardware-specific capabilities are exposed through versioned extension vtables.

                         umka-accel (KABI interface)
                                    |
                    +---------------+---------------+
                    |               |               |
               AccelBase       AccelCompute    AccelDisplay
               (all accel)     (compute-      (display/render
                               capable)        capable)
                    |               |               |
            +-------+-----+    +---+---+       +---+---+
            |       |     |    |       |       |       |
          nvidia  amdgpu  xe  nvidia  amdgpu   nvidia  amdgpu
           GPU     GPU   GPU   GPU     GPU      GPU     GPU
                                |
                            intel_npu
                                |
                            custom_tpu

21.1.2.2 AccelBaseVTable — The Universal Interface

Every accelerator driver provides this vtable. It covers what the kernel needs to manage the device, not what userspace needs to program it.

/// Base vtable that every accelerator driver must implement.
/// This is the kernel-management interface, not the compute API.
#[repr(C)]
pub struct AccelBaseVTable {
    /// Total size of this vtable struct in bytes. Allows the kernel to detect
    /// when a driver provides a newer (larger) vtable than expected and safely
    /// ignore trailing fields it does not understand.
    pub vtable_size: u64,

    /// KABI version encoded as (major << 16) | minor.
    /// - **Major version** (upper 16 bits): incremented for breaking changes
    ///   (removed fields, changed semantics, reordered vtable entries).
    ///   Kernel and driver must agree on the major version or registration fails.
    /// - **Minor version** (lower 16 bits): incremented for backward-compatible
    ///   additions (new Optional fields appended to the vtable, new enum variants).
    ///   A driver with minor version N works with a kernel expecting minor <= N.
    ///
    /// **Negotiation protocol during driver registration:**
    /// 1. Driver calls `register_driver()` with its vtable (including `version`
    ///    and `vtable_size`).
    /// 2. Kernel extracts the major version: `driver_major = version >> 16`.
    /// 3. If `driver_major != kernel_major`, registration fails with
    ///    `KABI_VERSION_MISMATCH`. The driver must be rebuilt against a
    ///    compatible kernel header.
    /// 4. If `driver_major == kernel_major`, the kernel accepts the vtable.
    ///    Any `Option<fn>` fields beyond the driver's `vtable_size` are treated
    ///    as `None` (not present). Any required (non-Option) fields beyond the
    ///    driver's `vtable_size` cause registration failure.
    /// 5. The negotiated version is recorded in the device registry node for
    ///    diagnostic purposes.
    ///
    /// This protocol applies identically to AccelComputeVTable,
    /// AccelDisplayVTable, RdmaDeviceVTable, and all other KABI vtables
    /// in this section.
    pub version: u32,

    // === Device Info ===

    /// Return device capabilities and resource inventory.
    pub get_info: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_info: *mut AccelDeviceInfo,
    ) -> IoResultCode,

    // === Context Management ===

    /// Create an execution context (one per process/tenant).
    /// Returns an opaque context handle.
    pub create_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        owner_pid: u32,
        priority: AccelPriority,
        limits: *const AccelContextLimits,
        out_context: *mut AccelContextHandle,
    ) -> IoResultCode,

    /// Destroy an execution context and free all its resources.
    pub destroy_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode,

    // === Command Submission ===

    /// Submit a command buffer for execution.
    /// The kernel calls this after validating capabilities and
    /// scheduling the submission according to policy.
    ///
    /// **Note**: Command buffers are pre-created via `create_cmd_buffer` and
    /// recorded via `ACCEL_IOCTL_CMD_RECORD` before submission. The driver
    /// receives the opaque `AccelCmdBufferHandle` (not raw command data).
    /// This allows the driver to manage command buffer memory and validation
    /// independently of the submission path.
    pub submit_commands: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        cmd_buffer: AccelCmdBufferHandle,  // Opaque handle; length is implicit in handle
        fences: *const AccelFence,
        fence_count: u32,
        out_submission: *mut AccelSubmissionHandle,
    ) -> IoResultCode,

    /// Poll for completion of a submitted command buffer.
    pub poll_completion: unsafe extern "C" fn(
        ctx: *mut c_void,
        submission: AccelSubmissionHandle,
        out_status: *mut AccelCompletionStatus,
    ) -> IoResultCode,

    /// Request preemption or cooperative yield of a running context.
    ///
    /// Behavior depends on `AccelDeviceInfo::preemption_granularity`:
    /// - `InstructionLevel` or `DrawBoundary`: True preemption. The device saves
    ///   context state and stops the running workload within ~50μs-10ms
    ///   (depends on preemption granularity and workload; instruction-level
    ///   preemption on modern compute GPUs is typically 50-100μs, while
    ///   draw-boundary preemption can take milliseconds).
    ///   The context can be resumed later via a new `submit_commands`.
    /// - `CommandBoundary`: The device finishes the current command buffer
    ///   but does not start the next queued one. Latency is bounded by
    ///   the longest in-flight command buffer.
    /// - `None`: Cooperative yield only. The driver stops submitting new
    ///   command buffers after the current one completes. Cannot interrupt
    ///   a running dispatch.
    ///
    /// Returns `IO_OK` if the preemption/yield was initiated (completion
    /// is asynchronous — the scheduler polls for the context to become
    /// idle). Returns `IO_NOT_SUPPORTED` if the driver does not implement
    /// any form of preemption or yield.
    pub preempt_context: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        reason: PreemptReason,
    ) -> IoResultCode>,

    // === Memory Management ===

    /// Allocate device-local memory (VRAM, local SRAM, etc.).
    pub alloc_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        size: u64,
        alignment: u64,
        page_size: AccelPageSize,
        flags: AccelMemFlags,
        out_handle: *mut AccelMemHandle,
    ) -> IoResultCode,

    /// Free device-local memory.
    pub free_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        handle: AccelMemHandle,
    ) -> IoResultCode,

    /// Map device memory into a CPU-visible virtual address.
    pub map_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        handle: AccelMemHandle,
        offset: u64,
        size: u64,
        out_cpu_addr: *mut u64,
    ) -> IoResultCode,

    /// Migrate pages between CPU RAM and device memory.
    /// Direction determined by flags.
    pub migrate_pages: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        pages: *const AccelMigrationEntry,
        page_count: u32,
        flags: MigrationFlags,
    ) -> IoResultCode>,

    // === Utilization Reporting ===

    /// Report current device utilization to the kernel scheduler.
    pub get_utilization: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_util: *mut AccelUtilization,
    ) -> IoResultCode,

    // === Power/Thermal ===

    /// Get current power and thermal state.
    pub get_power_state: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_state: *mut AccelPowerState,
    ) -> IoResultCode,

    /// Set performance level / clock frequency.
    pub set_performance_level: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        level: AccelPerfLevel,
    ) -> IoResultCode>,

    // === Initialization and Reset ===

    /// Called once after device enumeration to perform hardware initialization.
    /// The driver must configure firmware, set up internal queues, and return
    /// only after the device is ready to accept commands.
    ///
    /// Also called by HROT ([Section 21.3.7.3](#21373-watchdog-implementation))
    /// after a full device reset to restore the device to a clean, fully operational
    /// state equivalent to a fresh driver load.
    ///
    /// Returns 0 on success, negative errno on failure. On failure the device
    /// is placed in `AccelDeviceState::Error` and must be re-probed by the operator.
    ///
    /// **KABI version note**: Added in KABI version 1.1. Callers must check
    /// `vtable_size >= offset_of!(AccelBaseVTable, device_init) + size_of::<Option<fn>>())`
    /// before invoking; treat as `None` if not present.
    pub device_init: Option<unsafe extern "C" fn(dev: *mut AccelDevice) -> i32>,

    /// Reset a single execution context (not the whole device).
    pub reset_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode,

    /// Full device reset.
    pub reset_device: unsafe extern "C" fn(
        ctx: *mut c_void,
    ) -> IoResultCode,

    // === Completion Notification ===

    /// Register a kernel callback for command completion notification.
    /// Replaces polling for latency-sensitive contexts.
    pub register_completion_callback: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        callback: unsafe extern "C" fn(
            context: AccelContextHandle,
            submission: AccelSubmissionHandle,
            status: AccelCompletionStatus,
        ),
    ) -> IoResultCode>,

    // === Vendor-Private Passthrough ===

    /// Handle vendor-private ioctls (range 0xC0-0xFF). The kernel validates
    /// buffer bounds and capability permissions, then passes the raw buffer
    /// to this handler. Returns 0 on success, negative errno on failure.
    /// Drivers that do not support vendor-private ioctls set this to `None`.
    pub vendor_ioctl: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        ioctl_nr: u8,
        user_buf: *mut u8,
        buf_len: u32,
    ) -> IoResultCode>,
}

Callback restrictions: Completion callbacks execute in interrupt context (hardirq on x86, IRQ on ARM). They MUST NOT: - Allocate memory (no Box, Vec, or slab allocation) - Acquire sleeping locks (no Mutex, only SpinLock with try_lock) - Perform I/O (no disk, network, or MMIO beyond the accelerator's own registers) - Call schedule() or any function that may sleep

Callbacks that need to perform complex work (e.g., chaining dependent dispatches, updating shared data structures) must defer to a per-accelerator workqueue thread via accel_defer(closure). The workqueue runs in process context with full kernel capabilities. The completion callback's job is limited to: (1) recording completion status, (2) waking the workqueue if deferred work is pending, and (3) updating per-context statistics (fence value, timing).

21.1.2.3 Key Data Types

/// Device capability and resource inventory.
#[repr(C)]
pub struct AccelDeviceInfo {
    /// Device class.
    pub device_class: AccelDeviceClass,

    /// Number of hardware compute units in this device.
    /// The definition of "compute unit" is device-class-specific:
    ///   - GPU: Streaming Multiprocessors (SMs) for NVIDIA, Compute Units (CUs) for AMD
    ///   - NPU: neural processing cores
    ///   - FPGA: configurable logic blocks
    ///   - DSP: DSP cores
    /// This is the count reported by the driver via `get_info()` and used by the
    /// AccelScheduler for proportional resource accounting. For MIG/partitioned
    /// devices, this is the compute units assigned to the partition, not the
    /// full device total.
    pub compute_units: u32,

    /// Device-local memory size (bytes). 0 if no local memory.
    pub local_memory_bytes: u64,

    /// Maximum concurrent execution contexts.
    pub max_contexts: u32,

    /// Hardware preemption support.
    pub preemption_granularity: AccelPreemptionGranularity,

    /// Maximum command buffer size (bytes).
    pub max_cmd_buffer_size: u32,

    /// Supported memory types.
    pub memory_types: AccelMemTypeFlags,

    /// PCIe atomics support (needed for fine-grained SVM).
    /// Uses `u8` (0=false, 1=true) instead of `bool` for stable KABI:
    /// C `bool` has implementation-defined size and alignment, making it
    /// unsuitable for cross-compilation-unit ABI boundaries.
    pub pcie_atomics: u8,

    /// Peer-to-peer DMA capability.
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub p2p_capable: u8,

    /// Unified virtual addressing (CPU and device share address space).
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub unified_addressing: u8,

    /// Hardware page fault support (device can fault on unmapped pages).
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub hw_page_faults: u8,

    pub _pad: [u8; 32],
}

#[repr(u32)]
pub enum AccelDeviceClass {
    /// General-purpose GPU (compute + graphics + display).
    Gpu             = 0,
    /// Compute-only GPU (no display, e.g., datacenter SKUs).
    GpuCompute      = 1,
    /// Neural Processing Unit (fixed-function inference).
    Npu             = 2,
    /// Tensor Processing Unit / AI ASIC.
    Tpu             = 3,
    /// FPGA with compute overlay.
    Fpga            = 4,
    /// Digital Signal Processor.
    Dsp             = 5,
    /// Media processor (video encode/decode).
    MediaProcessor  = 6,
    /// Computational storage device with embedded compute capability.
    ComputeStorage  = 7,
    /// Other / vendor-specific.
    Other           = 255,
}

/// See `AccelPreemptionGranularity` (Section 21.1.2.6a) for full enum definition with design rationale.
pub type PreemptionGranularity = AccelPreemptionGranularity;

/// Reason for requesting preemption of a running accelerator context.
/// Passed to `preempt_context` so the driver can log/report the cause
/// and, on devices that support it, choose an appropriate preemption
/// strategy (e.g., urgent drain vs. graceful yield).
#[repr(u32)]
pub enum PreemptReason {
    /// A higher-priority submission is waiting behind this context.
    /// The scheduler detected priority inversion and is preempting the
    /// lower-priority context to unblock the higher-priority one.
    PriorityInversion = 0,
    /// The context has exceeded its CBS bandwidth budget for the current
    /// scheduling period. Fairness enforcement requires yielding the
    /// device so other contexts can use their guaranteed bandwidth.
    FairnessTimeout   = 1,
    /// The context's current submission has exceeded its
    /// `AccelContextLimits::max_execution_us` timeout. The scheduler is
    /// preempting to enforce the per-submission execution time limit.
    ExecutionTimeout  = 2,
    /// Administrative eviction: the context is being forcibly removed
    /// from the device. Causes include driver unload, device reset,
    /// cgroup removal, or process exit cleanup.
    AdminEvict        = 3,
    /// A soft watchdog timer fired because the submission has been
    /// executing longer than `AccelDeviceHrotCaps::soft_timeout_ms`.
    /// This is a warning-level preemption request: on preemptible
    /// hardware the driver should attempt a graceful yield; if the
    /// submission does not complete within the hard timeout window
    /// the kernel escalates to a full device reset via `accel_hard_reset`.
    /// Distinct from `ExecutionTimeout` (which enforces the per-submission
    /// `AccelContextLimits::max_execution_us` CBS budget); this variant
    /// is the HROT watchdog path.
    WatchdogSoftTimeout = 4,
}

/// Per-context resource limits (enforced by kernel).
#[repr(C)]
pub struct AccelContextLimits {
    /// Maximum device memory this context can allocate (bytes).
    /// 0 = no limit (subject to device capacity).
    pub max_memory_bytes: u64,

    /// Maximum compute time per submission (microseconds).
    /// 0 = no limit. Submissions exceeding this are preempted.
    pub max_execution_us: u64,

    /// Bandwidth guarantee: guaranteed microseconds of compute
    /// per scheduling period. See Section 21.3 (cgroup integration).
    /// 0 = best-effort (no guarantee).
    pub guaranteed_bandwidth_us: u64,

    /// Scheduling period for bandwidth accounting (microseconds).
    /// Default: 1_000_000 (1 second).
    pub bandwidth_period_us: u64,

    pub _pad: [u8; 32],
}

/// Scheduling priority for accelerator contexts.
#[repr(u32)]
pub enum AccelPriority {
    /// Background / batch. Lowest priority. Preempted by all others.
    Background  = 0,
    /// Normal interactive workload.
    Normal      = 1,
    /// High priority (e.g., real-time inference with SLO).
    High        = 2,
    /// Realtime. Highest priority. Preempts all others immediately.
    Realtime    = 3,
}

/// Entry describing a page to migrate between CPU and device memory.
/// Used by migrate_pages() in AccelBaseVTable.
#[repr(C)]
pub struct AccelMigrationEntry {
    /// Virtual address of the page to migrate (must be page-aligned).
    pub vaddr: u64,
    /// Current location: 0 = CPU RAM, 1 = device memory.
    pub current_location: u8,
    /// Desired location after migration.
    pub target_location: u8,
    // 2 bytes implicit padding (repr(C) alignment for i32).
    /// Result of migration (filled by driver): 0 = success, negative errno on failure.
    pub result: i32,
}

bitflags! {
    /// Flags controlling page migration behavior.
    #[repr(C)]
    pub struct MigrationFlags: u32 {
        /// Migrate from CPU RAM to device memory.
        const TO_DEVICE   = 1 << 0;
        /// Migrate from device memory to CPU RAM.
        const TO_CPU      = 1 << 1;
        /// Force migration even if page is hot/pinned.
        const FORCE       = 1 << 2;
        /// Asynchronous migration (return immediately, complete via callback).
        const ASYNC       = 1 << 3;
        /// Prefetch hint (migrate pages likely to be accessed soon).
        const PREFETCH    = 1 << 4;
    }
}

/// Device utilization report.
#[repr(C)]
pub struct AccelUtilization {
    /// Compute utilization 0-100%.
    pub compute_percent: u32,
    /// Memory bandwidth utilization 0-100%.
    pub memory_bw_percent: u32,
    /// Memory usage (bytes allocated).
    pub memory_used_bytes: u64,
    /// Memory total (bytes).
    pub memory_total_bytes: u64,
    /// Number of active contexts.
    pub active_contexts: u32,
    /// Current temperature (millidegrees Celsius).
    pub temperature_mc: u32,
    /// Current power draw (milliwatts).
    pub power_mw: u32,
    /// Current clock frequency (MHz).
    pub clock_mhz: u32,
}

/// Bitflags indicating supported memory types for an accelerator device.
/// Used in `AccelDeviceInfo::memory_types` to describe what kinds of memory
/// the device can allocate or access.
#[repr(transparent)]
pub struct AccelMemTypeFlags(pub u32);

impl AccelMemTypeFlags {
    /// Device-local VRAM (GPU-attached HBM, GDDR, etc.).
    pub const DEVICE_LOCAL: Self   = Self(1 << 0);
    /// Host-visible memory (CPU-accessible, uncached on device).
    pub const HOST_VISIBLE: Self   = Self(1 << 1);
    /// Host-coherent memory (no explicit flush/invalidate needed).
    pub const HOST_COHERENT: Self  = Self(1 << 2);
    /// System memory accessible via SVM/unified addressing.
    pub const SYSTEM_SVM: Self     = Self(1 << 3);
    /// Peer-to-peer accessible memory (other devices can DMA directly).
    pub const PEER_ACCESSIBLE: Self = Self(1 << 4);
}

/// Current power and thermal state of an accelerator device.
/// Returned by `AccelBaseVTable::get_power_state()`.
#[repr(C)]
pub struct AccelPowerState {
    /// Current power consumption in milliwatts.
    pub power_mw: u32,
    /// Current temperature in millidegrees Celsius (e.g., 75000 = 75.0°C).
    pub temperature_mc: u32,
    /// Current ACPI-style device power state.
    pub device_state: AccelDevicePowerLevel,
    /// Thermal throttling active (0 = no, 1 = yes).
    /// `u8` for KABI stability (C `bool` has implementation-defined size).
    pub throttled: u8,
    pub _pad: [u8; 3],
}

/// ACPI-style device power levels for accelerators.
#[repr(u32)]
pub enum AccelDevicePowerLevel {
    /// D0: Fully operational.
    D0Active    = 0,
    /// D1: Low-power idle (fast resume, ~microseconds).
    D1LowPower  = 1,
    /// D2: Deeper sleep (slower resume, ~milliseconds).
    D2Standby   = 2,
    /// D3: Off (full re-initialization required on resume).
    D3Off       = 3,
}

/// Performance level hint passed to `AccelBaseVTable::set_performance_level()`.
/// The driver maps this to device-specific clock/voltage settings.
#[repr(u32)]
pub enum AccelPerfLevel {
    /// Minimum clocks — lowest power, suitable for idle or light desktop compositing.
    Low       = 0,
    /// Balanced — driver chooses mid-range clocks.
    Medium    = 1,
    /// Maximum clocks — full performance, highest power/thermal.
    High      = 2,
    /// Boost — temporary overclock above nominal max (thermal-limited duration).
    Boost     = 3,
    /// Adaptive — driver uses internal DVFS; kernel does not pin a level.
    Adaptive  = 4,
}
// Opaque handle types (all u64 newtypes, #[repr(transparent)] for
// zero-cost FFI — the newtype has the same ABI as the inner u64).
#[repr(transparent)]
pub struct AccelContextHandle(pub u64);
#[repr(transparent)]
pub struct AccelMemHandle(pub u64);
#[repr(transparent)]
pub struct AccelSubmissionHandle(pub u64);
#[repr(transparent)]
pub struct P2pMappingHandle(pub u64);
/// Fence for synchronizing command submissions.
// The (device_id, context_id, value) tuple uniquely identifies a fence point.
// Cross-device fence signaling requires both devices to be registered in the
// same AccelFenceRegistry (Section 21.1.2.4).
#[repr(C)]
pub struct AccelFence {
    pub fence_type: AccelFenceType,
    pub value: u64,
    pub device_id: DeviceNodeId,  // Owning device (prevents cross-device forgery)
    pub context_id: u32,          // Owning context within the device
}

#[repr(u32)]
pub enum AccelFenceType {
    /// Device-local fence (GPU timeline semaphore).
    DeviceLocal  = 0,
    /// Cross-device fence (for P2P synchronization).
    CrossDevice  = 1,
    /// CPU-signalable fence (host-side event).
    CpuSignal    = 2,
}

#[repr(u32)]
pub enum AccelCompletionStatus {
    /// Command completed successfully.
    Success       = 0,
    /// Command failed with device error.
    Error         = 1,
    /// Command timed out (exceeded max_execution_us).
    Timeout       = 2,
    /// Command was preempted by higher-priority context.
    Preempted     = 3,
    /// Partial error: some work completed, some failed.
    PartialError  = 4,
}

/// Global registry for cross-device fence synchronization.
/// Devices must be registered in the same AccelFenceRegistry to share
/// cross-device fences. The registry is kernel-internal; userspace
/// interacts through /dev/umka-accel-N ioctls.
///
/// There is one global AccelFenceRegistry per accelerator topology domain
/// (typically one per NUMA node or PCIe domain). Devices in different
/// domains cannot share cross-device fences.
///
/// **Performance optimization (D2 fix):** The device map uses a hybrid
/// BTreeMap/flat-array design. BTreeMap provides O(log n) lookup flexibility
/// for device registration, but the hot path (cross-device fence polling)
/// uses a cached flat array for O(1) lookup:
pub struct AccelFenceRegistry {
    /// Registered devices and their per-device fence state.
    ///
    /// The device map changes only on device register/unregister (rare —
    /// hot-plug events). RCU-protected: cross-device fence lookups read
    /// the map lock-free under `rcu_read_lock()`; registration mutations
    /// clone-and-swap under `devices_update_lock`.
    devices: RcuPtr<Arc<BTreeMap<DeviceNodeId, Arc<DeviceFenceState>>>>,

    /// Cached flat array of device states for O(1) lookup during fence polling.
    /// This array is rebuilt on every device registration/unregistration (cold path).
    /// Index in the array corresponds to `device_id.0` for device IDs < 64.
    /// For device IDs >= 64, the BTreeMap fallback is used.
    ///
    /// Rationale: In multi-GPU systems, device IDs are typically assigned
    /// sequentially starting from 0. A 8-GPU system has device IDs 0-7, so
    /// the array has 8 entries. Fence polling becomes O(1) array indexing
    /// instead of O(log n) BTreeMap lookup.
    device_array: RcuPtr<Arc<[Option<Arc<DeviceFenceState>>; 64]>>,

    /// Serializes device registration / unregistration (cold path only).
    devices_update_lock: Mutex<()>,
}

/// Maximum device ID that can use the O(1) array lookup path.
/// Device IDs >= MAX_DEVICE_ARRAY_ID fall back to BTreeMap lookup.
/// This limit is chosen to balance array size (64 × 8 bytes = 512 bytes)
/// against coverage (covers 8-GPU systems with room for other accelerators).
pub const MAX_DEVICE_ARRAY_ID: u64 = 64;

/// Maximum concurrent waiters on a single GPU fence.
/// Exceeding this limit returns EAGAIN to the caller.
pub const MAX_FENCE_WAITERS: u32 = 64;

/// Per-device fence state. Sharded by device: cross-device fence polling
/// only contends with operations targeting the *same* device, not all
/// devices globally. A multi-GPU training job with 8 GPUs has 8
/// independent fence tables instead of one global bottleneck.
pub struct DeviceFenceState {
    /// Fence protocol capabilities for this device.
    pub protocol: FenceProtocolSupport,

    /// Active fences exported by this device, keyed by (context_id, value).
    /// Per-device RwLock: polling a fence on GPU-A takes GPU-A's read lock
    /// only — GPU-B's fence operations are uncontended.
    pub fences: spin::RwLock<BTreeMap<(u32, u64), Arc<AtomicBool>>>,

    /// Waiters registered for this device's fences. Callback-based:
    /// when a fence signals, the owning device iterates its waiter list.
    pub waiters: spin::RwLock<ArrayVec<FenceWaiterEntry, MAX_FENCE_WAITERS>>,
}

// **Cross-device fence polling path** (hot path for multi-GPU):
//   Optimized for O(1) lookup when device_id < MAX_DEVICE_ARRAY_ID:
//   1. rcu_read_lock() — lock-free
//   2. If device_id.0 < MAX_DEVICE_ARRAY_ID: array index — O(1)
//      Else: BTreeMap lookup — O(log n) fallback for large device IDs
//   3. Arc::clone the DeviceFenceState — one atomic increment
//   4. rcu_read_unlock()
//   5. Take per-device `fences.read()` — only contends with same-device ops
//   6. Look up (context_id, value) — O(log m) where m = fences on that device
//   7. Read AtomicBool — single atomic load
//
// For typical 8-GPU systems (device IDs 0-7), step 2 is O(1) array indexing.
// The BTreeMap fallback handles edge cases (device IDs >= 64, sparse IDs).
// Total contention: per-device only. No global serialization.
//
// Performance improvement: For 8 GPUs with 1M fences each, the original
// BTreeMap lookup was O(log 8) ≈ 3 comparisons per device poll. With the
// array optimization, device lookup is O(1). Fence lookup within the device
// remains O(log m), but this is dominated by the per-device RwLock anyway.
// Estimated speedup: ~2-3× for cross-device fence polling in large systems.

/// Fence protocol capabilities for a registered device.
#[repr(C)]
pub struct FenceProtocolSupport {
    /// Device supports timeline semaphores (monotonically increasing value).
    /// u8 (0=false, 1=true) for KABI stability (bool size is not guaranteed
    /// across compiler versions; see Section 21.1.2.3 KABI rules).
    pub timeline_semaphores: u8,
    /// Device supports cross-device signaling via hardware sync objects.
    pub hw_cross_device: u8,
    /// Device supports CPU-signaling of device fences (for host wait).
    pub cpu_signal: u8,
    /// Explicit padding for u64 alignment.
    pub _pad: [u8; 5],
    /// Maximum fence value before wrap (0 = no limit).
    pub max_fence_value: u64,
}

/// Entry in the fence waiter table.
struct FenceWaiterEntry {
    /// Fence being waited on.
    fence: AccelFence,
    /// Device waiting for the fence.
    waiter_device: DeviceNodeId,
    /// Callback to invoke when fence signals (driver-provided).
    callback: unsafe extern "C" fn(device_id: DeviceNodeId, fence: AccelFence),
}

Cross-device fence registration authentication (E3 fix):

To prevent unauthorized cross-device fence waiting (which could leak information about other devices' workloads), the kernel requires the ACCEL_P2P capability (0x0102) for both devices involved in a cross-device fence wait:

/// Register a cross-device fence waiter.
///
/// This allows `waiter_device` to wait on a fence owned by `fence.device_id`.
/// Authentication is required to prevent information leakage and fence forgery.
///
/// **Security requirement**: The caller must hold `ACCEL_P2P` capability (0x0102)
/// for BOTH the waiter_device AND the fence.device_id. This ensures that only
/// processes authorized for peer-to-peer access can establish cross-device
/// synchronization.
///
/// Returns:
/// - `IO_OK` on success
/// - `EACCES` if caller lacks ACCEL_P2P for either device
/// - `ENOENT` if the fence does not exist
/// - `EAGAIN` if the waiter table is full
fn register_cross_device_fence_wait(
    registry: &AccelFenceRegistry,
    fence: &AccelFence,
    waiter_device: DeviceNodeId,
    callback: unsafe extern "C" fn(DeviceNodeId, AccelFence),
    caller_caps: &CapabilitySet,  // Caller's current capabilities
) -> IoResultCode {
    // Authentication check (E3 fix):
    // Require ACCEL_P2P for BOTH devices to prevent unauthorized cross-device sync.
    if !caller_caps.has_cap_for_device(Capability::ACCEL_P2P, fence.device_id) {
        return Err(-EACCES);  // Not authorized for fence.owner_device
    }
    if !caller_caps.has_cap_for_device(Capability::ACCEL_P2P, waiter_device) {
        return Err(-EACCES);  // Not authorized for waiter_device
    }

    // Look up the target device's fence state
    let Some(target_device_state) = registry.get_device(fence.device_id) else {
        return Err(-ENOENT);  // Device not registered
    };

    // Look up the specific fence
    let fence_key = (fence.context_id, fence.value);
    let Some(target_fence) = target_device_state.fences.read().get(&fence_key) else {
        return Err(-ENOENT);  // Fence does not exist
    };

    // Add waiter (with capacity check)
    let mut waiters = target_device_state.waiters.write();
    if waiters.len() >= MAX_FENCE_WAITERS as usize {
        return Err(-EAGAIN);  // Waiter table full
    }

    waiters.push(FenceWaiterEntry {
        fence: *fence,
        waiter_device,
        callback,
    });

    IO_OK
}

Rationale for dual-device capability check:

  • Without the check: A compromised Tier 1 driver for device A could register waiters on device B's fences (owned by a different process/cgroup), learning when device B completes work (timing side-channel) or potentially signaling fences it doesn't own.
  • With the check: Only processes that already have ACCEL_P2P for both devices can establish cross-device synchronization. This is consistent with the P2P DMA authorization model (Section 21.2.2.5).
  • Capability binding: The ACCEL_P2P capability is bound to specific device pairs at grant time, preventing replay attacks (same as P2P DMA ACL anti-replay).

21.1.2.4 Kernel-Side Accelerator Scheduler

The kernel maintains an accelerator scheduler that sits between userspace submissions and the driver's submit_commands. This is the core innovation — the kernel sees and controls accelerator time.

// umka-core/src/accel/scheduler.rs (kernel-internal)

/// Per-accelerator scheduler.
pub struct AccelScheduler {
    /// Device this scheduler manages.
    device_id: DeviceNodeId,

    /// Device capabilities (cached from get_info).
    device_info: AccelDeviceInfo,

    /// Active contexts, ordered by priority and deadline.
    /// Uses a fixed-capacity sorted array (no heap allocation).
    /// Capacity is set to `device_info.max_contexts` at initialization.
    contexts: FixedSortedArray<AccelContextHandle, AccelContextState>,

    /// Per-context CBS bandwidth servers (same algorithm as CPU scheduler,
    /// see Section 6.3). Pre-allocated at initialization with capacity
    /// equal to `device_info.max_contexts`.
    bandwidth_servers: FixedVec<AccelCbsServer>,

    /// Hardware command queue depth (how many submissions are in-flight).
    hw_queue_depth: AtomicU32,

    /// Maximum concurrent in-flight submissions.
    max_inflight: u32,

    /// Scheduling policy.
    policy: AccelSchedPolicy,
}

pub struct AccelContextState {
    /// Owner process/cgroup.
    owner_pid: u32,
    cgroup_id: u64,

    /// Priority class.
    priority: AccelPriority,

    /// Resource limits.
    limits: AccelContextLimits,

    /// Index into `AccelScheduler::bandwidth_servers` (if guaranteed
    /// bandwidth is set). The CBS server state is owned by the scheduler's
    /// `bandwidth_servers` array — storing it here as well would create
    /// a consistency hazard (two copies of the same server state). The
    /// scheduler uses this index to look up the context's CBS server
    /// during scheduling decisions.
    cbs_server_index: Option<u32>,

    /// Pending command buffers waiting to be submitted to hardware.
    /// Fixed-capacity ring buffer, pre-allocated at context creation.
    pending_queue: FixedRingBuffer<PendingSubmission>,

    /// In-flight submissions (on hardware).
    /// Fixed-capacity array, sized to `max_inflight` at initialization.
    inflight: FixedVec<InflightSubmission>,

    /// Accounting: total compute time consumed (nanoseconds).
    total_compute_ns: AtomicU64,

    /// Accounting: total memory allocated (bytes).
    total_memory_bytes: AtomicU64,
}

/// Pending submission waiting to be dispatched to hardware.
/// Stored in the context's `pending_queue` until the scheduler submits it.
#[repr(C)]
pub struct PendingSubmission {
    /// Handle to the command buffer (driver-allocated).
    pub cmd_buffer: AccelCmdBufferHandle,

    /// Submission ID assigned by the driver for tracking.
    pub driver_submit_id: u64,

    /// Timestamp when submission was queued (for timeout detection).
    pub queued_timestamp_ns: u64,

    /// Number of fences this submission depends on (0 = immediate submit).
    pub dependency_count: u32,

    /// Whether `completion_semaphore` is valid (1 = present, 0 = none).
    /// Separate flag used instead of `Option<T>` because `Option` has no
    /// stable layout guarantee in repr(C) structs.
    pub has_completion_semaphore: u8,
    /// Padding for alignment.
    pub _pad: [u8; 7],
    /// Semaphore to signal on completion. Valid only when
    /// `has_completion_semaphore == 1`.
    pub completion_semaphore: AccelSemaphoreHandle,
}

/// Opaque handle identifying a fence for synchronization at the KABI boundary.
/// Drivers receive this handle when submitting work and poll or wait on it
/// to determine when the hardware has completed the submission.
/// Matches `AccelFence.id` on the kernel side.
pub type AccelFenceHandle = u64;

/// An in-flight submission tracked by the scheduler until completion.
/// Stored in the context's `inflight` array. Not to be confused with
/// `AccelSubmissionHandle` (the opaque u64 KABI handle returned to drivers).
#[repr(C)]
pub struct InflightSubmission {
    /// Driver-assigned submission ID (matches PendingSubmission::driver_submit_id).
    pub driver_submit_id: u64,

    /// Timestamp when submitted to hardware (for timeout detection).
    pub submitted_timestamp_ns: u64,

    /// Fence that will be signaled on completion.
    pub completion_fence: AccelFenceHandle,
}

#[repr(u32)]
pub enum AccelSchedPolicy {
    /// Simple round-robin between contexts. Default for NPU/FPGA.
    RoundRobin  = 0,
    /// Priority-based with preemption. Default for GPU.
    Priority    = 1,
    /// CBS bandwidth guarantee + priority. Default when cgroup limits set.
    Guaranteed  = 2,
}

/// CBS (Constant Bandwidth Server) state for per-context bandwidth enforcement.
/// Uses the same algorithm as the CPU scheduler (Section 6.3), adapted for
/// accelerator time. Each context with a bandwidth guarantee gets its own
/// AccelCbsServer instance in the scheduler's bandwidth_servers array.
#[repr(C)]
pub struct AccelCbsServer {
    /// Context this server is associated with.
    pub context_id: AccelContextHandle,

    /// Maximum bandwidth (nanoseconds of accelerator time per period).
    pub bandwidth_ns: u64,

    /// Period in nanoseconds (e.g., 10ms = 10_000_000).
    pub period_ns: u64,

    /// Current runtime consumed in this period (nanoseconds).
    pub runtime_consumed: AtomicU64,

    /// Absolute deadline (nanoseconds since boot). Computed as
    /// period_start + period_ns when the context is activated.
    pub deadline: AtomicU64,

    /// Start of the current period (nanoseconds since boot).
    pub period_start: AtomicU64,
}

Allocation discipline: All scheduler data structures use pre-allocated, fixed-capacity storage. No heap allocation occurs on the scheduling fast path. FixedSortedArray, FixedVec, and FixedRingBuffer are kernel-internal types that allocate their backing storage once at initialization (when the device is registered or a context is created) and never resize. This ensures that submit_commands, poll_completion, and context scheduling are deterministic-latency operations with no allocator contention.

Fixed-capacity container types:

// umka-core/src/collections/fixed.rs (kernel-internal, not part of KABI)

/// Fixed-capacity sorted array with O(log n) lookup and O(n) insert.
/// Backing storage is allocated once at initialization and never resized.
/// Capacity is set at construction; attempting to insert beyond capacity
/// returns an error rather than allocating.
///
/// Generic parameters:
/// - `K`: Key type (must be `Ord + Copy`)
/// - `V`: Value type (stored inline)
///
/// The array is kept sorted by key on every insert. Use when:
/// - Capacity is small (< 1024 entries)
/// - Lookups dominate inserts
/// - Deterministic latency is required
pub struct FixedSortedArray<K: Ord + Copy, V> {
    /// Pointer to pre-allocated storage: [(K, V); capacity]
    data: NonNull<(K, V)>,
    /// Current number of elements (<= capacity).
    len: AtomicUsize,
    /// Maximum capacity (fixed at construction).
    capacity: usize,
}

impl<K: Ord + Copy, V> FixedSortedArray<K, V> {
    /// Construct a new fixed sorted array with given capacity.
    /// The backing storage is allocated immediately and zero-initialized.
    /// Returns `None` if allocation fails.
    pub fn new(capacity: usize) -> Option<Self>;

    /// Insert a (key, value) pair, keeping the array sorted by key.
    /// Returns `Err(())` if the array is full (no allocation on error).
    /// Time complexity: O(n) due to element shifting.
    pub fn insert(&self, key: K, value: V) -> Result<(), ()>;

    /// Remove an element by key.
    /// Returns `Some(value)` if found, `None` otherwise.
    /// Time complexity: O(n) due to element shifting.
    pub fn remove(&self, key: K) -> Option<V>;

    /// Get a reference to the value with the given key.
    /// Time complexity: O(log n) via binary search.
    pub fn get(&self, key: K) -> Option<&V>;

    /// Get the minimum element (first in sorted order).
    /// Returns `None` if the array is empty.
    /// Time complexity: O(1).
    pub fn min(&self) -> Option<&(K, V)>;

    /// Get the maximum element (last in sorted order).
    /// Returns `None` if the array is empty.
    /// Time complexity: O(1).
    pub fn max(&self) -> Option<&(K, V)>;

    /// Return the current number of elements.
    pub fn len(&self) -> usize;

    /// Return whether the array is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Return the maximum capacity.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

    /// Iterate over all elements in sorted order.
    /// The iterator is valid even if elements are removed during iteration
    /// (removed elements are skipped).
    pub fn iter(&self) -> FixedSortedIter<'_, K, V>;
}

/// Fixed-capacity vector (dynamic array) with O(1) push/pop and O(1) indexing.
/// Backing storage is allocated once at initialization and never resized.
/// Capacity is set at construction; attempting to push beyond capacity
/// returns an error rather than allocating.
///
/// Generic parameters:
/// - `T`: Element type (stored inline)
///
/// Use when:
/// - Random access is required
/// - Push/pop at end dominates
/// - Deterministic latency is required
pub struct FixedVec<T> {
    /// Pointer to pre-allocated storage: [T; capacity]
    data: NonNull<T>,
    /// Current number of elements (<= capacity).
    len: AtomicUsize,
    /// Maximum capacity (fixed at construction).
    capacity: usize,
}

impl<T> FixedVec<T> {
    /// Construct a new fixed vector with given capacity.
    /// The backing storage is allocated immediately and zero-initialized.
    /// Returns `None` if allocation fails.
    pub fn new(capacity: usize) -> Option<Self>;

    /// Push an element to the end of the vector.
    /// Returns `Err(element)` if the vector is full (no mutation on error).
    /// Time complexity: O(1).
    pub fn push(&self, value: T) -> Result<(), T>;

    /// Pop an element from the end of the vector.
    /// Returns `None` if the vector is empty.
    /// Time complexity: O(1).
    pub fn pop(&self) -> Option<T>;

    /// Get a reference to the element at the given index.
    /// Returns `None` if index is out of bounds.
    /// Time complexity: O(1).
    pub fn get(&self, index: usize) -> Option<&T>;

    /// Get a mutable reference to the element at the given index.
    /// Requires exclusive access (`&mut self`) to ensure no aliasing.
    /// Returns `None` if index is out of bounds.
    /// Time complexity: O(1).
    pub fn get_mut(&mut self, index: usize) -> Option<&mut T>;

    /// Return the current number of elements.
    pub fn len(&self) -> usize;

    /// Return whether the vector is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Return the maximum capacity.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

    /// Iterate over all elements.
    pub fn iter(&self) -> FixedVecIter<'_, T>;

    /// Iterate over all elements mutably.
    pub fn iter_mut(&mut self) -> FixedVecIterMut<'_, T>;
}

/// Fixed-capacity single-producer single-consumer (SPSC) ring buffer.
/// Backing storage is allocated once at initialization and never resized.
/// Capacity is set at construction; enqueue returns `None` if full,
/// dequeue returns `None` if empty.
///
/// Generic parameters:
/// - `T`: Element type (stored inline)
///
/// Use when:
/// - FIFO ordering is required
/// - Bounded buffering is acceptable
/// - Lock-free SPSC operation is desired
///
/// **Note**: `FixedRingBuffer` is SPSC by default. For MPSC scenarios,
/// wrap access in a `SpinLock` or use multiple SPSC rings (one per producer).
/// The scheduler's per-context `pending_queue` is SPSC because only the
/// context owner enqueues and only the scheduler dequeues.
pub struct FixedRingBuffer<T> {
    /// Pointer to pre-allocated storage: [T; capacity]
    data: NonNull<T>,
    /// Capacity (fixed at construction).
    capacity: usize,
    /// Head index (write position, producer-owned).
    head: AtomicUsize,
    /// Tail index (read position, consumer-owned).
    tail: AtomicUsize,
}

impl<T> FixedRingBuffer<T> {
    /// Construct a new fixed ring buffer with given capacity.
    /// Capacity must be a power of two (enforced at construction).
    /// The backing storage is allocated immediately and zero-initialized.
    /// Returns `None` if allocation fails or capacity is not a power of two.
    pub fn new(capacity: usize) -> Option<Self>;

    /// Enqueue an element at the tail (producer operation).
    /// Returns `Err(element)` if the buffer is full (no mutation on error).
    /// Time complexity: O(1). Safe for concurrent producer/consumer.
    pub fn enqueue(&self, value: T) -> Result<(), T>;

    /// Dequeue an element from the head (consumer operation).
    /// Returns `None` if the buffer is empty.
    /// Time complexity: O(1). Safe for concurrent producer/consumer.
    pub fn dequeue(&self) -> Option<T>;

    /// Peek at the next element without removing it.
    /// Returns `None` if the buffer is empty.
    /// Time complexity: O(1).
    pub fn peek(&self) -> Option<&T>;

    /// Return the current number of elements in the buffer.
    /// Note: this is a snapshot and may be stale in concurrent usage.
    pub fn len(&self) -> usize;

    /// Return whether the buffer is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Return whether the buffer is full.
    pub fn is_full(&self) -> bool {
        self.len() == self.capacity - 1  // One slot wasted to distinguish full/empty
    }

    /// Return the maximum capacity.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

    /// Drain all elements from the buffer, calling `f` on each.
    /// This is a consumer operation that empties the buffer.
    pub fn drain<F>(&self, mut f: F)
    where
        F: FnMut(T),
    {
        while let Some(elem) = self.dequeue() {
            f(elem);
        }
    }
}

Memory safety notes: - All three types use NonNull to store the data pointer, ensuring non-null pointer optimization (no extra discriminant needed for Option). - The backing storage is allocated using PageAllocator::alloc() (4KB granularity) and is never freed until the owning object (device, context) is destroyed. - FixedVec and FixedSortedArray use AtomicUsize for len to allow concurrent reads without locking. Mutations (push, insert, remove) require external synchronization (typically a SpinLock or scheduler-level lock). - FixedRingBuffer is lock-free for single-producer single-consumer usage via atomic head/tail updates with appropriate memory ordering (Ordering::AcqRel for enqueue/dequeue).

Scheduling flow:

Userspace submits work (via /dev/umka-accel-N or DRM ioctl compat):
    |
    v
UmkaOS Core validates capabilities
  - Does this process have an AccelContext for this device?
  - Does this process's cgroup allow more accelerator time?
  - Does this context have memory budget for the command buffer?
    |
    v
AccelScheduler queues the submission
  - Assigns priority based on context priority + cgroup policy
  - Checks CBS bandwidth server: is this context within its guarantee?
    |
    v
AccelScheduler picks next submission to dispatch to hardware
  - Priority order: Realtime > High > Normal > Background
  - Within same priority: CBS server with earliest deadline first
  - If higher-priority work arrives and device supports preemption:
    preempt current context via driver's preempt_context()
    |
    v
Driver's submit_commands() sends work to hardware
    |
    v
Hardware completion interrupt
    |
    v
Driver's poll_completion() reports result
    |
    v
AccelScheduler updates accounting (compute time, memory)
    |
    v
Userspace receives completion notification

Scheduling algorithm specification:

The AccelScheduler uses a multi-level scheduling algorithm that combines priority-based scheduling with CBS (Constant Bandwidth Server) bandwidth guarantees:

// Scheduling decision pseudocode (Section 21.1.2.4)

/// Pick the next submission to dispatch to hardware.
/// Returns the context ID and submission index, or None if nothing is runnable.
/// Time complexity: O(log n) where n = number of active contexts.
fn pick_next_submission(&self) -> Option<(AccelContextHandle, SubmissionIndex)> {
    // Level 1: Priority classes are strictly ordered.
    // Realtime > High > Normal > Background
    // A higher-priority context is always scheduled before a lower-priority one.

    for priority in [Priority::Realtime, Priority::High, Priority::Normal, Priority::Background] {
        // Level 2: Within the same priority class, use CBS deadline ordering.
        // Each context with a bandwidth guarantee has a CBS server with a deadline.
        // The context with the earliest deadline wins (Earliest Deadline First).

        let candidates = self.contexts.iter()
            .filter(|ctx| ctx.priority == priority && ctx.has_pending_work());

        if let Some(best) = candidates.min_by_key(|ctx| {
            ctx.cbs_server_index.map(|i| {
                self.bandwidth_servers[i].deadline.load(Ordering::Relaxed)
            }).unwrap_or(u64::MAX)  // No CBS = infinite deadline
        }) {
            // Tie-breaker: if two contexts have identical priority and deadline,
            // use the context with the lowest handle (deterministic, avoids starvation).
            return Some((best.handle, best.next_submission_index()));
        }
    }

    None  // No runnable contexts
}

/// Check if preemption is warranted when a higher-priority submission arrives.
/// Returns `true` if the scheduler should preempt the currently running context.
fn should_preempt(&self, running_ctx: &AccelContextState, higher_prio_ctx: &AccelContextState) -> bool {
    // Preemption is expensive (~50μs-10ms). Only preempt if:

    // 1. Running context is lower priority AND device supports preemption.
    if running_ctx.priority < higher_prio_ctx.priority
        && self.device_info.preemption_granularity >= AccelPreemptionGranularity::DrawBoundary
    {
        return true;
    }

    // 2. Running context has exceeded its CBS budget (fairness violation).
    if let Some(server_idx) = running_ctx.cbs_server_index {
        let server = &self.bandwidth_servers[server_idx];
        if server.runtime_consumed.load(Ordering::Relaxed) > server.bandwidth_ns {
            return true;  // Budget exceeded, preempt to give other contexts their share
        }
    }

    // 3. Running context has exceeded max_execution_us (timeout violation).
    if running_ctx.current_submission_exceeded_timeout() {
        return true;
    }

    // Otherwise: let the running context continue (avoid thrashing).
    false
}

Tie-breaking and determinism: - When two contexts have identical priority and identical CBS deadline, the scheduler uses the context handle value as a deterministic tie-breaker (lower handle wins). This ensures reproducible scheduling decisions across runs. - The scheduler does NOT use random tie-breaking (unlike some Linux CFS optimizations) because determinism is valued over fairness微调 in accelerator scheduling.

Complexity: - pick_next_submission(): O(n) where n = number of active contexts (typically < 64). The FixedSortedArray keeps contexts sorted by handle, so iteration is cache-friendly. - should_preempt(): O(1) — simple comparisons and atomic loads. - CBS server update (on completion): O(1) — atomic increment of runtime_consumed, periodic deadline renewal.

Preemption interaction with sorted data structure: - Preemption does NOT require re-sorting the FixedSortedArray. The array is sorted by handle (static), not by dynamic state. - The scheduling decision (pick_next_submission) performs a linear scan filtered by priority, then finds the minimum deadline among candidates. This is O(n) but n is small (< 64 contexts typical). - For systems with many contexts (> 256), the scheduler can maintain a separate priority-indexed skip list for O(log n) lookup, but this is not implemented in the base design.


**CBS budget replenishment and submission charging (Section 21.1.2.4a):**

The pseudocode above omits the two side-effecting steps that make CBS work: replenishment of
exhausted budgets at the start of each scheduling decision, and budget charging after a
command buffer is selected. Both steps are mandatory; omitting either breaks CBS fairness.

```rust
// umka-core/src/accel/scheduler.rs — CBS replenishment and charging detail

/// Replenish CBS budgets for all contexts whose deadline has passed.
/// Called at the top of every `pick_next_submission` invocation, O(n) but n ≤ 64.
///
/// CBS replenishment rule (from Section 6.3):
///   if context.runtime_consumed >= context.bandwidth_ns
///      AND current_time >= context.deadline:
///     context.runtime_consumed = 0
///     context.period_start     = current_time
///     context.deadline         = current_time + context.period_ns
///
/// Contexts that have consumed their budget but whose deadline has NOT yet
/// passed remain ineligible until their deadline arrives (no early replenishment).
/// This enforces the bandwidth ceiling: a context cannot borrow future budget.
fn replenish_expired_budgets(sched: &mut AccelScheduler, now_ns: u64) {
    for ctx in sched.contexts.iter() {
        let Some(server_idx) = ctx.cbs_server_index else { continue };
        let server = &sched.bandwidth_servers[server_idx];

        let consumed = server.runtime_consumed.load(Ordering::Relaxed);
        let deadline = server.deadline.load(Ordering::Relaxed);

        if consumed >= server.bandwidth_ns && now_ns >= deadline {
            // Budget exhausted AND period has ended: replenish.
            server.runtime_consumed.store(0, Ordering::Relaxed);
            server.period_start.store(now_ns, Ordering::Relaxed);
            server.deadline.store(now_ns + server.period_ns, Ordering::Relaxed);
        }
        // If consumed < bandwidth_ns: budget not yet exhausted, no action.
        // If consumed >= bandwidth_ns but now_ns < deadline: still in penalty
        //   phase, do not replenish early.
    }
}

/// Charge a command buffer's estimated cost to its context's CBS budget.
/// Called immediately after pick_next_submission selects a command buffer.
///
/// Cost estimation: the kernel uses cmd_buffer's `estimated_ns` field, which
/// is set by userspace at submit time (via `AccelSubmitParams::estimated_ns`).
/// The estimate is clamped to `period_ns / 4` to prevent a single oversized
/// command from consuming more than one quarter of the CBS period in one shot.
/// (The CBS_MIN_BUDGET floor is enforced separately in the debt carry-forward
/// path — see Section 21.3.4 for the full debt model.)
///
/// After charging: if `runtime_consumed >= bandwidth_ns`, the context is
/// over-budget and will not be selected by `pick_next_submission` until its
/// next replenishment period.
fn charge_submission_cost(
    sched:   &mut AccelScheduler,
    ctx:     &AccelContextState,
    cmd:     &PendingSubmission,
) {
    let Some(server_idx) = ctx.cbs_server_index else { return };
    let server = &sched.bandwidth_servers[server_idx];

    // Retrieve the context's current period to compute the cost cap.
    let period_ns = server.period_ns;
    let cost_cap  = period_ns / 4;

    // Clamp estimated cost; fall back to cost_cap if estimate is zero or absent.
    let cost_ns = cmd.estimated_ns.clamp(1, cost_cap);

    // Saturating add: avoids overflow if estimated_ns is unreasonably large.
    server.runtime_consumed.fetch_add(cost_ns, Ordering::Relaxed);
}

estimated_ns field in PendingSubmission:

The PendingSubmission struct requires an additional field for the CBS charging path:

pub struct PendingSubmission {
    // (existing fields as above)

    /// Userspace-provided estimate of GPU execution time (nanoseconds).
    /// Set via `AccelSubmitParams::estimated_ns` at submit time.
    /// Used by `charge_submission_cost` for CBS budget accounting.
    /// Zero means "unknown" — the scheduler substitutes `period_ns / 4`.
    /// Capped at `period_ns / 4` regardless of userspace value.
    pub estimated_ns: u64,
}

Integrated pick_next_submission with replenishment and charging:

The full CBS-correct scheduling decision is thus:

pick_next_submission():
  1. replenish_expired_budgets(now)         // O(n): update any newly-eligible contexts
  2. For each priority level (Realtime → Background):
       a. Find all contexts at this priority with pending work AND
          runtime_consumed < bandwidth_ns (i.e., still within budget).
       b. Among eligible contexts, select the one with the earliest `deadline`
          in its CBS server (Earliest Deadline First within the priority class).
       c. Contexts without a CBS server (AccelSchedPolicy::RoundRobin) are treated
          as having deadline = u64::MAX and are selected in handle order as tie-breaker.
  3. If a context was found:
       a. Dequeue its next PendingSubmission (pop_front from pending_queue).
       b. charge_submission_cost(ctx, cmd)  // O(1): update CBS runtime_consumed
       c. Return (context_handle, submission).
  4. If no eligible context at any priority: return None (hardware queue stays idle).

This replaces the simpler pseudocode above, which did not specify replenishment timing or budget charging. The two are equivalent for contexts that never exceed their budget; the CBS machinery only activates when a context reaches its bandwidth ceiling.

AccelScheduler vs Firmware Scheduling Boundary:

Modern GPUs have their own internal schedulers (e.g., NVIDIA's GPC/TPC scheduler, AMD's ACE/HWS). The AccelScheduler does NOT replace or duplicate firmware scheduling. The boundary is clear:

AccelScheduler (kernel, software):
  Controls WHICH contexts get to submit work and WHEN.
  Enforces fairness, cgroup limits, bandwidth guarantees, priorities.
  Decides the ORDER of submissions to the hardware queue.
  Granularity: per-submission (command buffer level).

Firmware scheduler (device, hardware/firmware):
  Controls HOW submitted work is mapped to hardware execution units.
  Distributes warps/wavefronts across SMs/CUs.
  Manages context switching on the device.
  Handles workgroup scheduling and occupancy.
  Granularity: per-instruction-group (warp/wavefront level).

Interaction:
  AccelScheduler submits command buffer → firmware takes over.
  The kernel does not see or control internal GPU scheduling.
  The kernel CAN preempt at context boundaries (via preempt_context()),
  but CANNOT preempt mid-kernel on hardware that doesn't support it.
  Preemption capability is reported by get_info().preemption_granularity.
  Devices with AccelPreemptionGranularity::CommandBoundary do not support mid-dispatch preemption;
  the scheduler uses cooperative yield for those devices.

  When the device has hardware preemption (NVIDIA compute preemption,
  AMD MES): AccelScheduler can request preemption, and the firmware
  will drain the current workgroup and save context. Modern GPUs
  (Ampere+, CDNA2+) complete preemption in ~50μs-10ms including
  context save/restore, depending on preemption granularity and
  in-flight workload (instruction-level compute preemption is
  typically 50-100μs; draw/dispatch-level can be milliseconds).
  This is expensive compared to CPU context switches (~1μs) but
  bounded. The AccelScheduler therefore avoids preemption
  on the fast path (submissions are ordered by priority before dispatch)
  and triggers preemption only when necessary: priority inversion
  (a Realtime context is blocked behind a Background dispatch),
  fairness enforcement (a context has exceeded its CBS bandwidth
  budget), or timeout (max_execution_us exceeded).

  Older GPUs (pre-Ampere NVIDIA, pre-CDNA2 AMD) may report
  AccelPreemptionGranularity::CommandBoundary, where preemption
  latency is bounded by the longest in-flight command buffer
  (potentially hundreds of milliseconds for large compute dispatches).
  The scheduler treats these as cooperative-yield devices (see below).

Non-preemptible GPU limitation and cooperative yield:

Many GPUs (older hardware, NPUs, FPGAs, and some mid-range consumer GPUs) report AccelPreemptionGranularity::CommandBoundary. On these devices, the AccelScheduler CANNOT interrupt a running workload mid-dispatch. The scheduler can only enforce time slicing at submission boundaries — between command buffer submissions — not within a single dispatch. This is coarse-grained time slicing.

Consequences for non-preemptible devices: - max_execution_us is enforced at submission granularity, not instruction granularity. A single long-running dispatch that exceeds max_execution_us cannot be aborted; the scheduler must wait for it to complete naturally before taking corrective action. - Time guarantees for other contexts degrade proportionally to the longest single dispatch in the queue. Workloads with large dispatches should set small max_cmd_buffer_size limits (enforced by the kernel at submission time) to bound worst-case dispatch duration.

Cooperative yield mechanism for non-preemptible devices: - The driver implements preempt_context() as a cooperative yield: after the current command buffer completes, do not submit the next one, and signal the scheduler. This is NOT true preemption — the driver cannot interrupt the GPU mid-dispatch. It stops feeding new work at the next natural command buffer boundary. - The preempt_context vtable entry serves double duty: on preemptible devices it triggers true hardware preemption (~50μs-10ms); on non-preemptible devices it triggers cooperative yield (latency bounded by longest in-flight command buffer). The scheduler checks preemption_granularity to know which behavior to expect. - Drivers for non-preemptible devices MUST report AccelPreemptionGranularity::CommandBoundary in get_info(). Misreporting this field is a driver correctness violation. - The AccelScheduler logs a warning when a non-preemptible context exceeds max_execution_us, records the overrun duration in AccelContextState::total_compute_ns, and applies backpressure (delayed next submission) to amortize the overrun across the cgroup's next scheduling period.

21.1.2.5 AccelComputeVTable — Compute-Specific Extensions

For devices with programmable compute (GPUs, some NPUs):

#[repr(C)]
pub struct AccelComputeVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Query supported compute APIs (Vulkan compute, OpenCL, CUDA compat).
    pub get_compute_caps: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_caps: *mut ComputeCapabilities,
    ) -> IoResultCode,

    /// Set up a shared virtual address space between CPU and device.
    /// Enables SVM (Shared Virtual Memory) / unified addressing.
    pub enable_svm: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        process_page_table: u64,    // CR3 / TTBR0 of the owning process
    ) -> IoResultCode>,

    /// Handle a device-initiated page fault.
    /// Called by the driver when the device faults on an unmapped address.
    /// The kernel resolves the fault (allocate page, migrate, map) and
    /// tells the driver to retry.
    pub handle_device_fault: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        fault_addr: u64,
        fault_flags: u32,
        out_resolution: *mut FaultResolution,
    ) -> IoResultCode>,

    /// Get performance counters for a context.
    pub get_perf_counters: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        counters: *mut AccelPerfCounters,
    ) -> IoResultCode>,
}

/// Resolution of a device-initiated page fault, returned by handle_device_fault.
/// Tells the kernel how to proceed after a GPU/NPU fault on a shared virtual
/// memory address.
#[repr(C)]
pub struct FaultResolution {
    /// Action the driver should take.
    pub action: FaultAction,
    /// If action is MapLocal or MapRemote, the physical page to map.
    /// Ignored for other actions.
    pub page_paddr: u64,
    /// If action is MapLocal or MapRemote, the page table entry flags
    /// (readable/writable/executable, caching attributes).
    pub pte_flags: u64,
}

#[repr(u32)]
pub enum FaultAction {
    /// Retry the faulting access — kernel resolved it (e.g., allocated page).
    Retry = 0,
    /// Map the provided physical page locally (device-local memory).
    MapLocal = 1,
    /// Map the provided physical page as remote (system memory, peer device).
    MapRemote = 2,
    /// Deliver SIGBUS or SIGSEGV to the faulting context's owner process.
    Signal = 3,
    /// Kill the context (unrecoverable device error).
    KillContext = 4,
}

/// Performance counters for an accelerator context.
/// Returned by get_perf_counters() for monitoring and accounting.
#[repr(C)]
pub struct AccelPerfCounters {
    /// Total GPU/accelerator cycles executed for this context.
    pub cycles_executed: u64,
    /// Total bytes read from device memory.
    pub bytes_read: u64,
    /// Total bytes written to device memory.
    pub bytes_written: u64,
    /// Number of kernels/shaders executed.
    pub kernel_count: u64,
    /// Number of page faults (SVM/Unified Memory).
    pub page_faults: u64,
    /// Time spent stalled on memory (nanoseconds).
    pub memory_stall_ns: u64,
    /// Time spent executing (nanoseconds).
    pub execution_ns: u64,
    /// Device-specific counters (vendor-defined).
    pub vendor_counters: [u64; 8],
    pub _pad: [u8; 32],
}

#[repr(C)]
pub struct ComputeCapabilities {
    /// Shader/kernel ISA generation (vendor-specific version number).
    pub isa_version: u32,
    /// Maximum work group / thread block size.
    pub max_workgroup_size: u32,
    /// Maximum shared / local memory per work group (bytes).
    pub max_local_memory: u32,
    /// Number of shader/compute cores.
    pub shader_cores: u32,
    /// FLOPS (single precision, billions).
    pub fp32_gflops: u32,
    /// FLOPS (half precision, billions, 0 if unsupported).
    pub fp16_gflops: u32,
    /// INT8 TOPS (billions of 8-bit integer ops, for inference).
    pub int8_tops: u32,
    /// Tensor core / matrix unit count (0 if none).
    pub tensor_cores: u32,
    pub _pad: [u8; 32],
}

21.1.2.6 AccelDisplayVTable — Display-Capable Extensions

For devices with display output (GPUs with connectors). This vtable covers KMS mode-setting and framebuffer management; rendering/compositing acceleration is handled through AccelComputeVTable (compute shaders) or vendor-specific extensions. Advanced display features (HDR metadata, content-adaptive sync policies, display stream compression) are reserved for Phase 4 (Section 6.1-roadmap).

/// Display vtable for accelerators with display output.
/// Covers mode setting, framebuffer management, and hotplug.
#[repr(C)]
pub struct AccelDisplayVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Enumerate display connectors on this device.
    pub get_connectors: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_connectors: *mut AccelConnectorInfo,
        max_connectors: u32,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Get supported display modes for a connector.
    pub get_modes: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_modes: *mut AccelDisplayMode,
        max_modes: u32,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Set the active display mode for a connector.
    pub set_mode: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        mode: *const AccelDisplayMode,
    ) -> IoResultCode,

    /// Create a framebuffer object from device memory.
    pub create_framebuffer: unsafe extern "C" fn(
        ctx: *mut c_void,
        mem: AccelMemHandle,
        width: u32,
        height: u32,
        format: u32,
        out_fb: *mut AccelFramebufferHandle,
    ) -> IoResultCode,

    /// Schedule a page flip (swap front/back buffer).
    pub page_flip: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        fb: AccelFramebufferHandle,
        out_fence: *mut AccelFence,
    ) -> IoResultCode,

    /// Set display power management state.
    pub set_dpms_state: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        state: AccelDpmsState,
    ) -> IoResultCode,

    /// Get hotplug status for a connector.
    // KABI: u8 instead of bool for stable ABI (0=false, nonzero=true)
    pub get_hotplug_status: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_connected: *mut u8,
    ) -> IoResultCode,

    /// Read EDID data from a connector.
    pub read_edid: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_edid: *mut u8,
        max_size: u32,
        out_size: *mut u32,
    ) -> IoResultCode,

    /// Set hardware cursor image and position.
    pub set_cursor: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        image: *const u8,
        width: u32,
        height: u32,
        hot_x: u32,
        hot_y: u32,
    ) -> IoResultCode>,

    /// Enable/disable variable refresh rate (FreeSync/G-Sync).
    pub set_vrr_enabled: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        enabled: u8,
    ) -> IoResultCode>,
}

/// Display connector information.
#[repr(C)]
pub struct AccelConnectorInfo {
    /// Unique connector ID within this device.
    pub connector_id: u32,
    /// Connector type: HDMI=0, DisplayPort=1, DVI=2, VGA=3, eDP=4, Virtual=5.
    pub connector_type: u16,
    /// Maximum resolution width supported (0 if unknown).
    pub max_width: u32,
    /// Maximum resolution height supported (0 if unknown).
    pub max_height: u32,
    /// Currently connected (1) or disconnected (0).
    pub connected: u8,
    /// Currently active (has a mode set).
    pub active: u8,
    pub _pad: [u8; 6],
}

/// Display mode (resolution, refresh rate, timing).
#[repr(C)]
pub struct AccelDisplayMode {
    /// Horizontal resolution in pixels.
    pub width: u32,
    /// Vertical resolution in lines.
    pub height: u32,
    /// Refresh rate in millihertz (e.g., 59950 for 59.95 Hz).
    pub refresh_millihz: u32,
    /// Clock frequency in kHz.
    pub clock_khz: u32,
    /// Horizontal front porch in pixels.
    pub hfp: u16,
    /// Horizontal sync pulse width in pixels.
    pub hsw: u16,
    /// Horizontal back porch in pixels.
    pub hbp: u16,
    /// Vertical front porch in lines.
    pub vfp: u16,
    /// Vertical sync pulse width in lines.
    pub vsw: u16,
    /// Vertical back porch in lines.
    pub vbp: u16,
    /// Preferred mode flag (monitor's preferred resolution).
    pub preferred: u8,
    pub _pad: [u8; 7],
}

/// Framebuffer handle (opaque).
#[repr(transparent)]
pub struct AccelFramebufferHandle(pub u64);

/// Command buffer handle (opaque, driver-allocated).
/// The driver allocates command buffer memory and returns this handle to the kernel
/// for tracking. The handle is used in `PendingSubmission` to identify which command
/// buffer to execute.
/// Created via `AccelBaseVTable::create_cmd_buffer`; destroyed via
/// `AccelBaseVTable::destroy_cmd_buffer`. See command buffer creation path below.
#[repr(transparent)]
pub struct AccelCmdBufferHandle(pub u64);

/// Semaphore handle (opaque, driver-allocated).
/// Used for cross-context synchronization. A submission can optionally signal a
/// semaphore on completion, allowing other contexts or userspace to wait on it.
#[repr(transparent)]
pub struct AccelSemaphoreHandle(pub u64);

/// Display power management state.
#[repr(u32)]
pub enum AccelDpmsState {
    /// Display is on.
    On = 0,
    /// Display is in standby (minimal power, fast resume).
    Standby = 1,
    /// Display is suspended (low power, slower resume).
    Suspend = 2,
    /// Display is off.
    Off = 3,
}

21.1.2.6a Command Buffer Creation Path

AccelCmdBufferHandle identifies a recorded command buffer. Handles are created and destroyed via two vtable entries appended to AccelBaseVTable:

// Appended to AccelBaseVTable (after register_completion_callback and vendor_ioctl).
// Callers must check vtable_size before invoking per Section 11.1.4 versioning rules.

/// Create a command buffer for recording driver commands.
///
/// `size_hint`: estimated number of commands to be recorded (drivers may ignore).
/// Returns an `AccelCmdBufferHandle` opaque u64 identifying the buffer.
/// The buffer is initially empty; commands are added via driver-specific
/// `ACCEL_IOCTL_CMD_RECORD` ioctl calls before submission.
pub create_cmd_buffer: Option<unsafe extern "C" fn(
    ctx:        *mut c_void,
    context_id: AccelContextHandle,
    size_hint:  u32,
    out_handle: *mut AccelCmdBufferHandle,
) -> IoResultCode>,

/// Destroy a command buffer and free its resources.
/// Must not be called while the buffer is submitted (in-flight).
pub destroy_cmd_buffer: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    buf: AccelCmdBufferHandle,
) -> IoResultCode>,

Userspace ioctl path:

  • ACCEL_IOCTL_CREATE_CMD_BUFFER (ioctl number 0xA2): Takes {context_id: u64, size_hint: u32} → returns {handle: u64}.
  • ACCEL_IOCTL_DESTROY_CMD_BUFFER (ioctl number 0xA3): Takes {handle: u64}.
  • Commands are recorded into the buffer via ACCEL_IOCTL_CMD_RECORD (driver-specific; layout is opaque to the KABI layer and passed through vendor_ioctl).

Lifecycle:

create_cmd_buffer(context, size_hint)
  → record commands via ACCEL_IOCTL_CMD_RECORD (driver-specific)
  → submit_commands(context, cmd_buffer_handle, ...)
  → wait for completion via AccelFenceHandle / poll_completion
  → optionally resubmit (re-execute) without recreating the buffer
  → destroy_cmd_buffer(handle)

Command buffers may be resubmitted (re-executed) by calling submit_commands again without recreating — the recorded command sequence is preserved. The driver is responsible for ensuring that the buffer is not modified while an execution is in-flight.

Complete handle lifecycle specification (F1 fix):

// umka-core/src/accel/handles.rs (kernel-internal)

/// Maximum command buffers per context.
/// Limits memory usage for command buffer tracking structures.
/// With 1024 buffers × 4KB average size = 4MB per context.
pub const MAX_CMD_BUFFERS_PER_CONTEXT: usize = 1024;

/// Maximum semaphores per context.
/// Limits memory usage for semaphore tracking structures.
/// With 2048 semaphores × 64 bytes = 128KB per context.
pub const MAX_SEMAPHORES_PER_CONTEXT: usize = 2048;

AccelCmdBufferHandle lifecycle:

State Description Valid Operations
ALLOCATED Buffer created via create_cmd_buffer, empty record_commands, destroy
RECORDING Commands being recorded via ioctl record_commands, submit, destroy
SUBMITTED Buffer submitted to hardware via submit_commands poll_completion (none until complete)
COMPLETED Hardware completed, fence signaled resubmit, destroy
INVALID Destroyed via destroy_cmd_buffer (none)

Lifecycle rules: 1. create_cmd_buffer returns a new handle in ALLOCATED state. 2. record_commands transitions to RECORDING and appends commands. 3. submit_commands transitions to SUBMITTED and increments the context's in_flight_count. 4. On completion (fence signaled), state transitions to COMPLETED. 5. destroy_cmd_buffer: - If SUBMITTED: Returns EBUSY — caller must wait for completion first. - If any other state: Frees driver resources, transitions to INVALID. 6. Context destruction with pending buffers: - All SUBMITTED buffers are marked as COMPLETED with error status. - All buffers are implicitly destroyed (no explicit destroy_cmd_buffer needed). - Process receives ECONTEXTKILLED on any pending operations.

AccelSemaphoreHandle lifecycle:

State Description Valid Operations
ALLOCATED Semaphore created via create_semaphore signal, wait, destroy
SIGNAL_PENDING Signal operation in progress wait (blocks), destroy (returns EBUSY)
SIGNALED Semaphore has been signaled wait (immediate return), reset, destroy
INVALID Destroyed via destroy_semaphore (none)

Lifecycle rules: 1. create_semaphore returns a new handle in ALLOCATED state (unsignaled). 2. signal_semaphore transitions to SIGNAL_PENDING during async signal, then SIGNALED. 3. wait_semaphore: - If SIGNALED: Returns immediately. - If SIGNAL_PENDING or ALLOCATED: Blocks on internal wait queue. - Timeout option available (returns ETIMEOUT on expiry). 4. reset_semaphore: Transitions SIGNALEDALLOCATED (unsignaled). 5. destroy_semaphore: - If SIGNAL_PENDING: Returns EBUSY — wait for signal completion first. - If any other state: Wakes all waiters with ESEMDESTROYED, frees resources. 6. Context destruction with pending semaphores: - All semaphores are implicitly destroyed. - All waiters wake with ECONTEXTKILLED.

Error handling summary:

Error Meaning Recovery
EBUSY Handle is in use (submitted/signaling) Wait for completion, retry
EINVAL Invalid handle or operation Programming error — fix caller
ENOENT Handle does not exist Already destroyed — ignore
ECONTEXTKILLED Owning context was destroyed Recreate context, resubmit
ESEMDESTROYED Semaphore destroyed while waiting Retry with new semaphore
ETIMEOUT Wait operation timed out Retry or report error

21.1.2.6b In-Flight Limits and Semaphore Dependency Validation

The lifecycle tables above describe per-handle state transitions. This section specifies the bounded limits enforced at submit_commands time and the kernel-side tracking structures backing semaphore validation.

In-flight command buffer limit per context:

// umka-core/src/accel/limits.rs (kernel-internal)

/// Maximum command buffers in flight (submitted to hardware, not yet completed)
/// per AccelContext. This is distinct from MAX_CMD_BUFFERS_PER_CONTEXT (which
/// limits the total number of allocated, not necessarily submitted, buffers).
///
/// Backpressure behavior:
///   - Blocking submit (default): calling userspace thread sleeps until a slot
///     frees, woken by the completion interrupt handler.
///   - Non-blocking submit (ACCEL_SUBMIT_NONBLOCK flag): returns EAGAIN
///     immediately so the caller can poll or use a completion callback.
///
/// Rationale for 256: matches typical GPU hardware queue depths (e.g., NVIDIA
/// GR engine ring size = 256 entries) and ensures the kernel tracking array
/// (InflightSubmission × 256 ≈ 12KB per context) fits in a single huge page.
pub const ACCEL_MAX_INFLIGHT_PER_CTX: usize = 256;

/// Maximum semaphore dependencies (wait + signal combined) per command buffer.
///
/// Enforced at submit_commands() time: if
///   wait_semaphores.len() + signal_semaphores.len() > ACCEL_MAX_SEMAPHORE_DEPS
/// the submission is rejected with KabiError::TooManyDeps.
///
/// Rationale: dependency graph resolution at submit time is O(D) where D = dep
/// count. Bounding D at 64 keeps the worst-case submit path deterministic and
/// prevents pathological O(n²) behaviour when many contexts are waiting on the
/// same large semaphore fan-in.
pub const ACCEL_MAX_SEMAPHORE_DEPS: usize = 64;

These limits are checked in the submit_commands fast path:

// Enforced at submit_commands() entry, before any driver vtable call:
fn validate_submission_limits(
    ctx:    &AccelContextState,
    params: &AccelSubmitParams,
) -> Result<(), KabiError> {
    // Limit 1: in-flight queue depth.
    if ctx.inflight.len() >= ACCEL_MAX_INFLIGHT_PER_CTX {
        return Err(KabiError::QueueFull);
    }

    // Limit 2: semaphore dependency count.
    let total_deps = params.wait_semaphores.len()
        .saturating_add(params.signal_semaphores.len());
    if total_deps > ACCEL_MAX_SEMAPHORE_DEPS {
        return Err(KabiError::TooManyDeps);
    }

    Ok(())
}

Kernel-side semaphore tracking structure:

// umka-core/src/accel/semaphore.rs (kernel-internal)

/// Kernel-side state for a single AccelSemaphoreHandle within one AccelDevice.
/// Stored in AccelDevice::semaphore_table, keyed by the handle's u64 value.
pub struct SemaphoreState {
    /// Whether the semaphore has been signaled.
    /// Set to true atomically by the completion interrupt handler when the
    /// signaling command buffer completes on hardware.
    pub signaled: AtomicBool,

    /// The command buffer currently designated to signal this semaphore.
    /// `None` if no pending command buffer has listed this semaphore in its
    /// `signal_semaphores`. Set at submit_commands() time; cleared on completion.
    /// Weak reference: the AccelCmdBuffer may be freed (e.g., context destroyed)
    /// without invalidating this pointer — the Weak upgrade check handles that.
    pub signal_cmd: Mutex<Option<Weak<AccelCmdBuffer>>>,

    /// Number of command buffers currently waiting on this semaphore.
    /// Incremented at submit_commands() when the semaphore appears in
    /// `wait_semaphores`; decremented on each waiter's completion or cancellation.
    /// destroy_semaphore() is rejected with KabiError::Busy if this is > 0.
    pub wait_count: AtomicU32,

    /// Whether userspace has called destroy_semaphore() on this handle.
    /// Once true, any command buffers still waiting on it will be aborted
    /// with KabiError::SemaphoreDestroyed on their next scheduling attempt.
    pub destroyed: AtomicBool,
}

Semaphore validation in submit_commands():

After validate_submission_limits() passes, the following checks are applied to every semaphore handle listed in the submission (O(D) where D ≤ ACCEL_MAX_SEMAPHORE_DEPS):

For each handle h in params.wait_semaphores:
  1. Look up h in AccelDevice::semaphore_table.
     → Not found: return Err(KabiError::InvalidHandle)  // handle was never created
  2. If semaphore_table[h].destroyed == true:
     → return Err(KabiError::InvalidHandle)              // already destroyed
  3. If semaphore_table[h].signaled == true:
     → treat as satisfied; do NOT add a hardware dependency for h.
       (Avoids stalling hardware on an already-completed signal.)
  4. Else: increment semaphore_table[h].wait_count by 1.
     (The decrement happens in the completion handler when this cmd buffer finishes.)

For each handle h in params.signal_semaphores:
  1. Look up h in AccelDevice::semaphore_table.
     → Not found: return Err(KabiError::InvalidHandle)
  2. If semaphore_table[h].destroyed == true:
     → return Err(KabiError::InvalidHandle)
  3. If semaphore_table[h].signal_cmd (locked) is Some(_):
     → return Err(KabiError::AlreadySignaling)           // another cmd already owns this semaphore
       (One signaler per semaphore; two concurrent signalers would produce undefined ordering.)
  4. Else: set semaphore_table[h].signal_cmd = Weak::from(this cmd buffer).

If any check fails, the entire submission is rejected (no partial side-effects — wait_counts that were incremented in earlier iterations are decremented before returning the error).

Context destruction with outstanding semaphores:

On destroy_context() with semaphores still in the semaphore table for that context:

  1. For every semaphore owned by the context, set destroyed = true atomically.
  2. Any command buffer in another context that is waiting on such a semaphore will, on its next scheduling check, find destroyed == true and be aborted with KabiError::SemaphoreDestroyed rather than blocking indefinitely.
  3. The kernel does NOT automatically call destroy_semaphore for each handle — that would require enumerating all pending command buffers across all contexts, which is O(C × D) and not bounded-latency. Instead, the lazy-destroy mechanism (the destroyed flag check in the scheduling path) handles cleanup at O(1) per check.
  4. After context destruction, wait_count may transiently remain > 0 on destroyed semaphores until their waiters are aborted. The semaphore entries are freed from the table only after wait_count reaches 0 (checked in the waiter-abort path).

21.1.2.7 Scheduler Integration

The AccelScheduler interacts with the CPU scheduler to manage threads that are waiting for GPU work completion:

  • GPU-blocked threads: Threads waiting on poll_completion or a completion callback are placed in interruptible sleep (same state as I/O wait). They do not consume CPU time while waiting.
  • Completion wakeup: When the completion notification callback fires (see register_completion_callback), the waiting thread is woken directly from the interrupt handler — no polling required.
  • CPU vruntime accounting: Time spent blocked on GPU completion is treated like I/O wait: it does not accrue CPU debt. On wakeup, the thread receives a latency bonus similar to I/O-bound tasks in CFS, ensuring responsive scheduling for interactive GPU workloads.
  • Realtime priority contexts: For AccelPriority::Realtime contexts, the completion interrupt is handled by a dedicated high-priority IRQ thread to minimize wakeup latency.

21.1.3 Integration with UmkaOS Architecture

21.1.3.1 Device Registry Integration

Accelerators are modeled in the device registry (Section 10.5) with rich sub-device structure:

pci0000:00
  +-- 0000:41:00.0 (NVIDIA A100 GPU)
      +-- Properties:
      |     device-class: "gpu"
      |     compute-units: 108
      |     local-memory-bytes: 85899345920 (80 GB HBM2e)
      |     tensor-cores: 432
      |     p2p-capable: true
      |     preemption: "instruction"
      +-- Services published:
      |     "accel-compute" (AccelComputeVTable)
      |     "rdma-target" (for GPUDirect RDMA)
      +-- Children:
          +-- partition0 (MIG 3g.40gb)  # A100-80GB supports up to 2x 3g.40gb
          +-- partition1 (MIG 3g.40gb)
          +-- nvenc0 (video encoder sub-device)
          +-- nvdec0 (video decoder sub-device)

Example: NVIDIA RTX 4090 (desktop GPU with display output):

pci0000:00
  +-- 0000:01:00.0 (NVIDIA RTX 4090)
      +-- Properties:
      |     device-class: "gpu"
      |     compute-units: 128
      |     local-memory-bytes: 25769803776 (24GB GDDR6X)
      |     tensor-cores: 512
      |     p2p-capable: false
      |     preemption: "instruction"
      +-- Services published:
      |     "accel-compute" (AccelComputeVTable)
      |     "accel-display" (AccelDisplayVTable)
      +-- Children:
          +-- nvenc0 (video encoder sub-device)
          +-- nvdec0 (video decoder sub-device)

Power Management Integration:

On system suspend, the AccelScheduler drains in-flight submissions (with a timeout matching the device registry's suspend timeout, Section 10.5.5.3). Pending queue entries are preserved across suspend/resume and resubmitted after the device resumes. The AccelScheduler reports idle/busy state to the device registry for runtime power management decisions. AccelPowerState maps to the registry's PowerState as follows: D0Active ↔ active, D1LowPower ↔ idle.

21.1.3.2 Crash Recovery

GPU driver crashes are currently catastrophic in Linux. In UmkaOS:

1. GPU driver (Tier 1, domain-isolated) faults.
2. UmkaOS Core detects the fault.
3. Device registry transitions GPU node to Recovering.
4. AccelScheduler:
   a. Marks all active contexts as "interrupted."
   b. Completes all pending submissions with error status.
   c. Notifies all processes waiting on completions.
5. Registry orchestrates crash recovery (Section 10.5.10):
   a. Revoke driver's isolation domain.
   b. GPU device reset (PCIe FLR or driver-specific reset).
   c. Reload driver binary.
   d. Fresh KABI vtable exchange.
6. AccelScheduler recreates contexts for processes that are still running.
   - Process state is on CPU side (command buffers). Only the GPU-side state is lost.
   - Processes receive an error on their in-flight submissions and can retry.
7. Total recovery time: ~100ms–5s (dominated by GPU hardware reset and driver reload; varies by hardware).
   **Note**: This is *reset* latency, which is distinct from *preemption* latency:
   - **Preemption**: ~50μs-10ms — saving context state to allow another workload to run (reversible).
   - **Reset**: ~100ms-5s — full hardware reinitialization, firmware reload, PCIe FLR (destructive).
   Simple GPU resets (PCIe FLR + driver reload) typically complete in ~100-500ms.
   Datacenter GPUs with GSP firmware reload (e.g., NVIDIA H100) may take 2-5s.
   Preemption is used for fairness enforcement; reset is used for crash recovery and HROT.

Compare Linux: full system reboot (30-60 seconds), loss of all work.

21.1.3.3 GPU Firmware as Cluster Member (Future)

Modern GPUs run sophisticated firmware that manages scheduling, memory, and P2P transfers. NVIDIA GPUs run proprietary firmware; AMD GPUs run open-source firmware. This firmware is effectively a specialized OS running on the GPU's control processor.

In UmkaOS's distributed kernel model (Section 5.1), GPU firmware can potentially participate as a first-class cluster member:

Current state (Phase 1-3): - Host kernel controls GPU via driver (Tier 1) - GPU firmware is passive — it responds to host commands but does not initiate - All scheduling decisions, memory allocation, and work dispatch are host-driven

Future state (Phase 5+): - GPU firmware implements UmkaOS cluster membership protocol - GPU registers as a cluster node with NodeId, exposes VRAM as remote-accessible memory in the device registry - GPU firmware participates in DSM (Distributed Shared Memory) and DLM (Distributed Lock Manager) — allows GPU-initiated RDMA transfers, GPU-to-GPU locking without CPU involvement - Work can be scheduled across the cluster transparently: a cgroup's workload could span CPUs on node0, GPUs on node0 and node1, and a DPU on node2

Benefits: - GPU-initiated transfers: GPU firmware can directly request pages via DSM without waking the CPU driver, reducing latency for fine-grained data access patterns - GPU-to-GPU coordination: Two GPUs on different nodes can synchronize via DLM without routing through their respective host CPUs — critical for distributed training with NCCL/RCCL - Heterogeneous scheduling: The kernel scheduler sees CPUs and GPUs as a unified pool of compute resources, not separate domains

Requirements: - GPU firmware must implement UmkaOS's inter-kernel messaging protocol (see Section 5.1.2.2 "Device-local kernels as cluster members" for detailed protocol specification) - Three implementation paths: (A) run full UmkaOS on GPU's control processor, (B) firmware shim translating UmkaOS messages to native GPU operations, or (C) host-side proxy driver - Requires firmware modifications by NVIDIA/AMD/Intel or open-source GPU firmware (e.g., Nouveau, AMDGPU firmware) - Security model must prevent malicious GPU firmware from compromising host integrity — cluster membership is opt-in per device and requires signed firmware verification

This is not required for UmkaOS to function. It is an optional future enhancement that treats the "multikernel" model (one kernel per hardware domain) as a first-class design pattern rather than a hack. See Section 5.1.2.2 "Device-local kernels as cluster members" for SmartNIC/DPU equivalents.

21.1.3.4 FMA Integration

The FMA engine (Section 19.1) monitors accelerator health:

// Accelerator-specific health events
HealthEventClass::Accelerator  // New class

// Health data for accelerators
// G1 fix: Struct is exactly 64 bytes (one cache line) for optimal cache efficiency.
// Field accounting (all offsets in bytes):
//   temperature_mc:      4 (u32)     running: 4
//   power_mw:            4 (u32)     running: 8
//   _pad1:               4 ([u8;4])  running: 12 (align u64 to offset 16)
//   ecc_correctable:     8 (u64)     running: 16
//   ecc_uncorrectable:   8 (u64)     running: 24
//   throttle_count:      4 (u32)     running: 28
//   error_code:          4 (u32)     running: 32
//   pcie_replay_count:   4 (u32)     running: 36
//   _pad2:              28 ([u8;28]) running: 64
#[repr(C)]
pub struct AccelHealthData {
    /// GPU temperature (millidegrees Celsius).
    /// u32 required: GPU operating temperatures of 80–95°C = 80,000–95,000
    /// millidegrees, exceeding i16/u16 range (u16 max = 65,535).
    pub temperature_mc: u32,
    /// Power draw (milliwatts).
    pub power_mw: u32,
    /// Explicit padding to align ecc_correctable to 8-byte boundary.
    pub _pad1: [u8; 4],
    /// ECC error count (VRAM), correctable.
    pub ecc_correctable: u64,
    /// ECC error count (VRAM), uncorrectable.
    pub ecc_uncorrectable: u64,
    /// Thermal throttling events.
    pub throttle_count: u32,
    /// XID error code (NVIDIA) or equivalent vendor error code.
    pub error_code: u32,
    /// PCIe replay count (indicates link instability).
    pub pcie_replay_count: u32,
    /// Padding to reach exactly 64 bytes (one cache line).
    pub _pad2: [u8; 28],
}
// static_assert!(size_of::<AccelHealthData>() == 64);
// static_assert!(align_of::<AccelHealthData>() == 8);

FMA diagnosis rules for accelerators:

Rule Threshold Action
VRAM ECC degradation 100 correctable / hour Alert + schedule maintenance
VRAM uncorrectable error 1 event DisableDevice + Alert
Thermal throttling 10 events / hour Alert (may indicate cooling issue)
PCIe link unstable 50 replays / minute DemoteTier + Alert
Repeated driver crashes 3 in 1 hour DemoteTier (move to Tier 2)

21.1.3.5 Stable Tracepoints

New stable tracepoints for accelerator observability (Section 19.2):

Tracepoint Arguments Description
umka_tp_stable_accel_submit device_id, context, cmd_size, priority Command submitted
umka_tp_stable_accel_complete device_id, context, latency_ns, error Command completed
umka_tp_stable_accel_preempt device_id, preempted_ctx, preempting_ctx Context preempted
umka_tp_stable_accel_migrate device_id, direction, pages, bytes Memory migration
umka_tp_stable_accel_fault device_id, context, fault_addr Device page fault
umka_tp_stable_accel_oom device_id, requested, available Device memory exhaustion
umka_tp_stable_accel_p2p src_device, dst_device, bytes, latency_ns P2P DMA transfer

21.1.3.6 Object Namespace

Accelerators appear in the unified object namespace (Section 19.4):

\Devices\pci0000:00\0000:41:00.0   # GPU device object
\Accelerators\gpu0                   # Symlink to device
\Accelerators\gpu0\Contexts\        # Active execution contexts
\Accelerators\gpu0\Memory\          # Memory allocation tracking
\Accelerators\gpu0\Partitions\      # MIG partitions

Browsable via umkafs:

cat /mnt/umka/Accelerators/gpu0/Memory
# # type: AcceleratorMemory
# # total: 40000000000 (40 GB)
# # used: 27000000000 (27 GB)
# # contexts:
# #   ctx[pid=1234]: 16106127360 (15 GB) - training job
# #   ctx[pid=5678]: 8589934592 (8 GB) - inference server
# #   ctx[pid=9012]: 4294967296 (4 GB) - development

21.1.3.7 Partial Failure Handling

Single-context error: If a single execution context encounters an error (shader hang, invalid command), the AccelScheduler calls reset_context() on the affected context only. Other contexts on the same device continue unaffected. The affected process receives an error status on its next poll_completion call or via the completion callback.

ECC uncorrectable error: When device memory reports an uncorrectable ECC error, the affected page is retired (analogous to CPU page retirement, Section 19.1.6). The owning context is notified, and if possible, data is migrated to a healthy page. If the page content is unrecoverable, the owning process receives SIGBUS.

Timeout without crash: The AccelScheduler enforces max_execution_us via a kernel timer. When a submission exceeds its time limit: - If the device supports instruction-level preemption: preempt the overdue context, save its state, and return AccelCompletionStatus::Preempted. - If preemption is not supported: wait until the current command buffer boundary, then fail the overdue submission with AccelCompletionStatus::Timeout. - In either case, the context remains valid and the process can submit new work.

21.1.3.8 Resolved Design Decisions

1. /dev/umka-accel-N ioctl specification.

Each ioctl struct starts with a u32 version header; the kernel checks version compatibility before dispatching. Ioctl command numbers use the following range assignments:

Range Purpose
0x00–0x1F Device query (caps, topology, memory info, clock info)
0x20–0x3F Context management (create, destroy, set priority)
0x40–0x5F Memory management (alloc, free, map, migrate)
0x60–0x7F Command submission (submit, wait, fence create/signal)
0x80–0x9F Display/KMS (mode set, framebuffer, hotplug) — routed to AccelDisplayVTable
0xA0–0xBF Health/telemetry (temperature, ECC, utilization query)
0xC0–0xFF Vendor-private (opaque passthrough to driver)

The vendor-private range 0xC0–0xFF passes the raw buffer to AccelBaseVTable::vendor_ioctl() — the kernel does not interpret it, only validates buffer bounds and capability permissions. This gives vendors (NVIDIA, AMD, Intel) a stable passthrough channel without polluting the common ioctl space.

Security note (E1 fix): Vendor-private ioctls make the vendor driver part of the Trusted Computing Base (TCB) for any process with CAP_ACCEL access. A vendor driver can implement arbitrary privileged operations within the passthrough range. This is an accepted tradeoff: GPU command submission inherently requires ring 0 memory mapping (GART/GTT) and DMA that cannot be fully mediated without unacceptable overhead. The mitigation is the Tier 1 MPK isolation domain — a vendor driver bug corrupts only its own domain, not umka-core.

Vendor-private ioctl buffer validation:

To prevent buffer overflow attacks via vendor-private ioctls, the kernel enforces:

  1. Size prefix requirement: The first 4 bytes of any vendor-private ioctl buffer must contain the buffer size (little-endian u32). This is the standard Linux ioctl pattern (_IOC_SIZE bits). The kernel reads the size prefix before passing the buffer to vendor_ioctl.

  2. Maximum buffer size: Vendor-private ioctl buffers are capped at 4KB (MAX_VENDOR_IOCTL_SIZE = 4096). Buffers larger than this are rejected with -EINVAL. This bounds the attack surface and prevents drivers from receiving arbitrarily large buffers that could overflow internal structures.

  3. Capability requirement: Vendor-private ioctls require CAP_ACCEL capability. The ACCEL_COMPUTE capability (0x0100) is the minimum; certain vendor ioctls may require additional capabilities (e.g., ACCEL_P2P for P2P-related vendor extensions).

  4. Kernel validation before passthrough: ```rust /// Validate a vendor-private ioctl buffer before passing to driver. /// Returns the validated buffer size, or an error code. fn validate_vendor_ioctl_buf(user_buf: *const u8, buf_len: u32) -> Result { // 1. Check basic bounds if buf_len < 4 { return Err(-EINVAL); // Too small to contain size prefix }

    // 2. Read size prefix from userspace let size_prefix = read_user_u32(user_buf as *const u32)?; // -EFAULT on bad access

    // 3. Verify size prefix matches provided length if size_prefix != buf_len { return Err(-EINVAL); // Size mismatch = potential attack }

    // 4. Enforce maximum size if size_prefix > MAX_VENDOR_IOCTL_SIZE { return Err(-EINVAL); // Buffer too large }

    // 5. Verify buffer is readable/writable by caller if !access_ok(user_buf, size_prefix as usize) { return Err(-EFAULT); }

    Ok(size_prefix) } ```

  5. Driver receives validated buffer: By the time vendor_ioctl is called, the kernel has already validated the buffer size and bounds. The driver receives user_buf: *mut u8 and buf_len: u32 with the guarantee that buf_len <= 4096 and the buffer is accessible.

Versioning scheme: ioctl structs are append-only. New fields are added at the end. The version header tells the kernel which fields are present. The kernel zero-fills any fields beyond the caller's version (forward compat) and ignores trailing fields beyond its own version (backward compat). This mirrors the KABI vtable versioning protocol (Section 11.1.3).

2. Context save/restore for mid-shader preemption: driver-reported capability.

Context save/restore state size and latency are fundamentally hardware-defined. The kernel cannot dictate GPU context size. The driver reports preemption capabilities via a new optional vtable method:

// Appended to AccelBaseVTable per Section 11.1.4 versioning rules (Option<fn> = backward compat).
// Drivers compiled against older KABI versions see this field as None (vtable_size check).
//     pub preemption_caps: Option<unsafe extern "C" fn(
//         device: DeviceHandle,
//         out: *mut AccelPreemptionCaps,
//     ) -> IoResultCode>,

/// Preemption capability descriptor reported by the driver.
#[repr(C)]
pub struct AccelPreemptionCaps {
    /// Maximum context save state size in bytes (driver-reported).
    /// The kernel allocates this per-context from device memory.
    pub context_state_size: u64,
    /// Worst-case save latency (nanoseconds). The scheduler uses this
    /// to decide whether preemption is worth the cost for short timeslices.
    pub save_latency_ns: u64,
    /// Worst-case restore latency (nanoseconds).
    pub restore_latency_ns: u64,
    /// Preemption granularity supported by the hardware.
    pub granularity: AccelPreemptionGranularity,
    /// Whether this device supports mid-shader preemption.
    /// u8 (0=false, 1=true) for KABI stability (see Section 21.1.2.3 KABI rules).
    /// Placed after u32 granularity to avoid padding holes in #[repr(C)] layout.
    pub supports_mid_shader: u8,
    /// Explicit padding to ensure deterministic layout and zero-initialized bytes.
    /// This 3-byte array fills the gap between supports_mid_shader (u8 at offset 28)
    /// and the trailing _pad array, ensuring no compiler-inserted uninitialized padding.
    /// The entire struct is zero-initialized by the driver before filling fields.
    pub _reserved: [u8; 3],
    /// Reserved for future expansion. Must be zero-initialized.
    pub _pad: [u8; 32],
}
// Layout verification:
//   context_state_size:   0-7   (8 bytes)
//   save_latency_ns:      8-15  (8 bytes)
//   restore_latency_ns:  16-23  (8 bytes)
//   granularity:         24-27  (4 bytes, u32)
//   supports_mid_shader: 28     (1 byte)
//   _reserved:           29-31  (3 bytes)
//   _pad:                32-63  (32 bytes)
// Total: 64 bytes (one cache line)

/// Preemption granularity levels, from coarsest to finest.
#[repr(u32)]
pub enum AccelPreemptionGranularity {
    /// No mid-dispatch preemption — must wait for the entire current command buffer
    /// to complete before the scheduler can reclaim the device. Latency is bounded
    /// by the longest in-flight command buffer, which may be hundreds of milliseconds
    /// for large compute dispatches. The AccelScheduler treats devices reporting this
    /// level as cooperative-yield devices.
    CommandBoundary  = 0,
    /// Preemption point checked after each draw call or compute dispatch boundary.
    /// Medium granularity: latency is bounded by the longest single draw call or
    /// dispatch, typically 1-10ms on current hardware. Supported by most modern
    /// discrete GPUs (post-Maxwell NVIDIA, post-GCN3 AMD).
    DrawBoundary     = 1,
    /// Preemption point checked at triangle/pixel boundaries (graphics) or wavefront
    /// boundaries (compute). Finer than `DrawBoundary` but coarser than
    /// `InstructionLevel`. Latency is typically sub-millisecond.
    PixelBoundary    = 2,
    /// Preemption point checked after every GPU instruction — highest overhead, used
    /// only for latency-critical contexts such as AR/VR rendering and real-time
    /// inference. The device saves full shader register state and can resume
    /// mid-thread. Save+restore latency is typically 50-100μs on compute GPUs
    /// (e.g., NVIDIA Ampere+ with compute preemption enabled).
    InstructionLevel = 3,
}

Kernel responsibilities: - Allocate a save-state buffer of context_state_size bytes in device memory per AccelContext. - When preempting, call AccelBaseVTable::preempt_context() (Section 21.1.2.2). - The scheduler uses save_latency_ns + restore_latency_ns to compute the break-even point: do not preempt if the remaining timeslice is shorter than the save+restore cost.

Driver responsibilities: - Report accurate worst-case values (over-reporting causes conservative scheduling; under-reporting causes deadline misses). - Perform the actual hardware state save/restore when preempt_context() is called.

3. Multi-GPU unified memory coherence granularity: 64KB default, per-device configurable.

The coherence unit is 64KB as the default. Rationale:

  • 4KB is too fine-grained. GPU memory controllers are optimized for 64KB+ transfers. 4KB coherence causes 16× more page-fault interrupts and migration metadata overhead. False sharing is rare at 64KB in real ML workloads (tensors are almost always >64KB).
  • 64KB matches GPU hardware. NVIDIA and AMD GPU page tables operate at 64KB granularity. GPU-side page faults (via ATS/PRI or vendor-specific mechanisms) already use this unit. CPU-side overhead is acceptable — a 64KB migration over PCIe Gen4 x16 takes ~3µs (64KB / 25.6 GB/s raw bandwidth + TLP header and DMA setup overhead).
  • 2MB is too coarse. Causes excessive data transfer for partial access patterns (e.g., accessing one element of a large tensor page). Acceptable for pre-fetching but harmful for demand-paging.

The coherence granularity is a per-device property reported by the driver in AccelCapabilities:

pub struct AccelCapabilities {
    // ...existing fields...

    /// Minimum coherence granularity in bytes (must be power of 2, >= 4096).
    /// The unified memory system does not track coherence at finer granularity.
    pub coherence_granularity: u64,

    // ...
}

The kernel's HMM (heterogeneous memory management) layer rounds up all migration operations to this granularity. For multi-GPU scenarios where devices have different granularities, the kernel uses the LCM (least common multiple) of all participating devices' granularities. In practice this is 64KB since all major GPU vendors use it.

4. HealthEventClass::Accelerator formal taxonomy.

Event codes within the Accelerator class (stored in the event_code: u32 field of HealthEvent, Section 19.1):

Code Name Default Severity Recommended Action
0x0001 VRAM_ECC_CORRECTABLE Info (single), Warning (≥100/hr) Log. At threshold: schedule maintenance.
0x0002 VRAM_ECC_UNCORRECTABLE Critical Quiesce device. Kill affected contexts. Alert admin.
0x0003 SRAM_ECC_CORRECTABLE Info Log. L1/L2/register file errors are usually self-healing.
0x0004 SRAM_ECC_UNCORRECTABLE Critical Reset compute engine. Fail affected contexts.
0x0010 THERMAL_THROTTLE Warning Log. At threshold (≥10/hr): alert admin (cooling issue).
0x0011 THERMAL_SHUTDOWN Critical Device self-protected. Mark offline. Alert admin.
0x0012 POWER_BRAKE Warning Device hit power limit. Log for trending.
0x0020 GPU_HANG_DETECTED Degraded Attempt engine reset. If reset fails → Critical.
0x0021 ENGINE_RESET Warning Log which engine was reset and which contexts were lost.
0x0022 DEVICE_RESET Degraded Full device reset. All contexts lost. Alert admin.
0x0023 DEVICE_LOST Critical Device unresponsive after reset. Mark offline. DisableDevice.
0x0030 PCIE_REPLAY_THRESHOLD Warning (≥50/min) Link instability. DemoteTier. Suggest reseat/replace.
0x0031 PCIE_LINK_DEGRADED Degraded Link retrained at lower speed/width. Log + alert.
0x0040 VRAM_USAGE_HIGH Info (>80%), Warning (>95%) For trending / OOM avoidance.
0x0041 CONTEXT_OOM Warning A specific context's allocation failed.
0x0050 DRIVER_CRASH Degraded Driver process died. At threshold (≥3/hr): DemoteTier.
0x0051 FW_ERROR Degraded Device firmware reported an error (e.g., NVIDIA XID).
0x0060 ENCODER_ERROR Warning Video encode/decode engine error.
0x0070 NVLINK_CRC_ERROR Warning (threshold) Interconnect link error. Log for trending.
0x0071 NVLINK_DOWN Degraded Interconnect lane failed. Multi-GPU perf degraded.
0xFF00–0xFFFF VENDOR_SPECIFIC (driver-defined) Opaque vendor event — logged as-is.

Each event code maps to a specific subset of AccelHealthData fields (Section 21.1.3.4) that are meaningful for that event (e.g., THERMAL_THROTTLEtemperature_mc and throttle_count; VRAM_ECC_CORRECTABLEecc_correctable). The vendor-specific range 0xFF00–0xFFFF lets drivers report hardware-specific events without requiring a kernel update for each new error type.

FMA diagnosis rules (Section 19.1) map these codes to automated actions:

Rule Trigger Action
VRAM ECC degradation VRAM_ECC_CORRECTABLE ≥ 100/hr Alert + schedule maintenance
VRAM uncorrectable VRAM_ECC_UNCORRECTABLE × 1 DisableDevice + Alert
Thermal throttling THERMAL_THROTTLE ≥ 10/hr Alert (cooling issue)
PCIe link unstable PCIE_REPLAY_THRESHOLD ≥ 50/min DemoteTier + Alert
Repeated driver crashes DRIVER_CRASH ≥ 3/hr DemoteTier (move to Tier 2)
Device lost DEVICE_LOST × 1 DisableDevice + Alert

21.1.4 Implementation Phasing

Component Phase Dependencies Notes
AccelBaseVTable KABI definition Phase 3 Driver SDK, device registry Define the interface first
Accelerator scheduler (basic) Phase 3-4 AccelBase, cgroups Round-robin, then priority
DRM/KMS compatibility shim Phase 3-4 AccelBase Required for desktop GPU
Heterogeneous memory (basic) Phase 4 Memory manager, AccelBase CPU-device migration
P2P DMA Phase 4 IOMMU, AccelBase Requires IOMMU support
Cgroup accel controller Phase 4 AccelScheduler, cgroups Compute time + memory limits
CBS bandwidth guarantees Phase 4-5 AccelScheduler, CBS Guaranteed compute time
Hardware preemption Phase 5 AccelScheduler, GPU driver Driver-dependent
In-kernel inference (basic) Phase 4 Decision trees only Start with page prefetching
In-kernel inference (neural) Phase 5 Quantized INT8 runtime Requires validation
RDMA KABI Phase 5 Network drivers Separate from GPU
GPUDirect RDMA Phase 5+ RDMA + P2P DMA Cross-driver
NVIDIA KABI driver (basic init) Phase 3-4 AccelBase, KABI, ioctl compat nvidia-smi + simple CUDA (Section 21.5.2.3)
NVIDIA KABI driver (UVM) Phase 4-5 UmkaOS HMM, NVIDIA basic cudaMallocManaged + multi-GPU (Section 21.5.2.3)
NVIDIA KABI driver (production) Phase 5 All NVIDIA components Full CUDA stack + crash recovery
Device partitioning Phase 5+ AccelScheduler, registry Hardware-dependent
Topology-assisted collectives Phase 5+ Registry, RDMA, P2P Optimization layer

21.1.4.1 Priority Rationale

Phase 3-4 (Real Workloads): Basic accelerator framework + DRM compat + simple scheduling. This is the minimum for "GPU works on UmkaOS."

Phase 4-5 (Production Ready): Cgroup integration, memory management, P2P DMA, in-kernel inference basics. This is when UmkaOS becomes better than Linux for AI/ML workloads.

Phase 5+ (Ecosystem): Advanced scheduling, RDMA, collectives, partitioning. This is the competitive advantage phase — features that Linux fundamentally cannot provide due to architectural constraints.


21.1.5 Licensing Summary

Component IP Source Risk
AccelBase KABI Original design (vtable pattern from existing KABI) None
Accelerator scheduler CBS (academic, 1998) + priority scheduling (textbook) None
Heterogeneous memory Academic (HMM concepts, published research) None
P2P DMA PCIe spec (public), IOMMU spec (public) None
DRM/KMS compat Linux DRM (GPLv2 interface, ioctl numbers are facts) None
In-kernel inference Original design, standard ML algorithms (public) None
RDMA InfiniBand spec (public), RoCE spec (public) None
Cgroup controller Linux cgroup v2 interface (filesystem paths are facts) None
NVIDIA KABI driver New code inspired by MIT/GPLv2 open-source nvidia.ko None (MIT-compatible, OKLF-clean)

All vendor-specific GPU features (MIG, NVLink, etc.) are accessed through the AccelBase KABI vtable — the driver implements them, not the kernel. The kernel provides the framework; vendors fill in the hardware-specific logic through KABI.


21.2 Accelerator Memory and P2P DMA

21.2.1 Heterogeneous Memory Management

21.2.1.1 Problem

AI models are growing exponentially. A 70B parameter model at FP16 is ~140GB. GPU VRAM is typically 24-80GB. Models larger than VRAM need transparent memory management — pages migrating between CPU RAM and GPU VRAM based on access patterns.

Linux has hmm (Heterogeneous Memory Management) and mmu_notifiers, but they are bolted onto the existing MM and poorly integrated. Each vendor (NVIDIA UVM, AMD SVM, Intel SVM) implements their own page migration logic in the driver.

21.2.1.2 Design: Accelerator Memory as NUMA Nodes

The memory manager (Section 4.1) already has NUMA awareness — per-node buddy allocators, NUMA-local page allocation, NUMA-aware page cache placement.

Key insight: Accelerator memory is just another NUMA node. A GPU with 24GB VRAM is NUMA node N+1, with different performance characteristics (higher bandwidth, higher latency from CPU, not directly CPU-accessible).

Existing NUMA topology (Section 4.1):

  NUMA Node 0 (CPU 0-15)     NUMA Node 1 (CPU 16-31)
  64GB DDR5                   64GB DDR5
  Per-CPU page caches         Per-CPU page caches
  Buddy allocator 0           Buddy allocator 1

Extended with accelerator memory:

  NUMA Node 2 (GPU 0 VRAM)   NUMA Node 3 (GPU 1 VRAM)
  24GB GDDR6X                 24GB GDDR6X
  Managed by GPU driver       Managed by GPU driver
  via AccelBase vtable        via AccelBase vtable

21.2.1.3 Memory Node Types

// umka-core/src/mem/numa.rs (extend existing)

pub enum NumaNodeType {
    /// Standard CPU-attached DDR memory.
    CpuMemory,

    /// Accelerator device-local memory (VRAM, HBM).
    /// Managed through the AccelBase KABI vtable.
    AcceleratorMemory {
        device_id: DeviceNodeId,
        bandwidth_gbs: u32,        // GB/s per stack. e.g., 819 for HBM3 (per-stack at
                                    //       6.4 Gbps/pin × 1024 pins; JEDEC JESD238A).
                                    //       1229 for HBM3E (JEDEC max at 9.8 Gbps/pin;
                                    //       actual products: ~1000-1200 per stack)
        latency_ns: u32,           // e.g., 500-1000 for PCIe GPU
        cpu_visible: bool,         // Can CPU access this memory via BAR?
        cpu_visible_size: u64,     // Size of CPU-visible window (may be < total)
        coherent: bool,            // CPU-device cache coherent? (CXL = yes, PCIe = no)
    },

    /// CXL-attached memory (high-capacity, CPU-accessible, higher latency).
    CxlMemory {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
    },

    /// CXL 3.0 shared memory pool visible to multiple nodes.
    /// See Section 5.1.13 for CXL fabric integration details.
    CxlSharedPool {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
        /// Nodes that share this pool (bitfield).
        sharing_nodes: u64,
        /// Hardware coherence protocol version.
        coherence_version: CxlCoherenceVersion,
    },
}

21.2.1.4 Transparent Page Migration

When a device with hw_page_faults = 1 (modern GPUs with ATS/PRI support) faults on an unmapped address, the following occurs:

1. Device encounters page fault at virtual address VA.
2. Device sends ATS (Address Translation Services) fault to IOMMU.
3. IOMMU routes fault to UmkaOS Core's device fault handler.
4. UmkaOS Core looks up VA in the owning process's VMA tree:
   a. If VA maps to a CPU-resident page:
      - Migrate the page from CPU NUMA node to device NUMA node
        (via driver's migrate_pages callback)
      - Update CPU page tables (unmap from CPU)
      - Update device page tables (map on device)
      - Resume device execution
   b. If VA is unmapped:
      - Allocate a fresh page on the device's NUMA node
      - Map in device page tables
      - Resume device execution
   c. If VA maps to another device's memory:
      - P2P migration (see Section 21.2)
5. Track this page's location in the unified page tracking structure.

For devices WITHOUT hardware page faults (older GPUs, most NPUs): the kernel uses software-assisted migration. Before submitting a command buffer, the scheduler identifies which pages the command buffer references and pre-migrates them. This is less efficient but functionally equivalent.

21.2.1.5 Page Location Tracking

// umka-core/src/mem/hmm.rs (kernel-internal)

/// Tracks where each page of a shared address space currently resides.
pub struct PageLocationTracker {
    /// Per-page location (indexed by virtual page number).
    /// Each entry records which NUMA node currently holds this page.
    locations: RadixTree<PageLocation>,

    /// Per-page migration epoch counter (indexed by virtual page number).
    /// Incremented each time the page transitions to `Migrating` state.
    /// Used to detect stale migration completions after concurrent recovery.
    /// Each entry is a single `AtomicU64`, stored in a separate radix tree
    /// to avoid bloating the `PageLocation` enum (which is already 24 bytes).
    migration_epochs: RadixTree<AtomicU64>,

    /// Side table for in-flight migrations. Slab-allocated, bounded by
    /// `MAX_CONCURRENT_MIGRATIONS` (default 4096). The `Migrating` variant
    /// stores only the slab index (`migration_id: u32`), keeping the
    /// per-page `PageLocation` entry small.
    active_migrations: Slab<MigrationRecord>,
}

/// Full source/target metadata for an in-flight page migration.
/// Stored in `PageLocationTracker::active_migrations`, referenced by
/// `PageLocation::Migrating { migration_id }`.
pub struct MigrationRecord {
    pub source_kind: PageLocationKind,
    pub source_node: u8,
    pub source_device: DeviceNodeId,
    pub source_addr: u64,
    pub target_kind: PageLocationKind,
    pub target_node: u8,
    pub target_device: DeviceNodeId,
    pub target_addr: u64,
}

/// Maximum concurrent page migrations system-wide.
///
/// This limit bounds memory usage for the `active_migrations` slab and prevents
/// excessive contention on the per-page spinlocks during migration-heavy workloads.
/// The value 4096 is chosen to support:
/// - ~35M pages for a 70B parameter model with 4KB pages
/// - 1% active migration rate = 350K pages in transit
/// - With 4096 concurrent migrations and ~100μs per migration, throughput is
///   ~40K migrations/sec, sufficient for most workloads
///
/// If the limit is exceeded, new migration requests block on a global wait queue
/// until a slot becomes available. This is a backpressure mechanism that prevents
/// memory exhaustion and ensures forward progress.
pub const MAX_CONCURRENT_MIGRATIONS: usize = 4096;

/// Backoff strategy for per-page spinlock during migration metadata updates.
///
/// The per-page spinlock in `PageLocationTracker` is held only during metadata
/// updates (state transitions, epoch increments), which complete in tens of
/// nanoseconds. However, under high contention (e.g., 8 GPUs faulting on the same
/// page simultaneously in tensor parallelism), exponential backoff reduces
/// contention and power consumption.
///
/// The backoff sequence is:
/// 1. Try to acquire spinlock immediately.
/// 2. If locked, pause for 10 cycles (via `PAUSE` on x86, `YIELD` on ARM).
/// 3. Retry up to 10 times with exponential backoff (10, 20, 40, ..., 5120 cycles).
/// 4. After 10 failed retries, transition to sleep-based wait:
///    - Set a flag in the page metadata indicating "waiters pending".
///    - Block on a per-page wait queue (not spinning).
///    - The current migrator will wake all waiters after releasing the lock.
///
/// This hybrid approach (spin-then-sleep) optimizes for the common case where
/// the lock is held briefly, while avoiding wasted CPU cycles under sustained
/// contention.
pub const MIGRATION_LOCK_BACKOFF_BASE_CYCLES: u32 = 10;
pub const MIGRATION_LOCK_BACKOFF_MAX_RETRIES: u32 = 10;

/// Stored per-page in `PageLocationTracker::locations` (one entry per virtual page).
/// `#[repr(u8)]` keeps the discriminant to one byte; the full variant data is in the
/// enum payload. The radix tree stores the enum inline, so every extra byte of
/// discriminant wastes space proportional to the mapped address-space size.
#[repr(u8)]
pub enum PageLocation {
    /// Page is in CPU memory on this NUMA node.
    CpuNode(u8),

    /// Page is in accelerator device-local memory.
    DeviceLocal {
        device_id: DeviceNodeId,
        device_addr: u64,  // Device-side physical/virtual address
    },

    /// Page is in transit (being migrated).
    /// Stores only a side-table index to avoid bloating every per-page entry.
    /// The `MigrationRecord` (source, target, addresses) is stored in
    /// `PageLocationTracker::active_migrations` — a slab of fixed capacity
    /// bounded by `MAX_CONCURRENT_MIGRATIONS`. Migrations are transient
    /// (milliseconds), so the number of in-flight entries is small relative
    /// to total pages. This keeps `PageLocation` at 24 bytes (driven by
    /// `RemoteNode` / `RemoteDevice`) instead of ~48 bytes.
    Migrating {
        migration_id: u32,
    },

    /// Page is not yet allocated (demand-paged).
    NotPresent,

    /// Page is in compressed pool (Section 4.2).
    Compressed,

    /// Page is in swap.
    Swapped,

    // === Distributed memory locations (see Section 5.1.6 for the DSM protocol) ===

    /// Page is on a remote node's CPU memory, accessible via RDMA.
    RemoteNode {
        node_id: NodeId,
        remote_phys_addr: u64,
        dsm_state: DsmPageState,
    },

    /// Page is on a remote node's accelerator memory (GPUDirect RDMA).
    RemoteDevice {
        node_id: NodeId,
        device_id: DeviceNodeId,
        device_addr: u64,
    },

    /// Page is in CXL-attached memory pool (hardware-coherent).
    /// See Section 5.1.13 for CXL fabric integration details.
    CxlPool {
        pool_id: u32,
        pool_offset: u64,
    },
}

/// Discriminant for `PageLocation` variants, used in `MigrationRecord`
/// to record the source/target kind of an in-flight migration.
#[repr(u8)]
pub enum PageLocationKind {
    CpuNode      = 0,
    DeviceLocal  = 1,
    NotPresent   = 2,
    Compressed   = 3,
    Swapped      = 4,
    RemoteNode   = 5,
    RemoteDevice = 6,
    CxlPool      = 7,
}

21.2.1.6 Migration Policy

// umka-core/src/mem/hmm.rs

pub struct MigrationPolicy {
    /// Migrate on first device fault (aggressive, more migration traffic
    /// but lower steady-state fault rate).
    pub migrate_on_first_fault: bool,

    /// Migrate only after N faults on the same page within a window
    /// (conservative, less migration traffic, higher fault rate).
    pub fault_threshold: u32,
    pub fault_window_ms: u32,

    /// Batch migration: when migrating, also prefetch nearby pages
    /// (spatial locality heuristic).
    pub prefetch_pages: u32,    // 0 = no prefetch, 16 = prefetch 16 neighbors

    /// Maximum pages in-flight (being migrated) at once.
    pub max_inflight_migrations: u32,

    /// Eviction policy when device memory is full:
    /// LRU (least recently accessed on device) is the default.
    pub eviction_policy: EvictionPolicy,
}

pub enum EvictionPolicy {
    /// Evict least recently accessed pages back to CPU memory.
    Lru,
    /// Evict pages with lowest access frequency.
    Lfu,
    /// Use learned policy (in-kernel inference, Section 21.4).
    Learned,
}

21.2.1.7 Memory Oversubscription

When a model is larger than device VRAM:

Total model: 140GB (70B params, FP16)
GPU VRAM: 24GB
CPU RAM available: 256GB

Strategy:
  1. Most-accessed layers (attention heads, active KV cache) → GPU VRAM
  2. Less-accessed layers (early embedding, output projection) → CPU RAM
  3. Migration on demand: when GPU needs a page in CPU RAM, migrate it
     and evict a cold page from VRAM back to CPU RAM

The kernel manages this transparently. The ML framework sees a unified
140GB address space. Page migration is handled by the kernel's device
fault handler, exactly like CPU demand paging handles more-virtual-memory-
than-physical-RAM.

This is the accelerator equivalent of virtual memory. The kernel has been managing memory oversubscription since the 1960s. Now it does it for GPU VRAM too.

21.2.1.8 Huge Page Support for Tensors

ML tensors benefit from huge pages (fewer TLB misses, better DMA performance):

/// Page size selection for accelerator memory allocations.
/// This is a separate enum (not bitflags) because page sizes are mutually
/// exclusive — a single allocation uses exactly one page size.
#[repr(u32)]
pub enum AccelPageSize {
    /// Standard 4KB pages.
    Standard     = 0,
    /// 2MB huge pages.
    HugePage2M   = 1,
    /// 1GB huge pages.
    HugePage1G   = 2,
    /// Device-optimal page size (device driver selects the best size).
    DeviceOptimal = 3,
}

/// Accelerator memory allocation modifier flags (OR-combinable).
/// Combined with a separate AccelPageSize field to fully describe
/// an allocation request.
bitflags::bitflags! {
    #[repr(transparent)]
    pub struct AccelMemFlags: u32 {
        /// Non-migratable: pin in device memory, never evict to CPU.
        const PINNED          = 0x100;
        /// CPU-visible: map into CPU address space via BAR.
        const CPU_VISIBLE     = 0x200;
        /// Coherent: CPU-device coherent (requires CXL or similar).
        const COHERENT        = 0x400;
    }
}
// Usage: page_size = AccelPageSize::HugePage2M,
//        flags = AccelMemFlags::PINNED | AccelMemFlags::CPU_VISIBLE

21.2.1.9 Multi-Device Memory Coherence

When multiple GPUs access the same virtual address range (e.g., tensor parallelism across GPUs on the same machine), the kernel must manage coherence between device memories.

The PageLocation enum already includes DeviceLocal for single-device pages. For read-only copies replicated across multiple GPUs, the kernel uses a single-owner coherence protocol with explicit synchronization:

  • Single-owner with read sharing: At any time, one device owns the writable copy. Other devices may hold read-only copies. This mirrors the DSM protocol in Section 5.1.6.

  • Per-page ownership lock and wait queue: Each shared page has a per-page spinlock in the PageLocationTracker that serializes ownership transitions, plus a per-page wait queue for faulters that arrive during a migration. The spinlock protects only the page metadata (state, owner, reader list) — it is held for tens of nanoseconds, never across I/O. When a device faults on a page, the fault handler acquires this lock to inspect or modify the page's PageLocation state. If the page is in Migrating state, the faulter releases the spinlock and blocks on the per-page wait queue (not spinning) until the migration completes and wakes all waiters. The lock is per-page (not per-device or global) to avoid contention on unrelated pages. Scalability note: Per-page granularity minimizes contention on unrelated pages — two threads faulting different pages never contend. The spinlock protects only metadata updates (state, owner, reader list) held for tens of nanoseconds; actual data migration releases the lock and blocks on the per-page wait queue. For hot shared tensors where multiple threads fault the same page simultaneously, only the first faulter performs the migration — subsequent faulters block on the wait queue (non-spinning).

  • Explicit ownership transfer protocol: When a device writes to a page owned by another device:

  • Fault handler acquires the per-page ownership spinlock.
  • A MigrationRecord (source and target) is allocated in the side table and the page state transitions to PageLocation::Migrating { migration_id }. The PageLocationEntry's migration_epoch counter is incremented. The migrator captures this epoch value before releasing the spinlock.
  • Per-page ownership spinlock is released. The Migrating state prevents other faulters from modifying the page — they will see Migrating and block on the per-page wait queue until the migration completes. 3a. Pre-revocation fence (NEW): Before revoking the source mapping (step 4), the kernel issues a device-specific quiesce command via preempt_context() on the source device's active context. This ensures the source device has completed all in-flight writes to the page before the DMA copy reads the data. The fence is verified via AccelFence completion before proceeding to step 5.
  • The owning device's mapping is revoked (device page table unmap via driver callback). This is a slow operation (microseconds) and runs outside the spinlock.
  • P2P DMA transfers the page data to the new owner (Section 21.2.2). This may take tens to hundreds of microseconds and also runs outside the spinlock.
  • All devices holding read-only copies are invalidated via the AccelScheduler (which tracks all active contexts for a given address space). Each invalidation unmaps the page from the device's page table via the driver's callback.
  • Fault handler re-acquires the per-page ownership spinlock. Stale migration detection: the migrator checks that page.migration_epoch still matches the epoch value captured at step 2. If the epoch has changed (because a crash recovery or concurrent migration reset the page state), the migration is stale — the migrator discards the DMA result and returns EAGAIN to trigger a fresh fault. This prevents stale migration completions from corrupting page state after concurrent recovery paths have already resolved the page.
  • The new owner's mapping is installed and page state transitions to DeviceLocal.
  • Per-page ownership spinlock is released.
  • All waiters on the per-page wait queue are woken. They re-fault and see the new DeviceLocal state.

Steps 4-6 (revocation, DMA transfer, and reader invalidation) run outside the spinlock, so they do not block other faulters on unrelated pages or cause spin-time proportional to I/O latency. Faulters that arrive during steps 4-6 see Migrating state and sleep on the wait queue instead of spinning. Steps 4-6 must complete before step 8 (new mapping). This is a write-invalidate protocol — the same model used by CPU cache coherence (MOESI) and the DSM protocol in Section 5.1.6, adapted for device memory.

migration_epoch field: Each page tracking entry in PageLocationTracker includes a migration_epoch: u64 counter. This counter is incremented at step 2 each time a page transitions to Migrating. The migrator captures the epoch value after the increment and before releasing the spinlock. At step 7, after re-acquiring the spinlock, the migrator compares the current migration_epoch against the captured value. A mismatch indicates that a device crash + recovery or a concurrent migration has intervened during the lock-free DMA window and has already resolved (or restarted) the page migration independently. In that case the DMA result is discarded and EAGAIN is returned; the faulting device will re-fault and enter a fresh migration sequence.

  • Migration error and timeout handling: Migration is a multi-step I/O operation (device page table manipulation + DMA transfer) that can fail at any point. Without explicit recovery, a page can be stuck permanently in Migrating state, blocking all faulters on its wait queue indefinitely.

Timeouts: Every migration has a deadline: 100ms for intra-node transfers (devices on the same PCIe root complex or NVLink fabric), 1s for cross-node transfers (remote devices via RDMA or CXL). The fault handler starts a kernel timer when it transitions the page to Migrating (step 2 above). If the timer fires before the migration completes (step 8), the timeout handler initiates rollback.

Migration result:

```rust /// Outcome of a page migration attempt. #[repr(u8)] pub enum MigrationResult { /// Migration completed successfully. Page is now owned by the target device. Success,

  /// Migration failed (DMA timeout, device error, IOMMU fault). Page has been
  /// rolled back to its original location on the source device. Waiters are
  /// woken to re-fault and will find the page at its original location.
  RollbackToSource,

  /// Source device crashed or was reset during migration. The page data in
  /// device memory is lost. Page is marked `NotPresent` so the next fault
  /// reloads from backing store (CPU memory copy or disk).
  SourceLost,

} ```

Failure recovery by case:

  1. DMA timeout or destination device error (most common — e.g., P2P DMA times out, destination device returns an error during page write):

    • Cancel or abandon the in-flight DMA transfer.
    • The source device still holds the original page data (it was not unmapped until the transfer completes).
    • Re-acquire the per-page ownership spinlock.
    • Transition page state back from Migrating to its original DeviceLocal or CpuNode state (the MigrationRecord in the side table, looked up via the Migrating variant's migration_id, records exactly where to roll back to).
    • Release the per-page ownership spinlock.
    • Wake all waiters on the per-page wait queue. They re-fault and find the page at its original location.
    • Increment migration_failure_count in the PageLocationTracker stats.
    • Result: MigrationResult::RollbackToSource.
  2. Source device crash during migration (rare — the source device is reset or its driver crashes while the page is in Migrating state):

    • The page data in device memory is physically lost (device reset destroys VRAM contents — see Section 21.1.3.2).
    • The DMA transfer (if in progress) is aborted by the device reset.
    • Re-acquire the per-page ownership spinlock.
    • Transition page state to PageLocation::NotPresent.
    • Release the per-page ownership spinlock.
    • Wake all waiters on the per-page wait queue with an error indication. Waiters re-fault and trigger a fresh page-in from backing store: if the page has a CPU memory copy (e.g., it was migrated to the device from CPU RAM and the CPU copy was preserved or the page is file-backed), the fault handler loads from the backing store. If no backing store exists (anonymous page that was only in device memory), the owning process receives SIGBUS.
    • Increment migration_failure_count and source_lost_count in stats.
    • Result: MigrationResult::SourceLost.
  3. Waiter notification on failure: Waiters on the per-page wait queue are woken with an error code (MigrationFailed or SourceDeviceLost) embedded in the wait queue wake event. On wakeup, each waiter re-enters the fault handler, re-acquires the per-page spinlock, and inspects the current page state. For RollbackToSource, the page is back at its original location and the waiter proceeds normally (read-share or initiate a fresh migration attempt). For SourceLost, the page is NotPresent and the waiter triggers a fresh fault resolution (page-in from backing store or SIGBUS).

  4. Repeated migration failures: If a page accumulates more than 3 migration failures within a 60-second window (tracked per-page in PageLocationTracker), the page is marked as non-migratable (pinned at its current location) for a cooldown period (5 minutes). This prevents pathological retry loops on pages that consistently fail to migrate (e.g., due to a flaky P2P DMA path).

Fault handler behavior by MigrationResult (F2 fix):

MigrationResult Fault Handler Action Waiter Wakeup Retry Policy
Success Install new mapping, update TLB, return to user Wake with MIGRATION_SUCCESS N/A — migration complete
RollbackToSource Discard partial migration, restore source mapping, retry from backing store if needed Wake with MIGRATION_ROLLBACK — waiter re-faults and finds page at original location Automatic retry on next fault (up to 3 times within 60s)
SourceLost Mark page NotPresent, signal SIGBUS if no backing store exists Wake with MIGRATION_SOURCE_LOST — waiter re-faults from backing store or receives SIGBUS No automatic retry — page must be replenished from backing store first

Retry details: - RollbackToSource triggers an automatic retry on the next page fault from the same device. This is transparent to userspace. - After 3 RollbackToSource failures within 60 seconds, the page is marked NON_MIGRATABLE and pinned at its current location. - SourceLost does NOT trigger automatic retry — the page must be replenished from backing store (CPU RAM or disk) before migration can be attempted again. - Userspace receives SIGBUS only if: (1) SourceLost occurs, AND (2) no backing store exists (anonymous page that existed only in device memory).

/// Per-page migration failure tracking.
pub struct PageLocationTracker {
    // ...existing fields...

    /// Per-page failure count (upper 2 bits of metadata field).
    /// Reset after 60 seconds of successful migrations.
    fn migration_failure_count(&self) -> u8 { ... }

    /// Increment failure count and check if threshold exceeded.
    fn record_failure(&self) -> bool {  // returns true if threshold exceeded
        let count = self.migration_failure_count();
        if count >= 3 {
            self.mark_non_migratable();
            return true;
        }
        self.set_migration_failure_count(count + 1);
        false
    }

    /// Reset failure count after successful migration.
    fn record_success(&self) {
        self.set_migration_failure_count(0);
        self.last_success_time.store(now(), Relaxed);
    }

    /// Check if failure window has expired (60 seconds).
    fn failure_window_expired(&self) -> bool {
        let last = self.last_success_time.load(Relaxed);
        now() - last > 60_000_000_000  // 60 seconds in ns
    }
}
  • Read-sharing acquisition: When a device reads a page owned by another device, the fault handler acquires the per-page spinlock, checks page state. If the page is in Migrating state, the faulter releases the spinlock and blocks on the per-page wait queue (same as for write faults). Otherwise, the faulter records the additional reader in the page metadata, releases the spinlock, then performs the P2P DMA copy to create a read-only replica on the requesting device (outside the spinlock). No ownership change occurs.

  • Distributed cluster coherence: For multi-device coherence across machines (not just local GPUs), the per-page spinlock is replaced by the DSM directory protocol (Section 5.1.6, 05-distributed.md). The DSM protocol already provides single-owner / multi-reader semantics with a home-node directory that serializes ownership transitions across network boundaries — the same semantics as the local per-page spinlock, but designed for cross-node operation. Local multi-GPU coherence uses the lightweight per-page spinlock and wait queue; cross-machine coherence uses the DSM directory. The DLM (Section 5.1) is for filesystem-level distributed locking, not page-level coherence. The write-invalidate coherence protocol is identical in both cases — only the ownership serialization mechanism differs (local spinlock vs. DSM directory).

/// Per-page coherence tracking entry for multi-GPU systems.
/// Tracks ownership and sharing state for a single page across GPUs.
// Field accounting (all offsets in bytes):
//   owner_gpu:  1 (u8)         running: 1
//   state:      1 (GpuPageState = u8)  running: 2
//   _pad:       2 ([u8; 2], aligns u32 to offset 4)   running: 4
//   sharers:    4 (u32)        running: 8
//   version:    8 (u64)        running: 16
//   phys_addr:  8 (u64)        running: 24
//   lock:       4 (AtomicU32)  running: 28
//   _pad2:     36 ([u8; 36])   running: 64
// Total: 64 bytes, one cache line
#[repr(C)]
pub struct GpuCoherenceEntry {
    /// GPU that currently owns this page (exclusive write access).
    /// 0xFF = no owner (shared read state).
    pub owner_gpu: u8,
    /// State: Invalid, Shared, Modified (MSI-like protocol).
    pub state: GpuPageState,
    /// Explicit padding to align `sharers` to a 4-byte boundary.
    pub _pad: [u8; 2],
    /// Bitmask of GPU indices that have this page in their TLBs.
    /// Bit i set = GPU i has a valid mapping. Supports up to 32 GPUs.
    /// For systems with >32 GPUs, use GpuCoherenceEntryLarge.
    pub sharers: u32,
    /// Version counter (incremented on each ownership transfer).
    pub version: u64,
    /// Physical address of the page in the owning GPU's VRAM.
    pub phys_addr: u64,
    /// Lock for ownership transitions (spinlock, held only during transfer).
    pub lock: AtomicU32,
    /// Padding to reach exactly 64 bytes (one cache line).
    pub _pad2: [u8; 36],
}
// static_assert!(size_of::<GpuCoherenceEntry>() == 64);

Huge page requirement for multi-GPU coherence: The per-page coherence tracking structures (GpuCoherenceEntry, ~64 bytes per page) mandate huge pages (2 MB) when tracking >4 GPUs simultaneously. For a 4-GPU system with 256 GB VRAM total (64 GB per GPU, 16M pages per GPU), the coherence table requires ~1 GB. Using 4 KB pages for this allocation would consume 256K page table entries and cause TLB thrashing during coherence lookups. The allocator automatically promotes coherence table allocations to 2 MB huge pages when gpu_count >= 4. On systems without huge page support, the coherence table is capped at 4 GPUs; attempting to register a 5th GPU returns -ENOMEM.

For the common case of gradient all-reduce (all GPUs read the same weights but write different gradients), the kernel keeps weights as shared-read on all devices and only migrates gradient pages on write. Since each GPU writes to its own gradient partition (non-overlapping address ranges), per-page locks are uncontended in the common case.


21.2.2 Peer-to-Peer DMA

21.2.2.1 Problem

AI workloads need data to flow between devices without CPU involvement:

  • Storage → GPU: Load training data directly from NVMe to GPU VRAM (GPUDirect Storage)
  • Network → GPU: Receive model weights from remote node directly to GPU (RDMA + GPUDirect)
  • GPU → GPU: Gradient exchange between GPUs in multi-GPU training (NVLink, xGMI, PCIe P2P)
  • GPU → Network: Send inference results directly from GPU to network (bypassing CPU copy)

Linux supports these as vendor-specific driver features. There is no general kernel mechanism.

21.2.2.2 Design: Generalized P2P DMA

Extend the KABI with a general mechanism for device-to-device DMA transfers:

// Appended to KernelServicesVTable

/// Set up a peer-to-peer DMA mapping between two devices.
/// The kernel programs the IOMMU to allow device A to DMA to device B's
/// memory region.
pub p2p_dma_map: Option<unsafe extern "C" fn(
    src_device: DeviceHandle,
    dst_device: DeviceHandle,
    dst_mem: AccelMemHandle,    // Or DmaBufferHandle for non-accel devices
    dst_offset: u64,
    size: u64,
    out_iova: *mut u64,         // IOVA that src_device can use to reach dst memory
    out_mapping: *mut P2pMappingHandle,
) -> IoResultCode>,

/// Tear down a P2P DMA mapping.
pub p2p_dma_unmap: Option<unsafe extern "C" fn(
    mapping: P2pMappingHandle,
) -> IoResultCode>,

/// Initiate a DMA transfer between two devices.
/// The kernel coordinates between the two drivers.
pub p2p_dma_transfer: Option<unsafe extern "C" fn(
    mapping: P2pMappingHandle,
    src_offset: u64,
    dst_offset: u64,
    size: u64,
    out_fence: *mut AccelFence,
) -> IoResultCode>,

21.2.2.3 IOMMU Integration

P2P DMA requires careful IOMMU programming:

Without P2P:
  Device A → IOMMU → CPU RAM ← IOMMU ← Device B
  (two DMA transfers, CPU RAM as bounce buffer)

With P2P:
  Device A → IOMMU → Device B's BAR / memory
  (one DMA transfer, no CPU involvement)

The kernel validates: 1. Both devices support P2P (PCIe ACS, correct topology) 2. IOMMU can map across devices (same IOMMU domain or IOMMU supports cross-device mapping) 3. The requesting driver has the P2P_DMA capability 4. The target memory region is within the bounds of the P2P mapping

If hardware P2P is not possible (devices behind different root complexes without P2P bridge support), the kernel falls back to CPU-mediated copy transparently. The driver API is the same either way.

21.2.2.4 Topology-Aware Placement

The device registry (Section 10.5) provides topology information. The kernel uses this for P2P path selection:

PCIe Topology:
  Root Complex 0
    +-- Root Port 0
    |   +-- GPU 0 (VRAM: 24GB)
    |   +-- NVMe 0
    +-- Root Port 1
    |   +-- GPU 1 (VRAM: 24GB)
    |   +-- NIC (100GbE)

P2P capability matrix:
  GPU 0 ↔ NVMe 0: Direct P2P (same root port) — optimal
  GPU 0 ↔ GPU 1:  P2P via root complex — good
  GPU 0 ↔ NIC:    P2P via root complex — good
  GPU 1 ↔ NVMe 0: P2P via root complex — good

The scheduler uses this topology when placing workloads: a training job that loads data from NVMe 0 is preferentially scheduled on GPU 0 (same root port = best P2P path).

21.2.2.5 P2P Access Control

Peer-to-peer DMA allows devices to directly access each other's memory without CPU involvement. Without proper access control, a malicious or compromised device could read or corrupt another device's memory. This section specifies the mandatory access control mechanisms for all P2P DMA operations.

Authorization Model

Every P2P DMA mapping requires explicit authorization from both the source and target device owners:

P2P DMA Authorization Flow:
  1. Client (process or driver) requests P2P mapping: src_device → dst_device memory
  2. Kernel checks:
     a. Client holds ACCEL_P2P capability (0x0102) for BOTH devices
     b. Both devices are in the client's allowed device set (cgroup accel.devices)
     c. Target memory region is within the client's allocated AccelMemHandle
     d. Source device's cgroup has not exceeded accel.p2p.max limit (if set)
  3. Kernel programs IOMMU with P2P mapping (never direct device-to-device without IOMMU)
  4. P2pMappingHandle is bound to the requesting context — not transferable
  5. Mapping is recorded in both devices' P2P ACL tables

Capability-Based Device ACLs

Each device maintains a P2P Access Control List (ACL) that specifies which other devices are authorized to initiate P2P DMA to its memory:

/// P2P ACL entry — one per authorized (src_device, dst_device) pair.
/// Stored in the device registry node for the target device.
#[repr(C)]
pub struct P2pAclEntry {
    /// Source device that is authorized to initiate P2P DMA.
    pub src_device_id: DeviceNodeId,
    /// Capability token that authorized this ACL entry.
    /// Must be valid for the ACL entry to be valid.
    pub authorizing_cap: CapabilityToken,
    /// Maximum bytes that can be transferred via this ACL entry.
    /// Optional: 0 means unlimited (subject to cgroup limits).
    pub max_bytes: u64,
    /// Bytes transferred so far (reset on ACL refresh or revocation).
    pub bytes_transferred: AtomicU64,
    /// Creation timestamp (for auditing and expiration).
    pub created_ns: u64,
    /// Expiration timestamp (0 = no expiration).
    pub expires_ns: u64,
}

/// Maximum P2P ACL entries per device.
///
/// This limit bounds memory usage for ACL tables and ensures O(1) or O(log n)
/// lookup latency. The value 1024 is chosen to support:
/// - 32 GPUs in a single server × 31 peer devices = 992 entries (full mesh)
/// - Headroom for future expansion or per-cgroup ACLs
///
/// For systems requiring more than 1024 entries per device, use cgroup-based
/// ACL partitioning (each cgroup has its own ACL table).
pub const MAX_P2P_ACL_ENTRIES_PER_DEVICE: usize = 1024;

/// Maximum total P2P ACL entries system-wide.
///
/// This limit prevents unbounded memory growth in systems with many devices.
/// With 100K entries and 64 bytes per entry, total ACL memory is ~6.4MB.
pub const MAX_P2P_ACL_ENTRIES_SYSTEM: usize = 100_000;

P2P ACL lookup algorithm:

The P2P ACL is implemented as a fixed-capacity hash table with open addressing:

// umka-core/src/accel/p2p_acl.rs (kernel-internal)

/// P2P ACL table for a single device.
pub struct P2pAclTable {
    /// Hash table entries (capacity = power of 2, e.g., 1024).
    /// Uses linear probing for collision resolution.
    entries: SpinLock<[Option<P2pAclEntry>; CAPACITY]>,
    /// Number of active entries (for monitoring).
    len: AtomicUsize,
}

impl P2pAclTable {
    /// Look up an ACL entry by (src_device, authorizing_cap).
    /// Returns `Some(entry)` if found and valid, `None` otherwise.
    /// Time complexity: O(1) average case with good hash function.
    pub fn lookup(&self, src_device: DeviceNodeId, cap: &CapabilityToken) -> Option<&P2pAclEntry> {
        let hash = p2p_acl_hash(src_device, cap);
        let mask = CAPACITY - 1;  // CAPACITY is power of 2
        let mut idx = hash & mask;

        for _ in 0..CAPACITY {
            match self.entries.lock()[idx] {
                Some(ref entry)
                    if entry.src_device_id == src_device && entry.authorizing_cap == *cap =>
                {
                    // Check expiration
                    if entry.expires_ns != 0 && entry.expires_ns < get_time_ns() {
                        return None;  // Expired
                    }
                    return Some(entry);
                }
                None => return None,  // Empty slot = not found
                _ => {}  // Collision, continue probing
            }
            idx = (idx + 1) & mask;  // Linear probing
        }

        None  // Table is full and entry not found
    }

    /// Insert a new ACL entry.
    /// Returns `Err(())` if the table is full (no allocation on error).
    /// Time complexity: O(1) average case.
    pub fn insert(&self, entry: P2pAclEntry) -> Result<(), ()> {
        if self.len.load(Ordering::Relaxed) >= MAX_P2P_ACL_ENTRIES_PER_DEVICE {
            return Err(());  // Table full
        }

        let hash = p2p_acl_hash(entry.src_device_id, &entry.authorizing_cap);
        let mask = CAPACITY - 1;
        let mut idx = hash & mask;

        let mut entries = self.entries.lock();
        for _ in 0..CAPACITY {
            if entries[idx].is_none() {
                entries[idx] = Some(entry);
                self.len.fetch_add(1, Ordering::Relaxed);
                return Ok(());
            }
            idx = (idx + 1) & mask;
        }

        Err(())  // Table full (no empty slot found)
    }

    /// Remove an ACL entry by (src_device, authorizing_cap).
    /// Returns `Some(entry)` if found and removed, `None` otherwise.
    /// Time complexity: O(1) average case.
    pub fn remove(&self, src_device: DeviceNodeId, cap: &CapabilityToken) -> Option<P2pAclEntry> {
        let hash = p2p_acl_hash(src_device, cap);
        let mask = CAPACITY - 1;
        let mut idx = hash & mask;

        let mut entries = self.entries.lock();
        for _ in 0..CAPACITY {
            match entries[idx] {
                Some(ref entry)
                    if entry.src_device_id == src_device && entry.authorizing_cap == *cap =>
                {
                    let entry = entries[idx].take().unwrap();
                    self.len.fetch_sub(1, Ordering::Relaxed);
                    return Some(entry);
                }
                None => return None,  // Empty slot = not found
                _ => {}  // Collision, continue probing
            }
            idx = (idx + 1) & mask;
        }

        None  // Not found
    }
}

/// Hash function for P2P ACL lookups.
/// Combines src_device and capability token into a 64-bit hash.
fn p2p_acl_hash(src_device: DeviceNodeId, cap: &CapabilityToken) -> u64 {
    use umka_core::util::hash::fast_hash64;
    fast_hash64(&[
        src_device.0 as u64,
        cap.low as u64,
        cap.high as u64,
    ])
}

Memory layout and scaling:

  • Each P2pAclEntry is 64 bytes (1 cache line).
  • Per-device ACL table: 1024 entries × 64 bytes = 64KB.
  • System-wide ACL memory (32 GPUs): 32 × 64KB = 2MB.
  • For systems with 100 cgroups, each cgroup can have its own ACL table (64KB each), totaling ~6.4MB.

Lookup complexity:

  • Average case: O(1) with linear probing and load factor < 0.7.
  • Worst case (pathological hash collisions): O(n), but this is prevented by the hash function and table resizing policy.
  • The hash table is NOT dynamically resized (to avoid allocation during lookup). Instead, the fixed capacity is chosen to be large enough for all expected use cases.

ACL entry expiration:

  • Entries with expires_ns != 0 are automatically considered expired after the deadline.
  • Expired entries are lazily removed on next lookup or insert (tombstone behavior).
  • A periodic ACL sweep (every 10 seconds) removes expired entries to prevent table pollution.

The ACL is consulted on every `p2p_dma_map` call. The mapping is denied if:
- No ACL entry exists for (src_device, dst_device)
- The authorizing capability has been revoked
- The byte limit has been exceeded
- The ACL entry has expired

**Anti-replay property**: The `ACCEL_P2P` capability token (0x0102) in `authorizing_cap`
is bound to the specific (src_device, dst_device) pair at ACL creation time. A process
holding `ACCEL_P2P` for devices A and B cannot replay the same capability token to
authorize P2P between devices A and C — step 2a of the authorization flow (above)
requires the client to hold `ACCEL_P2P` for **both** the source and target device, and
the resulting ACL entry records the specific pair. If the `authorizing_cap` is revoked
(e.g., the process loses access to one device via cgroup migration), all ACL entries
referencing that token are invalidated at lookup time (lazy revocation), and the
corresponding IOMMU P2P mappings are torn down on the next `p2p_dma_map` denial or
periodic ACL sweep (every 10 seconds).

**IOMMU Mediation**

All P2P DMA transactions are mediated by the IOMMU. Direct device-to-device DMA
without IOMMU translation is **never permitted**:

Security-Critical Design Decision: P2P DMA Path: Device A → IOMMU → Device B's memory NOT: Device A → Device B (direct PCIe P2P without IOMMU)

Rationale: 1. IOMMU provides address translation and access validation on every transaction 2. IOMMU can revoke access instantly by invalidating the IOVA mapping 3. IOMMU provides fault isolation if a device goes rogue 4. IOMMU enables per-transaction logging for security auditing


The IOMMU is programmed with per-device IOVA page tables. A P2P mapping creates an
IOVA entry in the source device's page table that translates to the target device's
physical BAR address. When the mapping is revoked, the IOVA entry is invalidated.

**Revocation Protocol**

P2P access must be revoked when:
- The authorizing capability is revoked (process exit, cgroup removal)
- A device is removed from the system (hot-unplug)
- A device is quarantined due to error or security event ([Section 21.1.3.2](#21132-crash-recovery))
- The cgroup limit is exceeded
- An explicit unmap request is made

Revocation is synchronous and blocking:

P2P Revocation Sequence: 1. Kernel receives revocation trigger (capability revoke, device quarantine, etc.) 2. Kernel looks up all P2P mappings involving the affected device(s) 3. For each active mapping: a. Set mapping state to REVOKING b. Issue IOMMU TLB invalidation for the IOVA range c. Wait for in-flight transactions to complete (see below) d. Remove the mapping from both devices' ACL tables e. Free the P2pMappingHandle 4. Signal completion to revocation requester


**In-Flight Transaction Handling**

The critical challenge is handling P2P DMA transactions that are in-flight when
revocation is requested. The kernel uses a quiesce-then-revoke protocol:

```rust
/// P2P mapping state machine.
#[repr(u8)]
pub enum P2pMappingState {
    /// Mapping is active and can be used for new transfers.
    Active = 0,
    /// Mapping is being revoked. No new transfers allowed.
    /// In-flight transactions are completing.
    Revoking = 1,
    /// Mapping is fully revoked. IOVA is invalid.
    Revoked = 2,
}

/// In-flight transaction tracking.
/// Each P2P mapping tracks pending transfers.
pub struct P2pMapping {
    pub handle: P2pMappingHandle,
    pub src_device: DeviceNodeId,
    pub dst_device: DeviceNodeId,
    pub state: AtomicU8,  // P2pMappingState
    pub iova_base: u64,
    pub iova_size: u64,
    /// Number of in-flight transfers using this mapping.
    /// Incremented before p2p_dma_transfer, decremented after fence completion.
    pub in_flight_count: AtomicU32,
    /// Wait queue for revocation to wait on in-flight completion.
    pub revoke_wait_queue: WaitQueue,
}

Revocation waits for in-flight transactions:

In-Flight Handling During Revocation (E2 fix):
  1. Set state to REVOKING (atomic store)
  2. New p2p_dma_transfer calls fail with EREVOKED
  3. Wait for in_flight_count to reach zero:
     - Each transfer increments in_flight_count before submission
     - Each transfer decrements after fence signals completion
     - Timeout: 100ms for local P2P, 1s for cross-node RDMA
  4. If timeout expires (in_flight_count still > 0):
     a. **Escalation Level 1: Device quiesce via preempt_context**
        - Call `driver.preempt_context(context, PreemptReason::P2pRevocation)`
        - Wait up to 10ms for preemption to complete (via fence poll)
        - Re-check in_flight_count
     b. **Escalation Level 2: Force IOMMU TLB invalidation**
        - Issue IOMMU TLB invalidation for the IOVA range
        - Wait for IOMMU invalidation completion (typically <100μs)
        - The IOMMU blocks new translations; in-flight DMA may complete or fault
     c. **Escalation Level 3: Device reset via PCIe FLR**
        - If in_flight_count remains > 0 after 100ms of forced invalidation:
        - Call `driver.device_reset(context)` to perform a context-level reset
        - If device_reset returns error or times out: escalate to full device reset
        - Perform PCIe Function Level Reset (FLR) on the source device
        - Wait for FLR completion (typically 100-500ms)
     d. Log the escalation level reached:
        - Level 1 success: "P2P revocation completed via cooperative quiesce"
        - Level 2 success: "P2P revocation completed via forced TLB invalidation"
        - Level 3 success: "P2P revocation completed via device reset"
        - Level 3 failure: "P2P revocation FAILED — device unresponsive, marking faulted"
  5. If escalation succeeded (in_flight_count == 0 or device reset completed):
     - Set state to REVOKED
     - Remove mapping from ACL tables
     - Free P2pMappingHandle
     - Wake any waiters on revoke_wait_queue
  6. If escalation failed (device unresponsive after FLR):
     - Set state to REVOKED (forcefully, even if in_flight_count > 0)
     - Mark device as Faulted in FMA ([Section 21.1.3.4](#21134-fma-integration))
     - Trigger crash recovery path ([Section 21.1.3.2](#21132-crash-recovery))
     - Remove mapping from ACL tables (best effort)
     - Free P2pMappingHandle

Rationale for escalation ladder:

  • Level 1 (preempt_context): Cooperative quiesce is fastest (~1-10ms) and least disruptive. Works on well-behaved devices that respect preemption requests.
  • Level 2 (TLB invalidation): Forces IOMMU to block new translations. In-flight DMAs may complete with data or trigger device errors. Fast (~100μs) but may cause data corruption if DMA was mid-transfer.
  • Level 3 (device reset): Nuclear option. Guarantees device quiesce by resetting all internal state. Expensive (100ms-5s) but necessary for unresponsive devices. May affect other contexts on the same device.

Data corruption handling:

If Level 2 or Level 3 escalation is used, the target memory may contain partial or corrupted data. The kernel does NOT attempt to repair the data — that is the responsibility of the application (which should use checksums or other integrity verification for critical data). The kernel's job is to ensure system stability and prevent the hung mapping from blocking revocation indefinitely.

P2P Memory Accounting

All P2P DMA memory usage is accounted to the originating cgroup:

/sys/fs/cgroup/<group>/accel.p2p.current
# # Current bytes transferred via P2P DMA (read-only)
# # Sum across all devices this cgroup can access

/sys/fs/cgroup/<group>/accel.p2p.max
# # Maximum bytes that can be transferred via P2P in the current period
# # Format: "bytes period_us"
# # Example: "10737418240 1000000" (10GB per second)
# # Default: "max" (unlimited)

/sys/fs/cgroup/<group>/accel.p2p.stat
# # P2P statistics (read-only):
# #   total_bytes <cumulative bytes transferred>
# #   mappings <current active P2P mappings>
# #   revocations <times P2P was revoked>
# #   timeout_revocations <revocations that timed out waiting for in-flight>

The byte counter is incremented on each p2p_dma_transfer completion (fence signaled). If a transfer fails or is aborted, the bytes are not counted.

Device Quarantine and P2P

When a device is quarantined (Section 21.1.3.2), all P2P mappings involving that device are immediately revoked using the protocol above. This prevents a quarantined device from: - Initiating new P2P DMA to other devices - Receiving P2P DMA from other devices - Corrupting or exfiltrating data via P2P paths

The quarantine process blocks until all P2P revocations complete. If any revocation times out (in-flight transactions do not complete), the device is force-reset via PCIe FLR or driver-specific reset, which guarantees termination of all DMA activity.

Security Audit Trail

All P2P authorization and revocation events are logged to the kernel audit subsystem:

Audit Events (logged to /sys/kernel/debug/umka/audit.log):
  P2P_MAP:    src_device=<id> dst_device=<id> size=<bytes> cap=<token> cgroup=<path>
  P2P_UNMAP:  mapping=<handle> reason=<explicit|cap_revoke|cgroup_exit|quarantine>
  P2P_REVOKE: mapping=<handle> src_device=<id> dst_device=<id> in_flight=<count> timed_out=<bool>
  P2P_ACL_ADD:    src_device=<id> dst_device=<id> cap=<token> max_bytes=<bytes>
  P2P_ACL_REMOVE: src_device=<id> dst_device=<id> reason=<cap_revoke|expired|quarantine>

These audit events are available for security monitoring and forensics.


21.3 Accelerator Isolation and Scheduling

21.3.1 Capability-Based Access Control

Every accelerator context is gated by the UmkaOS capability system:

// Extend existing cap_id constants in umka-driver-sdk/src/capability.rs

pub const ACCEL_COMPUTE: u32    = 0x0100;  // Submit compute work
pub const ACCEL_MEMORY: u32     = 0x0101;  // Allocate device memory
pub const ACCEL_P2P: u32        = 0x0102;  // P2P DMA transfers
pub const ACCEL_PREEMPT: u32    = 0x0103;  // Preempt other contexts (admin)
pub const ACCEL_PERF: u32       = 0x0104;  // Read performance counters
pub const ACCEL_POWER: u32      = 0x0105;  // Change power/clock state (admin)
pub const ACCEL_CONTEXT_RT: u32 = 0x0106;  // Create realtime-priority contexts

A container gets a capability token that says "50% compute time on GPU 0, 8GB VRAM limit." The kernel enforces this through the AccelScheduler and memory accounting.

21.3.2 Cgroup Integration

New cgroup controller: accel.

/sys/fs/cgroup/<group>/accel.devices
# # Which accelerators this cgroup can access (by device index)
# # Format: "0 1 3" (devices 0, 1, 3 allowed)

/sys/fs/cgroup/<group>/accel.memory.max
# # Maximum total device memory across all accelerators (bytes)
# # Format: "8589934592" (8GB)

/sys/fs/cgroup/<group>/accel.memory.current
# # Current device memory usage (read-only)

/sys/fs/cgroup/<group>/accel.compute.guarantee
# # Guaranteed compute bandwidth (microseconds per second, per device)
# # Format: "device_idx quota period"
# # Example: "0 500000 1000000" (50% of GPU 0 guaranteed)

/sys/fs/cgroup/<group>/accel.compute.max
# # Maximum compute bandwidth ceiling
# # Format: same as guarantee

/sys/fs/cgroup/<group>/accel.compute.weight
# # Relative share of excess compute time (like cpu.weight)
# # Default: 100

/sys/fs/cgroup/<group>/accel.priority
# # Default priority for contexts created by this cgroup
# # "background", "normal", "high"
# # ("realtime" requires ACCEL_CONTEXT_RT capability)

/sys/fs/cgroup/<group>/accel.stat
# # Usage statistics (read-only):
# #   compute_time_us <total compute microseconds>
# #   memory_current <current bytes>
# #   memory_peak <peak bytes>
# #   submissions <total command submissions>
# #   preemptions <times preempted by higher priority>
# #   faults <device page faults>
# #   migrations <pages migrated>

21.3.3 Memory Isolation

Each AccelContext has its own device-side page table (or partition of the device's address space). The kernel ensures:

  1. Context A cannot access Context B's device memory (separate page tables / address spaces).
  2. Context A cannot exceed its accel.memory.max cgroup limit.
  3. When Context A is destroyed, all its device memory is freed immediately.
  4. The OOM killer is aware of device memory. If a process is consuming excessive device memory and the device is under pressure, the OOM killer can target it.

Device Memory and the OOM Killer:

Device memory is additive to a process's OOM score — device_bytes is counted as an RSS equivalent. Before the OOM killer selects a victim, the kernel attempts soft reclaim: evicting device pages back to CPU RAM (using the migrate_pages vtable call). Per-device memory pressure is tracked with high/low watermarks; when the high watermark is breached, the AccelScheduler proactively triggers eviction of cold pages from device memory to CPU RAM. If a process exceeds its cgroup accel.memory.max limit, the allocation returns ENOMEM rather than triggering OOM kill — the process must handle the allocation failure gracefully.

21.3.4 Compute Time Isolation

The AccelScheduler enforces compute time limits per context:

  • accel.compute.max: Hard ceiling. If a context has used its quota for this period, its submissions are queued until the next period. (Same semantics as cpu.max.)
  • accel.compute.guarantee: Minimum bandwidth via CBS server. (Same algorithm as cpu.guarantee from Section 6.3.)
  • accel.compute.weight: Proportional sharing of compute time not covered by guarantees. (Same semantics as cpu.weight.)
  • Preemption: If a high-priority context's CBS server becomes runnable, the scheduler preempts the current context (if hardware supports it). Otherwise, the current submission runs to completion and the high-priority context gets the next slot.

21.3.4.1 Context Preemption Memory Policy

When a context is preempted, its VRAM allocations cannot simply be abandoned — the device memory contains live computation state and data that may be needed when the context is rescheduled. This subsection specifies what happens to a context's VRAM allocations when it is preempted.

Memory residency state: Each AccelContextState carries a memory residency field:

/// Tracks where a preempted context's device-local memory currently resides.
/// Only meaningful when the context is not actively running on hardware.
#[repr(u32)]
pub enum AccelMemResidency {
    /// Memory is in device VRAM. Context can resume immediately upon
    /// rescheduling (subject to command-queue setup overhead only).
    InVram = 0,

    /// Memory has been migrated to CPU RAM via HMM (Section 21.2.1.4).
    /// Context must migrate pages back to VRAM before resuming.
    EvictedToCpuRam = 1,

    /// Memory is in CPU RAM and has been compressed by the memory subsystem
    /// (Section 4.2). Context must decompress and migrate back before resuming.
    /// Transparent to the AccelScheduler — HMM handles decompression on demand.
    EvictedAndCompressed = 2,
}

Three-tier policy: When the AccelScheduler preempts a context, it applies one of three memory handling strategies depending on system pressure at the time of preemption:

Tier 1 — Leave in place (default, zero cost)

If VRAM utilization across the device is below vram_pressure_threshold (a tunable parameter, default 85%), the preempted context's VRAM allocations are left untouched. The context is descheduled from the hardware queue but its memory remains resident. The AccelMemResidency state is InVram. On rescheduling, the context resumes with no migration cost (<1μs overhead for command-queue re-setup only).

Tier 2 — Migrate to CPU RAM (VRAM pressure)

If vram_utilization > vram_pressure_threshold at the time of preemption, the kernel invokes HMM migration to move the preempted context's VRAM pages to CPU RAM (Section 21.2.1.4). The driver's migrate_pages() vtable entry (Section 21.1.2.2) is called with MigrationFlags::TO_CPU. Migration cost is approximately 10–100μs per 4MB depending on PCIe generation and link utilization. The context's AccelMemResidency transitions to EvictedToCpuRam.

The vram_pressure_threshold parameter is registered with the tunable parameter store (Section 22.1.3) under the name "accel.vram_pressure_threshold_pct" (default 85, range 50–99).

Tier 3 — Compress in CPU RAM (system memory pressure)

If CPU RAM is also under memory pressure after Tier 2 migration, the memory subsystem's LZ4 compression path (Section 4.2) transparently compresses the migrated pages and places them in the compressed pool. This is not triggered by the AccelScheduler directly — it occurs automatically when the memory subsystem reclaims pages from contexts in EvictedToCpuRam state. The AccelMemResidency transitions to EvictedAndCompressed. The HMM layer handles decompression transparently when pages are needed for migration back to VRAM.

Priority-driven eviction: A new high-priority context that cannot fit its VRAM allocation in available device memory MAY trigger eviction of an already-preempted lower-priority context's CPU RAM pages (Tier 3 compression). Selection criteria:

  1. Only contexts with AccelMemResidency::EvictedToCpuRam are eligible for forced Tier 3 compression (contexts that are InVram are preempted first per normal scheduling policy before their memory is migrated).
  2. Among eligible contexts, select the one with the lowest CBS priority (highest deadline value).
  3. Among ties in priority: select the largest VRAM footprint, maximizing the space freed per eviction operation.

Higher-priority contexts (lower CBS deadline) never have their memory evicted to satisfy lower-priority allocations. The invariant is: a context's AccelMemResidency can only be advanced toward EvictedAndCompressed by a request from an equal-or-higher-priority context that cannot otherwise make progress.

Resume latency summary:

Residency state Resume cost Trigger condition
InVram <1μs VRAM utilization < threshold
EvictedToCpuRam 10–100μs per 4MB (PCIe transfer) VRAM utilization ≥ threshold
EvictedAndCompressed 50–500μs (decompress + PCIe) CPU RAM pressure after eviction

Resume latency is added to the context's AccelContextState::total_compute_ns accounting as wait time, not compute time, so CBS bandwidth budgets are not penalized for memory residency delays caused by other contexts' pressure.


21.3.4.2 Non-Preemptible Budget Overspend Handling

Unlike CPU CBS where a task can be preempted mid-quantum, accelerator operations often cannot be interrupted once submitted to hardware. When a CBS context's budget is exhausted mid-operation:

Detection: The CBS accounting thread (runs on each scheduler tick, 1ms resolution) detects overspend when runtime_consumed > bandwidth_ns for the current period (see AccelCbsServer fields above). It checks whether the context is currently executing a non-preemptible command (AccelCmdFlags::NON_PREEMPTIBLE set by the driver at submission, or preemption_granularity == PreemptionGranularity::None for the device).

Response to overspend:

  1. Non-preemptible operation in progress: The current command is allowed to complete. The overspend amount (runtime_consumed - bandwidth_ns) is recorded as debt_us in the CBS state.

  2. Debt carry-forward: In the next CBS period, the context's effective budget is reduced by debt_us: effective_budget_next = max(CBS_MIN_BUDGET, period_budget - debt_us) CBS_MIN_BUDGET = 100μs // prevents complete starvation from one oversized command

  3. Debt cap: Accumulated debt cannot exceed 3× the period budget. Excess debt is forgiven (prevents a single oversized command from penalizing a context across many future periods).

  4. Suspension: If a context accumulates debt equal to 5× its period budget within a 10-second window (indicating systematic abuse), it is suspended for one full CBS period and the process is sent SIGXCPU (compatible with Linux's CPU bandwidth enforcement signal for user-facing tooling).

  5. Accounting precision: Overspend is measured using hardware completion timestamps (from the device's completion event ring) for preemptible commands, and estimated from submitted_timestamp_ns + avg_cmd_duration_us for non-preemptible commands (hardware does not always report sub-command timing).

FMA reporting: Systematic overspend (debt > period budget for 10+ consecutive periods) triggers a cbs_budget_violation FMA event (Section 19.1) with the context ID, device, and debt statistics. Operators can use this to identify and cap runaway accelerator workloads.


21.3.5 Device Partitioning

For hardware that supports partitioning (NVIDIA MIG, AMD spatial partitioning):

/// Device partition descriptor.
#[repr(C)]
pub struct AccelPartition {
    /// Partition index.
    pub index: u32,
    /// Compute units assigned to this partition (same semantics as
    /// `AccelDeviceInfo::compute_units` — see Section 21.1.2.3).
    pub compute_units: u32,
    /// Memory assigned to this partition (bytes).
    pub memory_bytes: u64,
    /// Unique partition ID (for cgroup binding).
    pub partition_id: u64,
}

The device registry models partitions as child nodes of the GPU device node:

pci0000:00
  +-- 0000:41:00.0 (GPU, NVIDIA A100)
      +-- partition0 (MIG 2g.20gb: 28 SMs, 20GB)
      +-- partition1 (MIG 2g.20gb: 28 SMs, 20GB)
      +-- partition2 (MIG 3g.40gb: 42 SMs, 40GB)
      -- (108 total SMs, 38 reserved for system/L2 cache management; 70 available for MIG partitions)

Each partition is an independently schedulable and isolatable unit. Cgroups can be bound to specific partitions: echo "0000:41:00.0/partition0" > accel.devices.

21.3.6 GPU Virtualization Modes

GPUs support multiple virtualization approaches. The kernel accommodates all four without imposing one model.

Note: VFIO passthrough (mode 1) and SR-IOV (mode 2) are general-purpose PCIe virtualization technologies — widely used for NICs, NVMe controllers, FPGAs, and other devices. The general IOMMU/VFIO mechanism is described in Section 10.5.3.8. This section describes their specific application to GPUs and accelerators. Modes 3 (mdev) and 4 (MIG) are GPU/accelerator-specific.

1. PCIe Passthrough (VFIO):
   Entire GPU assigned to a single VM via IOMMU.
   Kernel role: IOMMU group management, VFIO device file (/dev/vfio/N).
   Performance: near-native. No sharing between VMs.
   Use case: dedicated GPU per VM (HPC, gaming).
   (This is the same VFIO mechanism used for NIC passthrough, NVMe
   passthrough, etc. — the interface is device-agnostic.)

2. SR-IOV (Single Root I/O Virtualization):
   Hardware creates Virtual Functions (VFs), each a separate PCIe
   device with its own BARs and IOMMU mapping.
   Kernel role: enumerate VFs, create device registry nodes for each,
   assign VFs to VMs via VFIO. AccelScheduler is NOT involved — each VF
   is an independent device scheduled by its own firmware.
   GPU support: Intel Data Center GPU Max, some AMD MI-series.
   Not supported by: NVIDIA consumer GPUs, most AMD consumer GPUs.
   (SR-IOV is more common for NICs — e.g., Intel E810, Mellanox
   ConnectX — where each VF provides an independent network interface.
   The kernel handles GPU VFs and NIC VFs identically at the PCIe level.)

3. Mediated Passthrough (mdev / vGPU):
   Software-defined GPU partitions, managed by the GPU driver.
   NVIDIA vGPU (GRID) and Intel GVT-g use this model.
   (AMD MxGPU uses SR-IOV, not mdev — see #2 above.)
   Kernel role:
     - mdev framework: /sys/bus/mdev/ device lifecycle (same as Linux).
     - Each mdev appears as a VFIO device to the VM (same API as #1).
     - The GPU driver (Tier 1) handles the actual time-slicing and
       memory partitioning internally.
     - AccelScheduler can enforce per-mdev cgroup limits if the driver
       exposes per-mdev utilization via get_utilization().
   Trade-off: more flexible than SR-IOV, but relies on driver quality.

4. MIG (Multi-Instance GPU) — NVIDIA A100/H100:
   Hardware partitioning into isolated GPU instances, each with
   dedicated SMs, memory controllers, and L2 cache.
   Modeled as child nodes in device registry (Section 21.3.5 above).
   Can be combined with VFIO: each MIG instance can be passed through
   to a separate VM.

The KABI AccelBase vtable supports all four modes — the kernel sees devices (physical, SR-IOV VF, MIG partition, or mdev) through the same interface. The virtualization mode is a configuration choice, not an architectural difference.

21.3.7 Hardware Reset-on-Timeout (HROT)

Problem: Many GPU and AI accelerator architectures (older CUDA generations, most NPUs, custom ASICs) do not support fine-grained preemption of long-running compute kernels. Once a kernel is submitted to hardware, it runs to completion — or hangs forever. A single misbehaving or hostile workload can render the accelerator unusable for all other processes.

Section 21.3.4 describes compute time isolation via CBS bandwidth servers and cooperative preemption, but these mechanisms depend on the device eventually yielding control. preempt_context() (Section 21.1.2.2) can only stop feeding new work at the next command buffer boundary — it cannot interrupt a running dispatch. If the dispatch itself hangs (infinite loop shader, firmware deadlock, hardware fault), no amount of cooperative scheduling recovers the device. HROT is the escalation path for this failure mode.

21.3.7.1 HROT State Machine

Every AccelContext managed by the AccelScheduler has a watchdog timer. The state machine governs the lifecycle of a submission from the kernel's perspective:

IDLE ──submit──> RUNNING ──complete──> IDLE
                    │
                    ├── timeout ──> WATCHDOG_EXPIRED ──soft_reset──> RESETTING
                    │                                                    │
                    │                                              reset_complete
                    │                                                    │
                    └─────────────────────────────────────────────────> IDLE
                                                              (context destroyed if owner still hung)

State transitions:

  • IDLE -> RUNNING: The AccelScheduler calls submit_commands() on the driver. The watchdog timer is armed with hrot.soft_timeout_ms.
  • RUNNING -> IDLE (normal): The driver signals completion via the fence mechanism (Section 21.1.2.3, AccelFence). The watchdog is disarmed.
  • RUNNING -> WATCHDOG_EXPIRED: The soft timeout fires. The kernel attempts driver-level preemption via preempt_context(). If the device supports preemption (AccelPreemptionGranularity is not CommandBoundary), this may succeed within ~1ms. A hard timeout timer is armed with hrot.hard_timeout_ms.
  • WATCHDOG_EXPIRED -> RESETTING: The hard timeout fires without the device completing or yielding. The kernel escalates to a hardware reset according to hrot.hard_action.
  • RESETTING -> IDLE: The device reset completes, the driver re-initializes (same flow as crash recovery in Section 21.1.3.2), and the context is destroyed or recreated depending on whether the owning process is still alive.

21.3.7.2 HROT Configuration

System-level tunable defaults: The default HROT timeout values are registered as KernelTunableParam entries (Section 22.1.3) rather than compile-time constants. This allows operators to adjust them at runtime for workloads with legitimately long execution times (e.g., large model compilation, offline batch inference) without recompiling the kernel.

// umka-core/src/accel/hrot.rs

/// Soft watchdog timeout default (milliseconds).
/// On soft timeout expiry, the kernel sends PreemptReason::WatchdogSoftTimeout
/// to the driver. The hard timeout timer is then armed.
/// Registered in KernelParamStore at boot via register_param!().
pub const ACCEL_SOFT_TIMEOUT_DEFAULT_MS: u64 = 5_000;

/// Hard watchdog timeout default (milliseconds).
/// On hard timeout expiry, the kernel escalates to accel_hard_reset().
/// Registered in KernelParamStore at boot via register_param!().
pub const ACCEL_HARD_TIMEOUT_DEFAULT_MS: u64 = 30_000;

// The params are registered at boot as:
//
//   register_param!("accel.watchdog.soft_timeout_ms",
//       default: ACCEL_SOFT_TIMEOUT_DEFAULT_MS,
//       min: 1_000,       // 1 second minimum — prevents spurious resets
//       max: 300_000,     // 5 minutes maximum
//       decay_period_ms: 0,  // manual only, no auto-decay
//       subsystem: SubsystemId::Accel,
//   );
//
//   register_param!("accel.watchdog.hard_timeout_ms",
//       default: ACCEL_HARD_TIMEOUT_DEFAULT_MS,
//       min: 2_000,
//       max: 600_000,     // 10 minutes maximum
//       decay_period_ms: 0,
//       subsystem: SubsystemId::Accel,
//   );

Invariant: hard_timeout_ms > soft_timeout_ms. The kernel MUST enforce this on every write to either parameter. Attempts to set hard_timeout_ms <= soft_timeout_ms are rejected with EINVAL. Attempts to lower soft_timeout_ms to a value that would violate the invariant against the current hard_timeout_ms are similarly rejected. Operators must update both parameters atomically (write hard_timeout_ms first if increasing, soft_timeout_ms first if decreasing) to avoid transient invariant violations.

Per-context override: Individual contexts may specify shorter timeouts via HrotConfig (embedded in AccelContextLimits, described below). The effective timeout applied to any submission is:

effective_soft_timeout_ms = min(HrotConfig::soft_timeout_ms,
                                 param_store["accel.watchdog.soft_timeout_ms"])

effective_hard_timeout_ms = min(HrotConfig::hard_timeout_ms,
                                 param_store["accel.watchdog.hard_timeout_ms"])

A per-context timeout of 0 means "use the system tunable." A per-context timeout greater than the system tunable is silently clamped to the system tunable — individual contexts cannot exceed the system-wide ceiling.

HrotConfig is embedded in AccelContextLimits and configurable per-context at creation time. The kernel enforces minimum values to prevent userspace from disabling the watchdog entirely (a process requesting hard_timeout_ms = 0 gets the system tunable default).

/// Hardware Reset-on-Timeout configuration for an accelerator context.
/// Embedded in AccelContextLimits. Governs how the kernel responds when
/// a submitted command does not complete within the expected time.
/// Timeout values here are per-context overrides; the system-level defaults
/// are governed by the "accel.watchdog.soft_timeout_ms" and
/// "accel.watchdog.hard_timeout_ms" KernelTunableParam entries
/// (see KernelParamStore, Section 22.1.3).
#[repr(C)]
pub struct HrotConfig {
    /// Soft timeout override (milliseconds). 0 = use system tunable default.
    /// Clamped to [1_000, system_tunable_soft_timeout_ms].
    /// Warn + attempt driver-level preemption via preempt_context() on expiry.
    pub soft_timeout_ms: u32,

    /// Hard timeout override (milliseconds). 0 = use system tunable default.
    /// Must be > soft_timeout_ms (kernel rejects with EINVAL otherwise).
    /// Clamped to [soft_timeout_ms + 1_000, system_tunable_hard_timeout_ms].
    pub hard_timeout_ms: u32,

    /// Action to take when the hard timeout fires.
    pub hard_action: HrotAction,

    /// Explicit padding for repr(C) stability.
    pub _pad: [u8; 28],
}

/// Action taken when HROT hard timeout fires.
#[repr(u8)]
pub enum HrotAction {
    /// Destroy the offending context, deliver SIGKILL to the submitting
    /// process, reset the accelerator engine, and allow other contexts
    /// to continue. This is the default and preferred action.
    /// Requires AccelDeviceHrotCaps::supports_context_reset == true.
    /// If the device does not support per-context reset, the kernel
    /// automatically falls back to ResetDevice.
    KillContextAndReset = 0,

    /// Destroy ALL contexts on the device and perform a full device reset
    /// (PCIe FLR or vendor-specific reset sequence). Used when per-context
    /// reset is not supported by the hardware, or when a per-context reset
    /// has already failed.
    ResetDevice = 1,

    /// Do not reset — just log a warning and deliver SIGKILL. Use when
    /// the hardware does not support reset (e.g., some FPGAs without
    /// partial reconfiguration). The device remains unavailable until
    /// the process exits and the driver reinitializes.
    LogAndKill = 2,
}

21.3.7.3 Watchdog Implementation

The watchdog runs on the kernel's timer subsystem (Section 6.5), not on the accelerator. One timer per active submission, managed by the AccelScheduler.

/// Called by the AccelScheduler's periodic tick (typically 1ms resolution).
/// Checks whether the current submission on the given context has exceeded
/// its timeout thresholds.
fn accel_watchdog_tick(ctx: &AccelContext) {
    let elapsed = now() - ctx.last_submit_time.load(Acquire);

    if elapsed > ctx.hrot.soft_timeout_ms as u64 * 1_000_000 {
        if ctx.soft_reset_attempted.compare_exchange(
            false, true, AcqRel, Acquire
        ).is_ok() {
            // Attempt driver-level preemption (send stop command to
            // hardware queue). If the hardware supports it, this
            // completes within ~1ms. On non-preemptible hardware,
            // this is a cooperative yield request — it has no effect
            // if the dispatch is truly hung.
            ctx.driver.vtable.preempt_context(
                ctx.handle,
                PreemptReason::WatchdogSoftTimeout,
            );
            // Log to FMA telemetry ring (Section 21.1.3.4).
            fma_report(FmaEvent::AccelSoftTimeout {
                device: ctx.device_id,
                context: ctx.id,
                elapsed_ms: (elapsed / 1_000_000) as u32,
            });
        }
    }

    if elapsed > ctx.hrot.hard_timeout_ms as u64 * 1_000_000 {
        // Hard timeout — escalate to hardware reset.
        accel_hard_reset(ctx);
    }
}

/// Escalation path when soft preemption has failed and the hard timeout
/// fires. This is the last resort — the device is assumed hung.
fn accel_hard_reset(ctx: &AccelContext) {
    match ctx.hrot.hard_action {
        HrotAction::KillContextAndReset => {
            // 1. Mark context as zombie (reject new submissions).
            ctx.state.store(ContextState::Zombie, Release);
            // 2. Send SIGKILL to the owning process.
            signal_send(ctx.owner_pid, SIGKILL);
            // 3. Call driver reset. The driver calls PCIe FLR or a
            //    vendor-specific reset sequence (e.g., NVIDIA GSP
            //    reset, AMD SDMA drain). The reset is synchronous
            //    from the kernel's perspective — the driver returns
            //    when the device is ready for re-initialization.
            ctx.driver.vtable.device_reset(ctx.handle);
            // 4. Re-initialize the device for other contexts.
            //    Same flow as crash recovery (Section 21.1.3.2 step 5).
            ctx.driver.vtable.device_init(ctx.handle);
            // 5. Log the event via FMA.
            fma_report(FmaEvent::AccelHardReset {
                device: ctx.device_id,
                context: ctx.id,
            });
        }
        HrotAction::ResetDevice => {
            // Reset affects ALL contexts on the device. Each context's
            // owning process receives an error on pending submissions.
            // This reuses the crash recovery path (Section 21.1.3.2).
            accel_reset_all_contexts(ctx.device);
        }
        HrotAction::LogAndKill => {
            // Cannot reset the device. Kill the process and hope the
            // driver can reinitialize when the process's context is
            // dropped. This is the worst case — the device may remain
            // hung until the driver is reloaded or the system reboots.
            signal_send(ctx.owner_pid, SIGKILL);
            fma_report(FmaEvent::AccelHung {
                device: ctx.device_id,
            });
        }
    }
}

device_reset and device_init vtable entries (F3 fix): HROT requires two new entries added to AccelBaseVTable (these are additions to the base vtable defined in Section 21.1.2.2):

// Additions to AccelBaseVTable:

/// Reset a single context's hardware state. Called when HROT fires with
/// KillContextAndReset. The driver must drain the device's command queue
/// for this context, discard any in-flight work, and leave the device in
/// a state where other contexts can continue submitting.
/// Returns IO_NOT_SUPPORTED if per-context reset is not possible (kernel
/// falls back to ResetDevice).
pub device_reset: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    context: AccelContextHandle,
) -> IoResultCode>,

/// Full device re-initialization after a reset. Called after device_reset
/// when per-context reset was used, or after PCIe FLR when full device
/// reset was used. The driver must restore the device to a clean state
/// equivalent to a fresh driver load.
pub device_init: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
) -> IoResultCode>,

Error handling and retry policy (F3 fix):

Operation Error Return Kernel Action Retry Count
device_reset returns error EIO, ETIMEDOUT, etc. Escalate to full device reset (PCIe FLR) immediately 0 retries — escalate immediately
device_init returns error Any error Mark device as Faulted, trigger crash recovery path Up to 3 retries with 100ms backoff
device_init fails 3× Give up, mark device permanently faulted, alert administrator N/A — maximum retries exceeded

Retry details for device_init:

/// Attempt to initialize a device after reset, with retry logic.
/// Returns IO_OK on success, or an error code after all retries exhausted.
fn hrot_device_init_with_retry(
    driver: &AccelDriver,
    ctx: *mut c_void,
) -> IoResultCode {
    const MAX_INIT_RETRIES: u32 = 3;
    const INIT_RETRY_DELAY_MS: u64 = 100;

    for attempt in 0..MAX_INIT_RETRIES {
        match unsafe { driver.vtable.device_init(ctx) } {
            IO_OK => return IO_OK,

            err => {
                fma_report(FmaEvent::AccelDeviceInitFailed {
                    device: driver.device_id,
                    attempt,
                    error_code: err,
                });

                if attempt < MAX_INIT_RETRIES - 1 {
                    // Wait before retry (exponential backoff: 100ms, 200ms, 400ms)
                    sleep_ms(INIT_RETRY_DELAY_MS << attempt);
                }
            }
        }
    }

    // All retries exhausted
    Err(-ENODEV)  // Device failed to initialize
}

/// HROT escalation path when device_reset or device_init fails.
fn hrot_reset_failure_escalation(ctx: &AccelContext) {
    // Step 1: Attempt per-context reset first (least disruptive)
    if let Err(e) = ctx.driver.vtable.device_reset(ctx.handle) {
        // Per-context reset failed or not supported
        // Escalate to full device reset via PCIe FLR
        fma_report(FmaEvent::AccelContextResetFailed {
            device: ctx.device_id,
            context: ctx.id,
            error_code: e,
        });

        // Step 2: Full device reset (affects ALL contexts)
        if let Err(e) = ctx.driver.vtable.device_init(ctx.handle) {
            // device_init failed — retry up to 3 times
            if let Err(e) = hrot_device_init_with_retry(&ctx.driver, ctx.handle) {
                // All retries exhausted — device is permanently faulted
                fma_report(FmaEvent::AccelDevicePermanentlyFaulted {
                    device: ctx.device_id,
                    reason: "device_init failed after 3 retries",
                    error_code: e,
                });

                // Transition device to Faulted state
                ctx.device.mark_faulted();

                // Trigger crash recovery path (driver unload/reload)
                trigger_crash_recovery(ctx.device);
            }
        }
    }
}

State transitions on error:

device_reset error → device_init (full device re-init)
                       ↓
                  device_init success → Device state: Active, contexts: re-established
                       ↓
                  device_init error (attempt 1) → Retry after 100ms
                       ↓
                  device_init error (attempt 2) → Retry after 200ms
                       ↓
                  device_init error (attempt 3) → Device state: Faulted
                                                   Trigger crash recovery
                                                   Alert administrator

Crash recovery trigger:

When device_init fails after all retries, the kernel triggers the crash recovery path (Section 21.1.3.2): 1. Mark device as Faulted in FMA. 2. Unload the driver module (if possible). 3. Perform PCIe FLR to reset the device to a known state. 4. Reload the driver module. 5. Attempt device_init one more time (fresh driver load). 6. If still failing, keep device offline and alert administrator.

21.3.7.4 Per-Submission Timeout

In addition to the context-level watchdog, individual command submissions can carry a deadline. This is useful for ML inference workloads with SLO (Service Level Objective) requirements — a single inference that exceeds its deadline should be killed without waiting for the full HROT hard timeout.

/// Per-submission deadline. Carried in the submission descriptor passed
/// to submit_commands() (Section 21.1.2.2). If the submission does not
/// complete by the deadline, the context is treated as hung and HROT
/// triggers immediately (bypassing the normal soft/hard timeout sequence).
pub struct AccelSubmissionDeadline {
    /// Absolute time (monotonic clock, nanoseconds since boot) by which
    /// this submission must complete. None = use context-level HROT config.
    pub deadline_ns: Option<u64>,
}

The per-submission deadline interacts with the context-level watchdog as follows:

  • If deadline_ns is set and earlier than the context-level soft timeout, the watchdog fires at deadline_ns instead.
  • If deadline_ns is set but later than the context-level hard timeout, the context-level hard timeout takes precedence (the kernel never allows a single submission to extend the overall HROT window).
  • Per-submission deadlines do not affect other submissions on the same context. Each submission's watchdog is independent.

21.3.7.5 Hardware Capability Detection

The AccelBase KABI requires drivers to advertise their reset and preemption capabilities so the kernel can select the appropriate HROT action automatically. This is reported via a new capabilities structure returned by get_info():

/// HROT-related device capabilities, reported by the driver at
/// registration time via get_info().
#[repr(C)]
pub struct AccelDeviceHrotCaps {
    /// Device supports per-context reset without affecting other contexts.
    /// If false, KillContextAndReset automatically falls back to ResetDevice.
    /// u8 instead of bool for stable repr(C) ABI (0 = false, 1 = true).
    pub supports_context_reset: u8,

    /// Device supports driver-level soft preemption of running kernels
    /// (via preempt_context). If false, the soft timeout phase is skipped
    /// and HROT proceeds directly to the hard timeout.
    /// u8 instead of bool for stable repr(C) ABI (0 = false, 1 = true).
    pub supports_preemption: u8,

    /// Minimum reset latency in microseconds. The device is unavailable
    /// for new submissions during this time. The AccelScheduler uses this
    /// to estimate recovery time when deciding whether to migrate contexts
    /// to another device vs. waiting for the reset to complete.
    /// 0 = unknown (kernel assumes 500ms as conservative default).
    pub reset_latency_us: u32,

    /// Maximum number of consecutive resets before the kernel marks the
    /// device as permanently faulted (via FMA). Prevents reset storms
    /// caused by hardware defects.
    /// 0 = no limit (not recommended; default: 5).
    pub max_consecutive_resets: u32,

    /// Explicit padding for repr(C) stability.
    pub _pad: [u8; 20],
}

The kernel uses these capabilities to make automatic decisions:

supports_context_reset supports_preemption Kernel behavior
true true Full HROT: soft preempt -> context reset -> device reset (escalation)
true false Skip soft timeout, go directly to hard timeout -> context reset
false true Soft preempt -> full device reset on hard timeout
false false Hard timeout -> full device reset (worst case: affects all contexts)

21.3.7.6 Interaction with Live Migration

If a VM with a GPU context is being live-migrated (Section 17.1), the HROT watchdog must account for the checkpoint/restore window during which the device is quiesced but not hung. Without adjustment, a migration that takes longer than hrot.hard_timeout_ms would trigger a false reset.

Rules:

  • When a live migration begins for a VM that owns accelerator contexts, the AccelScheduler enters migration hold for those contexts. The HROT watchdog is suspended (timers are paused, not cancelled).
  • The migration hold extends the effective hard timeout by the migration duration, up to a maximum extension of hrot.hard_timeout_ms (i.e., the total allowed time is at most 2 * hrot.hard_timeout_ms). This prevents indefinite watchdog suspension in the case of a migration that itself hangs.
  • If the migration completes successfully, the watchdog resumes with the remaining time from before the hold.
  • If the migration fails or is cancelled, the watchdog resumes immediately. Any time already elapsed before the hold still counts toward the timeout.
  • The migration hold is logged via FMA so administrators can correlate HROT events with migration activity.

21.3.7.7 Interaction with Crash Recovery

HROT and the crash recovery path (Section 21.1.3.2) share the same device reset mechanism but are triggered by different conditions:

  • Crash recovery: triggered by a driver fault (segfault, panic, domain violation in Tier 1 isolation). The driver itself is broken.
  • HROT: triggered by a device hang (the driver is healthy but the hardware is not responding). The driver is still running and can perform the reset.

When HROT triggers a ResetDevice, it reuses the crash recovery path from step 5 onward (PCIe FLR, driver reload, vtable re-exchange). The difference is that in the HROT case, the driver is asked to perform the reset itself first (device_reset vtable call). Only if the driver's reset fails (returns an error or times out) does the kernel fall back to the full crash recovery path with driver unload/reload.

Reset storm protection: If a device triggers HROT more than AccelDeviceHrotCaps::max_consecutive_resets times within a 60-second window, the kernel marks the device as permanently faulted via FMA (Section 21.1.3.4), transitions its device registry node to Faulted, and refuses new context creation. This prevents a hardware defect from causing an infinite reset loop that degrades system performance. An administrator can manually clear the fault via umkafs (Section 19.4).


21.4 In-Kernel Inference Engine

21.4.1 Rationale

The kernel makes millions of decisions per second: which page to evict, which I/O to schedule next, which task to migrate, whether a health metric is anomalous. Today these decisions use hand-tuned heuristics. Machine learning can do better for pattern-dependent decisions.

21.4.2 Constraints

In-kernel inference is not general-purpose ML. It must satisfy:

  1. Deterministic execution time: Every inference call must complete within a bounded number of cycles. No data-dependent loops, no dynamic allocation. For multi-layer models (TinyNeuralNet), inference is preemptible at layer boundaries: the inference loop checks need_resched() between layers and yields if a higher-priority task or interrupt is pending. This ensures CPU inference (which may run for ~1-20μs) does not cause scheduling latency spikes on latency-sensitive systems.
  2. No FPU or SIMD registers: Kernel inference must not use FPU or SIMD registers. The kernel does not save/restore FPU state on entry (doing so requires kernel_fpu_begin/end which holds a mutex and cannot be used in all interrupt contexts). All arithmetic uses scalar INT8/INT32 only. Rationale: The TinyNeuralNet architecture is bounded to 4 layers × 64 neurons × 64 neurons = 16K MACs. At ~4 scalar INT8 MACs/cycle on a 4 GHz core, this is ~4K cycles ≈ 1μs — feasible without SIMD. Weights fit in L1 cache (4 × 64 × 64 × 1 byte = 16KB). Architectures without SIMD (RISC-V, ARMv7 without NEON) execute at the same speed.
  3. Bounded memory: Model size is fixed at load time. No dynamic allocation during inference.
  4. Safe fallback: If the model produces nonsensical output, the system falls back to a traditional heuristic. The model is advisory, never authoritative for safety-critical decisions.
  5. Offline training: Models are trained in userspace (with full floating-point, GPU acceleration, unlimited time). Only the trained, quantized model is loaded into the kernel.

21.4.3 Supported Model Types

// umka-core/src/inference/mod.rs (kernel-internal)

pub enum KernelModelType {
    /// Decision tree / random forest.
    /// Bounded depth (max 32), bounded node count (max 64K).
    /// Inference = walk tree from root to leaf, O(depth) comparisons.
    /// Best for: classification decisions with <50 features.
    DecisionTree,

    /// Quantized lookup table.
    /// Input is quantized to N bits, output is a table lookup.
    /// O(1) inference. Best for: 1-2 dimensional functions.
    LookupTable,

    /// Quantized linear model.
    /// weights: [i16; N], bias: i32, threshold: i32.
    /// Inference = dot product + compare. O(N) multiplications.
    /// Best for: simple binary classification / regression.
    LinearModel,

    /// Quantized tiny neural network.
    /// Fixed architecture: input → hidden(s) → output.
    /// All weights INT8, activations INT8, accumulation INT32.
    /// Maximum: 4 layers, 64 neurons per layer.
    /// Weights fit in L1 cache (4 × 64 × 64 × 1 byte = 16 KB).
    /// Scalar-only arithmetic — no SIMD required or permitted in kernel context.
    /// Inference time bounded by architecture constants (~1-20μs).
    TinyNeuralNet,
}

21.4.4 Model Loading and Lifecycle

Models are trained in userspace and loaded into the kernel via a sysfs interface:

/sys/kernel/umka/inference/models/
    page_prefetch/
        model.bin       # Write: load model binary. Read: model metadata.
        active          # "1" to enable, "0" to disable, "heuristic" to fallback
        accuracy        # Read: online accuracy estimate (correct predictions / total)
        latency_ns      # Read: average inference latency
        invocations     # Read: total inference calls
        fallbacks       # Read: times fell back to heuristic
    io_scheduler/
        model.bin
        active
        accuracy
        latency_ns
        ...
    fma_anomaly/
        model.bin
        active
        ...
// umka-core/src/inference/model.rs (kernel-internal)

pub struct KernelModel {
    /// Model type determines the inference algorithm.
    pub model_type: KernelModelType,

    /// Model parameters (weights, tree nodes, lookup table).
    /// Allocated as a single contiguous block, fixed at load time.
    pub params: &'static [u8],

    /// Input feature count.
    pub input_features: u32,

    /// Output count (1 for regression, N for N-class classification).
    pub outputs: u32,

    /// Maximum inference latency in nanoseconds (pre-computed from model size).
    pub max_latency_ns: u64,

    /// Whether this model is currently active.
    pub active: AtomicBool,

    /// Online accuracy tracking.
    pub stats: ModelStats,
}

pub struct ModelStats {
    pub total_invocations: AtomicU64,
    pub correct_predictions: AtomicU64,  // When ground truth is available
    pub total_latency_ns: AtomicU64,
    pub fallback_count: AtomicU64,
}

run_inference method specification (H2 fix):

// umka-core/src/inference/model.rs (kernel-internal)

/// Result of an inference operation.
#[repr(u8)]
pub enum InferenceResult {
    /// Inference completed successfully.
    Success = 0,
    /// Output validation failed (NaN, Inf, or out-of-range values).
    /// Caller should use fallback heuristic.
    OutputInvalid = 1,
    /// Execution exceeded max_latency_ns (should not happen with
    /// load-time validation, but possible on preempted execution).
    Timeout = 2,
}

impl KernelModel {
    /// Run inference on the given input, writing output to the provided buffer.
    ///
    /// **Parameters:**
    /// - `input`: Slice of input features (length must equal `self.input_features`).
    /// - `output`: Mutable slice for output values (length must equal `self.outputs`).
    /// - `yielded`: Optional mutable flag. If `Some`, the method sets it to `true`
    ///   if the scheduler yielded during execution (TinyNeuralNet only).
    ///
    /// **Returns:**
    /// - `InferenceResult::Success` on successful completion.
    /// - `InferenceResult::OutputInvalid` if output contains invalid values.
    /// - `InferenceResult::Timeout` if execution exceeded `max_latency_ns`.
    ///
    /// **Behavior by model type:**
    ///
    /// - `DecisionTree`: Walk tree from root to leaf using input features as
    ///   comparison values. O(depth) comparisons. Maximum depth is 32.
    ///   Does not check `need_resched()` — completes in <200ns.
    ///
    /// - `LookupTable`: Quantize input to N bits, use as table index.
    ///   O(1) table lookup. Does not check `need_resched()`.
    ///
    /// - `LinearModel`: Compute dot product of input and weights, add bias,
    ///   compare to threshold. O(N) multiplications where N = input_features.
    ///   Uses INT16 arithmetic with INT32 accumulator. Does not check
    ///   `need_resched()` — completes in <500ns for typical N < 256.
    ///
    /// - `TinyNeuralNet`: Forward propagation through layers.
    ///   - All weights and activations are INT8; accumulation is INT32.
    ///   - Maximum 4 layers, 64 neurons per layer.
    ///   - Between each layer, checks `need_resched()` and yields if
    ///     a higher-priority task is pending. Sets `*yielded = true` if yielded.
    ///   - Total compute: ~16K MACs for max architecture (4×64×64).
    ///   - Scalar INT8 arithmetic only (no SIMD): ~1-20μs typical.
    ///   - Weights (16 KB max) fit in L1 cache on all supported architectures.
    ///
    /// **Output validation:**
    /// After inference completes, the output is validated:
    /// - For classification: all outputs must be in range [0, num_classes).
    /// - For regression: outputs must be finite (no NaN/Inf).
    /// - If validation fails, returns `OutputInvalid` and zeroes the output.
    ///
    /// **Security note:**
    /// The model parameters are read-only and validated at load time.
    /// Inference cannot modify kernel state beyond updating `stats`.
    pub fn run_inference(
        &self,
        input: &[i32],
        output: &mut [i32],
        yielded: &mut bool,
    ) -> InferenceResult {
        // Validate input/output lengths
        if input.len() != self.input_features as usize
            || output.len() != self.outputs as usize
        {
            return InferenceResult::OutputInvalid;
        }

        *yielded = false;

        let result = match self.model_type {
            KernelModelType::DecisionTree => {
                self.run_decision_tree(input, output)
            }
            KernelModelType::LookupTable => {
                self.run_lookup_table(input, output)
            }
            KernelModelType::LinearModel => {
                self.run_linear_model(input, output)
            }
            KernelModelType::TinyNeuralNet => {
                self.run_neural_net(input, output, yielded)
            }
        };

        // Validate output (defense-in-depth)
        if result == InferenceResult::Success {
            if !self.validate_output(output) {
                output.fill(0);
                self.stats.fallback_count.fetch_add(1, Relaxed);
                return InferenceResult::OutputInvalid;
            }
        }

        result
    }

    /// Validate output values are within expected range.
    fn validate_output(&self, output: &[i32]) -> bool {
        for &val in output {
            // Check for NaN/Inf equivalents in integer representation
            if val == i32::MIN || val == i32::MAX {
                return false;
            }
        }
        true
    }
}

21.4.5 Use Cases

Use case 1: Learned Page Prefetching

Replace Linux's simple sequential readahead with a learned prefetcher:

Input features (per page fault):
  - Faulting virtual address (quantized to page range)
  - Previous N fault addresses (pattern detector)
  - Process ID (different processes have different patterns)
  - Time since last fault
  - Memory region type (heap, stack, mmap, file-backed)

Output:
  - Next K pages to prefetch (page offsets relative to current fault)

Model type: TinyNeuralNet (2 hidden layers, 64 neurons each, INT8)
Inference time: ~1-5μs
Benefit: 20-40% reduction in page fault rate for workloads with learnable patterns
Fallback: Standard sequential readahead (Linux default)

Ground truth for online accuracy: track whether prefetched pages are actually accessed within a time window. If accuracy drops below threshold, fall back to heuristic.

Use case 2: I/O Scheduling

Optimize I/O queue ordering based on learned workload patterns:

Input features (per I/O request):
  - LBA (sector address, quantized)
  - Request size
  - Read vs write
  - Queue depth
  - Device utilization
  - Recent I/O pattern (last N requests, summarized)

Output:
  - Priority score (determines queue ordering)

Model type: DecisionTree (depth 16, ~4K nodes)
Inference time: ~200ns
Benefit: 5-15% IOPS improvement for mixed workloads
Fallback: mq-deadline heuristic

Use case 3: FMA Anomaly Detection

Detect correlated hardware degradation that threshold rules miss:

Input features (per health telemetry window):
  - ECC error rate (current window)
  - ECC error rate (historical baseline)
  - Temperature delta from baseline
  - PCIe correctable error rate
  - SMART attribute trends (multiple attributes)
  - Device age / power-on hours

Output:
  - Anomaly score (0.0 - 1.0 in fixed-point)
  - Predicted failure class (memory, storage, bus, thermal)

Model type: DecisionTree (depth 20, ~8K nodes)
Inference time: ~300ns
Benefit: Detect multi-signal degradation patterns that simple thresholds miss
Fallback: Threshold rules (Section 19.1.5)

Use case 4: Accelerator Memory Migration Policy

Decide which pages to migrate between CPU and GPU memory:

Input features (per page):
  - Access count on device (recent window)
  - Access count on CPU (recent window)
  - Time since last access
  - Page size
  - Current device memory pressure

Output:
  - Migrate to device / keep on CPU / evict from device

Model type: LinearModel (fast, simple)
Inference time: ~50ns per page
Benefit: Better migration decisions than fixed LRU
Fallback: LRU eviction

21.4.6 Safety Guarantees

/// Maximum number of operations per inference batch submission.
/// Batches exceeding this limit are rejected with EINVAL.
/// Rationale: prevents a single tenant from monopolizing the inference
/// queue with an unbounded batch, ensuring fair scheduling across
/// multiple inference contexts (Section 21.3 CBS bandwidth servers).
pub const INFERENCE_MAX_OPS_PER_BATCH: u32 = 65536;

The primary safety mechanism is mandatory structural validation at model load time (Section 21.4.9). The load-time validator statically proves that every accepted model terminates in bounded time before it is ever invoked:

  • Decision trees: Verify tree depth <= max_depth and that the tree is acyclic (DAG check). A tree of bounded depth with no cycles has a statically known maximum number of comparisons per inference.
  • Linear models: Verify the number of input features and outputs matches the declared dimensions. Inference is a single matrix-vector multiply with bounded operations.
  • Neural networks: Verify the layer count is fixed (no recurrent connections that could loop), and that the total multiply-accumulate (MAC) operations <= INFERENCE_MAX_OPS_PER_BATCH. A feedforward network with bounded layers and bounded dimensions has a statically known operation count.

Any model that cannot be statically proven to terminate in bounded time is rejected at load time and never reaches the inference path. This is the preemptive guarantee.

The post-hoc cycle check in infer_safe below remains as a defense-in-depth assertion — if the load-time validator has a bug and admits a model that runs longer than expected, the cycle check catches it and disables the model. But the cycle check is not the primary safety mechanism, because it measures cycles after run_inference completes; it cannot interrupt an infinite loop.

// umka-core/src/inference/safety.rs

/// Every model invocation goes through this wrapper.
/// The primary termination guarantee comes from load-time structural
/// validation (Section 21.4.9). This wrapper provides defense-in-depth:
/// post-hoc cycle checking, output validation, and fallback.
pub fn infer_safe<const MAX_NS: u64>(
    model: &KernelModel,
    input: &[i32],
    output: &mut [i32],
    fallback: impl FnOnce(&[i32], &mut [i32]),
) {
    // 1. Check model is active
    if !model.active.load(Ordering::Acquire) {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 2. Run inference with post-hoc cycle measurement.
    //    Termination is guaranteed by load-time structural validation
    //    (bounded tree depth / bounded layer count / bounded MAC ops).
    //    The cycle check is defense-in-depth only.
    //
    //    For TinyNeuralNet models, run_inference() checks need_resched()
    //    between layers. If a higher-priority task is pending, it yields
    //    and resumes after rescheduling. This adds ~5-20ns per layer
    //    boundary but prevents CPU inference from causing scheduling
    //    latency spikes. DecisionTree and LinearModel complete in <200ns
    //    and do not need preemption checks.
    //
    //    Because run_inference() may yield to the scheduler (via
    //    need_resched()), wall-clock TSC cycles include time spent
    //    sleeping/waiting — not actual model execution time. On a busy
    //    system, scheduler latency during a yield could push the
    //    wall-clock measurement far beyond MAX_NS even though the model
    //    executed correctly within its budget. To avoid false positives,
    //    we track whether a yield occurred during inference and skip the
    //    post-hoc cycle check in that case. The load-time structural
    //    validator (Section 21.4.9) is the primary termination guarantee;
    //    the cycle check is defense-in-depth that is only meaningful
    //    when the model ran without interruption.
    let mut yielded = false;
    let start = arch::current::cpu::read_cycle_counter();
    let result = model.run_inference(input, output, &mut yielded);
    let elapsed = arch::current::cpu::read_cycle_counter() - start;

    // 3. Defense-in-depth: check execution time was within bounds.
    //    This should never fire if the load-time validator is correct.
    //    If it does fire, it means the validator has a bug — disable
    //    the model and fall back to the heuristic.
    //    Skip the check if a yield occurred during inference, because
    //    the elapsed TSC cycles include scheduler/sleep time that is
    //    not attributable to the model. The load-time validator remains
    //    the primary safety mechanism regardless.
    if !yielded && elapsed > tsc_cycles_from_ns(MAX_NS) {
        model.active.store(false, Ordering::Release);
        fallback(input, output);
        log_warning!("inference model {} exceeded time bound (validator bug?), disabled", model.name);
        return;
    }

    // 4. Sanity-check output (model-specific validation)
    if !model.validate_output(output) {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 5. Update statistics
    model.stats.total_invocations.fetch_add(1, Ordering::Relaxed);
    model.stats.total_latency_ns.fetch_add(
        tsc_ns_from_cycles(elapsed), Ordering::Relaxed
    );
}

Execution time bounding: Static analysis (infer_safe) cannot prove termination for arbitrary GPU kernels (this is equivalent to the halting problem). Instead, UmkaOS uses a hardware watchdog approach: - Every accelerator dispatch includes a max_execution_us: u64 timeout (default: 5,000,000 μs = 5 seconds for compute kernels, 16,667 μs = 1/60th second for graphics). - The driver programs the GPU's hardware watchdog timer (GFX_TIMEOUT on AMD, TDR on NVIDIA via VGPU, or software timer for accelerators without hardware watchdog) before submitting the kernel. - If the kernel exceeds max_execution_us, the GPU engine is reset (per-engine reset, not full GPU reset where hardware supports it) and the dispatch returns ETIMEDOUT to the submitter. Other engines and other contexts are not affected. - infer_safe provides a BEST-EFFORT execution time estimate when possible (e.g., bounded loop counts × estimated per-iteration cost), which is used to set a tighter watchdog timeout. Kernels that infer_safe cannot analyze use the default timeout.

Hardware-accelerated inference timeout (NPU/DSP offload):

The infer_safe wrapper above covers CPU-side inference only. When a TinyNeuralNet model is offloaded to a hardware accelerator (e.g., an NPU or DSP via the AccelBase KABI), the post-hoc cycle counter cannot bound execution: once a command buffer is submitted to the device, the CPU cannot observe or interrupt mid-execution (for non-preemptible devices, see Section 21.1.2.4).

For hardware-offloaded inference, the scheduler enforces a hardware command timeout using the max_execution_us field in AccelContextLimits (Section 21.1.2.3). The inference subsystem creates a dedicated AccelContext for each model with a tight timeout derived from KernelModel::max_latency_ns:

// umka-core/src/inference/hw_offload.rs (kernel-internal)

/// Number of consecutive hardware timeouts before the accelerator
/// driver disables the device and reports a permanent fault to FMA.
pub const HW_TIMEOUT_DISABLE_THRESHOLD: u32 = 16;

/// Multiplier applied to `model.max_latency_ns` to derive the hardware
/// command timeout. Allows for device load variance while still bounding
/// worst-case lock-up time to 4x the declared model latency.
pub const HW_TIMEOUT_MULTIPLIER: u64 = 4;

/// Submit an inference request to a hardware accelerator with a hard timeout.
/// If the device does not complete within `model.max_latency_ns * HW_TIMEOUT_MULTIPLIER`,
/// the AccelScheduler cancels the submission and `infer_hw` falls back to the CPU path.
pub fn infer_hw(
    model: &KernelModel,
    ctx: &AccelContextHandle,
    input: &[i32],
    output: &mut [i32],
    fallback: impl FnOnce(&[i32], &mut [i32]),
) {
    // 1. Submit inference command buffer to hardware.
    //    AccelContextLimits::max_execution_us is set to
    //    (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 at context creation.
    //    The AccelScheduler arms a kernel timer on submission; if the device does not
    //    complete before the timer fires, it calls preempt_context() or, for
    //    non-preemptible devices, marks the submission Timeout at the next boundary.
    let submit_result = accel_submit_inference(ctx, model, input);
    if submit_result.is_err() {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 2. Wait for completion with the same timeout bound.
    //    poll_completion() returns Timeout if max_execution_us elapsed.
    match accel_poll_completion(ctx, model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) {
        AccelCompletionStatus::Success => { /* copy results from device buffer to output */ }
        AccelCompletionStatus::Timeout | AccelCompletionStatus::Error => {
            // Device did not complete in time — fall back to CPU heuristic.
            // Disable hardware offload for this model after repeated timeouts.
            model.hw_timeout_count.fetch_add(1, Ordering::Relaxed);
            if model.hw_timeout_count.load(Ordering::Relaxed) > HW_TIMEOUT_DISABLE_THRESHOLD {
                model.hw_offload_enabled.store(false, Ordering::Release);
                log_warning!("inference model {} exceeded hw timeout {} times, disabling hw offload",
                    model.name, HW_TIMEOUT_DISABLE_THRESHOLD);
            }
            fallback(input, output);
            model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        }
        AccelCompletionStatus::Preempted => {
            // Higher-priority work displaced the inference — fall back to CPU.
            fallback(input, output);
            model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        }
    }
}

The AccelContext for in-kernel inference is created at model load time with: - max_execution_us = (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 - priority = AccelPriority::Background (inference is advisory; never starves user workloads) - max_memory_bytes = model parameter size + fixed scratch buffer (no dynamic allocation)

This ensures that a misbehaving model or faulty NPU firmware cannot lock the accelerator indefinitely. The kernel timer in the AccelScheduler provides the hard bound; the CPU fallback path ensures the kernel continues operating correctly even if the hardware accelerator is unresponsive.

21.4.7 Adversarial Robustness

Section 21.4.6 addresses runtime safety (cycle budgets, fallback, output clamping). This section addresses a different threat: adversarial inputs — workload patterns deliberately crafted to exploit learned kernel models.

Threat model — an unprivileged attacker crafts memory access patterns, I/O sequences, or scheduling behavior designed to: 1. Degrade performance: trick the page prefetcher into evicting hot pages, or trick the I/O scheduler into making sub-optimal ordering decisions 2. Denial of service: cause the model to consistently produce worst-case outputs, degrading system throughput for co-tenants on the same machine 3. Information leakage: infer information about other processes' behavior by observing how the model's decisions change in response to probing inputs

Why kernel models are partially resistant — unlike ML models in adversarial ML research (image classifiers, NLP), kernel models operate on aggregate statistics (page fault rate over the last 1000 faults, I/O queue depth histogram), not on raw inputs from a single source. An attacker controls only their own process's behavior, which is one signal among many feeding into the model.

Mitigations:

  1. Per-process model state isolation: page prefetch and I/O scheduling models maintain per-process (per-cgroup) input features. Attacker process A's access pattern cannot directly influence model decisions for victim process B. The model sees A's features and B's features independently.

  2. Output clamping (already in Section 21.4.6): model output is clamped to a safe range. Even the worst possible model output (prefetch the maximally wrong pages, schedule I/O in the worst possible order) produces bounded degradation — the system falls back to LRU/FIFO behavior, which is the same as having no model at all. The adversary's best attack reduces performance to the no-model baseline, not below it.

  3. Anomaly detection on model inputs: the infer_safe wrapper tracks input feature distributions. If input features drift significantly from the training distribution (measured by simple range checks and mean/variance tracking), the model is automatically disabled for that process/cgroup and the heuristic fallback activates. This prevents an attacker from driving the model into an untrained region of the input space.

  4. Model-decision audit tracing: all model decisions are available via stable tracepoints (Section 19.2). Security-sensitive deployments can monitor for anomalous model behavior (e.g., prefetch hit rate dropping below a threshold for a specific cgroup) and trigger investigation.

  5. Side-channel hardening: the model's decision is not directly observable by unprivileged processes. An attacker cannot call "what did the prefetcher decide?" — they can only observe timing (whether a page fault occurred or not). This is equivalent to the existing cache side-channel problem, not a new attack surface. Standard cache partitioning mitigations (CAT, MBA) apply equally to model-influenced cache behavior.

Accepted risk — a sophisticated attacker with co-tenant access can degrade their own performance or (with difficulty) slightly degrade co-tenant prefetch accuracy, but cannot cause worse-than-baseline behavior (output clamping), cannot crash the kernel (cycle watchdog), and cannot corrupt other processes' data (model outputs are advisory — they influence which pages to prefetch, not which pages are accessible). This risk profile is comparable to existing cache pollution attacks, which are accepted in shared environments.

21.4.8 Fallback Mode Safety Specification

When in-kernel inference encounters errors, timeouts, or hardware failures, the system must transition to a safe fallback mode without compromising system integrity. This section specifies the safety guarantees that MUST be maintained even when fallback is active.

21.4.8.1 Fallback Trigger Conditions

Fallback mode is triggered by any of the following conditions:

Condition Scope Automatic Recovery
Model disabled (active=false) Per-model Yes, when admin re-enables
Cycle budget exceeded Per-model No, requires admin review
Hardware offload timeout (NPU/DSP) Per-model + per-device Yes, after cooldown period
Output validation failure Per-model, per-invocation Yes, model auto-disables after threshold
Input distribution anomaly Per-cgroup, per-model Yes, after distribution normalizes
Accelerator device failure Per-device No, requires device reset
Model binary signature failure Per-model No, requires valid signed model

Critical invariant: Fallback mode NEVER disables safety-critical kernel checks. The inference engine provides advisory decisions only (page prefetch hints, I/O ordering hints). The following safety guarantees remain active unconditionally:

  • Memory access validation (page tables, permission checks)
  • Capability checks for all resource accesses
  • I/O command validation (DMA bounds, register ranges)
  • Interrupt masking and priority enforcement
  • Scheduler fairness and preemption guarantees
  • Watchdog timers for all hardware operations

21.4.8.2 Fallback Scope and Isolation

Fallback is NEVER system-wide. The isolation granularity is:

  1. Per-model: Each inference model (page_prefetch, io_scheduler, etc.) operates independently. A failure in one model does not affect others.

  2. Per-cgroup for input anomaly: If a specific cgroup's input features drift outside the training distribution, fallback activates for that cgroup only. Other cgroups continue using the model normally.

  3. Per-device for hardware offload: If NPU A times out, models offloaded to NPU A fall back to CPU inference. NPU B continues operating normally.

// umka-core/src/inference/fallback.rs (kernel-internal)

/// Fallback state for a single model instance.
/// This is tracked independently per (model, cgroup) pair.
pub struct FallbackState {
    /// Why fallback is active (None = not in fallback)
    pub reason: Option<FallbackReason>,

    /// Timestamp when fallback mode was entered (monotonic clock)
    pub entered_at: Option<u64>,

    /// Number of consecutive fallback invocations
    pub consecutive_count: u64,

    /// Maximum consecutive fallbacks before escalation
    pub escalation_threshold: u64,
}

/// Reasons for entering fallback mode
pub enum FallbackReason {
    /// Model explicitly disabled by admin
    ModelDisabled,

    /// Cycle budget exceeded (validator bug suspected)
    CycleBudgetExceeded,

    /// Hardware accelerator timeout
    HwTimeout { device_id: u32, timeout_us: u64 },

    /// Output failed model-specific validation
    OutputValidationFailed,

    /// Input features outside training distribution
    InputDistributionAnomaly {
        feature_index: u32,
        observed_value: i32,
        expected_range: (i32, i32),
    },

    /// Accelerator device reported error
    DeviceError { device_id: u32, error_code: u32 },
}

21.4.8.3 Fallback Duration and Escalation

Fallback mode has bounded duration with automatic escalation:

Duration Action
0-60 seconds Automatic recovery: continue using heuristic fallback, periodically retry model
60-300 seconds Escalation: emit kernel warning event, log to audit trail
300+ seconds Critical escalation: emit high-priority alert to system management daemon
3600+ seconds (configurable) Model auto-disables: set active=false, require admin to re-enable
/// Fallback duration thresholds (configurable via sysfs)
pub const FALLBACK_RETRY_INTERVAL_SECS: u64 = 10;    // Retry model every 10s
pub const FALLBACK_WARNING_SECS: u64 = 60;           // Log warning after 60s
pub const FALLBACK_ESCALATION_SECS: u64 = 300;       // Alert after 5 minutes
pub const FALLBACK_AUTO_DISABLE_SECS: u64 = 3600;    // Disable model after 1 hour

Recovery attempts: While in fallback mode, the kernel periodically attempts to use the model again (every FALLBACK_RETRY_INTERVAL_SECS). If the model succeeds three consecutive times, fallback mode exits and normal operation resumes. This handles transient hardware glitches without operator intervention.

Escalation path:

Fallback entered
    ↓
[Retry every 10s, up to 60s]
    ↓ (still failing)
Log warning: "Model X in fallback for 60s, reason=Y"
    ↓
[Continue retries, up to 300s]
    ↓ (still failing)
Emit event: umka_inference_fallback(model="X", duration_s=300, reason="Y")
    ↓
[Continue retries, up to 3600s]
    ↓ (still failing)
Auto-disable model, emit critical event
    ↓
Require admin to re-enable via sysfs write

21.4.8.4 Admin Intervention Requirements

Certain fallback conditions require explicit admin action to exit fallback mode:

Requires admin intervention: - Model disabled due to cycle budget exceeded (indicates validator bug) - Accelerator device failure (requires device reset or replacement) - Model binary signature verification failure

Automatic recovery allowed: - Hardware timeout (transient NPU load) - Input distribution anomaly (workload shifted temporarily) - Output validation failure (if under threshold)

The sysfs interface exposes recovery control:

/sys/kernel/umka/inference/models/page_prefetch/
    fallback_reason      # Read: current fallback reason, empty if not in fallback
    fallback_duration_ms # Read: milliseconds since fallback entered
    fallback_auto_recover  # Write: "1" to attempt immediate recovery, "0" to stay in fallback
    require_admin_reset  # Read: "1" if admin action required to exit fallback

To exit a fallback state requiring admin intervention:

# # Check current state
cat /sys/kernel/umka/inference/models/page_prefetch/fallback_reason
# # Output: "cycle_budget_exceeded"

# # This condition requires admin review
cat /sys/kernel/umka/inference/models/page_prefetch/require_admin_reset
# # Output: "1"

# # After reviewing logs and validating the model is correct, admin resets:
echo 1 > /sys/kernel/umka/inference/models/page_prefetch/active

21.4.8.5 Audit Logging

All fallback state transitions are logged to the kernel audit subsystem:

/// Audit event emitted on fallback state changes
pub struct FallbackAuditEvent {
    /// Timestamp (monotonic nanoseconds since boot)
    pub timestamp_ns: u64,

    /// Model identifier (e.g., "page_prefetch")
    pub model_name: &'static str,

    /// Cgroup ID (0 if fallback is model-wide)
    pub cgroup_id: u64,

    /// Device ID (0 if not device-related)
    pub device_id: u32,

    /// Event type
    pub event_type: FallbackEventType,

    /// Reason for the event
    pub reason: FallbackReason,
}

pub enum FallbackEventType {
    /// Entered fallback mode
    Entered,

    /// Automatic recovery attempt (retrying model)
    RecoveryAttempt,

    /// Successfully recovered, exiting fallback
    Recovered,

    /// Escalation: warning logged
    EscalationWarning,

    /// Escalation: critical alert emitted
    EscalationCritical,

    /// Model auto-disabled due to prolonged fallback
    AutoDisabled,

    /// Admin re-enabled model after manual review
    AdminReset,
}

Audit log entries (visible via tracefs and forwarded to system audit daemon):

[12345.678901] UmkaOS-INFERENCE: model=page_prefetch event=entered reason=hw_timeout device=4 timeout_us=2000
[12355.678901] UmkaOS-INFERENCE: model=page_prefetch event=recovery_attempt
[12355.679001] UmkaOS-INFERENCE: model=page_prefetch event=recovered
[12405.678901] UmkaOS-INFERENCE: model=io_scheduler event=entered reason=input_anomaly cgroup=1234 feature=2
[12465.678901] UmkaOS-INFERENCE: model=io_scheduler event=escalation_warning duration_s=60

Audit retention: Fallback audit events are retained for a minimum of 30 days (configurable) to support post-incident analysis. Security-sensitive deployments MAY configure longer retention or external log forwarding.

21.4.8.6 Fallback Heuristic Specification

When fallback is active, the system MUST use a well-defined heuristic that provides correct (if suboptimal) behavior:

Model Fallback Heuristic
Page prefetch Sequential readahead: next 4 pages from current fault address
I/O scheduling mq-deadline: order by LBA, 50ms deadline for reads, 500ms for writes
FMA anomaly Static thresholds: trigger alert if any single metric exceeds hard limit
Memory migration Access-count based: migrate if device_access_count > cpu_access_count × 2
Power budget Equal distribution: allocate total_power / num_domains to each domain

Critical safety property: Fallback heuristics MUST NOT bypass any safety checks. For example, the sequential readahead fallback still validates that: - The target pages fall within the process's virtual address space - The underlying physical pages are accessible (not protected, not corrupt) - The readahead does not exceed the process's memory limits

The fallback heuristic is simply a decision algorithm for which pages to prefetch or which I/O to prioritize — it does NOT bypass the memory management or I/O subsystem's safety validation layers.

21.4.9 Model Binary Format

In-kernel models use a simple binary format for loading:

Header (4790 bytes — larger than original 256 bytes due to ML-DSA signature):
  magic:          u32    = 0x49534C45 ("UmkaOS")     //   4 bytes
  version:        u32    = 1                        //   4 bytes
  model_type:     u32    (KernelModelType discrim.) //   4 bytes
  input_features: u32                               //   4 bytes
  outputs:        u32                               //   4 bytes
  param_size:     u64    (bytes of parameter data)  //   8 bytes
  max_latency_ns: u64    (worst-case inference ns)  //   8 bytes
  sha256_hash:    [u8; 32]  (SHA-256 over entire model file: header fields above
                              [magic through max_latency_ns] concatenated with parameter
                              data — integrity check covering both structure and weights)
                                                    //  32 bytes
  ed25519_sig:    [u8; 64]  (Ed25519 signature over entire model — authentication)
                                                    //  64 bytes
  mldsa_sig_len:  u16       (actual ML-DSA sig length; ML-DSA-65 = 3309, ML-DSA-87 = 4627)
                                                    //   2 bytes
  mldsa_sig:      [u8; 4627] (ML-DSA signature — post-quantum authentication;
                              sized for largest variant ML-DSA-87; actual length in mldsa_sig_len)
                                                    // 4627 bytes
  _reserved:      [u8; 29]  (must be zero)          //  29 bytes
                                                    // ────────────
                                                    // Total: 4790

Parameter data: [u8; param_size]

Model binaries are verified using the same hybrid signature scheme as kernel modules (Section 8.2). The SHA-256 hash covers the entire model file (header + parameters), providing integrity over both the model structure and weights — an attacker cannot modify the header to reinterpret parameter data without invalidating the hash. The Ed25519 + ML-DSA signatures over the entire model provide authentication. Both signatures must verify against the kernel's built-in model signing public keys. Unsigned or incorrectly signed models are rejected unless the system is booted with accel.allow_unsigned=1 (disabled by default, requires CAP_ACCEL_ADMIN). This prevents an attacker from loading arbitrary model binaries even if they can construct one with a matching SHA-256 hash.

Load-time validation:

  1. Verify magic bytes (0x49534C45).
  2. Check param_size against maximum allowed model size (configurable, default 1 MB).
  3. Verify SHA-256 hash of entire model file (header fields magic through max_latency_ns concatenated with parameter data) matches sha256_hash.
  4. Signature verification — verify both Ed25519 and ML-DSA signatures over the entire model file (header + parameters) against the kernel's built-in model signing public keys. The ML-DSA signature uses the length indicated by mldsa_sig_len. Reject if either signature fails (unless accel.allow_unsigned=1).
  5. Structural termination proof — the validator statically proves bounded execution based on model type:
  6. Decision trees: Verify tree depth <= max_depth and that the tree is acyclic (DAG check via topological sort). Reject if any cycle is found or depth exceeds the configured maximum.
  7. Linear models: Verify input/output dimensions match the declared input_features and outputs fields. A single matrix-vector multiply is inherently bounded.
  8. Neural networks: Verify the layer count is fixed (no recurrent connections), and that total multiply-accumulate operations <= INFERENCE_MAX_OPS_PER_BATCH (computed from layer dimensions). Reject any model with recurrent or self-referencing layers. Any model that cannot be statically proven to terminate in bounded time is rejected. This is the primary safety mechanism — see Section 21.4.6.
  9. Compute worst-case latency from the proven model structure (tree depth, layer count, MAC operations) and reject if it exceeds the target subsystem's latency budget.

Atomic replacement: Models are replaced via double-buffer. The new model is loaded into a shadow slot while the current model continues serving. Once the new model is fully loaded and validated, an atomic pointer swap activates it. The old model is freed after a grace period (RCU-style) to ensure no in-flight inference uses stale data.

21.4.10 Model Drift Detection and Retraining Pipeline

Models trained on one workload profile may degrade when hardware changes (new storage devices with different latency characteristics), workload shifts (database server repurposed as a build server), or system configuration changes (memory added, CPUs hotplugged). This section addresses how UmkaOS detects and responds to model drift.

Online accuracy tracking — every model maintains a correct_predictions counter (Section 21.4.4 ModelStats). For the page prefetcher, a "correct prediction" means the prefetched page was actually accessed within a configurable window (~100ms). For the I/O scheduler, it means the predicted optimal ordering actually reduced tail latency. The kernel continuously computes a rolling accuracy rate:

accuracy = correct_predictions / total_invocations (over last 60 seconds)

Drift detection thresholds — the model is automatically disabled (falling back to heuristic) when accuracy drops below a configurable threshold:

/sys/kernel/umka/inference/models/page_prefetch/
    accuracy              # Current rolling accuracy (read-only)
    accuracy_threshold    # Minimum acceptable accuracy (read/write, default: 0.60)
    drift_detected        # "1" if accuracy < threshold (read-only)
    auto_disable          # "1" to auto-disable on drift (read/write, default: "1")

When drift_detected transitions to 1: 1. Model is disabled, heuristic fallback activates immediately 2. A kernel event is emitted: umka_inference_drift(model="page_prefetch", accuracy=0.52) 3. The event is visible via stable tracepoints (Section 19.2) and sysfs

Retraining pipeline — model retraining is deliberately out-of-kernel. The kernel collects training data; userspace trains models; the kernel loads the result:

┌─────────────┐    trace data    ┌──────────────┐   model.bin   ┌─────────────┐
│ UmkaOS Kernel  │ ─────────────→  │  Userspace    │ ───────────→  │ UmkaOS Kernel  │
│ (inference)  │  sysfs/tracefs  │  Trainer      │  sysfs write  │ (inference)  │
│              │                 │  (umka-mltool)│               │              │
│ Emits:       │                 │ - Reads trace │               │ Atomic swap: │
│ - features   │                 │ - Trains model│               │ new model    │
│ - outcomes   │                 │ - Quantizes   │               │ replaces old │
│ - accuracy   │                 │ - Validates   │               │              │
└─────────────┘                 └──────────────┘               └─────────────┘

The umka-mltool userspace utility (shipped with UmkaOS) automates the pipeline: 1. Reads training features from tracefs ring buffer 2. Trains a decision tree or lookup table using the collected features + outcomes 3. Quantizes to int8/int16 4. Validates against a holdout set 5. Writes the new model to sysfs (atomic replacement)

This can run as a cron job, a systemd timer, or triggered by the drift detection event. The kernel never trains a model itself — training requires floating-point, unbounded memory, and access to training libraries, all of which belong in userspace.

Why not online learning in-kernel? — Online learning (updating model weights incrementally as new data arrives) would eliminate the retraining round-trip. However: - Online learning requires floating-point arithmetic (kernel uses integer-only inference) - Convergence guarantees are weaker (adversarial inputs could steer the model — see Section 21.4.7) - Deterministic execution is impossible to guarantee with continuous weight updates - The kernel's cycle budget (~500-5000 cycles per inference) leaves no room for gradient computation

The offline-train / online-infer split is deliberate: the kernel is a fast, dumb inference engine; intelligence lives in userspace where it can be debugged, validated, and rolled back.

21.4.11 Tier 2 Inference Services

Section 21.4 defines in-kernel inference: tiny integer-only models running in the kernel's cycle budget (~500-5000 cycles) for per-decision hot paths. But many kernel optimization decisions don't need per-I/O latency — they're slow-path, strategic decisions made every few seconds or minutes. These can benefit from much more powerful models running in Tier 2 userspace drivers.

The architectural fit — Tier 2 drivers (Section 10.2) run as isolated userspace processes communicating with umka-core via ring buffers. They can: - Use floating-point and SIMD (no kernel FP restrictions) - Allocate unlimited heap memory - Link against full ML frameworks (ONNX Runtime, XGBoost, scikit-learn, PyTorch) - Access GPU/NPU accelerators via umka-accel (Section 21.1) - Crash without affecting the kernel (full process isolation)

This makes Tier 2 the natural home for advanced AI capabilities that exceed what the in-kernel inference engine can provide.

Tier 2 inference service model:

┌─────────────────────────────────────────────────────────┐
│ UmkaOS Core (Ring 0)                                      │
│                                                         │
│  In-kernel inference (45):       Tier 2 IPC client:     │
│  - Decision trees (fast)         - Ring buffer query →  │
│  - Lookup tables (fast)          - ← Ring buffer reply  │
│  - ~500-5000 cycles              - ~50-200 μs round-trip│
│  - Hot path (per-I/O)            - Warm path (periodic) │
│                                                         │
└──────────────────────┬──────────────────────────────────┘
                       │ Ring buffer IPC
                       │ (shared memory, zero-copy)
┌──────────────────────▼──────────────────────────────────┐
│ Tier 2 Inference Service (Userspace process)            │
│                                                         │
│  - Full FP, GPU access, unlimited memory                │
│  - GBM / Random Forest / small transformer models       │
│  - Online learning (incremental weight updates)         │
│  - Training data ingest from kernel ring buffer         │
│  - Model export back to in-kernel engine                │
│                                                         │
└─────────────────────────────────────────────────────────┘

Use cases for Tier 2 inference services:

Decision Frequency In-kernel (45) Tier 2 service
Page prefetch (which pages next?) Per fault (~μs) Decision tree N/A — too latency-sensitive
I/O scheduling (reorder queue) Per I/O (~μs) Lookup table N/A — too latency-sensitive
NUMA rebalancing (move pages?) Every ~10s Too complex GBM model: predict migration benefit from memory access histograms
Compression algorithm selection Per cgroup, every ~30s N/A Random forest: select LZ4/Zstd/none based on cgroup's page entropy distribution
Swap tier selection (Section 5.1) Every ~5s N/A Regression model: predict optimal local-vs-remote swap ratio from RDMA latency measurements
Anomaly detection (FMA, Section 19.1) Every ~1s N/A Autoencoder or isolation forest: detect anomalous device behavior patterns
Power budget optimization (Section 6.4) Every ~1s N/A RL agent: learn optimal RAPL power limits given current workload mix
Driver crash prediction Every ~5s N/A Time-series model: predict imminent driver failure from error rate trends

Online learning — safely in Tier 2:

The in-kernel inference engine cannot do online learning (Section 21.4.10 explains why: no floating-point, no convergence guarantees, adversarial risk). But a Tier 2 service can, because:

  1. Full FP available: Tier 2 runs in userspace with standard libm, SSE/AVX, GPU
  2. Crash-safe: if the online learning algorithm diverges or crashes, the Tier 2 process restarts. The kernel falls back to its in-kernel heuristic. No kernel impact.
  3. Adversarial resistance: the Tier 2 service can implement proper input validation, outlier detection, and gradient clipping — techniques too expensive for in-kernel but standard practice in ML engineering
  4. Auditable: the Tier 2 service's model weights, training data, and decisions can be logged, inspected, and rolled back using standard userspace debugging tools

The online learning loop:

1. Kernel emits training features to Tier 2 service via ring buffer
   (page fault patterns, I/O latencies, scheduling decisions + outcomes)
2. Tier 2 service updates model incrementally (mini-batch SGD, online RF, etc.)
3. Periodically, Tier 2 service quantizes a snapshot of its model to int8/int16
4. Tier 2 service writes the quantized model to the kernel via sysfs
   (atomic model replacement — Section 21.4.4)
5. In-kernel inference engine picks up the new model instantly
6. Kernel continues fast, deterministic, integer-only inference with fresh weights

This gives the system continuous adaptation — the in-kernel model is refreshed every few minutes with weights learned from the actual running workload — without any of the risks of in-kernel online learning. The Tier 2 service absorbs all the complexity (floating-point, convergence, validation), and the kernel sees only a stream of pre-validated, quantized model snapshots.

Tier 2 inference service KABI:

/// KABI interface for Tier 2 inference services.
/// Registered via the standard Tier 2 driver mechanism.
#[repr(C)]
pub struct InferenceServiceVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Kernel sends a query to the inference service.
    /// Returns immediately — result arrives asynchronously via ring buffer.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer and
    /// `features` points to at least `feature_count` valid i32 values.
    pub submit_query: unsafe extern "C" fn(
        service: *mut ServiceContext,
        query_type: InferenceQueryType,
        features: *const i32,
        feature_count: u32,
    ) -> KabiResult,

    /// Kernel retrieves the latest model snapshot (int8/int16 quantized).
    /// Called periodically to update the in-kernel inference engine.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer and
    /// `buffer` points to at least `buffer_len` bytes of writable memory.
    pub get_model_snapshot: unsafe extern "C" fn(
        service: *mut ServiceContext,
        model_type: KernelModelType,
        buffer: *mut u8,
        buffer_len: usize,
    ) -> KabiResult,

    /// Kernel reports ground truth (outcome of a previous prediction).
    /// Enables online accuracy tracking in the Tier 2 service.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer.
    pub report_outcome: unsafe extern "C" fn(
        service: *mut ServiceContext,
        query_id: u64,
        outcome: i32,
    ) -> KabiResult,
}

#[repr(u32)]
pub enum InferenceQueryType {
    /// Should we rebalance NUMA memory for this cgroup?
    NumaRebalance = 0,
    /// Which compression algorithm for this cgroup?
    CompressionSelect = 1,
    /// Optimal swap tier ratio (local vs remote)?
    SwapTierSelect = 2,
    /// Anomaly score for this device's health metrics?
    DeviceAnomaly = 3,
    /// Optimal power budget for current workload mix?
    PowerBudget = 4,
}

Latency budget — Tier 2 IPC round-trip is ~50-200μs (ring buffer submission + context switch + userspace inference + ring buffer reply). This is acceptable for decisions made every 1-30 seconds. For any decision needed per-I/O or per-fault, the in-kernel inference engine (Section 21.4) remains the only option.

Fallback hierarchy:

Tier 2 inference service available?
├── Yes → Use Tier 2 for slow-path decisions, in-kernel for hot-path
│         Tier 2 periodically refreshes in-kernel model weights
│
└── No  → In-kernel inference engine with static model
          ├── Model loaded? → Use model
          └── No model?     → Heuristic fallback (LRU, FIFO, static thresholds)

The system is fully functional at every level of the fallback hierarchy. Tier 2 inference services are a performance optimization, not a requirement. A system with no ML models at all behaves identically to a traditional kernel using standard heuristics.

Full framework: The Tier 2 inference services described here are the inference execution layer. The complete closed-loop framework — kernel observation bus, tunable parameter store, per-subsystem parameter catalogs, big-model integration pattern, and security model — is specified in Section 22.1.

Shipped Tier 2 services — UmkaOS ships the following reference Tier 2 inference services (optional, loaded on demand):

Service Model type Purpose
umka-ml-numa Gradient-boosted trees NUMA page placement and migration decisions
umka-ml-compress Random forest Per-cgroup compression algorithm selection
umka-ml-anomaly Isolation forest FMA device anomaly detection
umka-ml-power Online RL (contextual bandit) RAPL power budget optimization

Each is a standalone Tier 2 driver (~500-2000 LOC) that implements the InferenceServiceVTable. They are independently upgradable, independently crashable (Tier 2 process restart in ~10ms), and independently optional.


21.5 Accelerator Networking, RDMA, and Linux GPU Compatibility

21.5.1 RDMA and Collective Operations

21.5.1.1 Problem

Distributed ML training on N GPUs across M machines requires:

  • All-Reduce: After computing gradients on each GPU, average them across all GPUs. This is the single most important collective operation in distributed training.
  • All-Gather: Assemble distributed tensor shards.
  • Reduce-Scatter: Reduce and distribute results.
  • Point-to-point: Direct GPU-to-GPU transfer across network.

These operations need RDMA (Remote Direct Memory Access) for low latency and high throughput. Linux provides RDMA through the rdma-core / libibverbs userspace stack, but the kernel's role is minimal (IB/verbs kernel interface is thin).

21.5.1.2 Design: Kernel-Assisted Collectives

The kernel can optimize collectives by understanding the topology:

8-GPU training cluster:
  Machine A: GPU 0, GPU 1 (NVLink connected)
  Machine B: GPU 2, GPU 3 (NVLink connected)
  Machine C: GPU 4, GPU 5 (NVLink connected)
  Machine D: GPU 6, GPU 7 (NVLink connected)
  All machines connected via 100GbE RDMA

Optimal all-reduce strategy:
  1. Intra-machine: GPU 0 ↔ GPU 1 via NVLink (200 GB/s)
  2. Inter-machine: GPU 0 ↔ GPU 2 ↔ GPU 4 ↔ GPU 6 via RDMA (12.5 GB/s)
  3. Intra-machine: Broadcast result GPU 0 → GPU 1, etc.

The kernel knows the topology (device registry + network fabric).
Userspace NCCL/RCCL libraries currently discover this themselves.
The kernel can provide it as a service.

21.5.1.3 RDMA KABI Interface

/// RDMA device vtable (extends AccelBase for RDMA-capable NICs).
#[repr(C)]
pub struct RdmaDeviceVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Create a protection domain (PD).
    pub create_pd: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_pd: *mut RdmaPdHandle,
    ) -> IoResultCode,

    /// Register a memory region for RDMA.
    /// The registered region can be targeted by remote DMA.
    pub register_mr: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        addr: u64,         // Virtual or physical address
        size: u64,
        access: RdmaAccessFlags,
        out_mr: *mut RdmaMrHandle,
        out_rkey: *mut u32,
    ) -> IoResultCode,

    /// Register device memory (GPU VRAM) for RDMA.
    /// Enables GPUDirect RDMA: remote machine can DMA directly to/from GPU.
    pub register_device_mr: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        accel_mem: AccelMemHandle,
        offset: u64,
        size: u64,
        access: RdmaAccessFlags,
        out_mr: *mut RdmaMrHandle,
        out_rkey: *mut u32,
    ) -> IoResultCode>,

    /// Post a send work request. See `RdmaSendWr` below.
    pub post_send: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        wr: *const RdmaSendWr,
    ) -> IoResultCode,

    /// Post a receive work request. See `RdmaRecvWr` below.
    pub post_recv: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        wr: *const RdmaRecvWr,
    ) -> IoResultCode,

    /// Poll completion queue.
    pub poll_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        cq: RdmaCqHandle,
        max_entries: u32,
        out_wc: *mut RdmaWorkCompletion,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Create a queue pair (QP) for RDMA communication.
    pub create_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        qp_type: RdmaQpType,
        send_cq: RdmaCqHandle,
        recv_cq: RdmaCqHandle,
        out_qp: *mut RdmaQpHandle,
    ) -> IoResultCode,

    /// Destroy a queue pair.
    pub destroy_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
    ) -> IoResultCode,

    /// Create a completion queue (CQ).
    pub create_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        min_entries: u32,
        out_cq: *mut RdmaCqHandle,
    ) -> IoResultCode,

    /// Destroy a completion queue.
    pub destroy_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        cq: RdmaCqHandle,
    ) -> IoResultCode,

    /// Deregister a memory region.
    pub deregister_mr: unsafe extern "C" fn(
        ctx: *mut c_void,
        mr: RdmaMrHandle,
    ) -> IoResultCode,

    /// Destroy a protection domain.
    pub destroy_pd: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
    ) -> IoResultCode,

    // === Connection Management (RC queue pairs) ===

    /// Resolve a route to a remote RDMA destination and initiate connection.
    /// Transitions the QP from INIT → RTR → RTS (for RC transport).
    /// `dest` contains the remote node's LID/GID, QP number, and path info.
    pub connect_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        dest: *const RdmaConnParams,
    ) -> IoResultCode,

    /// Bind a QP to a local port and listen for incoming RC connections.
    /// The QP must be in the INIT state. When a remote peer calls
    /// `connect_qp` targeting this endpoint, the kernel invokes the
    /// `conn_request_callback` registered via `set_conn_callback`.
    pub listen_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        port: u8,
    ) -> IoResultCode,

    /// Accept an incoming RC connection request on a listening QP.
    /// Transitions the QP to RTS. `request_id` is the identifier
    /// provided by the `conn_request_callback`.
    pub accept_conn: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        request_id: u64,
        resp_params: *const RdmaConnParams,
    ) -> IoResultCode,

    /// Gracefully disconnect an RC queue pair. Sends a DREQ to the
    /// remote peer and transitions the QP to the SQD (Send Queue Drained)
    /// and then ERROR state. Outstanding work requests are flushed with
    /// error completions. After disconnect, the QP can be destroyed.
    pub disconnect_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
    ) -> IoResultCode,

    /// Register a callback for incoming connection requests on a listening QP.
    /// The callback receives a `request_id` that can be passed to `accept_conn`
    /// or ignored (to reject the connection).
    pub set_conn_callback: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        callback: unsafe extern "C" fn(
            qp: RdmaQpHandle,
            request_id: u64,
            remote_params: *const RdmaConnParams,
        ),
    ) -> IoResultCode>,
}

/// RDMA send work request. Describes a send, RDMA write, or RDMA read operation.
#[repr(C)]
pub struct RdmaSendWr {
    /// Work request ID (returned in completion).
    pub wr_id: u64,
    /// Opcode: Send, SendWithImm, RdmaWrite, RdmaWriteWithImm, RdmaRead.
    pub opcode: RdmaSendOpcode,
    /// Send flags (signaled, solicited, inline, fence).
    pub flags: u32,
    /// Scatter-gather list (pointer to array of RdmaSge entries).
    pub sg_list: *const RdmaSge,
    /// Number of scatter-gather entries.
    pub num_sge: u32,
    /// Remote key (for RDMA Read/Write operations).
    pub rkey: u32,
    /// Remote virtual address (for RDMA Read/Write operations).
    pub remote_addr: u64,
    /// Immediate data (for SendWithImm / RdmaWriteWithImm).
    pub imm_data: u32,
    pub _pad: [u8; 4],
}

/// RDMA receive work request. Posts a receive buffer for incoming messages.
#[repr(C)]
pub struct RdmaRecvWr {
    /// Work request ID (returned in completion).
    pub wr_id: u64,
    /// Scatter-gather list (receive buffer descriptors).
    pub sg_list: *const RdmaSge,
    /// Number of scatter-gather entries.
    pub num_sge: u32,
    pub _pad: [u8; 4],
}

/// Scatter-gather entry for RDMA operations.
#[repr(C)]
pub struct RdmaSge {
    /// Virtual address of the buffer.
    pub addr: u64,
    /// Length of the buffer in bytes.
    pub length: u32,
    /// Local key for the memory region containing this buffer.
    pub lkey: u32,
}

/// Connection parameters for RC queue pair connection management.
#[repr(C)]
pub struct RdmaConnParams {
    /// Remote QP number (24 bits, upper 8 bits reserved).
    pub remote_qpn: u32,
    /// Remote LID (local identifier, for InfiniBand subnets).
    pub remote_lid: u16,
    /// Service level (QoS, 0-15).
    pub service_level: u8,
    /// Global routing: remote GID (for RoCE and cross-subnet IB).
    pub remote_gid: [u8; 16],
    /// Explicit padding: `remote_gid` ends at offset 22; `path_mtu` (u32)
    /// needs 4-byte alignment, so 1 byte of padding aligns it to offset 24.
    pub _pad1: u8,
    /// Path MTU (256, 512, 1024, 2048, 4096 bytes).
    pub path_mtu: u32,
    /// Retry count for RC transport (0-7).
    pub retry_count: u8,
    /// RNR (Receiver Not Ready) retry count (0-7).
    pub rnr_retry: u8,
    /// Private data for application-level connection negotiation (up to 56 bytes).
    pub private_data: [u8; 56],
    /// Length of valid private data bytes.
    pub private_data_len: u8,
    pub _pad: [u8; 3],
}

21.5.1.4 Topology Export

The kernel exports accelerator and network topology so that collective libraries (NCCL, RCCL, Gloo, etc.) can make optimal routing decisions:

/sys/kernel/umka/topology/
    accelerators        # List of all accelerators with NUMA/PCIe location
    p2p_matrix          # NxN matrix of P2P bandwidth between accelerators
    rdma_links          # RDMA network links with bandwidth/latency
    collective_groups   # Pre-computed optimal collective groups
$ cat /sys/kernel/umka/topology/p2p_matrix
# # Bandwidth in GB/s (0 = no direct path)
# #       GPU0    GPU1    GPU2    GPU3
GPU0    -       200     12.5    12.5
GPU1    200     -       12.5    12.5
GPU2    12.5    12.5    -       200
GPU3    12.5    12.5    200     -

This is additive — NCCL currently discovers topology through a combination of nvidia-smi topo, sysfs traversal, and trial-and-error. A kernel-provided topology file is faster and more reliable.


21.5.2 Linux Compatibility Layer

21.5.2.1 DRM/KMS Compatibility

Linux's Direct Rendering Manager (DRM) is the standard GPU kernel interface. Userspace graphics stacks (Mesa, Vulkan, Wayland compositors) use it via /dev/dri/card* and /dev/dri/renderD*.

UmkaOS provides DRM compatibility through umka-compat:

Userspace (Vulkan, OpenGL, CUDA, etc.)
    |
    | Standard DRM/KMS ioctls
    v
umka-compat/src/drm/
    |
    | Translates DRM ioctls to umka-accel KABI calls
    v
umka-accel scheduler + AccelBase/AccelCompute vtable
    |
    | KABI vtable calls
    v
GPU driver (Tier 1, domain-isolated)

DRM ioctls translated to umka-accel operations:

DRM ioctl umka-accel equivalent
DRM_IOCTL_MODE_* Display subsystem (AccelDisplayVTable, not covered here)
DRM_IOCTL_*_CTX_CREATE/DESTROY create_context / destroy_context
DRM_IOCTL_*_GEM_CREATE alloc_device_memory
DRM_IOCTL_*_GEM_MMAP map_device_memory
DRM_IOCTL_*_EXECBUFFER submit_commands (via AccelScheduler)
DRM_IOCTL_*_WAIT poll_completion
DRM_IOCTL_PRIME_* P2P DMA handle export/import

21.5.2.2 NVIDIA Compatibility

NVIDIA's userspace stack (CUDA runtime, cuDNN, TensorRT) communicates with the kernel driver through proprietary ioctls on /dev/nvidia*. The compatibility approach:

Option A: NVIDIA ships a KABI-native kernel interface layer
  - NVIDIA already has a clean internal split between their proprietary
    compute core and the "kernel interface layer" (nvidia.ko)
  - UmkaOS provides a KABI implementation of this interface layer
  - NVIDIA's compute core links against our KABI implementation
  - Same approach described in Section 23.1.4

Option B: ioctl compatibility shim
  - Translate NVIDIA's /dev/nvidia* ioctls to umka-accel KABI calls
  - More fragile (NVIDIA changes their ioctl interface between releases)
  - Stopgap until Option A is available

Option A is strongly preferred and is the medium-term plan.

21.5.2.3 Feasibility Analysis: Porting Open-Source NVIDIA Drivers

NVIDIA open-sourced their kernel modules (nvidia.ko, nvidia-uvm.ko, nvidia-modeset.ko) in 2022 under dual MIT/GPLv2 license. This section analyzes the feasibility of writing a new UmkaOS-native driver based on this open-source code, preserving binary compatibility with NVIDIA's proprietary userspace stack (CUDA, cuDNN, TensorRT, etc.).

Why this matters: If unmodified libcuda.so, libnvidia-ml.so, cuDNN, TensorRT, and the entire CUDA toolkit work on UmkaOS without recompilation, the kernel is immediately viable for the entire GPU computing ecosystem.

21.5.2.3.1 NVIDIA's Architecture (Post-2022 Open-Source Release)

The modern NVIDIA driver stack has a clean three-layer architecture:

Layer 3: Proprietary Userspace (binary-only, NOT recompiled)
  ┌─────────────────────────────────────────────────────┐
  │  libcuda.so (CUDA runtime)                          │
  │  libnvidia-ml.so (NVML monitoring)                  │
  │  cuDNN, TensorRT, NCCL                              │
  │  Vulkan/OpenGL ICD (libnvidia-glcore.so)            │
  │  NVENC/NVDEC (video encode/decode)                  │
  └──────────────────────┬──────────────────────────────┘
                         │ ioctl() on /dev/nvidia*
                         │ (stable-ish interface, versioned by driver release)
                         v
Layer 2: Open-Source Kernel Module (MIT/GPLv2, THIS IS WHAT WE PORT)
  ┌─────────────────────────────────────────────────────┐
  │  nvidia.ko                                          │
  │  ├── OS interface layer (nv-linux.c, nv-pat.c,     │
  │  │   nv-mmap.c, nv-i2c.c, nv-acpi.c, etc.)        │
  │  │   → Linux kernel API calls (PCI, DMA, IRQ, MM)  │
  │  │   → THIS is what we rewrite for UmkaOS       │
  │  │                                                  │
  │  ├── RM (Resource Manager) core                     │
  │  │   → Hardware-agnostic resource management        │
  │  │   → Talks DOWN to GSP firmware via RM RPC        │
  │  │   → Talks UP to userspace via control ioctls     │
  │  │   → We can reuse this largely unchanged          │
  │  │                                                  │
  │  └── Entry points: module_init/exit, ioctl dispatch │
  │      → Rewrite for KABI driver lifecycle             │
  │                                                     │
  │  nvidia-uvm.ko (Unified Virtual Memory)             │
  │  ├── Page fault handler, migration engine           │
  │  ├── Uses Linux HMM (mmu_notifiers)                 │
  │  └── Integrate with umka-core HMM (Section 21.2)       │
  │                                                     │
  │  nvidia-modeset.ko (Display/KMS)                    │
  │  ├── Mode setting, display management               │
  │  └── Map to AccelDisplayVTable                       │
  └──────────────────────┬──────────────────────────────┘
                         │ RM RPC (register read/write commands)
                         v
Layer 1: GSP Firmware (runs ON the GPU, opaque, NOT our concern)
  ┌─────────────────────────────────────────────────────┐
  │  GPU System Processor firmware                      │
  │  - Loaded by kernel module during initialization    │
  │  - Handles actual hardware programming              │
  │  - Context switching, power management, ECC, etc.   │
  │  - Communicates with kernel via shared memory + IRQ │
  │  - Binary blob, but runs on GPU, not on CPU         │
  │  - We just need to load it and talk RM RPC to it    │
  └─────────────────────────────────────────────────────┘

Key insight: nvidia.ko is NOT a traditional monolithic driver. On modern GPUs (Turing and newer, i.e., everything from 2018+), the kernel module is primarily a thin translation layer between Linux kernel APIs and the GSP firmware running on the GPU itself. The "intelligence" is in the GSP firmware, not in the kernel module.

21.5.2.3.2 Component-by-Component Porting Analysis

Component 1: OS Interface Layer → KABI Translation (Mechanical, ~60% of work)

Linux API Used UmkaOS Equivalent Difficulty
pci_register_driver() KABI device registry match + register_driver() Trivial
request_irq() / free_irq() KABI request_irq() / free_irq() Trivial
dma_alloc_coherent() / dma_map_*() KABI dma_alloc() / dma_map() Trivial
ioremap() / iounmap() KABI ioremap() / iounmap() Trivial
alloc_pages() / __free_pages() KABI alloc_pages() Trivial
mutex_lock() / spin_lock() KABI mutex / spinlock (or Rust equivalents) Trivial
timer_setup() / schedule_work() KABI timer / workqueue equivalents Easy
pci_enable_msi() / MSI-X KABI MSI/MSI-X support Easy
sysfs_create_group() KABI property export (device registry) Easy
ACPI methods (power management) KABI power state callbacks Easy
vm_area_struct / vm_operations KABI memory mapping interface Medium
mmu_notifier_*() (for UVM) UmkaOS HMM PageLocationTracker (Section 21.2) Medium
/proc/driver/nvidia/ /sys/kernel/umka/accel/ (Section 21.5.2.5) Easy

This is ~90+ files in the open-source nvidia.ko that implement Linux kernel API calls. The work is mechanical: each Linux API call maps to its KABI equivalent. There are no algorithmic decisions — it's pure API translation.

Component 2: ioctl Interface → /dev/nvidia* Compatibility (Critical)

The binary userspace communicates exclusively through ioctls on: - /dev/nvidia0, /dev/nvidia1, ... (per-GPU control) - /dev/nvidiactl (system-level control) - /dev/nvidia-uvm (unified virtual memory) - /dev/nvidia-uvm-tools (profiling) - /dev/nvidia-modeset (display)

Approach:
  umka-compat provides /dev/nvidia* character devices.
  Each ioctl is dispatched to the UmkaOS NVIDIA driver (Tier 1).
  The driver's RM core processes the ioctl exactly as it does on Linux.

  The critical point: we do NOT reinterpret ioctls.
  We pass them through to the same RM core code (open-source).
  The RM core talks to GSP firmware, which does the actual work.
  The ioctl ABI is preserved byte-for-byte.

The ioctl numbers and structures are defined in NVIDIA's open-source headers (nvidia-uvm/uvm_linux_ioctl.h, nvidia/nv-ioctl-numbers.h). Since the RM core is open-source and handles ioctl dispatch internally, we preserve the exact same ioctl interface. Binary libcuda.so cannot distinguish UmkaOS from Linux.

Component 3: UVM (Unified Virtual Memory) → UmkaOS HMM Integration (Hardest Part)

nvidia-uvm.ko implements CUDA Unified Memory. On Linux, it hooks deeply into the memory management subsystem:

Linux UVM hooks:
  - mmu_notifier (track CPU page table changes)
  - hmm_range_fault() (resolve CPU page faults for GPU access)
  - migrate_vma() (migrate pages between CPU and GPU)
  - Fault handler for GPU page faults (ATS/PRI)

UmkaOS's advantage: Section 21.2 (Heterogeneous Memory Management) provides exactly these primitives as first-class kernel features. The mapping is:

Linux UVM mechanism UmkaOS equivalent
mmu_notifier_register() PageLocationTracker subscription
hmm_range_fault() UmkaOS HMM handle_device_fault() callback
migrate_vma_setup/pages/finalize() AccelBase migrate_pages()
fault_handler() for ATS UmkaOS device fault handler (Section 21.2.1.4)
Per-process GPU page tables AccelContext memory management
UVM counters / access tracking UmkaOS memory accounting (cgroup accel.memory.*)

This is the most complex component because UVM deeply interweaves with memory management internals. However, UmkaOS's HMM design (Section 21.2) was designed with exactly this use case in mind, so the mapping is cleaner than on Linux.

Component 4: GSP Firmware Loading and Communication (Straightforward)

The kernel module loads GSP firmware from the filesystem (/lib/firmware/nvidia/) into GPU VRAM, then communicates via shared memory regions and interrupts. This is self-contained within the RM core and does not depend on Linux APIs beyond basic DMA and interrupt handling — both of which map trivially to KABI.

Component 5: Display / KMS (nvidia-modeset.ko) (Standard)

Maps to AccelDisplayVTable. Standard DRM/KMS translation (Section 21.5.2.1) handles most of this. NVIDIA's modeset module is relatively thin compared to the compute path.

21.5.2.3.3 Difficulty Rating Summary
Component Difficulty Notes
OS interface layer rewrite 3/10 Mechanical API translation
ioctl passthrough (/dev/nvidia*) 2/10 Character device + dispatch
GSP firmware loading 2/10 Self-contained in RM core
PCI/DMA/IRQ setup 2/10 Trivial KABI mapping
UVM → UmkaOS HMM integration 8/10 Deepest integration point (~40K lines in upstream UVM; essentially a rewrite of MM integration layer)
Display / modeset 4/10 Standard DRM/KMS path
Power management (ACPI/PCIe) 3/10 Device registry PM callbacks
Testing + binary userspace validation 5/10 Must test full CUDA stack
21.5.2.3.4 Binary Userspace Compatibility Verification

The following must work without recompilation on UmkaOS:

Component Interface Used Compatibility Path
libcuda.so (CUDA runtime) /dev/nvidia* ioctls ioctl passthrough to RM core
libnvidia-ml.so (NVML) /dev/nvidiactl ioctls + sysfs ioctl passthrough + sysfs compat
cuDNN / TensorRT Links against libcuda.so Transitive (libcuda works → these work)
nvidia-smi NVML library Transitive
NCCL (multi-GPU) libcuda + /dev/nvidia-uvm UVM compat + P2P DMA support
Vulkan ICD DRM + /dev/nvidia* DRM compat (8.1) + ioctl passthrough
NVENC/NVDEC /dev/nvidia* ioctls ioctl passthrough
Container runtime (nvidia-container-toolkit) cgroup + device files cgroup compat + device file compat
21.5.2.3.5 What UmkaOS Does Better Than Linux for NVIDIA GPUs
Capability Linux Behavior UmkaOS Behavior
GPU crash recovery System reboot required Driver reload in ~100ms–5s (Section 21.1.3.2)
GPU scheduling Driver-internal, invisible Kernel-managed, cgroup-integrated (Section 21.1.2.4)
GPU memory limits None (driver tracks, no enforcement) cgroup accel.memory.max (Section 21.3.2)
GPU compute QoS None cgroup accel.compute.guarantee (Section 21.3.4)
GPU memory in OOM killer Invisible Full visibility, OOM-killable (Section 21.3.3)
Multi-tenant isolation MIG only (hardware-dependent) Software scheduling + MIG (Section 21.3)
GPU observability nvidia-smi (polling) Stable tracepoints + eBPF (Section 21.1.3.4)
UVM performance Bolted-on HMM, driver-specific First-class HMM, kernel-managed (Section 21.2)
P2P DMA (GPUDirect) NVIDIA-specific API Generalized KABI P2P (Section 21.2)
Power management Driver-internal Topology-driven, device-registry-integrated
21.5.2.3.6 Risk Assessment
Risk Likelihood Impact Mitigation
NVIDIA changes ioctl ABI between releases Medium High Pin to specific driver release initially; NVIDIA's ioctl ABI is versioned and backwards-compatible in practice
UVM integration bugs cause data corruption Low Critical Extensive testing with CUDA memory stress tests; UVM has well-defined semantics
GSP firmware version incompatibility Low High Support specific firmware versions matching the open-source driver release
Performance regression vs Linux Medium Medium Profile early; most performance is in GSP firmware and userspace, not kernel module
NVIDIA open-source license compliance None N/A MIT/GPLv2 dual license; our OKLF is compatible with both
21.5.2.3.7 Implementation Strategy
Phase 1: Basic GPU Initialization
  - Port OS interface layer to KABI
  - GSP firmware loading
  - /dev/nvidia* character devices with ioctl passthrough
  - PCI, DMA, IRQ setup via KABI
  - Goal: nvidia-smi shows GPU info

Phase 2: Compute Workloads
  - Full RM core integration
  - CUDA simple programs work (vector add, matmul)
  - Basic context management through AccelBase
  - Goal: CUDA samples compile and run

Phase 3: UVM and Multi-GPU
  - UVM integration with UmkaOS HMM
  - Unified memory (cudaMallocManaged) works
  - P2P DMA for multi-GPU (NVLink, PCIe)
  - NCCL multi-GPU collectives
  - Goal: PyTorch distributed training works

Phase 4: Production Hardening
  - Display / modeset (AccelDisplayVTable)
  - Full cgroup integration
  - Crash recovery testing
  - Performance parity validation
  - Goal: Production-ready for datacenter + desktop
21.5.2.3.8 Licensing Note

NVIDIA's open-source kernel modules are dual-licensed MIT/GPLv2. Since we are writing a new driver inspired by the open-source code (not copy-pasting it), and our KABI interface layer is original work, there are no licensing concerns. The ioctl numbers and structures are functional interfaces (facts, not copyrightable expression). The GSP firmware is a binary blob loaded onto the GPU (not linked into our kernel). This is fully compatible with OKLF v1.3.

21.5.2.4 VFIO Passthrough

For VMs that need direct device access. VFIO is a general-purpose mechanism (see Section 10.5.3.8) — it works identically for GPUs, NICs, NVMe controllers, and any other PCIe device. The GPU-specific example:

/dev/vfio/ interface (Section 10.4, Tier 2 driver path)
    |
    v
umka-kvm (KABI Tier 1 driver)
    |
    | Programs IOMMU for VM isolation
    v
PCIe device (entire device assigned to VM, VM's guest driver manages it)

VFIO passthrough works unchanged from Linux. The device is assigned to the VM at the IOMMU group level (Section 10.5.3.8).

21.5.2.5 UmkaOS-Specific Interfaces (Superset)

Beyond Linux compatibility, new interfaces for UmkaOS-aware software:

/dev/umka-accel-0           # UmkaOS-native accelerator access
/dev/umka-accel-1           # (one per accelerator device)

/sys/kernel/umka/accel/
    devices/
        0/
            info            # AccelDeviceInfo (JSON-formatted)
            utilization     # Current utilization stats
            contexts        # Active context list
            memory          # Memory usage breakdown
            power           # Power/thermal/clock state
            topology        # PCIe/NVLink/xGMI connections
            partitions/     # MIG/SR-IOV partitions (if available)
                0/
                    info
                    bind    # Bind to cgroup
    scheduler/
        policy              # Global scheduling policy
        stats               # Global scheduling statistics
    topology/
        p2p_matrix          # Bandwidth matrix
        rdma_links          # Network links
        numa_map            # Accelerator-to-NUMA mapping
    inference/
        models/             # In-kernel model management (Section 21.4)

Existing Linux tools (nvidia-smi, rocm-smi, intel_gpu_top) continue to work through the DRM/sysfs compatibility layer. UmkaOS-specific tools (umka-accel-top, umka-gpu-smi) can use the richer /sys/kernel/umka/accel/ interface for more detailed information and control.

21.5.2.6 Display Stack: Wayland and Buffer Sharing

The modern Linux display stack centers on Wayland compositors consuming DRM/KMS (Section 21.5.2.1). UmkaOS provides the full kernel infrastructure these compositors require, with crash-recovery advantages that Linux cannot offer.

DMA-BUF (Cross-Device Buffer Sharing)

DMA-BUF is the kernel primitive for sharing memory buffers between devices — GPU to display controller, GPU to video encoder, camera to GPU, GPU to network (RDMA). In Linux, DMA-BUF is a global namespace with weak access control. In UmkaOS, each DMA-BUF is a capability-protected kernel object:

DMA-BUF lifecycle:
  1. Exporter (e.g., GPU driver) creates a DMA-BUF from device memory
  2. Returns a capability token (not a raw file descriptor)
  3. Importer (e.g., display driver) receives the capability via IPC
  4. Importer maps the DMA-BUF into its device's address space
  5. Zero-copy path: both devices access the same physical memory

For Linux compatibility, DMA-BUF capabilities are presented as file descriptors through the compat layer, so existing userspace (Mesa, Wayland compositors, GStreamer) works unmodified.

Explicit Synchronization (dma_fence / sync_file)

GPU and display operations are asynchronous. Explicit sync primitives coordinate them:

  • dma_fence: kernel-internal synchronization point representing GPU work completion. Created by the GPU driver when work is submitted, signaled when the GPU finishes.
  • sync_file: userspace-visible wrapper around one or more dma_fences. Exported as a file descriptor. Wayland compositors use these to know when a client's rendering is complete before presenting.
  • Timeline semaphores: Vulkan-style monotonic sync objects. More efficient than binary fences for pipelined workloads (render frame N while displaying frame N-1).

UmkaOS implements the Linux-compatible sync_file ioctl interface (SYNC_IOC_MERGE, SYNC_IOC_FILE_INFO) so existing Wayland compositors and Vulkan drivers work without modification.

GBM (Generic Buffer Manager)

GBM is the userspace buffer allocation library that Wayland compositors use to allocate scanout-capable buffers. It talks to DRM via ioctl. UmkaOS's DRM compatibility layer (Section 21.5.2.1) ensures GBM works unmodified — gbm_create_device(), gbm_bo_create(), and related functions issue standard DRM ioctls that UmkaOS handles.

Render Nodes (/dev/dri/renderD*)

Render nodes provide unprivileged GPU compute and render access without modesetting capability. Any user can open a render node to submit GPU work (3D rendering, compute shaders, video encode/decode) without needing root or DRM master status. UmkaOS exposes render nodes via the DRM compat layer, mapping each to an umka-accel context creation with appropriate capability restrictions (no modesetting, no display control).

DRM Leases

DRM leases allow a client to take exclusive control of a display connector — the primary consumer is VR headsets (SteamVR uses DRM leases to drive the HMD display independently from the desktop compositor). UmkaOS supports the lease ioctl family (DRM_IOCTL_MODE_CREATE_LEASE, DRM_IOCTL_MODE_LIST_LESSEES, etc.) through the DRM compatibility layer.

KMS Atomic Modesetting

The modern display pipeline API. Replaces legacy drmModeSetCrtc with atomic commits that update multiple display properties (resolution, refresh, gamma, position) as a single transaction — either the entire commit succeeds or nothing changes. UmkaOS's DRM layer implements the full atomic commit interface (DRM_IOCTL_MODE_ATOMIC) including: - TEST_ONLY mode (compositor can test configurations without applying them) - Non-blocking commits (compositor doesn't stall waiting for vblank) - Property-based interface (all display state expressed as properties)

Multi-GPU Display (PRIME)

Render on GPU A, scanout on GPU B via DMA-BUF sharing (known as PRIME in the DRM ecosystem). The P2P DMA infrastructure (Section 21.2) handles the underlying data transfer. For the display case specifically: - DMA-BUF export from render GPU → DMA-BUF import on display GPU - If GPUs share a PCIe switch, transfer is direct P2P (no CPU copy) - If GPUs are on different NUMA nodes, transfer goes through host memory

HDR and Wide Color Gamut

Modern displays support HDR10, Dolby Vision, and wide color gamuts (DCI-P3, Rec. 2020). Kernel support is exposed via KMS atomic properties: - HDR_OUTPUT_METADATA: HDR static/dynamic metadata per connector - COLORSPACE: signal color space to the display (BT.709, DCI-P3, BT.2020) - MAX_BPC: maximum bits-per-channel for the connector (8, 10, 12, 16) - COLOR_ENCODING / COLOR_RANGE: YCbCr encoding and quantization range

These are standard KMS properties exposed through atomic modesetting. Wayland compositors (wlroots, Mutter, KWin) use these to negotiate HDR output with the display.

Crash Recovery Advantage

If the DRM/GPU driver crashes (Tier 1 reload, Section 10.8):

  1. The driver binary is reloaded in ~50-150ms
  2. DMA-BUF metadata (size, format, sharing graph, capability tokens) survives the reload — this metadata is managed by umka-core's memory subsystem, not the driver. Buffer handles remain valid. However, VRAM contents are lost on device reset — a GPU reset physically reinitializes the device memory, destroying all framebuffer data, texture contents, and compute buffers stored in VRAM. Applications must re-upload data to the GPU after recovery.
  3. Sync objects are re-created in signaled state (conservative — forces resubmission of any pending work, but doesn't deadlock the compositor)
  4. The Wayland compositor sees a brief stall (~100ms–5s, the full recovery window) and must re-render its buffers (VRAM contents are lost), but does not lose its buffer handles or capability tokens, does not crash, and does not need to re-negotiate with clients. The compositor's CPU-side state (window tree, client list, layout) is fully preserved.
  5. Running applications still hold valid DMA-BUF capabilities. They receive an ENODATA error on the next access to a VRAM-backed buffer, indicating that the buffer contents are stale and must be re-uploaded. Applications that maintain CPU-side copies of their rendering data (the common case for double-buffered Wayland clients) can re-upload and resume rendering immediately. Applications that relied solely on VRAM-resident data with no CPU-side copy must regenerate or reload from disk.

In Linux, a GPU driver crash typically kills the entire X/Wayland session, requiring all graphical applications to restart. UmkaOS's display stack crash recovery preserves the sharing graph and buffer metadata, allowing rapid re-rendering rather than full session teardown. This is a direct consequence of the capability-managed DMA-BUF design and Tier 1 driver reload.


21.6 Unified Compute Model

21.6.1 The Convergence Problem

The architecture currently treats CPU scheduling (Section 6.1) and accelerator scheduling (Section 21.1.2.4) as separate worlds:

World 1 — CPU Scheduler (Section 6.1):
  Input:     threads (instruction streams)
  Resources: CPU cores (P-core, E-core, RISC-V harts)
  Decision:  which core runs this thread?
  Cgroup:    cpu.max, cpu.weight

World 2 — Accelerator Scheduler (Section 21.1.2.4):
  Input:     command buffers (GPU kernels, inference requests)
  Resources: accelerator contexts (GPU CUs, NPU engines)
  Decision:  which context gets device time?
  Cgroup:    accel.compute.max, accel.compute.weight

These worlds share no abstraction. The kernel cannot answer: "given a fixed power budget and a matrix workload, is it more efficient to run on CPU-AMX, GPU, or NPU?" It cannot balance compute load across device types. It cannot make holistic energy decisions.

Meanwhile, the hardware is converging:

Trend Example Implication
CPU gains matrix ops Intel AMX (P-cores), ARM SME CPU can do what GPUs used to do
CPU+GPU share memory APU (AMD), Apple M-series, Grace Hopper NVLink-C2C No DMA copy between CPU↔GPU
Heterogeneous ISA within CPUs RISC-V: some harts have Vector, some don't (Section 6.1.5.9) "CPU with Vector" vs "GPU CU" is a matter of degree
CXL 3.0 shared memory Samsung CMM-H, Intel Ponte Vecchio + CXL Hardware-coherent memory shared by CPU and accelerator
On-die NPU Intel Meteor Lake NPU, Qualcomm Hexagon NPU is as close to CPU as an E-core

The conceptual leap from Section 6.1.5.9 (RISC-V harts with different ISA extensions) to "GPU CU as another compute unit type" is small. Both are compute resources with different capability profiles and power characteristics.

21.6.2 Design Principle: Overlay, Not Replacement

Critical constraint: This must work Day 1 as a Linux drop-in replacement. NVIDIA's proprietary userspace (libcuda, libnvidia-ml, cuDNN) runs unmodified. CUDA applications explicitly target the GPU — the kernel does NOT redirect them. AMD ROCm, Intel oneAPI, all work as-is.

The unified compute model is an advisory overlay on top of the existing separate schedulers:

                    ┌─────────────────────────────┐
                    │   Unified Compute Topology   │  ← NEW (advisory)
                    │   Multi-dimensional capacity │
                    │   Cross-device energy model  │
                    └──────┬──────────────┬────────┘
                           │              │
                    ┌──────▼──────┐ ┌─────▼──────┐
                    │CPU Scheduler│ │Accel Sched  │  ← UNCHANGED
                    │ CFS + EAS   │ │ CBS + Prio  │
                    │ (Section 6.1)│ │ (Section 21.1.2.4)   │
                    └─────────────┘ └─────────────┘
  • Existing schedulers continue to make execution decisions independently.
  • The unified layer provides topology, capacity, and energy data that both schedulers and userspace runtimes can consume.
  • No vendor must rewrite anything. Benefits accrue from kernel-side visibility.

21.6.3 Multi-Dimensional Compute Capacity

Section 6.1.5.1 defines CpuCapacity as a single scalar (0–1024). This works for heterogeneous CPUs because all cores execute the same type of work (general-purpose instructions) at different speeds.

Once accelerators enter the picture, capacity becomes a vector — different compute units excel at different workload types:

// umka-core/src/compute/capacity.rs

/// Multi-dimensional capacity profile for any compute unit.
///
/// Values are ABSOLUTE (device-intrinsic), not normalized to the system.
/// Each dimension is in hardware-specific units that do not change when
/// devices are hot-plugged. The kernel maintains a per-system max for
/// each dimension (updated lazily on device arrival/departure) and
/// normalizes to 0–1024 ONLY for sysfs display (compute.capacity_normalized).
///
/// Why absolute: normalizing internally means hot-plugging a faster GPU
/// would silently change every other device's capacity values, breaking
/// comparisons across snapshots and racing with concurrent readers.
///
/// Used for:
///   - Power budgeting: informed cross-device throttling (Section 6.4)
///   - Intent-based management: workload-to-device advisory (Section 6.7)
///   - Userspace runtime hints: exposed via sysfs for OpenCL/SYCL
///
/// NOT used for: actual scheduling decisions (those remain in
/// CPU scheduler and AccelScheduler respectively).
pub struct ComputeCapacityProfile {
    /// Scalar integer throughput (million instructions per second, MIPS-equivalent).
    /// CPU P-core: ~50000. GPU CU: ~200. NPU: 0.
    pub scalar: u32,

    /// Vector/SIMD throughput (GFLOPS single-precision equivalent).
    /// GPU CU: ~2000. CPU P-core with AVX-512: ~300. NPU: 0.
    pub vector: u32,

    /// Matrix throughput (TOPS, tera-operations per second for matmul).
    /// NPU: ~40. GPU tensor core: ~300. CPU AMX: ~5. CPU scalar: ~0.
    pub matrix: u32,

    /// Memory bandwidth (GB/s).
    /// GPU HBM3: ~3000. CPU DDR5: ~100. NPU on-chip SRAM: ~500.
    pub memory_bw: u32,

    /// Launch overhead (inverse, microseconds to first useful work).
    /// CPU: 1 (thread wakeup ~1μs). GPU: 30 (kernel launch ~30μs).
    /// NPU: 200 (model load + inference setup).
    /// Lower = better. Determines crossover: small tasks favor CPU.
    pub launch_overhead_us: u32,
}

Example profiles on a Grace Hopper system (ARM CPU + H100 GPU):

CPU Grace core (ARM Neoverse):
  scalar=45000  vector=300  matrix=2    memory_bw=100  launch_overhead_us=1

GPU H100 SM:
  scalar=200    vector=2000 matrix=300  memory_bw=3000 launch_overhead_us=30

Intel Alder Lake system with discrete GPU and NPU:
CPU P-core:   scalar=50000  vector=300  matrix=5   memory_bw=80  launch_overhead_us=1
CPU E-core:   scalar=25000  vector=150  matrix=2   memory_bw=80  launch_overhead_us=1
iGPU EU:      scalar=100    vector=600  matrix=10  memory_bw=50  launch_overhead_us=20
Discrete GPU: scalar=200    vector=2000 matrix=250 memory_bw=900 launch_overhead_us=30
NPU:          scalar=0      vector=0    matrix=40  memory_bw=50  launch_overhead_us=200

RISC-V SoC with heterogeneous harts + accelerators:
Hart (RV64GC):   scalar=15000  vector=0    matrix=0   memory_bw=30  launch_overhead_us=1
Hart (RV64GCV):  scalar=15000  vector=200  matrix=1   memory_bw=30  launch_overhead_us=1
Custom ML hart:  scalar=3000   vector=100  matrix=20  memory_bw=60  launch_overhead_us=5
Attached NPU:    scalar=0      vector=0    matrix=10  memory_bw=40  launch_overhead_us=200

(These are illustrative values — actual values are runtime-discovered from hardware
capability queries. Actual values are populated from driver-reported specs at device
registration time. Sysfs normalizes to 0-1024 per dimension for display, where
best-in-system = 1024.)

Key property: CPU cores already have entries (derived from CpuCapacity in Section 6.1.5.1). Accelerators get profiles from AccelBase get_utilization (Section 21.1.2.2) extended with a get_capacity_profile vtable entry. This is a minor KABI extension — one new function pointer that returns static data.

21.6.4 Unified Compute Topology

The device registry (Section 10.5) already models all devices in one tree. The unified compute layer adds a compute view that flattens this into a map of compute units with their capabilities, power profiles, and memory domains:

// umka-core/src/compute/topology.rs

/// A compute unit in the unified topology.
/// Can be a CPU core, GPU SM, NPU engine, DSP core, etc.
pub struct ComputeUnit {
    /// Device registry node ID.
    pub device_id: DeviceNodeId,

    /// What kind of compute unit this is.
    pub unit_type: ComputeUnitType,

    /// Multi-dimensional capacity profile.
    pub capacity: ComputeCapacityProfile,

    /// Which memory domain is local to this compute unit?
    /// CPU cores → system RAM NUMA node.
    /// GPU SMs → VRAM NUMA node (Section 21.2).
    /// APU GPU → same NUMA node as CPU (shared memory).
    pub memory_domain: MemoryDomainId,

    /// Is memory shared with CPU without DMA copy?
    /// true for: APU, Apple M-series, Grace Hopper, CXL-attached accelerator.
    /// false for: discrete PCIe GPU (data must be explicitly transferred).
    pub memory_unified_with_cpu: bool,

    /// Energy model: OPP table (same format as Section 6.1.5.2).
    /// GPU OPPs come from the driver via AccelBase get_utilization/set_performance_level.
    pub energy_model: Option<EnergyModelRef>,

    /// Current utilization (0–1024), updated periodically.
    /// CPU: from PELT (Section 6.1.5.4).
    /// Accelerator: from AccelBase get_utilization (Section 21.1.2.2).
    pub utilization: AtomicU32,
}

#[repr(u32)]
pub enum ComputeUnitType {
    /// General-purpose CPU core. Managed by CPU scheduler (Section 6.1).
    CpuCore         = 0,
    /// GPU compute unit (SM/CU/EU). Managed by AccelScheduler (Section 21.1.2.4).
    GpuCompute      = 1,
    /// Neural processing unit. Managed by AccelScheduler.
    NpuEngine       = 2,
    /// Digital signal processor. Managed by AccelScheduler.
    DspCore         = 3,
    /// FPGA reconfigurable logic. Managed by AccelScheduler.
    FpgaSlot        = 4,
    /// Computational storage processor (Section 14.8). Managed by AccelScheduler.
    CsdProcessor    = 5,
}

Population: The topology is built at boot and updated on hot-plug:

1. CPU cores:    discovered by existing CPU topology (Section 6.1.5.10).
                 ComputeCapacityProfile derived from CpuCapacity + IsaCapabilities.

2. Accelerators: discovered by device registry (Section 10.5) when AccelBase driver loads.
                 Driver provides ComputeCapacityProfile via get_capacity_profile().
                 If driver doesn't implement it (legacy, compat):
                   → kernel estimates from AccelDeviceClass + get_utilization().
                   → NVIDIA compat driver: AccelDeviceClass::GpuCompute,
                     bandwidth/utilization from nvidia-smi equivalent queries.
                   → No NVIDIA code change needed. Kernel reads existing telemetry.

21.6.5 Cross-Device Energy Optimization

The power budgeting system (Section 6.4) currently reads power per domain (CPU package, DRAM, Accelerator) and throttles independently. With unified topology, it can make informed cross-device decisions:

Current (independent throttling):
  Container exceeds power.max.
  → Throttle CPU (reduce frequency).
  → Throttle GPU (reduce clock).
  → Both throttled equally. Dumb.

With unified compute awareness:
  Container exceeds power.max.
  → Kernel reads workload profile:
      80% of compute is matrix ops (GPU-bound).
      20% is scalar (CPU, mostly waiting for GPU).
  → Informed decision:
      Keep GPU at high clock (it's doing the useful work).
      Aggressively throttle CPU (it's mostly idle-waiting anyway).
      Save more power with less performance loss.

This requires no change to throttle mechanisms — just better information for PowerBudgetEnforcer (Section 6.4.4) to decide WHICH domain to throttle.

// Extension to umka-core/src/power/budget.rs

impl PowerBudgetEnforcer {
    /// When a cgroup exceeds its power budget, decide which domain to throttle.
    /// Uses unified compute topology to understand where useful work is happening.
    fn select_throttle_target(
        &self,
        cgroup: CgroupId,
        excess_mw: u32,
    ) -> ArrayVec<ThrottleAction, MAX_POWER_DOMAINS> {
        let topology = unified_compute_topology();
        let workload = cgroup_workload_profile(cgroup);

        // Score each domain by "usefulness" = domain's contribution to
        // the cgroup's primary workload type.
        // Throttle the LEAST useful domain first.
        // domains is bounded by MAX_POWER_DOMAINS (no heap allocation).
        let mut domains: ArrayVec<_, MAX_POWER_DOMAINS> = self.domains.iter()
            .filter(|d| d.cgroup_attribution(cgroup) > 0)
            .map(|d| (d, workload.usefulness_score(d, &topology)))
            .collect();

        // Sort: least useful first (throttle first).
        domains.sort_by_key(|(_, score)| *score);

        // Apply throttle actions starting from least useful domain
        // until excess_mw is recovered.
        self.build_throttle_plan(&domains, excess_mw)
    }
}

21.6.6 Workload Profile Classification

Intel Thread Director (Section 6.1.5.6) classifies CPU workloads by instruction mix. Generalize this to a system-wide workload profile that covers all compute domains:

// umka-core/src/compute/classify.rs

/// System-wide workload classification for a cgroup or process.
/// Updated periodically (~1 second) from multiple sources.
///
/// Fractions are fixed-point: 0–1000 representing 0.0%–100.0%.
/// No floating-point in kernel data structures (kernel does not use FPU).
///
/// Invariant: `scalar_fraction + vector_fraction + matrix_fraction <= 1000`.
/// These three fields partition the compute demand; their sum must not exceed 1000.
/// `accel_wait_fraction` and `memory_bound_fraction` are independent blocking-time
/// metrics and are not included in the partition sum.
/// Values that would push the sum above 1000 are saturated at 1000 by the
/// classifier before storing; the dominant fraction absorbs any excess.
pub struct WorkloadProfile {
    /// Fraction of compute demand that is scalar (0–1000).
    /// Source: PELT utilization on CPU cores + ITD hints.
    pub scalar_fraction: u32,

    /// Fraction of compute demand that is vector/SIMD (0–1000).
    /// Source: hardware performance counters (SIMD instruction retired).
    pub vector_fraction: u32,

    /// Fraction of compute demand that is matrix/tensor (0–1000).
    /// Source: AMX/SME counters on CPU, utilization reports from AccelBase.
    pub matrix_fraction: u32,

    /// Fraction of time spent waiting for accelerator completion (0–1000).
    /// Source: CPU scheduler (time in interruptible sleep waiting for accel).
    /// High value = GPU-bound workload.
    pub accel_wait_fraction: u32,

    /// Fraction of time spent waiting for memory/I/O (0–1000).
    /// Source: hardware counters (LLC miss rate, stall cycles).
    pub memory_bound_fraction: u32,

    /// Dominant compute domain for this workload.
    /// Derived from the fractions above.
    pub dominant_domain: ComputeUnitType,
}

dominant_domain derivation algorithm: The dominant_domain field is computed from scalar_fraction, vector_fraction, and matrix_fraction using the following rule:

/// Compute the dominant compute domain from the workload fractions.
/// All fractions are per-mille (0–1000 = 0.0%–100.0%).
///
/// A domain is "dominant" only if its fraction is at least 400 (40.0%).
/// If no single domain clears the threshold, `ComputeUnitType::Mixed` is returned.
/// This prevents spurious dominant-domain assignments when work is evenly distributed.
fn compute_dominant_domain(profile: &WorkloadProfile) -> ComputeUnitType {
    let candidates = [
        (ComputeUnitType::CPU,       profile.scalar_fraction),
        (ComputeUnitType::GpuCompute, profile.vector_fraction),
        (ComputeUnitType::NeuralProcessingUnit, profile.matrix_fraction),
    ];
    let (domain, max_frac) = candidates
        .iter()
        .max_by_key(|(_, f)| *f)
        .copied()
        .unwrap();
    // Must exceed 40% (400 per-mille) to be considered dominant; otherwise Mixed.
    if max_frac >= 400 {
        domain
    } else {
        ComputeUnitType::Mixed
    }
}

This function is called at the end of each profile update cycle (after EMA smoothing and saturation) to refresh dominant_domain. Callers must not set dominant_domain directly — the field is always derived, never policy-driven input.

Fractions invariant and validation: The sum scalar_fraction + vector_fraction + matrix_fraction must not exceed 1000. accel_wait_fraction and memory_bound_fraction are independent blocking-time metrics and are not included in the partition sum; each may range 0–1000 independently.

If a policy update would cause the partition sum to exceed 1000, the kernel rejects the update with EINVAL and logs a KERN_WARNING. Callers must normalize fractions before submitting. The validate() method enforces this invariant; it is called by the policy consumer vtable dispatch before applying any profile:

impl WorkloadProfile {
    /// Validate the fraction invariant.
    ///
    /// Returns `Ok(())` if `scalar_fraction + vector_fraction + matrix_fraction <= 1000`
    /// and all fields are within [0, 1000].
    /// Returns `Err(KernelError::InvalidArgument)` otherwise.
    pub fn validate(&self) -> Result<(), KernelError> {
        let partition_sum = self.scalar_fraction
            .saturating_add(self.vector_fraction)
            .saturating_add(self.matrix_fraction);
        if partition_sum > 1000 {
            return Err(KernelError::InvalidArgument);
        }
        if self.accel_wait_fraction > 1000 || self.memory_bound_fraction > 1000 {
            return Err(KernelError::InvalidArgument);
        }
        Ok(())
    }
}

AccelContext WorkloadProfile Classification Algorithm:

The accelerator scheduler assigns a behavioral WorkloadProfile to each AccelContext based on runtime signals. Classification runs every 100ms (low-overhead background task, separate from the CPU-level WorkloadProfile above which runs at sched_latency_ns).

Signals sampled per context (all available from the scheduler's per-context counters): - avg_cmd_duration_us: Exponential moving average of command completion time. - cmd_stddev_us: Standard deviation of completion times (variability). - queue_depth: Average number of commands in flight. - memory_bandwidth_gbps: DMA bandwidth used (from IOMMU counters). - compute_utilization_pct: Fraction of device time active (from device query).

Classification rules (applied in priority order):

if avg_cmd_duration_us < 50 AND cmd_stddev_us < 10:
    → AccelContextProfile::LowLatency     // Interactive (inference serving, video encode frames)

elif avg_cmd_duration_us > 5000 AND queue_depth >= 4:
    → AccelContextProfile::HighThroughput // Batch compute (training, bulk encode)

elif compute_utilization_pct < 30 AND memory_bandwidth_gbps > 50:
    → AccelContextProfile::MemoryBound    // Memory-bandwidth-limited (attention layers, large embed)

elif cmd_stddev_us > avg_cmd_duration_us * 0.5:
    → AccelContextProfile::Bursty         // Irregular submission patterns (game engines, mixed)

else:
    → AccelContextProfile::Balanced       // Default

Profile → scheduling policy mapping: - LowLatency: Preempt HighThroughput workloads; shorter time slices; wake immediately on submission. - HighThroughput: Longer time slices; batch mode enabled; yield to LowLatency. - MemoryBound: Colocate with compute-bound contexts (memory bandwidth and compute don't contend on the same hardware resources). - Bursty: Reserve 10% excess queue depth for burst absorption. - Balanced: Standard round-robin with CBS bandwidth enforcement.

Profile stability: Profile changes are subject to a minimum hold time of 500ms to prevent thrashing. A profile change is applied only if the new profile has been consistently indicated for 500ms or 5 consecutive classifications.


CPU-level WorkloadProfile classification: The scheduler updates WorkloadProfile every sched_latency_ns (default: 6ms on server, 4ms on desktop) using a PELT-style exponential moving average:

  1. Read hardware counters (per-CPU, sampled at context switch):
  2. scalar_fraction: (retired_instructions - simd_retired - amx_retired) / total_retired × 1000
  3. vector_fraction: simd_retired / total_retired × 1000
  4. matrix_fraction: amx_retired / total_retired × 1000 (or AccelBase utilization report for GPU/NPU)
  5. memory_bound_fraction: stall_cycles_memory / total_cycles × 1000
  6. accel_wait_fraction: time_in_accel_wait / total_wall_time × 1000
  7. EMA smoothing: new = (alpha × sample) + ((1 - alpha) × old), where alpha = 1/4 (fast response).
  8. Saturation: if scalar + vector + matrix > 1000, the dominant fraction absorbs the excess.
  9. Dominant domain: call compute_dominant_domain(profile) and store the result in dominant_domain.

On architectures without per-instruction-class counters (e.g., some ARM cores), vector_fraction and matrix_fraction are estimated from the accelerator utilization report (Section 21.1) and ITD (Intel Thread Director) hints where available.

Where this data is used:

  1. Power budgeting (Section 6.4): Which domain to throttle (Section 21.6.5 above).
  2. Intent-based management (Section 6.7): When intent.efficiency = 80 (prefer efficiency), and the workload is matrix-dominant, the optimizer suggests moving from GPU to NPU (lower power per matrix op).
  3. Userspace runtimes: Exposed via sysfs for consumption by OpenCL/SYCL/oneAPI runtimes that make device selection decisions.
/sys/fs/cgroup/<group>/compute.profile
# # Read-only. Current workload classification (0-1000 = 0.0%-100.0%):
# #   scalar: 150
# #   vector: 50
# #   matrix: 700
# #   accel_wait: 600
# #   memory_bound: 100
# #   dominant: gpu_compute

/sys/kernel/umka/compute/topology
# # Read-only. JSON: all compute units with capacity profiles.
# # Consumed by userspace runtimes for device selection.

/sys/kernel/umka/compute/unit/<device_id>/capacity
# # Read-only. Per-unit: "scalar=1024 vector=300 matrix=150 ..."

21.6.7 Unified Cgroup Compute Budget (Optional)

An optional cgroup knob that expresses total compute need abstractly, leaving device selection to the kernel:

/sys/fs/cgroup/<group>/compute.weight
# # Proportional share of total system compute (across ALL devices).
# # Default: 0 (disabled — use existing cpu.weight + accel.compute.weight).
# # When set: kernel adjusts cpu.weight and accel.compute.weight internally
# # to optimize for the cgroup's workload profile.
    #
# # Example: two cgroups, both compute.weight=100.
# # Cgroup A is GPU-bound → kernel gives A more GPU time, less CPU.
# # Cgroup B is CPU-bound → kernel gives B more CPU time, less GPU.
# # Both get "equal compute" in terms of actual useful work done.

Implementation: compute.weight is an orchestration knob. The kernel's intent optimizer (Section 6.7) reads compute.weight + compute.profile and adjusts the existing per-domain knobs (cpu.weight, accel.compute.weight) every ~1 second. No new scheduling fast path. No change to CPU scheduler or AccelScheduler.

When compute.weight is 0 (default): existing separate knobs work exactly as they do on Linux. Zero overhead. Full backward compatibility.

21.6.8 Unified Memory Domain Tracking

When CPU and accelerator share physical memory (no DMA copy boundary), the memory manager should understand this for page placement:

// umka-core/src/compute/memory.rs

/// Memory domain descriptor in the unified compute topology.
pub struct MemoryDomain {
    /// NUMA node ID (integrates with existing memory manager, Section 4.1).
    pub numa_node: u8,

    /// Which compute units have local access to this memory?
    /// On an APU: both CPU cores and GPU CUs list the same domain.
    /// On discrete GPU: GPU CUs list VRAM domain, CPU lists DDR domain.
    /// Populated at boot/hot-plug (cold path, heap available). The count is
    /// bounded by the number of compute units in the system.
    pub local_compute_units: Vec<DeviceNodeId>,  // heap: cold-path only, after allocator init

    /// Is this domain coherent across all local compute units?
    /// true: APU shared memory, CXL 3.0 coherent pool.
    /// false: discrete GPU VRAM (requires explicit flush/invalidate).
    pub hardware_coherent: bool,

    /// Bandwidth and latency from each compute unit type.
    /// Used by page placement decisions.
    /// Populated at boot/hot-plug (cold path, heap available).
    pub access_costs: Vec<MemoryAccessCost>,  // heap: cold-path only, after allocator init
}

pub struct MemoryAccessCost {
    pub from_unit: DeviceNodeId,
    pub latency_ns: u32,
    pub bandwidth_gbs: u32,
}

What this enables:

Discrete GPU (PCIe, separate memory):
  CPU cores → DDR NUMA node 0 (latency: 80ns, BW: 50 GB/s)
  GPU SMs   → VRAM NUMA node 2 (latency: 100ns, BW: 900 GB/s)
  CPU→VRAM: latency 500ns, BW 25 GB/s (PCIe)
  → Page migration between CPU and GPU is expensive.
  → Applications MUST explicitly manage data placement (cudaMemcpy).
  → Kernel's role: NUMA-aware allocation. Same as Linux.

APU (shared memory):
  CPU cores → DDR NUMA node 0 (latency: 80ns, BW: 50 GB/s)
  GPU CUs   → DDR NUMA node 0 (latency: 90ns, BW: 45 GB/s)  ← SAME NODE
  → No page migration needed. CPU and GPU see the same pages.
  → Kernel can optimize page placement within the shared domain
    (e.g., cache-line alignment for GPU access patterns).
  → Workload migration CPU↔GPU is a scheduling decision only, no data movement.

Grace Hopper (NVLink-C2C unified memory):
  CPU cores → LPDDR5X NUMA node 0 (latency: 80ns, BW: 500 GB/s)
  GPU SMs   → HBM3 NUMA node 1 (latency: 100ns, BW: 3000 GB/s)
  CPU↔GPU:  NVLink-C2C (latency: 200ns, BW: 900 GB/s, COHERENT)
  → Hardware-coherent. Kernel can migrate pages transparently.
  → Hot pages accessed by GPU → migrate to HBM3 (faster).
  → Cold pages → migrate to LPDDR5X (more capacity).
  → Same mechanism as NUMA balancing between CPU sockets, extended to GPU.

This is an extension of the existing NUMA-aware page placement (Section 4.1, Section 21.2), not a new mechanism. The PageLocationTracker (Section 21.2) already tracks which NUMA node pages belong to. Unified memory domains just ensure accelerator-local memory is correctly represented as a NUMA node with proper distance/bandwidth metadata.

21.6.9 NVIDIA Compatibility: No Changes Required

The unified compute model is specifically designed to NOT require driver changes:

NVIDIA driver stack (discrete GPU, PCIe):

  Userspace (closed-source, binary compat):
    libcuda.so         — CUDA runtime        → unchanged
    libnvidia-ml.so    — management library   → unchanged
    libnvcuvid.so      — video decode         → unchanged
    All communicate via ioctl to kernel driver → unchanged

  Kernel driver (open-source nvidia.ko, ported per Section 21.1 KABI):
    Implements AccelBase vtable:
      get_utilization() → kernel reads GPU utilization, power, clock
      submit_commands() → kernel sees command flow
      set_performance_level() → kernel can request clock changes

  What the unified compute layer reads (no new driver code):
    1. GPU utilization % → from get_utilization() (already required by AccelBase)
    2. GPU power draw mW → from get_utilization() (already required)
    3. GPU clock MHz     → from get_utilization() (already required)
    4. Memory bandwidth  → from get_utilization() or static spec data

  What the unified compute layer estimates (kernel-side, no driver involvement):
    5. ComputeCapacityProfile → derived from AccelDeviceClass::GpuCompute +
       known GPU specs (SM count, tensor core presence, memory type).
       Spec database in kernel, keyed by PCI device ID. Same approach as
       Linux's GPU frequency tables.

  Optional future enhancement (minor KABI extension):
    6. get_capacity_profile() → driver provides precise profile.
       Not required. Kernel estimate works without it.

CUDA applications continue to explicitly target the GPU. The kernel does NOT intercept CUDA calls or redirect compute. The benefit is: - Better power budgeting (kernel knows GPU is the useful domain) - Better cgroup fairness (compute.weight distributes across CPU+GPU) - Better topology data for orchestrators (Kubernetes reads sysfs)

21.6.10 What the Kernel Does NOT Do

To be explicit about boundaries — the kernel does NOT:

  1. Automatically redirect CUDA/ROCm/oneAPI workloads between devices. Applications that explicitly target a device continue to target that device. The kernel respects explicit choices.

  2. Implement a compute compiler that translates CPU code to GPU kernels or vice versa. That's a userspace runtime concern (OpenCL, SYCL, Vulkan Compute).

  3. Require drivers to expose internal scheduling decisions. GPU drivers still schedule internally. The kernel provides cross-device orchestration data.

  4. Add overhead to the compute submission hot path. Command submission (submit_commands) goes through AccelScheduler exactly as before. The unified topology is a background advisory system consulted at ~1 second intervals.

  5. Break on systems with no accelerators. When only CPUs are present, the unified compute topology contains only CPU entries. It degrades to exactly the Section 6.1.5 CpuCapacity model. Zero overhead.

21.6.11 Sysfs Interface for Userspace Runtimes

The key practical benefit: userspace runtimes (OpenCL, SYCL, oneAPI, future CUDA alternatives) can query the kernel for topology + workload data instead of each runtime re-discovering hardware independently:

/sys/kernel/umka/compute/
    topology.json                    # Full compute topology (all units)
    unit_count                       # Number of compute units

/sys/kernel/umka/compute/unit/<id>/
    type                             # "cpu_core", "gpu_compute", "npu_engine", ...
    capacity                         # Absolute: "scalar=50000 vector=400 matrix=5 ..."
    capacity_normalized              # Normalized 0-1024: "scalar=1024 vector=390 ..."
    memory_domain                    # NUMA node ID
    memory_unified                   # "1" if shared with CPU, "0" if separate
    utilization                      # Current utilization (0-1024)
    energy_model                     # OPP table (freq, capacity, power)

/sys/fs/cgroup/<group>/
    compute.profile                  # Workload classification (read-only)
    compute.weight                   # Unified compute budget (optional, default 0)

Use case: A SYCL runtime deciding between CPU and GPU for a kernel launch:

1. Read /sys/kernel/umka/compute/unit/*/capacity
2. Read /sys/kernel/umka/compute/unit/*/memory_unified
3. Know: GPU has matrix=800, memory_unified=1 (APU).
4. Decision: matrix workload + shared memory → GPU (no copy cost).
   vs. on discrete GPU: memory_unified=0 → compare launch overhead
   + data transfer cost vs GPU throughput gain.

Today on Linux, each runtime does its own hardware discovery via vendor-specific APIs (nvml, rocm-smi, level-zero). The kernel provides no unified view. UmkaOS's sysfs topology eliminates redundant discovery and gives runtimes data the kernel already has (NUMA distances, power state, utilization).

21.6.12 Linux Compatibility

No existing Linux interfaces are affected. All new interfaces are additive:

Existing (preserved):
  /sys/devices/system/cpu/cpuN/*           — CPU topology, unchanged
  /sys/class/drm/card0/*                   — GPU sysfs, unchanged
  /dev/nvidia*, /dev/dri/*                 — device nodes, unchanged
  sched_setattr(), ioctl(GPU_SUBMIT, ...)  — syscalls, unchanged

New (additive):
  /sys/kernel/umka/compute/*               — unified compute topology
  /sys/fs/cgroup/<group>/compute.profile   — workload classification
  /sys/fs/cgroup/<group>/compute.weight    — optional unified budget

Applications unaware of new interfaces see standard Linux behavior.

21.6.13 Convergence Path: Accelerators as Peer Kernel Nodes

The unified compute topology (Section 21.6.4) treats accelerators as opaque compute units behind AccelBase vtables. This works Day 1 with existing proprietary firmware. But the architecture already contains the design for the next step.

Observation: every modern accelerator already has its own processor and runs its own kernel or firmware:

Device                 Processor            Runs today           Transport
─────────────────────  ───────────────────  ───────────────────  ─────────
NVIDIA GPU (Ada+)      RISC-V (GSP cores)   Proprietary μkernel  PCIe/NVLink
NVIDIA BlueField DPU   ARM A78              Full Linux kernel    PCIe
Intel Gaudi NPU        Custom cores         Firmware             PCIe
AMD Instinct           Embedded μctrl       Firmware             PCIe/xGMI
CXL memory expander    Management proc      Firmware             CXL
Crypto coprocessor     Dedicated core       Firmware/RTOS        PCIe/SPI
Future RISC-V accel    RV64 harts           Firmware             PCIe/CXL/custom

The distributed kernel (Section 5.1) already solves "multiple UmkaOS instances sharing memory and capabilities across a transport." SmartNIC/DPU offload (Section 5.2) already says "a DPU is a close remote node connected via PCIe."

The convergence: any device with its own processor is a potential peer kernel node. If a vendor replaces proprietary firmware with a UmkaOS-lite instance, that device becomes a full participant in the distributed kernel fabric — its memory becomes DSM-managed, its compute is visible to the cluster scheduler, capabilities flow across the interconnect.

21.6.13.1 The Three-Stage Adoption Path

Naming note: The "stages" below describe the accelerator integration maturity model within Section 21.6. They are NOT the project-wide implementation phases (Phase 1-5) defined in Section 23.2 (23-roadmap.md). The mapping is: Stage A (Opaque) ships with Phase 3-4 (Real Workloads / Production Ready). Stage B (Advisory) ships with Phase 4-5 (Production Ready / Ecosystem). Stage C (Peer) targets Phase 5+ (Ecosystem and beyond, vendor-driven).

Stage A — Opaque (Day 1, drop-in Linux replacement):
  ┌──────────────┐  AccelBase vtable   ┌──────────────┐
  │  UmkaOS   │ ──── (ioctl) ────► │  Proprietary  │
  │  (host CPU)   │                     │  firmware     │
  └──────────────┘                      └──────────────┘
  Kernel submits commands, reads telemetry. Device is a black box.
  Works with existing NVIDIA, AMD, Intel stacks. No vendor changes.

Stage B — Advisory topology (this section):
  Same as Stage A, plus:
  - Kernel builds multi-dimensional capacity profiles.
  - Workload classification drives power budgeting and intent optimization.
  - Sysfs exposes topology data for userspace runtimes.
  Still opaque device firmware. Benefits from kernel-side intelligence.

Stage C — Peer kernel node (vendor adoption):
  ┌──────────────┐  Section 5.1 distributed    ┌──────────────┐
  │  UmkaOS   │ ──── kernel ──────► │  UmkaOS   │
  │  (host CPU)   │  (PCIe/NVLink/CXL) │  (device)     │
  └──────────────┘                      └──────────────┘
  Device runs UmkaOS-lite. Becomes a node in the distributed fabric:
  - Device memory → DSM-managed (Section 5.1.6). Transparent page sharing.
  - Device compute → visible to cluster scheduler (Section 5.1.9).
  - Capabilities → network-portable across host↔device (Section 5.1.10).
  - Crash recovery → kernel restart on device, state preserved (Section 10.8).

  The application API is UNCHANGED between Stage A and Stage C.
  A CUDA app, an OpenCL app, a custom accelerator app — all work at every stage.
  Stage C just makes the device more deeply integrated and manageable.

21.6.13.2 Transport Unification

The architecture currently has two separate transport abstractions:

KernelTransport (Section 5.1.4):     RDMA-only, for inter-node distributed kernel.
OffloadTransport (Section 5.2):      PCIe/SharedMemory/RDMA, for DPU offload.

These should converge into a single NodeTransport that covers all interconnects:

// umka-core/src/transport/mod.rs

/// Unified transport between kernel nodes.
/// Covers all interconnect types: network (RDMA), local bus (PCIe),
/// chip-to-chip (NVLink, xGMI, CXL), and future interconnects.
pub enum NodeTransport {
    /// RDMA (InfiniBand, RoCE). Inter-node across network.
    /// Existing KernelTransport functionality.
    Rdma {
        device: DeviceNodeId,
        connection: RdmaConnection,
    },

    /// PCIe BAR-mapped shared memory. Host↔device on same machine.
    /// For DPUs, discrete GPUs, add-in accelerators.
    PcieBar {
        device: DeviceNodeId,
        bar_base: PhysAddr,
        bar_size: u64,
        mailbox: Option<PcieMailbox>,
    },

    /// NVLink / NVLink-C2C. Chip-to-chip, hardware-coherent.
    /// For GPU↔GPU or CPU↔GPU (Grace Hopper).
    NvLink {
        device: DeviceNodeId,
        link_id: u32,
        coherent: bool,
        bandwidth_gbs: u32,
    },

    /// CXL 3.0. Hardware-coherent shared memory.
    /// For CXL-attached accelerators, memory expanders, composable infrastructure.
    Cxl {
        device: DeviceNodeId,
        cxl_port: u32,
        coherent: bool,
        bandwidth_gbs: u32,
    },

    /// TCP/IP fallback. For non-RDMA networks.
    /// Existing fallback in KernelTransport.
    TcpFallback {
        addr: IpAddr,
        port: u16,
    },
}

Fence semantics: Each transport variant must implement a common ordering model so the distributed kernel protocol can reason about consistency without knowing the underlying interconnect:

/// Transport-agnostic memory operations.
/// Every NodeTransport variant implements this trait.
pub trait TransportOps {
    /// One-sided remote read. Returns data without interrupting remote CPU.
    /// Semantics: read is atomic for naturally-aligned loads up to 8 bytes.
    fn read(&self, remote_addr: u64, buf: &mut [u8]) -> Result<(), TransportError>;

    /// One-sided remote write. Writes data without interrupting remote CPU.
    /// Semantics: write is atomic for naturally-aligned stores up to 8 bytes.
    fn write(&self, remote_addr: u64, data: &[u8]) -> Result<(), TransportError>;

    /// Fence: all preceding operations via this transport are visible to the
    /// remote side before any subsequent operation.
    /// RDMA:  For ordering after RDMA Writes, RC QP in-order delivery
    ///        guarantees are sufficient — no explicit fence is needed.
    ///        For ordering after RDMA Reads or Atomics, post a zero-length
    ///        Send with IBV_SEND_FENCE and poll the CQ for completion.
    ///        The fence() implementation issues the appropriate mechanism
    ///        based on the preceding operation type.
    /// PCIe:  SFENCE + read-back from BAR (flush posted writes).
    /// NVLink/CXL (coherent): hardware-coherent, fence is a no-op.
    /// NVLink (non-coherent): GPU membar.sys instruction.
    /// TCP:   implicit (TCP is ordered).
    fn fence(&self) -> Result<(), TransportError>;

    /// Send a message (interrupts remote CPU). For control plane.
    fn send_message(&self, msg: &[u8]) -> Result<(), TransportError>;

    /// Is this transport hardware-coherent? If true, fence() is a no-op
    /// and the DSM directory can skip invalidation for this node pair.
    fn is_coherent(&self) -> bool;
}

TransportError Enum:

/// Errors from transport operations (used by NodeTransport trait).
#[repr(u32)]
pub enum TransportError {
    /// Connection to remote node lost (link down, node crashed).
    ConnectionLost  = 0,
    /// Operation timed out (no response within deadline).
    Timeout         = 1,
    /// Invalid remote address (unmapped, out of range).
    InvalidAddress  = 2,
    /// Permission denied (rkey mismatch, capability revoked).
    PermissionDenied = 3,
    /// Device error (NIC failure, PCIe error, CXL protocol error).
    DeviceError     = 4,
}

NodeTransport Hot-Unplug Handling:

When a device backing a NodeTransport is removed (PCIe hot-unplug, NVLink failure, CXL device removal):

  1. The device registry emits DeviceEvent::Removed for the transport device.
  2. All in-flight operations on that transport return TransportError::ConnectionLost.
  3. The distributed kernel protocol (if active) downgrades the affected node:
  4. DSM pages owned by that node become read-only on all other nodes.
  5. Capabilities issued by that node are marked suspect (cannot be renewed).
  6. If the device was a GPU running UmkaOS-lite (Phase 3 peer), its compute units are removed from the unified topology and its memory domains are marked unavailable.
  7. Processes with active contexts on the removed device receive SIGBUS.

Key property: The distributed kernel protocol (DSM coherence, capability exchange, heartbeat, cluster join) is transport-agnostic. It sends messages and reads/writes remote memory. Whether that goes over RDMA, PCIe BAR, NVLink, or CXL is an implementation detail. The protocol layer doesn't change.

This means a GPU running UmkaOS-lite joins the distributed kernel using the same protocol as a remote server — just over NVLink instead of RDMA. The cluster scheduler, DSM directory, and capability system see it as another node.

21.6.13.3 Any Accelerator, Any Interconnect

This model is deliberately generic. It applies to any device with a processor, regardless of function:

Device type          Compute capacity profile (Section 21.6.3 fields)  When it becomes a peer node
───────────────────  ────────────────────────────────────────  ─────────────────────────────
GPU                  vector=2000, matrix=300, scalar=200      Vendor ships UmkaOS firmware
NPU                  matrix=40, vector=0, scalar=0            Vendor ships UmkaOS firmware
DPU/SmartNIC         scalar=5000, vector=100, memory_bw=200   Already runs Linux → easy port
Crypto coprocessor   scalar=1000, vector=0, matrix=0          Vendor ships UmkaOS firmware
FPGA                 variable (depends on bitstream)           FPGA shell runs UmkaOS
DSP                  vector=500, scalar=2000, matrix=0         Vendor ships UmkaOS firmware
CSD (comp. storage)  scalar=3000, memory_bw=500               NVMe controller runs UmkaOS
Future RISC-V accel  scalar+vector (implementation defined)    Naturally runs UmkaOS (RISC-V)

All device types express their capacity using the five ComputeCapacityProfile fields defined in Section 21.6.3 (scalar, vector, matrix, memory_bw, launch_overhead_us). Specialized workload categories (inference, network offload, crypto, signal processing) map to combinations of these base dimensions. For example, NPU inference throughput is captured by matrix (the dominant operation), and DPU network offload is captured by scalar + memory_bw.

RISC-V accelerators are the most natural fit: UmkaOS has RISC-V as a first-class target architecture (Section 2.2), so a RISC-V-based accelerator can run the same kernel binary (with a different device tree and minimal board support).

The architecture doesn't need to predict which device types will exist. It only needs to provide: 1. A generic compute unit model (Section 21.6.3, Section 21.6.4) — works for any device. 2. A transport-agnostic distributed kernel protocol (Section 5.1) — works over any interconnect. 3. An adoption path that doesn't require vendors to change anything until they choose to.

Mixed-Coherence Cluster Optimization:

In a system with both coherent (CXL, NVLink-C2C) and non-coherent (RDMA, PCIe BAR) transports, the DSM protocol can skip invalidation messages for node pairs connected by coherent transports. The is_coherent() method on TransportOps enables per-pair optimization:

  • Coherent pair (e.g., CPU ↔ CXL GPU): fence() is a no-op. No invalidation messages needed — hardware maintains coherence automatically.
  • Non-coherent pair (e.g., CPU ↔ remote RDMA node): standard DSM invalidation protocol applies (explicit messages + RDMA fencing).
  • Mixed cluster: The DSM directory tracks coherence per node-pair. Invalidation fanout skips coherent pairs, reducing message traffic in heterogeneous clusters.

21.6.14 Performance Impact

Systems without accelerators:
  Unified topology contains only CPU entries.
  Overhead: one additional struct per CPU core (~64 bytes).
  Runtime overhead: zero. Advisory system has nothing extra to advise on.

Systems with accelerators (steady state):
  Topology update: ~1μs per accelerator per second (read get_utilization).
  Workload classification: ~2μs per cgroup per second (read perf counters).
  Cross-device energy optimization: ~1μs per cgroup per power budget tick.
  Total: ~4μs per second per cgroup. Fraction of a percent.

  Same data is already being read by AccelScheduler (Section 21.1.2.4) and
  PowerBudgetEnforcer (Section 6.4). Unified topology reuses those readings.
  Marginal overhead: near zero.

Compute submission hot path: UNCHANGED.
  submit_commands() → AccelScheduler → driver. No new code in this path.

Benefit: better power budgeting decisions (Section 21.6.5) save more power than
  the microseconds spent on workload classification.

§22.1 AI/ML Policy Framework has been moved to 22-ml-policy.md for clarity. The section number (§22.1) and all anchor links are unchanged.