Chapter 3: Concurrency Model¶

Locking strategy, lock-free structures, PerCpu, RCU, atomic operations, memory ordering, interrupt handling

Concurrency is built on Rust ownership and lock-free primitives. There is no Big Kernel Lock. Hot paths use per-CPU state (CpuLocal via dedicated register), RCU for read-mostly data, and lock-free rings. Spinlocks protect short critical sections; mutexes are for sleepable paths only. Memory ordering is explicit and per-architecture validated.

3.1 Rust Ownership for Lock-Free Paths¶

Rust's ownership system provides compile-time guarantees that replace many of Linux's runtime-only checking tools (lockdep, KASAN, KCSAN).

/// Guard returned by `preempt_disable()`. Held reference prevents the
/// current task from being preempted and migrated to another CPU.
/// Dropped by `preempt_enable()` (implicit on scope exit).
/// Cannot be sent across threads (`!Send`).
pub struct PreemptGuard {
    _not_send: PhantomData<*const ()>,
}

/// Disable preemption on the current CPU. Returns a guard; preemption
/// is re-enabled when the guard is dropped. Nesting is tracked via
/// `CpuLocal.preempt_count` — the counter increments on acquire and
/// decrements on drop. Interrupts are **not** disabled by this call;
/// use `local_irq_save()` for interrupt masking.
///
/// Safe to call from any context: process, softirq, hardirq, NMI.
/// `preempt_count` is a nesting depth counter — nested calls from
/// interrupt context are the correct and expected behavior. (UmkaOS uses
/// separate `irq_count` / `softirq_count` / `preempt_count` fields in
/// CpuLocalBlock, not a packed bitfield with a "high bit" discriminator.)
pub fn preempt_disable() -> PreemptGuard {
    // SAFETY: single-instruction increment of preempt_count via CpuLocal.
    unsafe { arch::current::cpu::preempt_count_inc() }
    PreemptGuard { _not_send: PhantomData }
}

impl Drop for PreemptGuard {
    fn drop(&mut self) {
        // SAFETY: matching decrement; guard lifetime ensures no underflow.
        unsafe { arch::current::cpu::preempt_count_dec() }
    }
}

/// Per-CPU data: compile-time prevention of cross-CPU access.
/// Access requires a proof token that can only be obtained with
/// preemption disabled on the current CPU.
///
/// The backing storage is dynamically sized at boot based on the actual
/// CPU count discovered from ACPI MADT / device tree / firmware tables.
/// No compile-time CPU count limit — MAX_CPUS (4096) is a static array
/// capacity for link-time allocation, not a runtime constraint. The actual
/// CPU count is discovered dynamically from ACPI MADT / device tree.
/// Allocation uses the boot allocator (Section 4.1) during early init,
/// before the general-purpose allocator is available.
pub struct PerCpu<T> {
    /// Pointer to dynamically allocated array of per-CPU slots.
    /// Length = num_possible_cpus(), discovered at boot.
    data: *mut UnsafeCell<T>,
    /// Number of CPU slots (set once at boot, never changes).
    count: usize,
    /// Per-slot borrow state tracking for runtime aliasing detection.
    /// Length = count, parallel to data array. Each slot tracks:
    ///   - 0 = free (no active borrows)
    ///   - 1..=MAX-1 = that many active read borrows
    ///   - u32::MAX = mutably borrowed
    /// This array enables detection of aliased borrows across multiple
    /// PreemptGuard instances, which the type system cannot prevent alone.
    borrow_state: *mut AtomicU32,
}

impl<T> PerCpu<T> {
    /// Construct a new `PerCpu<T>`, allocating one slot per CPU.
    ///
    /// **Bootstrap ordering**:
    /// All allocations go through `alloc::alloc::alloc()`, which routes to
    /// the currently-registered `#[global_allocator]`. During boot phases 0-7,
    /// the global allocator IS the boot allocator ([Section 4.1](04-memory.md#boot-allocator)).
    /// After phase 8, the slab allocator replaces it. The code uses a single
    /// allocation path — the global allocator abstraction handles the routing
    /// transparently, so no boot-vs-slab branching is needed here.
    ///
    /// The `borrow_state` array (debug-only tracking) is always allocated
    /// alongside the data array from the same allocator, initialized to zero
    /// (`AtomicU32::new(0)` per slot — all slots start as "free").
    ///
    /// # Safety
    ///
    /// Must be called exactly once per `PerCpu<T>` instance, during boot on the
    /// BSP with no concurrent access. The `count` parameter must equal
    /// `num_possible_cpus()` discovered from firmware tables.
    pub unsafe fn new(count: usize, init: impl Fn() -> T) -> Self {
        let layout_data = core::alloc::Layout::array::<UnsafeCell<T>>(count).unwrap();
        let data = alloc::alloc::alloc(layout_data) as *mut UnsafeCell<T>;
        assert!(!data.is_null(), "PerCpu: data allocation failed");
        for i in 0..count {
            core::ptr::write(data.add(i), UnsafeCell::new(init()));
        }

        let layout_borrow = core::alloc::Layout::array::<AtomicU32>(count).unwrap();
        let borrow_state = alloc::alloc::alloc_zeroed(layout_borrow) as *mut AtomicU32;
        assert!(!borrow_state.is_null(), "PerCpu: borrow_state allocation failed");
        // alloc_zeroed guarantees all bytes are zero, which is the correct
        // representation for AtomicU32::new(0) on all architectures.

        Self { data, count, borrow_state }
    }

    /// Returns a reference to the borrow_state atomic for the given CPU slot.
    /// # Panics
    /// Panics if cpu >= count (bounds check).
    fn borrow_state(&self, cpu: usize) -> &'static AtomicU32 {
        assert!(cpu < self.count, "PerCpu: cpu {} out of range (count {})", cpu, self.count);
        // SAFETY: borrow_state was allocated with `count` elements at boot
        // (in new()), and cpu < count is verified above. The pointer remains
        // valid for the kernel's lifetime (static allocation). The returned
        // reference is 'static because the borrow_state array outlives all
        // PerCpu usages.
        unsafe { &*self.borrow_state.add(cpu) }
    }

    /// # Safety contract
    ///
    /// `PerCpu<T>` is a global singleton per data type, accessed via a static
    /// reference. Soundness depends on preventing aliased mutable references
    /// to the same CPU's slot. Two distinct hazards must be addressed:
    ///
    /// 1. **Cross-thread aliasing**: Two threads on different CPUs access
    ///    *different* slots, so no aliasing occurs. A thread cannot migrate
    ///    mid-access because preemption is disabled by the guard.
    ///
    /// 2. **Interrupt aliasing**: Interrupt handlers (including NMI) may fire
    ///    while preemption is disabled. If an interrupt handler accesses the
    ///    same per-CPU variable mutably, this creates aliased `&mut T`
    ///    references — undefined behavior in Rust. Therefore:
    ///    - **Read-only access via `get()`** is safe with only preemption
    ///      disabled, because multiple `&T` references are permitted.
    ///      `get()` returns a `PerCpuRefGuard` that only disables preemption.
    ///    - **Mutable access via `get_mut()`** requires BOTH preemption
    ///      disabled AND local interrupts disabled. `get_mut()` returns a
    ///      `PerCpuMutGuard` that disables local interrupts (via
    ///      `local_irq_save`/`local_irq_restore`, defined in `arch::current::interrupts`)
    ///      for the duration of the borrow, in addition to disabling
    ///      preemption. This prevents any maskable interrupt handler from
    ///      observing or mutating the slot while the caller holds `&mut T`.
    ///
    /// 3. **NMI constraint**: NMIs (non-maskable interrupts) cannot be
    ///    disabled by `local_irq_save`. An NMI that fires during a
    ///    `get_mut()` borrow will see `borrow_state == u32::MAX`. If the
    ///    NMI handler calls `get()` on the **same** PerCpu variable, the
    ///    borrow_state check detects the conflict and panics. This is the
    ///    intended safety mechanism — panicking is preferable to silent
    ///    data corruption.
    ///
    ///    **NMI handler rules**:
    ///    - NMI handlers MUST NOT call `get_mut()` on any PerCpu variable.
    ///    - NMI handlers SHOULD avoid calling `get()` on PerCpu variables
    ///      that are commonly accessed via `get_mut()` elsewhere, because
    ///      the conflict causes a panic (which in NMI context halts the
    ///      CPU).
    ///    - NMI handlers that need per-CPU state MUST use dedicated
    ///      NMI-specific PerCpu variables that are only read (never
    ///      mutated) from non-NMI contexts, or use raw atomic fields
    ///      outside the PerCpu abstraction (e.g., the pre-allocated
    ///      per-CPU crash buffer used by the panic NMI handler).
    ///    - UmkaOS's NMI handlers (panic coordinator, watchdog, perf sampling)
    ///      follow this rule: they write only to dedicated pre-allocated
    ///      per-CPU buffers and never access the general PerCpu<T> API.
    ///
    /// The `PreemptGuard` is `!Send`, `!Clone`, and pin-bound to the issuing
    /// CPU, proving that the caller is on the correct CPU. The borrow checker
    /// prevents calling `get()` and `get_mut()` on the same guard
    /// simultaneously, preventing aliased `&T` + `&mut T` within the same
    /// non-interrupt context.
    ///
    /// **Aliasing safety**: The `&mut PreemptGuard` borrow prevents aliasing
    /// through the *same* guard, but nothing in the type system prevents a
    /// caller from creating two separate `PreemptGuard`s and using them to
    /// obtain aliased references (e.g., `&T` from one guard and `&mut T` from
    /// another, or two `&mut T`). To close this hole, `PerCpu<T>` maintains a
    /// per-slot `borrow_state: AtomicU32` with the following encoding:
    ///   - `0`         = slot is free (no active borrows)
    ///   - `1..=MAX-1` = slot has that many active read borrows (`get()` in use)
    ///   - `u32::MAX`  = slot is mutably borrowed (`get_mut()` in use)
    ///
    /// `get()` atomically increments the reader count (fails if currently
    /// `u32::MAX`, i.e., a writer is active). `get_mut()` atomically
    /// transitions from `0` to `u32::MAX` (fails if any readers OR another
    /// writer are active). Both `PerCpuRefGuard` and `PerCpuMutGuard` restore
    /// the borrow_state on drop. In debug builds, violations panic
    /// unconditionally, catching aliasing bugs during development and testing.
    /// In release builds, the borrow-state CAS is elided for performance
    /// (Section 3.1.3) — the structural invariants (preemption disabled, IRQs
    /// disabled for `get_mut()`) are sufficient to prevent aliased access.
    /// For the hottest per-CPU fields, `CpuLocal` (Section 3.1.2) bypasses
    /// `PerCpu<T>` entirely, using architecture-specific registers for
    /// ~1-10 cycle access.
    ///
    /// **Comparison with Linux**: Linux's `this_cpu_read/write` on x86-64 compiles
    /// to a single `gs:`-prefixed instruction (no atomic, no preempt_disable needed
    /// for the single-instruction case). UmkaOS addresses this gap with a two-tier
    /// per-CPU model (Section 3.1.2-c):
    ///
    /// - **CpuLocal** (Section 3.1.2): Register-based access (~1-10 cycles, arch-dependent)
    ///   for the hottest ~10 fields (current_task, runqueue, slab magazines, etc.).
    ///   Matches Linux's per-CPU register pattern on all eight architectures.
    /// - **PerCpu\<T\>** (this struct): Generic abstraction for all other per-CPU data.
    ///   Borrow-state CAS is debug-only in release builds (Section 3.1.3), reducing cost
    ///   from ~20-30 cycles to ~3-8 cycles. The CAS catches aliasing bugs during
    ///   development; release builds trust the structural invariants (preemption
    ///   disabled + IRQs disabled = exclusive access).
    ///
    /// See `get()` and `get_mut()` documentation below.

    /// Read-only access to the current CPU's slot. Takes `&PreemptGuard`,
    /// which only requires preemption to be disabled. Multiple `&T` references
    /// are sound because `&T` is `Sync`-like — interrupt handlers may also
    /// hold `&T` to the same slot without causing UB.
    /// Returns a `PerCpuRefGuard<T>` that derefs to `&T`.
    ///
    /// The lifetime `'g` ties the returned guard to the `PreemptGuard`, not
    /// to `&self`. This prevents dropping the `PreemptGuard` (re-enabling
    /// preemption and allowing migration) while still holding a reference
    /// to this CPU's slot — which would be use-after-migrate unsoundness.
    ///
    /// The `T: Sync` bound is required because interrupt handlers may
    /// concurrently hold `&T` to the same slot. Without `Sync`, types like
    /// `Cell<u32>` would allow data races between the caller and interrupt
    /// handlers sharing the same per-CPU slot.
    ///
    /// **Borrow-state protocol**: `get()` atomically increments the per-slot
    /// `borrow_state` counter, provided the current value is not `u32::MAX`
    /// (which indicates an active mutable borrow). If the slot is currently
    /// mutably borrowed, `get()` panics. This prevents the `&T` + `&mut T`
    /// aliasing UB that would arise if a caller obtained an immutable borrow
    /// via one `PreemptGuard` while another context held a mutable borrow via
    /// a second `PreemptGuard`. The `PerCpuRefGuard` decrements the counter
    /// on drop, returning the slot to a lower reader count or back to 0.
    pub fn get<'g>(&self, guard: &'g PreemptGuard) -> PerCpuRefGuard<'g, T>
    where
        T: Sync,
    {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpu: cpu_id {} out of range (count {})", cpu, self.count);
        // Atomically increment the reader count. Fail if a writer holds the
        // slot (borrow_state == u32::MAX). Use a compare-exchange loop so we
        // can distinguish "writer active" from a successful increment.
        // Active in debug builds only; elided in release (Section 3.1.3).
        //
        // Memory ordering: AcqRel on success ensures the borrow_state write
        // (Release) is visible to NMI handlers or concurrent debug checks
        // before we access the slot data (Acquire). Failure uses Acquire to
        // see the writer's Release store.
        #[cfg(debug_assertions)]
        {
            let state = self.borrow_state(cpu);
            let prev = state.fetch_update(Ordering::AcqRel, Ordering::Acquire, |v| {
                if v == u32::MAX { None } else { Some(v + 1) }
            });
            if prev.is_err() {
                panic!("PerCpu: slot {} already mutably borrowed — cannot take shared borrow", cpu);
            }
        }
        // SAFETY: PreemptGuard guarantees we are on cpu_id() and cannot
        // migrate. In debug builds, the borrow_state increment above ensures
        // no mutable borrow is active. In release builds, the structural
        // invariant (preemption disabled = CPU pinned) is sufficient for
        // read-only access. T: Sync guarantees shared references are sound
        // across concurrent contexts (caller + interrupt handlers).
        unsafe {
            PerCpuRefGuard {
                value: &*self.data.add(cpu).as_ref().unwrap().get(),
                #[cfg(debug_assertions)]
                borrow_state: self.borrow_state(cpu),
                _guard: PhantomData,
            }
        }
    }

    /// Mutable access to the current CPU's slot. Takes `&mut PreemptGuard`
    /// to prevent aliasing with any concurrent `get()` or `get_mut()` on
    /// the same guard in the caller's context. Additionally, the returned
    /// `PerCpuMutGuard` disables local interrupts (via `local_irq_save`)
    /// to prevent interrupt handlers from accessing this slot while `&mut T`
    /// is live. Interrupts are restored when the guard is dropped.
    ///
    /// The lifetime `'g` ties the returned guard to the `PreemptGuard`,
    /// preventing use-after-migrate (same rationale as `get()`).
    ///
    /// This two-layer protection is necessary because preemption-disable
    /// alone does NOT prevent interrupt handlers from firing and potentially
    /// accessing the same per-CPU variable.
    ///
    /// **Aliasing safety**: The `&mut PreemptGuard` borrow prevents aliasing
    /// through the *same* guard, but callers could theoretically create a
    /// second `PreemptGuard` and call `get()` or `get_mut()` again on the
    /// same slot, producing `&T` + `&mut T` or two `&mut T` — both are UB.
    /// To close this hole, `get_mut()` atomically transitions the per-slot
    /// `borrow_state` from exactly `0` (free) to `u32::MAX` (writer) using
    /// a compare-exchange. If the slot has any active readers (count > 0) or
    /// another writer (count == u32::MAX), the CAS fails and `get_mut()`
    /// panics.
    ///
    /// **Debug vs release**: In debug builds (`cfg(debug_assertions)`), the
    /// CAS is always active — catching aliasing bugs during development. In
    /// release builds, the CAS is elided (Section 3.1.3): the structural invariants
    /// (preemption disabled + IRQs disabled) are sufficient to guarantee
    /// exclusive access. This reduces `get_mut()` cost from ~20-30 cycles to
    /// ~3-8 cycles. The `borrow_state` array is still allocated for binary
    /// compatibility with debug-built modules.
    ///
    /// `PerCpu<T>` is designed to be used with a
    /// single `PreemptGuard` per critical section — creating multiple guards
    /// and using them to access the same `PerCpu<T>` is a logic error detected
    /// at runtime.
    pub fn get_mut<'g>(&self, guard: &'g mut PreemptGuard) -> PerCpuMutGuard<'g, T> {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpu: cpu_id {} out of range (count {})", cpu, self.count);
        // SAFETY: local_irq_save() MUST be called BEFORE updating borrow_state.
        // If we update borrow_state first and an interrupt fires before IRQs
        // are disabled, an interrupt handler calling get() on the same PerCpu
        // variable would see borrow_state == u32::MAX and panic. Disabling IRQs
        // first ensures no interrupt handler can observe or race with the
        // borrow_state transition.
        let saved_flags = local_irq_save();
        // Runtime borrow-state check: active in debug builds
        // (`cfg(debug_assertions)`); elided in release builds for performance
        // (see Section 3.1.3). Atomically transitions borrow_state from 0
        // (free) to u32::MAX (writer). Fails if any reader (count > 0) or
        // another writer (u32::MAX) is active. Fires both for &mut T + &mut T
        // (two writers) and for &T + &mut T (reader + writer) aliasing.
        // This check is safe from interrupt-handler races because IRQs are
        // already disabled above.
        #[cfg(debug_assertions)]
        {
            let state = self.borrow_state(cpu);
            if state.compare_exchange(0, u32::MAX, Ordering::AcqRel, Ordering::Acquire).is_err() {
                local_irq_restore(saved_flags);
                panic!("PerCpu: slot {} already borrowed (reader or writer active)", cpu);
                // Note: The kernel is compiled with `panic = "abort"` (no unwinding),
                // so borrow_state cannot leak: a panic immediately halts the core.
                // For Tier 1 drivers sharing the address space, driver panics are caught
                // by the driver fault handler (Section 11.7) which resets the driver's state
                // including any held per-CPU borrows and any Core kernel locks acquired
                // through KABI calls (see Section 11.7 ISOLATE step: KABI lock registry).
            }
        }
        // SAFETY: PreemptGuard prevents migration. local_irq_save() prevents
        // interrupt handlers from accessing this slot. In debug builds, the CAS
        // above additionally verifies no aliased borrows exist. In release builds,
        // we trust the structural invariants. Together, these guarantee exclusive
        // access to this CPU's slot.
        unsafe {
            PerCpuMutGuard {
                value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
                saved_flags,
                #[cfg(debug_assertions)]
                borrow_state: self.borrow_state(cpu),
                _guard: PhantomData,
            }
        }
    }
}

/// Guard for read-only per-CPU access. Only requires preemption disabled.
/// Implements `Deref<Target = T>` for ergonomic read access.
///
/// The lifetime `'a` is tied to the `PreemptGuard`, not to the `PerCpu<T>`
/// container. This ensures the guard cannot outlive the preemption-disabled
/// critical section, preventing use-after-migrate.
///
/// On creation, the per-slot `borrow_state` reader count was incremented by
/// `get()`. On drop, `PerCpuRefGuard` decrements the reader count, returning
/// the slot to its prior borrow state. This is necessary so that `get_mut()`
/// can detect when readers are no longer active and safely transition to
/// writer mode.
pub struct PerCpuRefGuard<'a, T> {
    value: &'a T,
    /// Reference to the per-slot borrow_state counter so Drop can decrement
    /// the reader count. Only present in debug builds (`#[cfg(debug_assertions)]`);
    /// in release builds, the borrow-state CAS is elided (Section 3.1.3).
    #[cfg(debug_assertions)]
    borrow_state: &'a AtomicU32,
    /// Ties this guard's lifetime to the `PreemptGuard`, not to `PerCpu<T>`.
    _guard: PhantomData<&'a PreemptGuard>,
}

/// `PerCpuRefGuard` must NOT be sent to another CPU/thread. It holds a
/// reference to a per-CPU slot that is only valid on the CPU where the
/// `PreemptGuard` was obtained. Sending it to another thread (which runs
/// on a different CPU) would allow reading another CPU's slot without any
/// synchronization, violating the per-CPU data invariant.
impl<T> !Send for PerCpuRefGuard<'_, T> {}

impl<'a, T> core::ops::Deref for PerCpuRefGuard<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value }
}

impl<'a, T> Drop for PerCpuRefGuard<'a, T> {
    fn drop(&mut self) {
        // Decrement the reader count with underflow detection.
        // Only present in debug builds; release builds elide borrow tracking
        // (Section 3.1.3).
        #[cfg(debug_assertions)]
        {
            // CORRECTNESS: fetch_sub on 0 wraps to u32::MAX (the writer sentinel),
            // which would corrupt the borrow state. We detect this case after the
            // fact and panic rather than silently corrupting state.
            //
            // Normal case: borrow_state is 1..=u32::MAX-1 (one or more readers).
            // Underflow case: borrow_state is 0 (no readers — this is a bug).
            // Writer-corruption case: borrow_state is u32::MAX (impossible if get()
            //   correctly checked for writer before incrementing).
            //
            // We panic on error because it indicates a logic error in the PerCpu
            // borrow tracking, and continuing would corrupt state. The recovery
            // store attempts to restore a consistent state for debugging, but
            // the kernel will halt anyway (panic = "abort" for kernel code).
            //
            // **Why no IRQ protection?** An interrupt handler that fires between
            // fetch_sub and the panic check, and calls get() on the same PerCpu,
            // will see borrow_state == u32::MAX (writer sentinel) and correctly
            // fail its get() call. The handler does NOT proceed with corrupted
            // state — it safely errors out. Adding IRQ save/restore to every guard
            // drop (hot path) to protect a code path that already panics is
            // unnecessary overhead.
            let old = self.borrow_state.fetch_sub(1, Ordering::Release);
            if old == 0 {
                // Underflow: no reader was registered. This is a kernel bug.
                // The fetch_sub wrapped to u32::MAX. Restore to 0 and panic.
                self.borrow_state.store(0, Ordering::Release);
                panic!("PerCpuRefGuard::drop: borrow_state underflow — double-drop or missing get()");
            }
            if old == u32::MAX {
                // Was in writer mode — this should be impossible since get() fails
                // when a writer is active. If we reach here, state is corrupted.
                // The fetch_sub wrapped to u32::MAX-1. Restore sentinel and panic.
                self.borrow_state.store(u32::MAX, Ordering::Release);
                panic!("PerCpuRefGuard::drop: borrow_state was u32::MAX (writer sentinel) — corrupted state");
            }
            // Normal case: old was 1..=u32::MAX-1, now decremented successfully.
            // No further action needed.
        }
    }
}

/// Guard for mutable per-CPU access. Disables local interrupts on creation,
/// restores them on drop. This prevents interrupt handlers from creating
/// aliased references to the same per-CPU slot.
///
/// The lifetime `'a` is tied to the `PreemptGuard` (via `PhantomData`),
/// not to the `PerCpu<T>` container. Same rationale as `PerCpuRefGuard`.
///
/// On creation, the per-slot `borrow_state` was set to `u32::MAX` (writer
/// sentinel) by `get_mut()`. On drop, `PerCpuMutGuard` resets `borrow_state`
/// to `0` (free) before restoring local IRQs, allowing subsequent `get()` or
/// `get_mut()` calls on this slot.
pub struct PerCpuMutGuard<'a, T> {
    value: &'a mut T,
    saved_flags: usize,
    /// Reference to the per-slot borrow_state counter so Drop can reset it
    /// to 0 (free), enabling detection of concurrent borrows in subsequent
    /// callers. Only present in debug builds (`#[cfg(debug_assertions)]`);
    /// in release builds, the borrow-state CAS is elided entirely
    /// (Section 3.1.3) and the structural invariants (preemption disabled +
    /// IRQs disabled) are sufficient to guarantee exclusive access.
    #[cfg(debug_assertions)]
    borrow_state: &'a AtomicU32,
    /// Ties this guard's lifetime to the `PreemptGuard`, not to `PerCpu<T>`.
    _guard: PhantomData<&'a mut PreemptGuard>,
}

/// `PerCpuMutGuard` must NOT be sent to another CPU/thread. The `saved_flags`
/// field contains the interrupt state of the originating CPU (saved via
/// `local_irq_save`). Restoring these flags on a different CPU would corrupt
/// that CPU's interrupt state — potentially enabling interrupts that should
/// be disabled or vice versa, leading to missed interrupts or unsafe re-entry.
impl<T> !Send for PerCpuMutGuard<'_, T> {}

impl<'a, T> core::ops::Deref for PerCpuMutGuard<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value }
}

impl<'a, T> core::ops::DerefMut for PerCpuMutGuard<'a, T> {
    fn deref_mut(&mut self) -> &mut T { self.value }
}

impl<'a, T> Drop for PerCpuMutGuard<'a, T> {
    fn drop(&mut self) {
        // Reset borrow_state from u32::MAX (writer) back to 0 (free) BEFORE
        // restoring IRQs. This ensures that any interrupt handler firing
        // immediately after IRQ restoration sees the slot as free and can
        // call get() or get_mut() without a false-positive panic.
        // Only present in debug builds; release builds elide borrow tracking.
        //
        // NMI window note: Between borrow_state.store(0) and drop() completion,
        // an NMI handler could see borrow_state == 0 and obtain &T while this
        // guard still technically holds &mut T. This is safe in practice because:
        // (1) no writes occur after the borrow_state reset — the mutation is
        //     complete, and (2) the remaining operation (local_irq_restore) does
        //     not access the PerCpu slot data. In release builds, borrow_state
        //     tracking is entirely compiled out, so no observable state transition
        //     exists. The reverse order (IRQ restore before borrow_state reset)
        //     would cause IRQ handlers to see borrow_state == MAX and panic.
        #[cfg(debug_assertions)]
        {
            self.borrow_state.store(0, Ordering::Release);
        }
        local_irq_restore(self.saved_flags);
    }
}

3.1.1 PerCpuCounter: Batched Per-CPU Counter for Warm Paths¶

For warm-path counters where approximate reads are acceptable (dirty page counts, free block counts, superblock writer counts), PerCpuCounter<T> provides batched per-CPU accumulation with an approximate global view. Each CPU maintains a local counter that periodically folds into a global sum when a batch threshold is exceeded.

This is NOT for hot-path counters like RSS — use PerCpu<AtomicI64> for those (zero batch overhead, one fetch_add per update). See design decision AI-036 Option C in Section 4.8 (MmStruct.rss documentation) for the full rationale distinguishing hot-path vs warm-path per-CPU counters.

Counter type	Abstraction	Update cost	Read cost	Use case
Hot path	`PerCpu<AtomicI64>`	~1-3 cycles (one `fetch_add`)	O(num_cpus) sum	RSS (per page fault)
Warm path	`PerCpuCounter<i64>`	~3-8 cycles (local add + threshold check)	~1 cycle (read global)	dirty pages, free blocks, `SbWriters`

Linux equivalent: struct percpu_counter (include/linux/percpu_counter.h, lib/percpu_counter.c). Linux uses s32 __percpu *counters + s64 count + raw_spinlock_t. UmkaOS uses PerCpu<AtomicI64> for the local slots (avoiding the need for IRQ disable on the increment path) and a SpinLock for the global fold.

/// Batched per-CPU counter for warm-path statistics.
///
/// Each CPU accumulates increments/decrements in a local `AtomicI64` slot.
/// When the local value's absolute magnitude reaches `batch`, the local
/// value is folded into the global `count` under a spinlock and the local
/// slot is reset to zero. This amortises lock acquisition over `batch`
/// updates.
///
/// **Approximate reads** via `read_approximate()` return `count` without
/// summing per-CPU slots — O(1) but may drift by up to ±(batch * num_cpus).
/// **Exact reads** via `read_slow()` acquire the lock and sum all CPU slots
/// — O(num_cpus) but precise.
///
/// # When to use
///
/// Use `PerCpuCounter` for counters updated at moderate frequency (mount,
/// write, dirty-page tracking) where:
/// - Updates must be scalable (no cross-CPU contention on the common path).
/// - Reads are infrequent relative to updates (freeze/thaw, balance_dirty_pages).
/// - Approximate reads are acceptable for most consumers.
///
/// Do NOT use for:
/// - **Hot-path counters** (RSS, packet counts): use `PerCpu<AtomicI64>` —
///   zero batch overhead, one atomic op per update.
/// - **Counters read on every update**: the batch threshold check adds ~3-8
///   cycles that a raw atomic avoids.
///
/// # Batch threshold
///
/// Default: `max(32, num_online_cpus() * 2)` — matches Linux's
/// `compute_batch_value()`. Scaled dynamically on CPU hotplug. Callers
/// may override via `new_with_batch()` for counters with known access
/// patterns (e.g., superblock writer counts use a smaller batch because
/// freeze must drain quickly).
///
/// # Drift analysis
///
/// Maximum drift from true value when reading `count` without summing:
/// ±(`batch` * `num_possible_cpus()`). With default batch=64 on a 256-CPU
/// system: ±16384. For dirty page tracking (typical system has millions of
/// dirty pages), this drift is negligible. For counters where drift matters
/// (e.g., "is this counter exactly zero?"), use `read_slow()`.
///
/// # CPU hotplug
///
/// On CPU offline, the dying CPU's local slot is folded into `count` under
/// the lock (registered via the CPU hotplug callback in
/// [Section 3.2](#cpulocal-register-based-per-cpu-fast-path)). On CPU online, the
/// new CPU's slot starts at zero. The `batch` value is recomputed via
/// `max(32, num_online_cpus() * 2)`.
pub struct PerCpuCounter {
    /// Global approximate count. Updated when a per-CPU slot overflows
    /// the batch threshold. Protected by `lock` for writes; may be read
    /// without the lock for approximate reads (Relaxed load).
    count: AtomicI64,
    /// Per-CPU local accumulators. Each slot is independently updated
    /// with `fetch_add` (no IRQ disable needed — `AtomicI64` is safe
    /// against interrupt-handler races on the same CPU because the
    /// worst case is a slightly delayed fold, not data corruption).
    local: PerCpu<AtomicI64>,
    /// Lock protecting `count` during fold operations and `read_slow()`.
    /// Also serialises concurrent folds from different CPUs (rare — only
    /// happens when two CPUs hit the batch threshold simultaneously).
    lock: SpinLock<()>,
    /// Current batch threshold. Recomputed on CPU hotplug.
    /// Read with `Relaxed` on the increment path (stale value is benign —
    /// a slightly larger or smaller batch just shifts the fold timing).
    batch: AtomicI32,
}

impl PerCpuCounter {
    /// Create a new counter initialised to `initial_value`, with the
    /// default batch threshold `max(32, num_online_cpus() * 2)`.
    ///
    /// # Safety
    /// Must be called after the per-CPU allocator is available (boot phase 8+).
    pub unsafe fn new(initial_value: i64) -> Self {
        let nr_cpus = num_online_cpus() as i32;
        Self {
            count: AtomicI64::new(initial_value),
            local: PerCpu::new(num_possible_cpus(), || AtomicI64::new(0)),
            lock: SpinLock::new(()),
            batch: AtomicI32::new(core::cmp::max(32, nr_cpus * 2)),
        }
    }

    /// Create with a caller-specified batch threshold.
    pub unsafe fn new_with_batch(initial_value: i64, batch: i32) -> Self {
        Self {
            count: AtomicI64::new(initial_value),
            local: PerCpu::new(num_possible_cpus(), || AtomicI64::new(0)),
            lock: SpinLock::new(()),
            batch: AtomicI32::new(batch),
        }
    }

    /// Increment by 1. Common-case wrapper for `add(1)`.
    #[inline(always)]
    pub fn inc(&self) {
        self.add(1);
    }

    /// Decrement by 1. Common-case wrapper for `add(-1)`.
    #[inline(always)]
    pub fn dec(&self) {
        self.add(-1);
    }

    /// Add `amount` to the counter (may be negative).
    ///
    /// **Fast path** (~3-8 cycles): preempt_disable, fetch_add on the
    /// local CPU's slot, check if |local| >= batch. If not, return.
    ///
    /// **Slow path** (batch overflow): acquire `lock`, fold local slot
    /// into `count`, reset local to zero, release lock.
    ///
    /// On architectures with `cmpxchg` (x86-64, AArch64 via CASAL),
    /// the local read-add-check can use a single `fetch_add` followed
    /// by a branch on the result. On architectures without (PPC32),
    /// the local slot is read and written under preempt_disable.
    pub fn add(&self, amount: i64) {
        let guard = preempt_disable();
        let local_ref = self.local.get(&guard);
        let new_local = local_ref.fetch_add(amount, Ordering::Relaxed) + amount;
        let batch = self.batch.load(Ordering::Relaxed) as i64;
        if new_local.abs() >= batch {
            // Fold into global count.
            let _lock = self.lock.lock();
            // Re-read local: another fold may have raced (rare, requires
            // interrupt handler also hitting the batch on this CPU).
            let current = local_ref.swap(0, Ordering::Relaxed);
            self.count.fetch_add(current, Ordering::Relaxed);
        }
        // guard dropped — preemption re-enabled.
    }

    /// Read the approximate global value. O(1), no lock, no CPU iteration.
    /// May drift from the true value by up to ±(batch * num_cpus).
    ///
    /// Suitable for: dirty page ratio estimation, free-block heuristics,
    /// cgroup memory.stat reporting where exactness is not required.
    #[inline]
    pub fn read_approximate(&self) -> i64 {
        self.count.load(Ordering::Relaxed)
    }

    /// Read the exact global value. O(num_cpus), acquires lock.
    /// Folds all per-CPU slots into `count` and returns the result.
    ///
    /// Suitable for: freeze/thaw drain checks ("is writer count exactly
    /// zero?"), OOM scoring, `/proc` reporting where precision matters.
    pub fn read_slow(&self) -> i64 {
        let _lock = self.lock.lock();
        let mut total = self.count.load(Ordering::Relaxed);
        for cpu in 0..num_possible_cpus() {
            // SAFETY: iterating all possible CPUs. Offline CPUs have their
            // slots folded during hotplug-down, so their values are zero.
            // Online CPUs may be concurrently incrementing — the race is
            // benign (we read a slightly stale value for that CPU).
            let slot = unsafe { self.local.get_cpu(cpu) };
            total += slot.load(Ordering::Relaxed);
        }
        total
    }

    /// Read the exact value, clamped to non-negative. Equivalent to
    /// `max(0, read_slow())`. Used by filesystem free-block reporting
    /// where a transiently-negative value (due to concurrent decrements)
    /// must not be exposed to userspace.
    pub fn read_positive_slow(&self) -> i64 {
        core::cmp::max(0, self.read_slow())
    }

    /// Called from CPU hotplug callback when a CPU goes offline.
    /// Folds the dying CPU's local slot into the global count.
    pub fn cpu_dead(&self, cpu: usize) {
        let _lock = self.lock.lock();
        let slot = unsafe { self.local.get_cpu(cpu) };
        let val = slot.swap(0, Ordering::Relaxed);
        self.count.fetch_add(val, Ordering::Relaxed);
        // Recompute batch for new CPU count.
        let nr = num_online_cpus() as i32;
        self.batch.store(core::cmp::max(32, nr * 2), Ordering::Relaxed);
    }
}

get_cpu() note: PerCpu::get_cpu(cpu) is a raw slot accessor used only by read_slow() and cpu_dead() — it bypasses the PreemptGuard requirement because the caller is either iterating all CPUs under a lock or handling a hotplug event for a specific CPU. This is the only escape hatch from the proof-token discipline, and it is unsafe to reflect that the caller must ensure the access is sound (no concurrent mutable borrow on that slot).

impl<T> PerCpu<T> {
    /// Raw slot accessor for a specific CPU. Bypasses the `PreemptGuard`
    /// requirement — the caller is responsible for ensuring that no concurrent
    /// mutable borrow exists on the returned slot.
    ///
    /// # Intended callers
    /// - `PerCpuCounter::read_slow()` (iterating all CPUs under a SpinLock).
    /// - `PerCpuCounter::cpu_dead()` (hotplug callback for a specific CPU).
    /// - `mm_sum_rss()` in the OOM scorer ([Section 4.5](04-memory.md#oom-killer)): iterating all
    ///   possible CPUs to sum RSS counters (read-only, `Relaxed` loads).
    ///
    /// # Safety
    /// - `cpu` must be < `self.count` (num_possible_cpus()).
    /// - The caller must ensure no concurrent `get_mut()` borrow exists on
    ///   the same `cpu` slot. Typically satisfied by: (a) holding a lock
    ///   that serializes all writers, (b) the CPU being offline (no code
    ///   running on it), or (c) the access being read-only on an Atomic type
    ///   (no `&mut T` aliasing concern).
    pub unsafe fn get_cpu(&self, cpu: usize) -> &T {
        debug_assert!(cpu < self.count, "PerCpu::get_cpu: cpu {} >= count {}", cpu, self.count);
        &*(*self.data.add(cpu)).get()
    }
}

/// Per-CPU locks: type-safe sharded locking without array-indexing aliasing hazards.
///
/// A common pattern in scalable kernels is to give each CPU its own lock protecting
/// a shard of data, avoiding contention on a single global lock. Naively implementing
/// this as `[SpinLock<T>; N]` creates a type-safety hole: Rust's aliasing rules permit
/// holding `&mut T` from `locks[i]` and `&mut T` from `locks[j]` simultaneously if
/// `i != j`, but nothing in the type system prevents accessing `locks[other_cpu]`
/// when the caller intended to access only `locks[this_cpu]`. Furthermore, if two
/// CPUs simultaneously hold their respective locks, the `&mut T` references are
/// derived from the same array allocation, which can violate LLVM's noalias assumptions
/// in ways that are difficult to reason about.
///
/// UmkaOS provides `PerCpuLock<T>` to enforce the per-CPU locking invariant at the
/// type level:
///
/// 1. **Separate allocations per CPU**: Each CPU's lock+data is a separate slab
///    allocation (one `PerCpuLockSlot<T>` per CPU), NOT an array element. This ensures
///    that `&mut T` references from different CPUs are derived from disjoint
///    allocations, satisfying Rust's aliasing rules without relying on the
///    "different array indices" reasoning that LLVM may not honor.
///
/// 2. **Access restricted to current CPU**: `PerCpuLock::lock()` takes a
///    `&PreemptGuard` (the same proof token used by `PerCpu<T>`) and only returns
///    a guard for the current CPU's lock. There is no API to access another CPU's
///    lock — the type system makes it impossible.
///
/// 3. **Cache-line aligned**: Each `PerCpuLockSlot<T>` is `#[repr(align(64))]` to
///    prevent false sharing. The alignment is part of the type, not a runtime hint.
///
/// 4. **Safe composition with `PerCpu<T>`**: For cases where per-CPU data needs
///    both a lock and lock-free access, use `PerCpu<UnsafeCell<T>>` with explicit
///    `PerCpuMutGuard` for writes, or wrap the data in `PerCpu<SpinLock<T>>` where
///    the lock itself is per-CPU. The key invariant is that `PerCpuLock<T>` never
///    exposes `&mut T` from one CPU while another CPU's lock is held — each CPU's
///    lock guards only that CPU's data shard.
///
/// **Use case**: Per-CPU statistics counters that need occasional atomic updates
/// across all CPUs (e.g., network Rx packet counts). Each CPU locks its own slot
/// for local updates; aggregating across all CPUs requires iterating without holding
/// any locks (the counters are `AtomicU64`, so reads are consistent).
pub struct PerCpuLock<T> {
    /// Array of pointers to independently-allocated lock slots.
    /// Each pointer points to a separately-allocated `PerCpuLockSlot<T>`.
    /// Length = num_possible_cpus(), discovered at boot.
    /// The indirection ensures each slot is a separate allocation for aliasing purposes.
    ///
    /// **Allocation timing**: `PerCpuLock<T>` is initialized during phase 2 of boot
    /// (after the slab allocator is available, Section 4.3). Each slot is allocated
    /// via the slab allocator (not `Box`/global allocator). The pointer array itself
    /// is allocated from the boot bump allocator during early init.
    slots: *mut *mut PerCpuLockSlot<T>,
    /// Number of CPU slots (set once at boot, never changes).
    count: usize,
}

/// A single CPU's lock slot, cache-line aligned to prevent false sharing.
/// This is allocated independently (via boot allocator) for each CPU at boot.
///
/// `#[repr(align(64))]` guarantees:
/// - The struct's starting address is 64-byte aligned.
/// - The compiler pads the struct to a multiple of 64 bytes automatically.
/// - `size_of::<PerCpuLockSlot<T>>()` >= 64 regardless of `SpinLock<T>` size.
///
/// No explicit `_pad` field is needed — the compiler handles padding.
/// This works correctly for any `SpinLock<T>` size: small types get padded
/// up to 64 bytes, large types round up to the next 64-byte boundary.
#[repr(align(64))]
struct PerCpuLockSlot<T> {
    /// The spinlock protecting this CPU's data shard.
    lock: SpinLock<T>,
}

impl<T> PerCpuLock<T> {
    /// Lock the current CPU's data shard.
    ///
    /// # Safety contract
    ///
    /// - Takes `&PreemptGuard` to prove the caller is pinned to a specific CPU.
    /// - Returns `PerCpuLockGuard<'_, T>` that derefs to `&T` and `&mut T`.
    /// - The guard holds the spinlock for this CPU's slot.
    /// - No API exists to lock another CPU's shard — the only access path is
    ///   through the current CPU, enforced by the `PreemptGuard` proof token.
    ///
    /// **Aliasing safety**: Because each slot is a separate heap allocation,
    /// holding `&mut T` on CPU 0 and `&mut T` on CPU 1 simultaneously is sound —
    /// the references are derived from disjoint allocations, not from different
    /// indices of the same array. This satisfies Rust's aliasing rules and
    /// LLVM's noalias semantics without subtle reasoning about array indexing.
    ///
    /// **Interrupt safety**: `SpinLock::lock()` disables preemption for the
    /// critical section. If the caller needs to also disable interrupts
    /// (to prevent interrupt handlers from deadlocking on the same lock),
    /// they must wrap the call in `local_irq_save()`/`local_irq_restore()`.
    /// For per-CPU locks, this is rarely needed because interrupt handlers
    /// typically access different data or use lock-free patterns.
    pub fn lock<'g>(&self, guard: &'g PreemptGuard) -> PerCpuLockGuard<'g, T> {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpuLock: cpu_id {} out of range", cpu);
        // SAFETY: slots was allocated with count elements at boot. cpu is in bounds.
        let slot_ptr = unsafe { *self.slots.add(cpu) };
        // SAFETY: slot_ptr was obtained from slab allocation at boot and is never
        // freed during kernel operation. It points to a valid PerCpuLockSlot<T>.
        let slot = unsafe { &*slot_ptr };
        PerCpuLockGuard {
            inner: slot.lock.lock(),
            _cpu_pin: PhantomData,
        }
    }

    /// Try to lock the current CPU's data shard without blocking.
    ///
    /// Returns `Some(PerCpuLockGuard)` if the lock was acquired, `None` if
    /// the lock is currently held (e.g., by an interrupt handler on this CPU).
    ///
    /// Same safety contract as `lock()`, but non-blocking.
    pub fn try_lock<'g>(&self, guard: &'g PreemptGuard) -> Option<PerCpuLockGuard<'g, T>> {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpuLock: cpu_id {} out of range", cpu);
        let slot_ptr = unsafe { *self.slots.add(cpu) };
        let slot = unsafe { &*slot_ptr };
        slot.lock.try_lock().map(|inner| PerCpuLockGuard {
            inner,
            _cpu_pin: PhantomData,
        })
    }

    /// Access all slots for cross-CPU aggregation (read-only, no locks held).
    ///
    /// This is the ONLY way to access another CPU's slot, and it only provides
    /// read-only access to the lock structure itself — NOT to the protected data.
    /// The typical use case is iterating over all CPUs' atomic counters.
    ///
    /// # Safety
    ///
    /// The caller must ensure no CPU is currently mutating its data shard through
    /// a `PerCpuLockGuard`. For atomic counter aggregation, this is safe because
    /// the counters are read atomically without holding the lock. For non-atomic
    /// data, the caller must use external synchronization (e.g., pause all CPUs
    /// via IPI) before calling this method.
    ///
    /// Returns an iterator over `&SpinLock<T>` for each CPU. The caller can
    /// read the lock state or use `try_lock()` on each, but cannot obtain `&mut T`
    /// through this path without holding the lock.
    pub unsafe fn iter_slots(&self) -> impl Iterator<Item = &'_ SpinLock<T>> {
        (0..self.count).map(move |i| {
            let slot_ptr = *self.slots.add(i);
            &(*slot_ptr).lock
        })
    }
}

/// Guard for a per-CPU lock. Implements `Deref`/`DerefMut` to access the protected data.
///
/// The guard holds a `SpinLockGuard` derived from the current CPU's lock slot.
/// The `PhantomData<&'a PreemptGuard>` ties the guard's lifetime to the CPU pin,
/// preventing use-after-migrate if the caller drops the `PreemptGuard` while
/// still holding the lock guard.
pub struct PerCpuLockGuard<'a, T> {
    inner: SpinLockGuard<'a, T>,
    _cpu_pin: PhantomData<&'a PreemptGuard>,
}

impl<'a, T> core::ops::Deref for PerCpuLockGuard<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { &*self.inner }
}

impl<'a, T> core::ops::DerefMut for PerCpuLockGuard<'a, T> {
    fn deref_mut(&mut self) -> &mut T { &mut *self.inner }
}

/// `PerCpuLockGuard` must NOT be sent to another CPU/thread.
/// The guard holds a `SpinLockGuard` from a specific CPU's lock slot.
/// Sending it to another thread would allow that thread to access a lock
/// that may be concurrently acquired by the original CPU (e.g., in an
/// interrupt handler), causing deadlock or data corruption.
impl<T> !Send for PerCpuLockGuard<'_, T> {}

Why separate allocations matter for type soundness:

The naive [SpinLock<T>; N] approach has a subtle aliasing problem. When CPU 0 holds locks[0].lock() and CPU 1 holds locks[1].lock(), both CPUs have derived their &mut T references from the same array allocation. While Rust's reference rules permit disjoint array element access, LLVM's noalias attribute and the optimizer's alias analysis may not distinguish between "different indices of the same array" and "same allocation." In practice, this is unlikely to cause miscompilation with current LLVM, but the UmkaOS architecture takes a conservative approach: each CPU's lock+data is a separate heap allocation, guaranteeing that the &mut T references are truly disjoint at the allocation level.

This design also simplifies reasoning about memory reclamation: if a per-CPU lock slot needs to be freed (e.g., during CPU hot-unplug), the individual slab allocation can be returned without affecting other CPUs' slots.

3.1.2 ArcSwap — Lock-Free Atomic `Arc<T>` Replacement¶

ArcSwap<T> provides lock-free read access and atomic swap of Arc<T> values. It is the kernel-internal equivalent of the userspace arc-swap crate, designed for cases where a shared resource is read frequently on hot paths and replaced rarely on cold paths (credential updates, cgroup migration, mm replacement during exec, namespace changes via setns/unshare).

Key properties: - Read path (load()): Returns an ArcSwapGuard<T> that derefs to &T and extends the lifetime of the inner Arc<T> for the guard's duration. No locks, no atomic increment on the Arc refcount — the guard holds a snapshot. Cost: one AtomicPtr::load(Acquire) + hazard pointer registration (~3-5 cycles on x86-64). On the scheduler hot path this is significantly cheaper than Arc::clone() (which requires a fetch_add on the refcount — ~15-25 cycles on contended cache lines). - Write path (store(), swap()): Atomically replaces the inner Arc<T>. The old Arc<T> is reclaimed after all existing ArcSwapGuards that reference it have been dropped. Uses hazard-pointer-based deferred reclamation (not RCU) because ArcSwap swap sites are not always within RCU read-side critical sections. Cost: one AtomicPtr::swap(AcqRel) + deferred reclamation of the old value. - Interior mutability: store() and swap() take &self, not &mut self. This is the primary reason ArcSwap exists — it allows atomic replacement of Arc<T> through shared references (e.g., Task.nsproxy, Task.cgroup, Process.mm). The caller provides synchronization for write-side serialization (typically an external lock or single-writer guarantee).

/// Atomic `Arc<T>` container — lock-free reads, atomic swap.
///
/// Generic replacement for patterns where `Option<Arc<T>>` or `Arc<T>` must
/// be mutated through a shared reference. All methods take `&self`.
///
/// # Usage in UmkaOS
///
/// | Field | Read path | Write path | Serialization |
/// |---|---|---|---|
/// | `Task.nsproxy` | syscall dispatch (every syscall) | `setns(2)`, `unshare(2)` | `task.task_lock()` |
/// | `Task.cgroup` | scheduler tick, resource charge | cgroup migration | `cgroup_threadgroup_rwsem` |
/// | `Process.mm` | page fault, /proc reads | `exec()` | single-thread at PNR |
/// | `Process.cred` | capability checks | `commit_creds()` | `Process::cred_lock` |
/// | `Task.files` | fd operations | `close_on_exec` in exec | FdTable internal lock |
/// | `Task.fs` | path resolution | `chdir()`, `chroot()`, `unshare(CLONE_FS)` | FsStruct internal RwLock |
///
/// # Memory ordering
///
/// - `load()`: `Acquire` on the internal `AtomicPtr`. Pairs with the `Release`
///   in `store()`/`swap()`. Ensures all writes to the `T` inside the `Arc` are
///   visible to the reader after loading the pointer.
/// - `store()`/`swap()`: `AcqRel` on the internal `AtomicPtr`. The `Release`
///   side ensures writes to the new `Arc<T>`'s contents are visible to future
///   `load()` callers. The `Acquire` side ensures the old value is fully read
///   before the pointer is replaced.
///
/// # Reclamation
///
/// The old `Arc<T>` returned by `swap()` (or implicitly freed by `store()`)
/// must not be dropped while any `ArcSwapGuard` from a prior `load()` still
/// references it. The implementation uses a per-CPU hazard pointer array
/// (one slot per CPU, ~8 bytes × num_cpus) to track active guards.
/// `store()` spins briefly on CPUs whose hazard pointer matches the old value
/// (bounded by the guard's scope — typically a few instructions). In practice,
/// contention is near-zero because swaps are rare (credential changes,
/// cgroup migration) and guards are short-lived (a few hundred cycles).
///
/// # Comparison with alternatives
///
/// | Alternative | Problem |
/// |---|---|
/// | `RwLock<Arc<T>>` | Read lock on hot path (~20-40 cycles, contention under load) |
/// | `Mutex<Arc<T>>` | Exclusive lock on reads (unacceptable for concurrent readers) |
/// | `RcuCell<Arc<T>>` | Requires RCU read-side critical section; ArcSwap is usable outside RCU context |
/// | `AtomicPtr<T>` | No lifetime management; caller must manually manage Arc refcounts |
/// | Bare `Arc<T>` + `Arc::clone()` | `fetch_add` on refcount per read (~15-25 cycles contended) |
pub struct ArcSwap<T> {
    /// Internal pointer to the `ArcInner<T>` (same layout as `Arc<T>`'s internal
    /// pointer). Loaded with `Acquire`, stored with `Release`.
    ptr: AtomicPtr<T>,
}

impl<T> ArcSwap<T> {
    /// Create a new `ArcSwap` holding the given `Arc<T>`.
    ///
    /// Consumes the `Arc` (incrementing its strong count is not needed — the
    /// `ArcSwap` takes ownership of the reference count).
    pub fn new(val: Arc<T>) -> Self {
        let ptr = Arc::into_raw(val) as *mut T;
        Self { ptr: AtomicPtr::new(ptr) }
    }

    /// Construct from a value directly (allocates a new `Arc`).
    pub fn from_pointee(val: T) -> Self {
        Self::new(Arc::new(val))
    }

    /// Load the current value. Returns an `ArcSwapGuard` that derefs to `&T`.
    ///
    /// Lock-free. The guard registers a hazard pointer on the current CPU to
    /// prevent the referenced `Arc<T>` from being reclaimed while the guard
    /// is live. The hazard pointer is cleared on guard drop.
    ///
    /// # Performance
    /// ~3-5 cycles on x86-64 (one atomic load + one per-CPU hazard store).
    /// No `Arc` refcount increment.
    pub fn load(&self) -> ArcSwapGuard<'_, T> {
        // 1. Register hazard pointer (per-CPU slot).
        // 2. Load ptr with Acquire.
        // 3. Verify ptr matches hazard (retry if swap raced).
        // 4. Return guard holding &T derived from the loaded ptr.
        // SAFETY: The hazard pointer prevents reclamation of the Arc
        // while the guard is live. The Acquire ordering ensures all
        // writes to T are visible.
        todo!("implementation in umka-core/src/sync/arc_swap.rs")
    }

    /// Atomically replace the stored `Arc<T>` with `new_val`.
    ///
    /// The old `Arc<T>` is dropped after all existing guards that reference
    /// it have been dropped (hazard pointer scan). Takes `&self` — this is
    /// the interior mutability entry point.
    ///
    /// # Write serialization
    /// Multiple concurrent `store()` calls are serialized by the atomic swap
    /// on `ptr`. However, the caller should provide external serialization
    /// (a lock or single-writer guarantee) to ensure that the "old" vs "new"
    /// value has a well-defined order. Without external serialization, two
    /// concurrent stores may complete in either order.
    pub fn store(&self, new_val: Arc<T>) {
        let old = self.swap(new_val);
        // old is the previous Arc<T>. defer_drop via hazard scan.
        drop(old); // Arc::drop decrements refcount; actual free if last ref.
    }

    /// Atomically swap and return the old `Arc<T>`.
    ///
    /// The returned `Arc<T>` is safe to drop immediately — the hazard pointer
    /// mechanism ensures no `ArcSwapGuard` is still referencing it by the time
    /// the caller observes the return value.
    pub fn swap(&self, new_val: Arc<T>) -> Arc<T> {
        let new_ptr = Arc::into_raw(new_val) as *mut T;
        let old_ptr = self.ptr.swap(new_ptr, AcqRel);
        // Wait for any hazard pointers matching old_ptr to clear.
        // Then reconstruct the Arc.
        // SAFETY: old_ptr was previously stored via Arc::into_raw,
        // and all guards have released their hazard pointers.
        unsafe { Arc::from_raw(old_ptr) }
    }
}

/// Guard returned by `ArcSwap::load()`. Derefs to `&T`.
///
/// Holds a hazard pointer registration that prevents the referenced `Arc<T>`
/// from being reclaimed. The hazard pointer is cleared on drop.
///
/// Short-lived by convention: hold for the duration of a single field access
/// or computation, not across blocking points or context switches.
pub struct ArcSwapGuard<'a, T> {
    /// Reference to the inner T, valid for the guard's lifetime.
    value: &'a T,
    /// Hazard pointer slot index (per-CPU). Cleared on drop.
    _hazard: HazardSlot,
}

impl<T> core::ops::Deref for ArcSwapGuard<'_, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value }
}

// ArcSwapGuard must not be sent across threads — the hazard pointer is
// registered on the current CPU's slot. Sending to another thread would
// leave the hazard pointer on the wrong CPU, failing to protect the
// referenced Arc from reclamation during a swap on the original CPU.
impl<T> !Send for ArcSwapGuard<'_, T> {}

ArcSwap<T> vs RcuCell<Arc<T>>: Both provide lock-free reads with deferred reclamation. The key difference is the reclamation scope:

RcuCell defers reclamation to the next RCU grace period. Requires the reader to be in an RCU read-side critical section (implicit in preempt-disabled kernel code). Best for data that is always read on fast kernel paths where RCU context is guaranteed (credential checks, routing table lookups, dentry cache).
ArcSwap defers reclamation via hazard pointers (per-CPU, cleared on guard drop). Does NOT require an RCU read-side context. Best for data that may be read outside RCU context (e.g., from userspace-triggered paths like setns() that swap nsproxy, or from /proc reads that are not in RCU context).

Both are correct; the choice depends on whether the read site is guaranteed to be in RCU context. Task.cred uses RcuCell (reads are always in syscall fast path = RCU context). Task.nsproxy and Task.cgroup use ArcSwap (reads may occur from cgroup migration paths that are not in RCU context).

3.2 CpuLocal: Register-Based Per-CPU Fast Path¶

PerCpu<T> is the correct abstraction for general per-CPU data, but its indirection (CPU ID lookup + array indexing + borrow-state checking) is too expensive for the kernel's hottest paths — scheduler pick_next_task, slab magazine alloc/free, NAPI poll, RCU quiescent-state reporting. These paths execute millions of times per second and every cycle matters.

Linux solves this with architecture-specific per-CPU registers that allow single- or two-instruction access to critical per-CPU fields. UmkaOS adopts the same approach as a two-tier per-CPU model:

Tier 1 — CpuLocal: A fixed-layout struct pointed to by the architecture's dedicated per-CPU register. Access is 1-4 instructions with no function-call overhead. Used for ~10 of the hottest fields.
Tier 2 — PerCpu<T>: The existing generic abstraction. Used for everything else (statistics counters, per-CPU caches, driver state).

Per-architecture register assignment:

Architecture	Register	Instruction	Cycles	Notes
x86-64	GS segment	`mov %gs:OFFSET, %reg`	~1	Segment prefix encodes offset in instruction. Set via `MSR_GS_BASE` per CPU at boot.
AArch64	`TPIDR_EL1`	`mrs reg, tpidr_el1` + `ldr`	~2-4	System register, kernel-only (EL1). VHE kernels use `TPIDR_EL2`.
ARMv7	`TPIDRPRW`	`mrc p15, 0, reg, c13, c0, 4` + `ldr`	~3-5	Privileged thread ID register (PL1 only). Requires ARMv6K+.
PPC64	`r13` (PACA)	`ld reg, OFFSET(r13)`	~1-3	`r13` permanently points to Per-processor Area (PACA). Matches Linux.
PPC32	`SPRG3`	`mfspr reg, SPRG3` + `lwz`	~3-6	SPRG3 is designated for OS use. Linux PPC32 does not optimize this; UmkaOS does.
RISC-V	`tp` (x4)	`mv reg, tp` + `ld`	~2-4	Matches Linux RISC-V: `tp` holds per-CPU base in kernel mode; `sscratch` holds user `tp` (U-mode) or 0 (S-mode). On trap entry: `csrrw tp, sscratch, tp` swaps the two; sscratch == 0 distinguishes kernel re-entrant traps from user traps.
s390x	`PREFIX` page	`lg reg, OFFSET` (lowcore)	~2-4	s390x PREFIX register remaps the low 8 KiB to a per-CPU "lowcore" area. Per-CPU base is stored at a fixed lowcore offset; a single `lg` from the per-CPU base reaches CpuLocalBlock fields.
LoongArch64	`CSR.KS0`	`csrrd reg, KS0` + `ld.d`	~2-4	LoongArch KS0 (Kernel Scratch 0) is a kernel-only CSR register. Matches Linux LoongArch: `CSR.KS0` holds per-CPU base pointer.

Design rationale: Only x86-64 achieves true single-instruction per-CPU access (the segment register encodes the per-CPU offset within the instruction itself). All other architectures require at minimum two instructions: read the per-CPU base register, then load from base+offset. This is still an order of magnitude faster than the PerCpu<T> generic path (~3-5 cycles vs ~20-30 cycles with CAS borrow checking).

/// Per-size-class free-object magazine for lock-free CPU-local slab allocation.
/// A magazine holds up to `MAGAZINE_SIZE` pre-freed objects of one size class.
/// All slots are pointers to objects of identical size; the size class is
/// implicit from which `MagazinePair` entry this magazine lives in.
///
/// # Allocation fast path
/// 1. Caller checks `loaded.count > 0`.
/// 2. If true, returns `loaded.objects[--count]` (no lock, no atomic).
/// 3. If false, swaps `loaded` ↔ `spare`; if spare was non-empty, retry step 1.
/// 4. If both are empty, refills `loaded` from the global per-size-class slab
///    under a short spinlock (typically 64 objects at a time).
///
/// # Free fast path
/// 1. If `loaded.count < MAGAZINE_SIZE`, store freed pointer at `loaded.objects[count++]`.
/// 2. If loaded is full, swap loaded ↔ spare; if spare was full, drain spare
///    to the global slab under a spinlock, then retry.
///
/// # Memory layout
/// `objects` is a fixed-size array to keep the magazine in a single cache line.
/// At MAGAZINE_SIZE = 64, `size_of::<SlabMagazine>()` = 8 (count) + 64 × 8 (ptrs)
/// = 520 bytes, fitting in 9 cache lines. Magazines are allocated from the slab
/// allocator itself (size class for 520-byte objects).
pub struct SlabMagazine {
    /// Number of valid object pointers currently loaded (0..=MAGAZINE_SIZE).
    pub count: usize,
    /// Object pointer slots — valid at indices [0..count).
    /// Slots at indices [count..MAGAZINE_SIZE) are uninitialised and must
    /// not be read. Written by the free path; read and cleared by the alloc path.
    pub objects: [*mut u8; MAGAZINE_SIZE],
}

/// Maximum number of objects in a single `SlabMagazine`.
/// 64 objects = one cache line's worth of pointers at 8 bytes each.
/// Changing this constant requires recompiling all slab consumers and
/// re-tuning the global slab's bulk-refill batch size (currently 1 magazine).
pub const MAGAZINE_SIZE: usize = 64;

/// Two-magazine pair per CPU per size class.
///
/// The two-magazine design avoids the "thrash" case where a tight alloc/free
/// loop on the same CPU would otherwise bounce between the global slab and the
/// per-CPU cache. With two magazines, the free path can fill the spare magazine
/// before touching the global slab, and the alloc path can drain the loaded
/// magazine fully before swapping in the spare.
///
/// # Type representation
///
/// Both fields use `Option<NonNull<SlabMagazine>>`:
/// - `NonNull<SlabMagazine>` encodes the invariant that the pointer is non-null
///   and properly aligned, providing a safety contract for `unsafe` dereference.
/// - `Option<NonNull<SlabMagazine>>` uses niche optimisation (same size as
///   `*mut SlabMagazine` — 8 bytes on 64-bit, 4 bytes on 32-bit) so there is
///   zero space overhead.
/// - On the **fast path**, both fields are always `Some`. Accessing the magazine
///   is `pair.loaded.unwrap().as_ref()` (or `as_mut()`), which compiles to a
///   single pointer dereference when the compiler can prove the `Some` invariant
///   (the `unwrap()` is elided in release builds on the fast path because the
///   alloc/free hot path only executes when `magazine_active == true`, and
///   initialisation guarantees `Some`).
/// - The `None` state occurs only in the **GC IPI drain** handler
///   (`drain_all_cpu_magazines()`), where `magazine_active` is already `false`
///   and `.take()` extracts the `NonNull` for depot return. In
///   `slab_free_slow()`, the no-empties path copies objects to a stack buffer
///   and zeroes the magazine in-place, keeping `pair.spare` as `Some`.
/// - `core::mem::take()` on `*mut T` would produce null (`Default` for raw
///   pointers) -- the root cause of SLAB-13. With `Option<NonNull<T>>`,
///   `.take()` produces `None` (an explicit sentinel, not a null pointer),
///   and `None` is handled by `if let Some`. This eliminates the SLAB-13
///   null-deref bug by construction.
///
/// # Invariants
/// - On the fast path (magazine_active == true), both `loaded` and `spare` are
///   `Some` — they always point to valid, slab-allocated `SlabMagazine` instances.
/// - `None` is permitted only during GC IPI drain (where `magazine_active`
///   is already `false`). The `slab_free_slow()` no-empties path keeps both
///   fields `Some` by zeroing the magazine in-place. Code that encounters
///   `None` must handle it explicitly (the `None` state means "magazine
///   extracted for depot return, CPU magazines inactive").
/// - Both point to magazines for the same size class.
/// - Access to `loaded` and `spare` requires preemption disabled on the current
///   CPU (the containing `CpuLocalBlock` is only accessed with preemption off).
///
/// **Live evolution**: `MagazinePair` is a NON-REPLACEABLE data structure (part of
/// the allocator data layer, see [Section 4.3](04-memory.md#slab-allocator) for the full slab allocator
/// design including depot and partial-list interactions). The replaceable `SlabAllocPolicy` trait
/// ([Section 4.3](04-memory.md#slab-allocator))
/// controls how magazines are refilled and drained (batch size, NUMA node
/// selection), but the magazine pop/push hot path is fixed code that never goes
/// through the policy trait.
pub struct MagazinePair {
    /// Currently active magazine: alloc pops from here, free pushes here first.
    /// Always `Some` on the fast path. `None` only during GC IPI drain.
    pub loaded: Option<NonNull<SlabMagazine>>,
    /// Backup magazine: swapped in when `loaded` is empty (alloc) or full (free).
    /// Always `Some` on the fast path. `None` only during GC IPI drain.
    pub spare: Option<NonNull<SlabMagazine>>,
}

/// Number of slab size classes. Each size class has its own per-CPU magazine
/// pointer in CpuLocalBlock. Covers allocations from 8 bytes (class 0) to
/// 16384 bytes (class 25). Includes non-power-of-two classes 96 and 192
/// to reduce internal fragmentation for common 3-pointer structs.
/// Classes 0-3: powers of two (8..64). Class 4: 96. Class 5: 128.
/// Class 6: 192. Classes 7-9: 256, 512, 1024. Classes 10-25: step region.
/// This constant MUST match `SLAB_SIZE_CLASSES` in [Section 4.3](04-memory.md#slab-allocator).
pub const SLAB_SIZE_CLASSES: usize = 26;

/// Fixed-layout per-CPU data block. Accessed via the architecture's
/// dedicated per-CPU register (x86-64 GS, AArch64 TPIDR_EL1, etc.).
/// Contains only the hottest fields — those accessed on every syscall,
/// every interrupt, or every scheduler tick.
///
/// # Invariants
///
/// - One `CpuLocalBlock` is allocated per CPU at boot (boot allocator).
/// - The per-CPU register is initialized to point to this block during
///   `smp_prepare_boot_cpu()` (BSP) and `secondary_cpu_init()` (APs).
/// - The register value NEVER changes after init for a given CPU.
/// - Access requires preemption to be disabled (caller holds a
///   `PreemptGuard`), ensuring the CPU cannot change between register
///   read and field access. No runtime borrow check is needed — the
///   structural invariant (preemption disabled + dedicated register)
///   guarantees single-writer access.
///
/// # Field selection criteria
///
/// A field belongs in `CpuLocalBlock` only if ALL of the following hold:
/// 1. It is accessed on a hot path (syscall, IRQ, scheduler, slab, NAPI).
/// 2. It is per-CPU (not per-task, not global).
/// 3. It is a scalar or fixed-size pointer (no dynamically-sized data).
/// 4. It changes infrequently relative to how often it is read.
/// All other per-CPU data belongs in `PerCpu<T>`.
/// # ABI Stability
///
/// Assembly code in per-architecture entry paths (`entry.asm`, `entry.S`)
/// accesses `CpuLocalBlock` fields at hardcoded offsets. Any reordering or
/// removal of existing fields will silently break all architectures.
///
/// **Evolution protocol**:
/// - New fields MUST be appended before `_pad` (and `_pad` shrunk accordingly).
/// - Never reorder or remove existing fields.
/// - Every assembly-accessed field must have a compile-time offset assertion
///   (see `const_assert!` block below the struct definition).
/// - When adding a field, update the `const_assert!` block AND all per-arch
///   assembly files that reference `CpuLocalBlock` offsets.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct CpuLocalBlock {
    /// Pointer to the currently executing task.
    /// Read on every syscall entry, every interrupt, every context switch.
    pub current_task: *mut TaskStruct,

    /// Pointer to this CPU's runqueue.
    /// Read by the scheduler on every tick and every wakeup.
    pub runqueue: *mut RunQueue,

    /// Preemption nesting count.
    /// Incremented/decremented on every preempt_disable/enable pair.
    /// Must be in CpuLocal because preempt_disable itself needs to
    /// access it without going through a preempt-disabled guard (circular).
    ///
    /// Unlike Linux's packed `preempt_count` (which encodes preemption depth,
    /// softirq count, hardirq count, and NMI state in a single u32 with
    /// bit-field packing), UmkaOS uses separate fields for clarity and type
    /// safety. The `in_interrupt()` check uses
    /// `irq_count > 0 || softirq_count > 0`, not bitmask extraction.
    /// The `preemptible()` check uses
    /// `preempt_count == 0 && irq_count == 0 && softirq_count == 0`.
    pub preempt_count: u32,

    /// Hardirq nesting count. Incremented by `irq_enter()`, decremented by
    /// `irq_exit()`. `in_hardirq()` = `irq_count > 0`. Does NOT include
    /// softirq depth — UmkaOS uses separate typed fields for clarity instead
    /// of Linux's packed `preempt_count` bitfield layout.
    pub irq_count: u32,

    /// Softirq (bottom-half) nesting count. Incremented by `local_bh_disable()`,
    /// decremented by `local_bh_enable()`. `in_softirq()` = `softirq_count > 0`.
    /// `in_interrupt()` = `irq_count > 0 || softirq_count > 0`.
    pub softirq_count: u32,

    /// Per-CPU slab magazine pairs (one per size class).
    /// The slab fast path (alloc/free) reads and updates these pairs.
    /// See `MagazinePair` and `SLAB_SIZE_CLASSES` above (= 26, covering 8 B to 16 KB).
    /// Each entry is a `MagazinePair` containing a loaded and a spare
    /// `Option<NonNull<SlabMagazine>>`. On the fast path both are always `Some`.
    pub slab_magazines: [MagazinePair; SLAB_SIZE_CLASSES],

    /// NAPI poll budget remaining for the current poll cycle.
    pub napi_budget: u32,

    /// RCU nesting depth. Quiescent state is eligible for reporting when
    /// this drops to 0 (see Section 3.3.1 for deferred reporting design).
    pub rcu_nesting: u32,

    /// Flag: set to `true` by `RcuReadGuard::drop()` when the outermost
    /// RCU read-side critical section exits. Checked and cleared by
    /// `rcu_check_callbacks()` (called from `scheduler_tick()` and
    /// `context_switch()`), which propagates the quiescent state up the
    /// `RcuNode` tree. This deferred model avoids acquiring the leaf
    /// node's spinlock on every guard drop (~1 cycle flag write vs.
    /// ~20-50 cycle lock acquisition).
    ///
    /// `AtomicBool` (not plain `bool`) for NMI safety: `RcuReadGuard::drop()`
    /// writes this field from task/softirq context, and an NMI could fire
    /// during the write. While no current NMI handler reads this field,
    /// `AtomicBool` with `Relaxed` ordering is defensive against future
    /// diagnostic NMI handlers and costs zero extra cycles on x86 (same
    /// codegen as plain `bool`) and ~1 cycle on ARM (store-release vs plain
    /// store). The NMI write is visible within one tick boundary.
    pub rcu_passed_quiesce: AtomicBool,

    /// Timestamp of the last scheduler tick (nanoseconds, monotonic).
    pub last_tick_ns: u64,

    /// CPU index (redundant with register read, but avoids arch-specific
    /// decoding of the register value to extract the CPU number).
    pub cpu_id: u32,

    /// Isolation register shadow — per-CPU cache of the current hardware
    /// isolation register value. `switch_domain()` ([Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes--pkru-write-elision-mandatory))
    /// compares the target value against this shadow and skips the hardware
    /// write if they match. This is mandatory: every isolation register
    /// write goes through `switch_domain()`, which enforces shadow comparison.
    /// The field is architecture-specific:
    /// - x86-64: PKRU value (u32, upper bits unused)
    /// - AArch64: POR_EL0 value (u64, POE) or ASID (u64, page-table fallback).
    ///   On AArch64 platforms without POE (FEAT_S1POE), the ASID-based isolation
    ///   fallback requires updating `isolation_shadow` during context switch.
    ///   The switch writes the incoming task's ASID and isolation domain ID to
    ///   `CpuLocalBlock.isolation_shadow`, then issues `TLBI ASIDE1IS` to
    ///   invalidate stale TLB entries. `switch_domain()` on non-POE AArch64
    ///   switches ASID via `MSR TTBR0_EL1` with the target domain's page table
    ///   base.
    /// - ARMv7: DACR value (u32)
    /// - PPC64: Radix PID (u32)
    /// - PPC32: not stored here (segment registers use a separate shadow array)
    /// - RISC-V: not applicable (page-table isolation, no register shadow)
    ///
    /// `isolation_shadow` is the arch-neutral name for the cached isolation
    /// domain register value. Each architecture's isolation module reads/writes
    /// this field via `arch::current::isolation::shadow_load()`/`shadow_store()`.
    pub isolation_shadow: u64,

    /// PPC32: Shadow copy of segment registers for isolation domain switching.
    /// Updated on switch_domain(); context switch restores from here.
    #[cfg(target_arch = "powerpc")]
    pub sr_shadow: [u32; 16],

    /// PPC64LE: MMIO barrier batching flag. Set by MMIO accessor macros,
    /// drained on `spin_unlock()`. Avoids redundant `sync` instructions
    /// when multiple MMIO writes occur within a single critical section.
    /// See [Section 3.5](#locking-strategy--mmio-barrier-batching-ppc64le).
    #[cfg(target_arch = "powerpc64")]
    pub io_sync: IoSyncFlag,

    /// Slab magazine validity flag for CPU hotplug. When `false`, slab
    /// allocations on this CPU bypass the per-CPU magazine and fall through
    /// to the depot slow path. Cleared before draining magazines when the CPU
    /// goes offline; set to `true` when the CPU comes back online and fresh
    /// magazines are allocated. Analogous to `pcp_valid` in the physical
    /// allocator ([Section 4.2](04-memory.md#physical-memory-allocator)).
    pub magazine_active: AtomicBool,

    /// Nesting depth of `SimdKernelGuard` on this CPU (§3.10.2).
    /// 0 = no kernel SIMD active. Incremented on guard acquire, decremented
    /// on drop. Used by `SimdKernelGuard::is_active()` to detect re-entry
    /// and by debug assertions to catch illegal nesting.
    /// AtomicU8 allows access via CpuLocal::get() (shared &CpuLocalBlock)
    /// without requiring get_mut(). Relaxed ordering compiles to plain
    /// loads/stores on x86-64 — zero overhead vs plain u8.
    pub simd_kernel_depth: AtomicU8,

    /// Voluntary preemption flag. Set by the scheduler when a reschedule is
    /// needed (e.g., a higher-priority task was woken). Checked on every
    /// `spin_unlock()` and `preempt_enable()` — if set and `preempt_count == 0`,
    /// the scheduler is invoked. Must be in `CpuLocalBlock` because it is
    /// read on every lock release (hot path) and written by IPI / scheduler tick.
    ///
    /// `AtomicBool` (not plain `bool`) because IPIs from remote CPUs write
    /// `true` to this field on the target CPU while the target CPU may be
    /// reading it in `preempt_enable()`. This is a concurrent write + read
    /// on the same memory — UB under the Rust memory model for non-atomic
    /// types. `Relaxed` ordering suffices: the IPI write is to the local
    /// CPU's CpuLocalBlock (single-CPU semantics), and the ordering between
    /// the IPI send and the target CPU's read is established by the IPI
    /// delivery mechanism itself (interrupt serialization).
    pub need_resched: AtomicBool,

    /// True when this CPU is executing the idle loop (between
    /// `cpu_idle_enter()` and `cpu_idle_exit()`). The RCU GP kthread
    /// reads this to detect idle CPUs and report quiescent states
    /// without sending an IPI. Set/cleared by the idle task:
    /// `cpu_idle_enter()` stores true (Release); `cpu_idle_exit()`
    /// stores false (Release). See [Section 7.1](07-scheduling.md#scheduler--idle-task).
    pub is_idle: AtomicBool,

    /// NMI nesting flag. Set to `true` on NMI entry, cleared on NMI exit.
    /// Checked by the locking strategy to prevent certain lock acquisitions
    /// in NMI context (e.g., spinlocks that are not NMI-safe). Also used by
    /// the RCU subsystem to detect NMI-within-RCU-read-side and by the perf
    /// subsystem's PMI handler.
    ///
    /// `AtomicBool` (not plain `bool`) because NMI fires asynchronously on
    /// the same CPU. The NMI entry handler writes `true` while the
    /// interrupted code may be reading `in_nmi` — two concurrent accesses
    /// where one is a write constitutes UB on non-atomic types under the
    /// Rust memory model. `Relaxed` ordering suffices (single-CPU access).
    pub in_nmi: AtomicBool,

    /// Per-CPU performance event context. Points to the `PerfEventContext`
    /// for CPU-pinned events on this core. Set during perf subsystem init;
    /// accessed from NMI handler and sampler kthread.
    pub perf_ctx: *mut PerfEventContext,

    /// Bitmask of pending softirq vectors (one bit per vector, bits 0-9
    /// correspond to HI_SOFTIRQ through RCU_SOFTIRQ). Set by
    /// `raise_softirq()` / `raise_softirq_irqoff()` via
    /// `fetch_or(bit, Relaxed)`; consumed by `do_softirq()` on IRQ exit
    /// via `swap(0, Relaxed)` for atomic snapshot-and-clear.
    ///
    /// **AtomicU32 rationale**: Although `softirq_pending` is accessed only
    /// by the local CPU (no cross-CPU writes), a hardirq can preempt
    /// `do_softirq()` between the snapshot and clear steps. Under the Rust
    /// abstract memory model, concurrent non-atomic access from different
    /// execution contexts on the same CPU (hardirq preempting softirq) is
    /// a data race — undefined behavior. `AtomicU32` with `Relaxed`
    /// ordering eliminates the UB. On x86-64 (TSO), Relaxed ops compile to
    /// plain loads/stores. On weakly-ordered architectures, Relaxed `swap`
    /// is cheaper than the IRQ-disable/enable pair Linux uses (~5-10 cycles
    /// vs ~20-30 cycles for DAIF manipulation on AArch64). AtomicU32 has
    /// the same size and alignment as u32, so offset assertions are
    /// unaffected.
    /// See [Section 3.8](#interrupt-handling--softirq-deferred-interrupt-processing).
    pub softirq_pending: AtomicU32,

    /// Domain ID of the isolation domain currently executing on this CPU.
    /// Set by the ring consumer loop when it starts processing requests
    /// in a Tier 1 domain. Cleared (set to 0) when the consumer yields
    /// or exits back to the core domain.
    ///
    /// Read by the NMI crash handler
    /// ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)) to determine which
    /// domain a faulting CPU belongs to. The crash handler compares
    /// `active_domain` against the revoked domain ID to decide whether
    /// this CPU should redirect to crash recovery.
    ///
    /// `AtomicU64` for NMI-safe access: the NMI handler interrupts
    /// execution asynchronously on the same CPU that is writing
    /// `active_domain` during consumer loop entry/exit. Non-atomic
    /// access would be a data race under the Rust memory model.
    /// `Relaxed` ordering suffices (single-CPU access; the NMI
    /// interrupt delivery provides implicit ordering).
    ///
    /// Value 0 = Core domain (Tier 0, no isolation). Values 1-N = Tier 1
    /// isolation domains. The domain ID space is managed by the domain
    /// service ([Section 11.3](11-drivers.md#driver-isolation-tiers)).
    ///
    /// **Relationship to `isolation_shadow`**: `isolation_shadow` holds the
    /// raw hardware register value (PKRU, POR_EL0, DACR) from which the
    /// active domain CAN be derived via architecture-specific bit extraction.
    /// `active_domain` is the pre-extracted, architecture-neutral domain ID
    /// — O(1) lookup without any bit manipulation. Both are maintained by
    /// the domain-switch path; `active_domain` is set AFTER the hardware
    /// register write so the NMI handler never sees a stale value that
    /// lags behind the actual hardware state.
    pub active_domain: AtomicU64,

    /// Pointer to the current domain's panic recovery `JmpBuf`. Used by
    /// `catch_domain_panic()` ([Section 12.8](12-kabi.md#kabi-domain-runtime)) to implement the
    /// recoverable-panic path (Path 1): when a panic occurs within a domain,
    /// `longjmp()` returns to the `setjmp()` site in the consumer loop,
    /// allowing per-request recovery without full domain teardown.
    ///
    /// Set by `catch_domain_panic()` before entering the domain dispatch loop
    /// via `swap()`. Restored to the previous value on loop exit. Value is
    /// `null` when no domain recovery point is active (Core domain execution).
    ///
    /// `AtomicPtr<()>` for NMI safety: the NMI crash handler reads this field
    /// to determine whether a recoverable panic point exists for the current
    /// CPU. The `swap()` with `Ordering::Release` in `catch_domain_panic()`
    /// pairs with the NMI handler's `load(Acquire)`.
    pub domain_panic_jmpbuf: AtomicPtr<()>,

    /// Result code from the last domain panic recovery. Set by the domain's
    /// panic hook (installed at domain creation) before calling `longjmp()`.
    /// Read by `catch_domain_panic()` after `setjmp()` returns nonzero to
    /// determine the panic reason.
    ///
    /// Values: 0 = no panic, 1 = recoverable panic (per-request), 2 = fatal
    /// (domain must be torn down). The panic hook writes this before
    /// `longjmp()` so the consumer loop can decide whether to retry or
    /// escalate to full domain teardown.
    pub domain_panic_result: AtomicU8,

    /// Per-CPU domain validity flag for trampoline TOCTOU mitigation.
    /// Set to `true` by the consumer loop after `switch_domain()` succeeds.
    /// Cleared to `false` by the crash recovery path (Step 2) on ALL CPUs
    /// for the crashed domain.
    ///
    /// The trampoline code in `device-services-and-boot.md` reads this via
    /// `per_cpu::domain_valid.load(Acquire)` immediately after switching into
    /// a domain. If `false`, the domain was invalidated between the generation
    /// check and the domain switch (TOCTOU window). The trampoline switches
    /// back to the saved domain and returns `Error::ProviderDead`.
    ///
    /// `AtomicU8` (not `AtomicBool`) for NMI safety and consistent CpuLocal
    /// access patterns. 0 = invalid, 1 = valid.
    pub domain_valid: AtomicU8,

    // No explicit padding field needed — `#[repr(C, align(64))]` on the struct
    // ensures cache-line alignment and prevents false sharing between adjacent
    // CPUs' blocks. The compiler pads to a 64-byte boundary automatically.
}

// Compile-time offset assertions for fields accessed from assembly.
// These MUST be updated when fields are added (append-only; see ABI Stability above).
// Offsets differ between 64-bit and 32-bit targets because pointer fields
// (current_task, runqueue) are 8 bytes on 64-bit and 4 bytes on 32-bit.
const_assert!(core::mem::offset_of!(CpuLocalBlock, current_task) == 0);

#[cfg(target_pointer_width = "64")]
mod offset_asserts_64 {
    use super::*;
    const_assert!(core::mem::offset_of!(CpuLocalBlock, runqueue) == 8);
    const_assert!(core::mem::offset_of!(CpuLocalBlock, preempt_count) == 16);
    const_assert!(core::mem::offset_of!(CpuLocalBlock, irq_count) == 20);
}

#[cfg(target_pointer_width = "32")]
mod offset_asserts_32 {
    use super::*;
    const_assert!(core::mem::offset_of!(CpuLocalBlock, runqueue) == 4);
    const_assert!(core::mem::offset_of!(CpuLocalBlock, preempt_count) == 8);
    const_assert!(core::mem::offset_of!(CpuLocalBlock, irq_count) == 12);
}

// slab_magazines: offset is compiler-computed (Rust struct access via CpuLocal
// register), NOT hardcoded in assembly. No const_assert needed. If
// SLAB_SIZE_CLASSES changes, the Rust compiler automatically adjusts all field
// accesses. The assertions above cover only fields referenced by hand-written
// assembly entry stubs (entry.asm / entry.S).
const_assert!(core::mem::align_of::<CpuLocalBlock>() == 64);

// Total struct size assertion (per-arch, defensive against accidental growth).
// Size is dominated by slab_magazines (26 × MagazinePair). Growth directly
// impacts BSS usage: nr_cpus × size_of::<CpuLocalBlock>().
// On 64-bit targets, slab_magazines dominates: 26 × MagazinePair (each pair
// is two Option<NonNull<SlabMagazine>> fields = 2 × 8 bytes via niche
// optimisation, same size as two raw pointers) = 416 bytes. On 32-bit
// targets, pointers shrink to 4 bytes, reducing slab_magazines to
// 26 × 8 = 208 bytes. Other pointer fields
// (current_task, runqueue) also shrink by 4 bytes each. The PPC32-only
// sr_shadow: [u32; 16] adds 64 bytes back — the net result is still well
// within 4096 bytes on 32-bit. The 4096-byte bound is a hard page-size
// cap — exceeding it would require multi-page CpuLocalBlock allocation,
// complicating the register-based access pattern.
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<CpuLocalBlock>() <= 4096);
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<CpuLocalBlock>() <= 4096);

3.2.1.1 `JmpBuf` Type and `setjmp`/`longjmp` Primitives¶

The domain panic recovery mechanism (catch_domain_panic() in Section 12.8) uses architecture-specific setjmp/longjmp primitives to implement non-local returns from panicking domain code. These are NOT the C library functions — they are kernel-internal primitives with restricted semantics (no signal mask save/restore, no stack unwinding).

/// Architecture-specific jump buffer for setjmp/longjmp.
/// Stores callee-saved registers so that longjmp() can restore them
/// and resume execution at the setjmp() return point.
///
/// Each architecture saves a different set of callee-saved registers:
///
/// | Architecture | Callee-saved registers | JmpBuf size |
/// |---|---|---|
/// | x86-64 | rbx, rbp, r12-r15, rsp, rip | 64 bytes |
/// | AArch64 | x19-x28, x29(fp), x30(lr), sp, fpcr, fpsr | 128 bytes |
/// | ARMv7 | r4-r11, sp, lr, d8-d15 (VFP) | 104 bytes |
/// | RISC-V 64 | s0-s11, sp, ra, fs0-fs11 | 208 bytes |
/// | PPC32 | r14-r31, sp, lr, cr | 80 bytes |
/// | PPC64LE | r14-r31, sp, lr, cr, toc(r2) | 160 bytes |
/// | s390x | r6-r13, r14(lr), r15(sp), f8-f15 | 160 bytes |
/// | LoongArch64 | s0-s8, sp, ra, fp, fs0-fs7 | 176 bytes |
///
/// All sizes are padded to 8-byte alignment for consistent stack layout.
// kernel-internal, not KABI
#[repr(C, align(8))]
pub struct JmpBuf {
    /// Opaque register save area. Size is the maximum across all
    /// architectures (208 bytes for RISC-V 64). Architectures with
    /// smaller JmpBuf use only the leading bytes.
    regs: [u8; 208],
}

impl JmpBuf {
    /// Create a zeroed JmpBuf. Must be initialized by `setjmp()` before
    /// any call to `longjmp()`.
    pub const fn new() -> Self {
        Self { regs: [0u8; 208] }
    }
}

/// Save the current execution context (callee-saved registers, stack
/// pointer, return address) into `buf`. Returns 0 on the initial call.
/// When `longjmp(buf, val)` is called, execution resumes at the
/// `setjmp()` return point with return value `val` (nonzero).
///
/// # Safety
///
/// - `buf` must be valid and aligned.
/// - The stack frame that called `setjmp()` must still be active when
///   `longjmp()` is called (no return from the calling function between
///   setjmp and longjmp). In the consumer loop, this is guaranteed because
///   `catch_domain_panic()` allocates `jmp_buf` on its own stack frame
///   and the consumer loop never returns while the domain is active.
/// - Must not be called from interrupt context (NMI, IRQ handlers).
///
/// # Architecture implementation
///
/// Each arch provides `arch::current::cpu::setjmp()` as a naked asm function
/// that saves callee-saved registers into `buf.regs` and returns 0.
pub unsafe fn setjmp(buf: &mut JmpBuf) -> i32;

/// Restore the execution context saved by `setjmp()` and resume
/// execution at the `setjmp()` return point with return value `val`.
/// Does not return.
///
/// # Safety
///
/// - `buf` must have been initialized by a prior `setjmp()` call.
/// - The stack frame from the `setjmp()` call must still be active.
/// - `val` must be nonzero (0 is reserved for the initial `setjmp()` return).
pub unsafe fn longjmp(buf: &JmpBuf, val: i32) -> !;

Per-domain panic hook installation: At domain creation, the domain service installs a panic hook that writes domain_panic_result to CpuLocalBlock and calls longjmp() on the current CPU's domain_panic_jmpbuf. The hook is a function pointer stored in the domain descriptor:

/// Panic hook installed by the domain service at domain creation.
/// Called by the Rust panic handler when a panic originates from code
/// executing within this domain (identified via CpuLocalBlock.active_domain).
///
/// The hook:
/// 1. Writes the panic classification to cpu_local.domain_panic_result
///    (1 = recoverable, 2 = fatal — based on panic payload inspection).
/// 2. Loads the jmpbuf pointer from cpu_local.domain_panic_jmpbuf.
/// 3. If non-null: calls longjmp() to return to catch_domain_panic().
/// 4. If null: falls through to the full domain teardown path (Path 2
///    in the crash recovery sequence).
type DomainPanicHook = fn(info: &core::panic::PanicInfo) -> !;

AP initialization: On the BSP, the CpuLocalBlock is initialized during early boot before SMP bringup. For APs, initialization occurs in secondary_cpu_init() (each arch boot sequence file documents the specific AP entry flow — see Section 2.2, Section 2.5, Section 2.7, etc.). APs spin on a boot_flag until the BSP signals readiness; all CpuLocalBlock fields are zero-initialized before the dedicated register (GS base, TPIDR_EL1, tp) is set.

Architecture accessor functions (provided by arch::current::cpu):

/// Returns a raw pointer to the current CPU's CpuLocalBlock.
/// On x86-64: reads GS base (zero instructions — the offset is encoded
/// in subsequent gs:-prefixed loads, so this function is a no-op that
/// returns a sentinel used by the inline accessors below).
/// On AArch64: `mrs x0, tpidr_el1` (1 instruction).
/// On RISC-V: `mv x0, tp` (1 instruction — tp holds per-CPU base in kernel mode).
///
/// # Safety
///
/// Caller must have preemption disabled. The returned pointer is valid
/// only on the current CPU. If preemption is re-enabled, the pointer
/// may refer to another CPU's block after migration.
#[inline(always)]
pub unsafe fn cpu_local_block() -> *const CpuLocalBlock;

/// Convenience wrapper: obtain a shared reference to this CPU's `CpuLocalBlock`.
///
/// # Safety
///
/// Caller must have preemption disabled. The reference is valid only on the
/// current CPU. Used by pseudocode throughout the spec as `CpuLocal::get()`.
#[inline(always)]
pub unsafe fn get() -> &'static CpuLocalBlock {
    // SAFETY: cpu_local_block() returns a valid pointer when preemption
    // is disabled; the static lifetime is bounded by kernel lifetime.
    &*cpu_local_block()
}

/// Convenience wrapper: obtain an exclusive reference to this CPU's `CpuLocalBlock`.
///
/// # Safety
///
/// Caller must have preemption disabled. No other reference to this CPU's
/// block may be live. Used by pseudocode as `CpuLocal::get_mut()`.
#[inline(always)]
pub unsafe fn get_mut() -> &'static mut CpuLocalBlock {
    // SAFETY: single-CPU access with preemption disabled guarantees
    // exclusive access. The pointer is valid for the kernel's lifetime.
    &mut *(cpu_local_block() as *mut CpuLocalBlock)
}

/// Read the current task pointer. Single instruction on x86-64.
#[inline(always)]
pub fn current_task() -> *mut TaskStruct {
    // SAFETY: Safe from any kernel context. The CpuLocal register
    // (GS on x86-64, TPIDR_EL1 on AArch64, etc.) is always valid in
    // kernel mode. The pointer-sized load is atomic on all architectures
    // (single MOV/LDR). The TaskStruct is reference-counted and cannot
    // be freed while the task is executing. Preemption does NOT need to
    // be disabled — even if the task is preempted immediately after this
    // read, the returned pointer remains valid.
    unsafe { (*cpu::cpu_local_block()).current_task }
}

/// Read this CPU's runqueue pointer.
///
/// **Note on preemption**: Unlike `current_task()`, the runqueue pointer
/// is meaningful only on the CPU that owns it. If the caller is preempted
/// and migrated to another CPU after reading this pointer, the returned
/// pointer refers to the *original* CPU's runqueue, not the new CPU's.
/// Callers that need a stable runqueue reference must hold a preemption
/// guard or accept reading a potentially-stale-but-valid pointer.
#[inline(always)]
pub fn this_rq() -> *mut RunQueue {
    // SAFETY: Same as current_task() — CpuLocal register is valid in
    // kernel mode; pointer-sized load is atomic on all architectures.
    // The RunQueue is statically allocated and never freed.
    unsafe { (*cpu::cpu_local_block()).runqueue }
}

On x86-64, current_task() compiles to a single mov %gs:0, %rax, matching Linux's current macro exactly.

CpuLocal vs PerCpu — when to use which:

Criterion	CpuLocal	PerCpu\<T>
Access cost	~1-10 cycles (arch-dependent)	~3-5 cycles (release) / ~20-30 cycles (debug)
Borrow checking	None (structural safety)	Debug-only CAS (see Section 3.3)
Data types	Fixed scalars/pointers only	Any `T`
Number of fields	~10 (fixed at compile time)	Unlimited (one `PerCpu<T>` per data type)
Adding new fields	Requires `CpuLocalBlock` layout change	Just declare a new `PerCpu<T>`
Interrupt safety	Inherent (no borrow state to corrupt)	`get_mut()` disables IRQs
Use case	current_task, runqueue, preempt_count, slab magazines	Per-CPU page pools (Section 4.2), stats, driver state

Slab magazines use CpuLocal (~1-10 cycles) because slab is the hottest allocation path. PCP page lists use PerCpu<T> because page allocation is less frequent and PcpPagePool is too large for CpuLocalBlock (see Section 4.2).

3.2.2 Initialization Sequence¶

Static allocation: One CpuLocalBlock per logical CPU is allocated in a .cpulocal BSS section at link time: static mut CPU_LOCAL_BLOCKS: [CpuLocalBlock; MAX_CPUS] where MAX_CPUS is a compile-time upper bound (default 4096). At boot, the actual CPU count is discovered from ACPI MADT / device tree; only entries 0..nr_cpus are used. The compile-time array can be enlarged for larger systems without changing the algorithm — it is not a runtime limit.

BSP initialization (Phase 0, before heap):

Zero CPU_LOCAL_BLOCKS[0] (BSS guarantees this; explicitly documented for clarity).
Point the architecture register at &CPU_LOCAL_BLOCKS[0]:
x86-64: WRMSR MSR_GS_BASE with address of CPU_LOCAL_BLOCKS[0]
AArch64: MSR TPIDR_EL1, <addr>
ARMv7: MCR p15, 0, <addr>, c13, c0, 3 (TPIDRPRW)
RISC-V: mv tp, <addr> then CSRW sscratch, 0 (tp = per-CPU base, sscratch = 0 = kernel mode)
PPC64: r13 is the dedicated thread pointer; load &CPU_LOCAL_BLOCKS[0] into r13 before any kernel Rust code executes
PPC32: r2 serves as the kernel base pointer; per-CPU block address is stored in a well-known offset from the SPRG0 scratch register. Note: SPRG0 is used here for the BSP init sequence (boot-time scratch), while the CpuLocal fast-path table above uses SPRG3 (mfspr) for runtime per-CPU access. Both registers are designated for OS use on PPC32.
Set cpu_id = 0, preempt_count = 0, current_task = &idle_task[0].
runqueue, irq_count, and slab_magazines remain zeroed (valid default).

BSP Phase 2 (after slab init):

Initialize slab magazines for CPU 0: call slab_init_cpu_magazines(0).

AP initialization (during SMP bring-up, per AP n):

BSP writes the AP's CpuLocalBlock address into AP_TRAMPOLINE_CPULOCAL[n] (a per-AP word in the trampoline page) with store(Release) ordering.
BSP sends the platform wake signal:
x86-64: INIT-SIPI-SIPI sequence to the AP's local APIC ID
AArch64/ARMv7: PSCI CPU_ON(cpu_id, entry, context_id) via HVC/SMC
RISC-V: SBI HSM_HART_START(hartid, start_addr, opaque)
PPC64: RTAS start-cpu or direct OPAL call
AP trampoline: load(Acquire) from AP_TRAMPOLINE_CPULOCAL[n] to get its block address; writes address into its own arch register (same sequence as BSP step 2 above).
AP writes cpu_id = n, preempt_count = 0, current_task = &idle_task[n].
AP initializes its own per-CPU slab magazines: slab_init_cpu_magazines(n). (AP owns its CpuLocalBlock — BSP must never write to a remote AP's block.)
AP signals ready: AP_READY[n].store(true, Release).
BSP waits: while !AP_READY[n].load(Acquire) { core::hint::spin_loop(); }.

Invariant: No kernel code may access CpuLocal::* before step 3 completes on that CPU. The preempt_count field reads as zero by BSS convention even before step 3, but only the owning CPU may write to its own block. Cross-CPU writes to another CPU's block are never permitted; the only cross-CPU interaction is the AP_TRAMPOLINE_CPULOCAL handshake (write by BSP before wakeup, read by AP during trampoline).

For the complete kernel init phase ordering across all subsystems, see the Kernel Init Phase Reference table in Section 2.3.

3.3 PerCpu Borrow Checking: Debug-Only in Release Builds¶

The PerCpu<T> borrow-state CAS (Section 3.1) serves as a runtime bug detector, not a safety mechanism. The actual safety guarantee comes from the structural invariants:

get() requires &PreemptGuard → preemption disabled → CPU pinned.
get_mut() requires &mut PreemptGuard + disables IRQs → exclusive access.
Therefore: if the caller follows the API contract (one guard per critical section), aliased access is structurally impossible.

Note: PCP page pools (Section 4.2) require IRQs disabled (not just preemption disabled) because interrupt handlers may access PCP pools. This is why get_mut() unconditionally calls local_irq_save() — preemption-disable alone is insufficient for data structures accessed from hardirq context.

The CAS detects violations of rule (3) — e.g., creating two PreemptGuards and using both to obtain &mut T. This is a logic error in the caller, not a race condition. In release builds, this class of bug should have been caught during development and testing.

Design decision: In release builds (cfg(not(debug_assertions))), the borrow-state CAS is replaced with a no-op. The borrow_state array is still allocated (for binary compatibility with debug modules), but get() and get_mut() skip the atomic operations:

impl<T> PerCpu<T> {
    pub fn get_mut<'g>(&self, guard: &'g mut PreemptGuard) -> PerCpuMutGuard<'g, T> {
        let cpu = guard.cpu_id();

        // SAFETY: local_irq_save() MUST be called BEFORE updating borrow_state.
        // See Section 3.1.1 for the full rationale.
        let saved_flags = local_irq_save();

        #[cfg(debug_assertions)]
        {
            // Full CAS borrow-state checking — catches aliasing bugs.
            let state = self.borrow_state(cpu);
            if state.compare_exchange(0, u32::MAX, Ordering::Acquire, Ordering::Relaxed).is_err() {
                local_irq_restore(saved_flags);
                panic!("PerCpu: slot {} already borrowed", cpu);
            }
        }

        // SAFETY: PreemptGuard proves CPU is pinned. local_irq_save()
        // prevents interrupt handler interference. In debug builds, the
        // CAS above additionally verifies no aliased borrows exist.
        // In release builds, we trust the structural invariants.
        unsafe {
            PerCpuMutGuard {
                value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
                saved_flags,
                #[cfg(debug_assertions)]
                borrow_state: self.borrow_state(cpu),
                _guard: PhantomData,
            }
        }
    }
}

impl<'a, T> Drop for PerCpuMutGuard<'a, T> {
    fn drop(&mut self) {
        #[cfg(debug_assertions)]
        {
            self.borrow_state.store(0, Ordering::Release);
        }
        local_irq_restore(self.saved_flags);
    }
}

Release-mode cost: get_mut() = CPU ID lookup (~1-3 cycles) + array index + local_irq_save/local_irq_restore (~5-10 cycles total). Approximately ~3-8 cycles per access, down from ~20-30 with the CAS.

3.3.1 IRQ Save/Restore Elision: `get_mut_nosave()`¶

The get_mut() function unconditionally calls local_irq_save() and local_irq_restore() to guarantee exclusive access. On x86-64, this costs ~5-10 cycles (pushfq/cli + popfq). However, many hot paths already have IRQs disabled at the call site:

Hardirq handlers: IRQs disabled by hardware on entry.
Softirq context: IRQs disabled during do_softirq() execution.
Spinlock holders: spin_lock_irqsave() disables IRQs before acquiring.
Context switch path: schedule() disables IRQs around runqueue locking.

In these contexts, the IRQ save/restore is redundant — IRQs are already off. UmkaOS provides a proof-token variant that elides the redundant operation. This is a core design decision, not a deferred optimization — the IrqDisabledGuard token is woven into the type system from day one.

The two primitive operations are implemented per-arch in umka_core::arch::current::interrupts:

/// Disable local interrupts and return the previous interrupt state (flags register
/// value on x86, DAIF on AArch64, `sstatus.SIE` on RISC-V, etc.).
/// Returns an opaque `usize` whose only valid use is as the argument to `local_irq_restore`.
/// Ordering: `local_irq_save()` acts as a compiler fence — no reads/writes are
/// reordered across it. Does NOT prevent NMIs (non-maskable interrupts).
pub fn local_irq_save() -> IrqDisabledGuard { /* arch-specific */ }

/// Restore interrupt state saved by a prior `local_irq_save()`.
/// Must be called on the same CPU that called `local_irq_save()`.
/// Calling this on a different CPU is undefined behavior.
pub fn local_irq_restore(guard: IrqDisabledGuard) { /* arch-specific via Drop */ }

Per-architecture IRQ save/restore:

The IrqDisabledGuard uses architecture-specific instructions to save and restore the interrupt enable flag atomically. The pseudocode for each supported architecture:

Architecture	Save (disable IRQs, return old state)	Restore
x86-64	`PUSHF; POP rax` (rax = eflags); `CLI`	`PUSH rax; POPF`
AArch64	`MRS x0, DAIF`; `MSR DAIFSet, #0xF`	`MSR DAIF, x0`
ARMv7	`MRS r0, CPSR`; `CPSID if`	`MSR CPSR_c, r0`
RISC-V 64	`CSRRCI a0, sstatus, 0x2` (returns old sstatus, clears SIE)	`CSRW sstatus, a0`
PPC32	`MFMSR r0`; `RLWINM r1,r0,0,17,15`; `MTMSR r1`	`MTMSR r0`
PPC64LE	`MFMSR r0`; `LI r1,MSR_RI`; `MTMSRD r1,1` (clears EE, sets RI)	`MTMSRD r0,1`
s390x	`STNSM mask,0xFC` (save PSW mask, clear I/O + external IRQ bits)	`SSM mask` (restore saved PSW mask)
LoongArch64	`CSRRD r0, CRMD` (save CRMD); `CSRXCHG zero, IE_MASK, CRMD` (clear IE bit)	`CSRWR r0, CRMD` (restore saved CRMD)

Notes: - x86-64: PUSHF/POP captures the entire EFLAGS including IF (bit 9). CLI clears IF. PUSH/POPF restores all flags including IF. - AArch64: DAIF = Debug/SError/IRQ/FIQ mask bits. MSR DAIFSet, #0xF sets all four mask bits (disables all async exceptions). Restore writes the entire DAIF register, re-enabling only the exceptions that were enabled before save. - ARMv7: CPSR_c = control byte of CPSR. CPSID if = disable IRQ+FIQ. - RISC-V: sstatus.SIE (bit 1) controls supervisor interrupt enable. CSRRCI atomically reads old value and clears SIE in one instruction. - PPC32: MSR.EE (bit 15) = external interrupt enable. RLWINM with mask clears bit 15. MTMSR restores. - PPC64LE: MTMSRD with L=1 only updates bits 48 (EE) and 62 (RI) from the source register; all other MSR bits are unchanged. LI r1,MSR_RI loads a value with EE=0, RI=1; MTMSRD r1,1 clears EE while preserving RI. Note: Linux PPC64 BOOK3S uses software IRQ masking (PACA store) for the common case; the hardware MSR approach shown here is the hard-disable path. - s390x: STNSM saves the PSW mask byte and clears the specified bits. Bits 6 (I/O interrupts) and 7 (external interrupts) are cleared by masking with 0xFC. SSM restores the saved mask byte. - LoongArch64: CRMD.IE (bit 2) controls interrupt enable. CSRXCHG performs an atomic read-modify-write on the CSR: zero with IE_MASK clears IE. CSRWR restores the entire CRMD register from the saved value.

The platform-independent IrqDisabledGuard::new() calls arch::current::interrupts::local_irq_save() which expands to the appropriate instruction sequence above. This is zero-cost abstraction: the compiler selects the single-architecture path at compile time.

IrqDisabledGuard is the RAII wrapper returned by local_irq_save(). Its Drop implementation calls the per-arch restore sequence, ensuring correct pairing:

/// Proof token that IRQs are disabled on the current CPU. Created by
/// `local_irq_save()` or `irq_disabled_guard()` (the latter asserts
/// that IRQs are already disabled and constructs the token without
/// a redundant CLI). The token is `!Send` and `!Sync` — it cannot
/// be transferred to another CPU where IRQ state may differ.
///
/// **Implies preemption disabled.** Disabling hardware IRQs masks the
/// scheduler tick interrupt, so the scheduler cannot preempt the running
/// task while this guard is held. Code holding `IrqDisabledGuard` therefore
/// has the same preemption-safety guarantee as code holding `PreemptGuard`.
/// The converse is NOT true: `PreemptGuard` alone (via `preempt_disable()`)
/// does not disable IRQs — interrupt handlers can still fire and must not
/// call `get_mut_nosave()` unless they hold their own `IrqDisabledGuard`.
pub struct IrqDisabledGuard {
    /// Saved interrupt flags. If this guard was created by
    /// `local_irq_save()`, Drop restores them. If created by
    /// `irq_disabled_guard()` (assertion-only), Drop is a no-op.
    saved_flags: Option<usize>,
    _not_send: PhantomData<*const ()>,
}

impl Drop for IrqDisabledGuard {
    fn drop(&mut self) {
        if let Some(flags) = self.saved_flags {
            // SAFETY: restoring flags that we saved earlier.
            unsafe { local_irq_restore(flags) };
        } else {
            // saved_flags is None: this is the assertion-only variant created
            // by `irq_disabled_guard()`. The caller is responsible for
            // maintaining the IRQ-disabled state. Debug assertion verifies
            // the invariant was not violated (e.g., by dropping a SpinLockGuard
            // that re-enabled IRQs while a PerCpuMutRefNosave was still live).
            #[cfg(debug_assertions)]
            {
                assert!(
                    !arch::current::interrupts::are_enabled(),
                    "IrqDisabledGuard (None variant) dropped with IRQs enabled — \
                     the caller violated the safety contract by re-enabling IRQs \
                     while this proof token was live"
                );
            }
        }
    }
}

/// Assert that IRQs are already disabled and obtain a proof token.
///
/// # Safety
///
/// The caller must guarantee that hardware interrupts remain disabled
/// for the entire lifetime of the returned guard. Specifically:
/// - IRQs must be disabled at the point of call (verified by debug assert).
/// - No other guard (e.g., `SpinLockGuard`) that might re-enable IRQs may
///   be dropped while a `PerCpuMutRefNosave` derived from this token is
///   still live. The borrow checker ties the `PerCpuMutRefNosave` lifetime
///   to this guard, but cannot track the hardware IRQ state changing via
///   an independent guard's Drop.
///
/// Violating this invariant allows interrupt handlers to race on per-CPU
/// data, producing undefined behavior.
pub unsafe fn irq_disabled_guard() -> IrqDisabledGuard {
    #[cfg(debug_assertions)]
    {
        assert!(
            !arch::current::interrupts::are_enabled(),
            "irq_disabled_guard() called with IRQs enabled"
        );
    }
    IrqDisabledGuard {
        saved_flags: None,
        _not_send: PhantomData,
    }
}

The PerCpu<T> variant that accepts the proof token:

/// Guard returned by `PerCpu::get_mut_nosave()`.
///
/// Holds a mutable reference to the per-CPU value. Unlike `PerCpuMutGuard<T>`,
/// this guard does NOT save/restore IRQ flags: the caller already holds an
/// `IrqDisabledGuard` proving IRQs are disabled, so saving/restoring them
/// again would be redundant (~5-10 cycles saved on x86-64 per access).
///
/// # Safety invariant
/// The caller must ensure `IrqDisabledGuard` remains live for the entire
/// lifetime `'a` of this guard. Dropping the `IrqDisabledGuard` while this
/// guard is alive would re-enable IRQs, allowing interrupt handlers to race
/// on the same per-CPU slot — undefined behaviour.
///
/// The `PhantomData<&'a IrqDisabledGuard>` field enforces this at the type
/// level: the borrow checker will not allow the `IrqDisabledGuard` to be
/// consumed (moved or dropped) while a `PerCpuMutRefNosave` derived from it
/// is still in scope.
///
/// # Drop behaviour
/// `Drop` only clears the debug-mode borrow state (sets the per-slot
/// `borrow_state` back to `0`). It does NOT call `local_irq_restore()` —
/// that is the caller's responsibility via their `IrqDisabledGuard`.
pub struct PerCpuMutRefNosave<'a, T: 'a> {
    value: &'a mut T,
    /// Proof that the caller holds an `IrqDisabledGuard` for lifetime `'a`.
    _irq: PhantomData<&'a IrqDisabledGuard>,
    /// Debug-builds only: reference to the per-slot borrow-state counter so
    /// that `Drop` can reset it to `0` (free), allowing subsequent callers
    /// to detect aliasing. Elided in release builds — the structural
    /// invariants (IRQs disabled + preemption disabled) are sufficient.
    #[cfg(debug_assertions)]
    borrow_state: &'a AtomicU32,
    /// Ties guard lifetime to the `PreemptGuard` (via `get_mut_nosave` signature).
    _guard: PhantomData<&'a mut PreemptGuard>,
}

impl<'a, T> core::ops::Deref for PerCpuMutRefNosave<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value }
}

impl<'a, T> core::ops::DerefMut for PerCpuMutRefNosave<'a, T> {
    fn deref_mut(&mut self) -> &mut T { self.value }
}

/// `PerCpuMutRefNosave` must NOT be sent to another CPU/thread.
/// The contained `&'a mut T` is a per-CPU slot reference; moving it to a
/// different thread would allow that thread to alias it without holding the
/// required `IrqDisabledGuard` or `PreemptGuard`.
impl<T> !Send for PerCpuMutRefNosave<'_, T> {}

impl<T> PerCpu<T> {
    /// Like `get_mut()`, but skips `local_irq_save()`/`local_irq_restore()`.
    /// The caller must prove IRQs are already disabled by providing an
    /// `IrqDisabledGuard`. This saves ~5-10 cycles on x86-64 (no `pushfq`/
    /// `cli`/`popfq`) and ~3-8 cycles on ARM (no `mrs`/`msr` DAIF).
    ///
    /// # When to use
    ///
    /// Use `get_mut_nosave()` instead of `get_mut()` when:
    /// - Inside a hardirq or softirq handler (IRQs disabled by hardware).
    /// - Holding a `SpinLockGuard` from `spin_lock_irqsave()`.
    /// - In the context switch path (scheduler holds IRQ-disabling lock).
    /// - In any code path where an `IrqDisabledGuard` is already available.
    ///
    /// In all other contexts, use `get_mut()` which manages IRQ state itself.
    ///
    /// # Why `&IrqDisabledGuard`, not `&PreemptGuard`
    ///
    /// `IrqDisabledGuard` is a strictly stronger guarantee than `PreemptGuard`:
    /// it subsumes preemption disabling (tick IRQ is masked) while also preventing
    /// interrupt handlers from racing on the same per-CPU slot. `PreemptGuard`
    /// alone would not be sufficient — an IRQ could fire between the CPU-id read
    /// and the slot access, running a handler that mutably accesses the same slot.
    pub fn get_mut_nosave<'g, 'irq>(
        &self,
        guard: &'g mut PreemptGuard,
        _irq: &'irq IrqDisabledGuard,
    ) -> PerCpuMutRefNosave<'g, T>
    where
        'irq: 'g,  // IrqDisabledGuard must outlive the returned reference
    {
        let cpu = guard.cpu_id();

        #[cfg(debug_assertions)]
        {
            let state = self.borrow_state(cpu);
            if state.compare_exchange(0, u32::MAX, Ordering::Acquire, Ordering::Relaxed).is_err() {
                panic!("PerCpu: slot {} already borrowed", cpu);
            }
        }

        // No local_irq_save() — the IrqDisabledGuard proves IRQs are off.
        // SAFETY: PreemptGuard pins to this CPU. IrqDisabledGuard proves
        // IRQs disabled. Together, these guarantee exclusive access.
        unsafe {
            PerCpuMutRefNosave {
                value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
                _irq: PhantomData,
                #[cfg(debug_assertions)]
                borrow_state: self.borrow_state(cpu),
                _guard: PhantomData,
            }
        }
    }
}

/// Mutable reference guard that does NOT save/restore IRQ flags on drop.
/// Created by `get_mut_nosave()`. Drop only clears the debug borrow state.
impl<'a, T> Drop for PerCpuMutRefNosave<'a, T> {
    fn drop(&mut self) {
        #[cfg(debug_assertions)]
        {
            self.borrow_state.store(0, Ordering::Release);
        }
        // No local_irq_restore() — caller's IrqDisabledGuard manages IRQ state.
    }
}

Cost comparison (x86-64, release builds):

Variant	IRQ management	Total cycles	Use case
`get_mut()`	`pushfq`/`cli` + `popfq`	~3-8	General per-CPU mutation
`get_mut_nosave()`	None (proof token)	~1-3	Hardirq, softirq, spinlock, scheduler

On hot paths (NVMe interrupt handler, scheduler tick, NAPI poll), the IRQ elision saves ~5-10 cycles per PerCpu access. With 1-2 PerCpu accesses per NVMe completion, this yields ~5-20 cycles saved per I/O operation.

3.3.2 NMI Safety¶

local_irq_save() does NOT prevent NMIs (non-maskable interrupts). An NMI can arrive while a PerCpu<T> variable is mutably borrowed. This creates a soundness hazard: if the NMI handler also accesses the same PerCpu<T>, the mutable borrow invariant is violated.

NMI handler rules:

NMI handlers MUST NOT call PerCpu<T>::get_mut() or get_mut_nosave() on any variable. In debug builds, the CAS check panics (correct — panic is better than data corruption). In release builds, this would be UB.
NMI handlers MAY call PerCpu<T>::get() (shared borrow) on variables that are never mutably borrowed during normal interrupt-enabled execution. In practice, this limits NMI-safe reads to a small set of immutable-after-init or structurally read-only variables.
NMI handlers that need per-CPU mutable state MUST use dedicated NMI-safe buffers allocated outside the PerCpu<T> mechanism — e.g., raw per-CPU arrays indexed by cpu_id() with AtomicU64 fields, or the CpuLocal register-based fast path (Section 3.2) which is inherently NMI-safe (single-instruction register read, no borrow tracking).

CpuLocal field NMI safety classification:

Field	NMI-safe	Reason
`cpu_id`	Yes	Immutable after init, single register read
`current_task`	Read-only yes	Written only by `context_switch()` which cannot race with NMI on same CPU
`preempt_count`	Read-only yes	NMI can read; must not modify (would corrupt preemption state)
`in_nmi`	Yes (dedicated)	NMI entry sets this flag; NMI-specific counter
`slab_magazines`	No	Mutable under `get_mut()` during allocation
`runqueue`	No	Mutable during scheduler tick
`rcu_data`	No	Mutable during quiescent state reporting

Enforcement: In debug builds, PerCpu<T>::get_mut() sets an atomic flag that get() from NMI context checks — if the flag is set, panic with "PerCpu<T> mutably borrowed during NMI". NMI context is detected by reading CpuLocal.in_nmi. This catches violations during testing. In release builds, the cost of this check (~3 cycles per NMI entry) is elided; the invariant is enforced by code review and the /// # NMI Safety documentation convention.

Documentation convention: Any function that may be called from NMI context MUST include a /// # NMI Safety doc section listing which per-CPU state it accesses and why that access is safe. This is analogous to the /// # Safety requirement for unsafe fn.

3.4 Cumulative Performance Budget¶

The complete set of UmkaOS overhead-reduction techniques — CpuLocal (Section 3.2), debug-only PerCpu CAS (Section 3.3), IRQ elision (Section 3.3), RCU deferred quiescent state (Section 3.1, RcuReadGuard::drop), PKRU shadow elision (Section 11.2), capability amortization (Section 12.3), and doorbell coalescing (Section 11.7) — are all core design decisions implemented from day one. They are not deferred optimizations. Below is the cumulative overhead analysis for the three hottest I/O paths with all techniques active.

Platform coverage: The 5% overhead budget applies to architectures with hardware-assisted Tier 1 isolation: x86-64 (MPK), AArch64 (POE, Cortex-X4+ / Neoverse V3+), ARMv7 (DACR), PPC64LE (Radix PID), and PPC32 (segment registers). All comfortably meet the budget. On RISC-V 64, s390x, and LoongArch64, no fast hardware isolation mechanism exists — Tier 1 drivers run as Tier 0 (in-kernel, zero isolation overhead) or Tier 2 (Ring 3, full process isolation). The 5% Tier 1 budget is N/A on these platforms. Pre-POE AArch64 (page-table isolation) exceeds 5% and also falls back to Tier 0/Tier 2. See Section 11.2 for per-architecture isolation mechanism details.

NVMe 4KB random read (~10μs = ~25,000 cycles at 2.5 GHz):

Source	Cycles	Notes
MPK domain switches (×4, shadow-elided)	~47-92	Shadow elides 1-2 of 4 WRPKRU on back-to-back transitions
CpuLocal slab magazine access (×2)	~2-8	alloc + free, arch-dependent
PerCpu (nosave) in IRQ context (×1)	~1-3	IRQ-disabled proof token, no pushfq/popfq
RCU quiescent flag (×1)	~1	CpuLocal bool write, deferred to tick
Capability validation (×1, amortized)	~8-19	ValidatedCap token + REVOKED_FLAG check, 3-4 sub-calls use cached
Doorbell write (×1, amortized over batch-32)	~5	Single MMIO write for entire batch
Total additional over Linux	~64-128
% of 10μs operation	~0.26-0.51%

TCP RX packet (~5μs per-packet, NAPI batch-64):

Source	Cycles	Notes
MPK domain switches (×4, shadow-elided, amortized/64)	~0.7-1.4/pkt	Shadow + NAPI batching
CpuLocal NAPI budget + socket access	~2-6/pkt
RCU conntrack guard (×1)	~0.02/pkt	CpuLocal flag write, amortized
Capability (amortized over NAPI batch)	~0.1-0.2/pkt	One ValidatedCap per NAPI poll cycle
Total additional per packet	~3-8
% of 5μs operation	~0.02-0.06%	per packet, with batching
% without batching (batch=1)	~1.5-2.2%	worst case, still improved

Context switch (~2μs = ~5,000 cycles):

Source	Cycles	Notes
CpuLocal runqueue access	~1-4	pick_next_task
Isolation shadow save/restore	~2-4	Memory writes only, no WRPKRU
RCU quiescent report (batched)	~3-5	Check flag + report, once per switch
Total additional	~6-13	integer workload, no FPU
% of ~2μs switch	~0.1-0.3%

Cumulative worst case (nginx-like: receive + read + send + switch):

Platform: x86-64 MPK:

TCP RX (NAPI-64):    ~0.04%
NVMe read (batched):  ~0.35%
TCP TX (batched):     ~0.25%
Context switches:     ~0.15%
──────────────────────────
Compound:             ~0.8%    (4.2% headroom under 5% budget)

Platform: AArch64 POE (Cortex-X4+, Neoverse V3+):

TCP RX (NAPI-64):    ~0.4%
NVMe read (syscall): ~1.5%
TCP TX (batched):     ~0.5%
Context switches:     ~0.2%
──────────────────────────
Compound:             ~1.5-2.0%  (3.0-3.5% headroom under 5% budget)

NVMe 4KB random read — syscall path with LSM (~10μs):

The budget above assumes io_uring SQPOLL (zero domain crossings for the read itself). A conventional read() syscall adds VFS and LSM overhead:

Source	Cycles	Notes
All isolation overhead (as above)	~63-123	Same as SQPOLL path
VFS domain crossing (ring dispatch, amortized)	~8-20	Ring submit + consumer dispatch, amortized at N≥12 per domain-switch cycle
LSM hooks (2-3 per read: file_permission, inode_permission)	~30-90	Static dispatch, single LSM: ~10-30 per hook. 3 hooks typical.
UserPtr validation (copy_to_user)	~5-10	Bounds check + potential page fault
Total additional over Linux	~121-269
% of 10μs operation	~0.5-1.1%	Still well within 5% budget

TCP RX packet with NetBuf overhead (~5μs per-packet, NAPI batch-64):

Source	Cycles	Notes
All isolation overhead (as above)	~3-8/pkt	Same as base budget
NetBuf slab alloc in receiving domain (per-packet)	~15-25/pkt	Magazine pop (hot), slab fallback (warm)
NetBuf ring entry serialization (128-byte write)	~10-15/pkt	2 cache lines, sender side
NetBuf ring entry deserialization + reconstruction	~10-15/pkt	2 cache lines read + field copy, receiver side
NetBuf slab free in sending domain (per-packet)	~10-15/pkt	Magazine push (hot), slab return (warm)
Total additional per packet	~48-78
% of 5μs operation	~0.4-0.6%	per packet, with NAPI batching

NetBuf lifecycle cost breakdown: The domain crossing copies the 128-byte NetBufRingEntry wire format (2 cache lines), NOT the full 296-byte NetBuf struct. The receiving domain reconstructs a full NetBuf from the ring entry plus the shared DMA data handle. Data pages are shared zero-copy via the DMA buffer pool — only metadata crosses the domain boundary.

Cumulative worst case with LSM (nginx-like: receive + read + send + switch):

Platform: x86-64 MPK (with LSM):

TCP RX (NAPI-64):    ~0.4%   (includes NetBuf lifecycle overhead)
NVMe read (syscall): ~0.9%   (includes VFS + LSM)
TCP TX (batched):    ~0.5%   (includes NetBuf lifecycle + LSM)
Context switches:    ~0.15%
LSM hooks (open/close/sendfile, 6 total): ~0.5%  (static dispatch)
──────────────────────────
Compound:             ~2.5%   (2.5% headroom under 5% budget)

Platform: AArch64 POE (with LSM):

Compound:             ~2.8-3.8%  (1.2-2.2% headroom under 5% budget)

The ~2.5% headroom is tighter than the SQPOLL-only estimate (~4.2%). This is the realistic worst case for a typical server workload with security enabled.

NVMe write path additional overhead: The read path budget above is the simpler case. The write path incurs additional costs beyond the read path: (1) the writeback domain crossing pair (~23-46 cycles on x86-64), (2) the bio dispatch domain crossing pair (~23-46 cycles), and (3) LSM hooks (file_permission + inode_permission, ~20-60 cycles). In total, the write path adds ~66-152 cycles (~0.3-0.7% of 10 us) over the read path. The compound budget remains within the 5% envelope: ~2.8-3.2% worst case for a mixed read/write workload. See Section 1.3 for the itemized breakdown.

3.4.1.1 Write Path Completion Latency (fsync/O_SYNC)¶

The bio completion callback defers end_page_writeback() to the blk-io workqueue because page cache operations may acquire sleeping locks (required for VFS crash recovery — see Section 11.9). This deferral adds ~1-5us per write completion on the fsync/O_SYNC path.

Metric	Value	Notes
Workqueue deferral latency	~1-5us	Per write completion
Impact on fsync() at 100K IOPS	~100-500ms/sec additional	Sequential, not pipelined
Impact on async writeback	~0% throughput	Workqueue runs concurrently
Comparison with Linux	Linux: 0us (softirq context)	Linux lacks VFS crash recovery

This is the price of crash containment: UmkaOS's VFS crash recovery requires sleeping locks on the writeback path, which cannot execute in softirq context. The throughput budget (cycles per operation at pipeline throughput) is unaffected because the workqueue runs asynchronously for non-sync I/O. Only O_SYNC/fsync callers observe the sequential latency penalty.

3.4.1.2 Metadata-Heavy Workload Note¶

Metadata-heavy workloads (stat, readdir, open/close): Individual metadata syscalls pay ~18ns overhead per call on x86-64 (~46 cycles for Core → VFS+FS → Core round-trip). This is ~3.6-9% per individual stat() (base cost ~200-500ns). For throughput benchmarks, the overhead is amortized across mixed I/O operations. For purely metadata-heavy workloads (e.g., find traversal, package manager resolution), expect ~5-9% syscall-level overhead per metadata operation, dominated by the VFS domain crossing cost. This is the design tradeoff for VFS crash containment: the dentry/inode cache lives in the VFS domain, enabling crash recovery via cache rebuild, at the cost of one domain crossing pair per metadata syscall.

Per-architecture metadata overhead (single stat() call, L1-hot, dentry cache hit):

Architecture	VFS round-trip	stat() base cost	Overhead
x86-64 (MPK)	~46 cycles (~18ns)	~200-500ns	~3.6-9%
AArch64 (POE)	~80-160 cycles (~32-64ns)	~200-500ns	~6.4-32%
ARMv7 (DACR)	~60-80 cycles (~30-40ns)	~200-500ns	~6-20%
PPC64 (Radix PID)	~60-120 cycles (~24-48ns)	~200-500ns	~4.8-24%
PPC32 (segments)	~40-80 cycles (~20-40ns)	~200-500ns	~4-20%
RISC-V (page table)	~400-1000 cycles (~160-400ns)	~200-500ns	~32-200%
AArch64 (page table)	~150-300 cycles (~60-120ns)	~200-500ns	~12-60%

On RISC-V and page-table-fallback AArch64, metadata-heavy workloads exceed the 5% budget per-operation. These architectures should run VFS as Tier 0 (no domain crossing) unless crash containment is specifically required, in which case the overhead is accepted as the cost of isolation.

Amortized metadata overhead: The raw per-call overhead above applies to isolated stat() calls with no prefetch. For the dominant readdir+stat access pattern (e.g., find, ls -la, package managers), the VFS readdir-plus prefetch mechanism (Section 14.1) reduces effective overhead to ~0.3-0.5% on x86-64 by serving stat results from a per-task Core-memory buffer with zero domain crossings (~95% hit rate). For io_uring batched statx workloads, coalescing reduces overhead to ~0.05% per stat (one crossing amortized over 64 operations). These improvements apply to Always and CacheAware filesystem policies; Never-policy filesystems (DLM-based cluster FS) retain the unoptimized cost, which is negligible relative to their ~50-500us distributed lock overhead.

3.4.1.3 Non-Batched / Latency-Sensitive Workloads¶

The budgets above assume batch amortization (NAPI batch-64, doorbell batch-32). These assumptions hold for sustained throughput workloads (web servers, streaming, bulk storage). They do not hold for workloads where each operation is independent and unbatched: database transactions, RPC microservices, interactive key-value stores.

Single-request RPC scenario (gRPC/Thrift service: recv 1 request → 2 DB reads → format → send 1 response, x86-64 MPK):

Source	Cycles	Notes
TCP RX: NIC driver domain switch (×2, enter+exit)	~46	No NAPI batch — single packet wakes poll, NAPI completes with 1 packet
TCP RX: umka-net domain switch (×2)	~46	TCP → socket buffer → userspace
NetBuf lifecycle (×1: alloc + serialize + deserialize + free)	~48-78	Full ring entry copy + slab alloc/free
Capability validation (×1)	~8-19	Amortized across NIC+TCP sub-calls (includes REVOKED_FLAG check)
LSM hooks: socket_recvmsg (×1)	~10-30
NVMe read #1: VFS domain crossing (×2)	~46-92	syscall read → VFS → NVMe driver
NVMe read #1: doorbell (×1, no batching)	~150	Single MMIO write, no coalescing
NVMe read #1: LSM (file_permission + inode_permission)	~20-60
NVMe read #1: capability (×1)	~8-19	Includes REVOKED_FLAG check
NVMe read #2: same as #1	~223-316
TCP TX: NIC domain switch (×2)	~46
TCP TX: umka-net domain switch (×2)	~46
TCP TX: NetBuf lifecycle + LSM	~58-108	NetBuf lifecycle ~48-78 + LSM ~10-30
Total additional over Linux	~755-1051
Typical RPC latency	~25-50 μs (62,500-125,000 cycles at 2.5 GHz)
Overhead (L1-hot)	~0.8-1.7%	Well within 5%

3.4.1.4 Cache-Cold Multiplier (L1I Displacement from Domain Working Set Switching)¶

The L1-hot assumption above is optimistic for mixed workloads. When control transfers between umka-core and a Tier 1 driver, the driver's instruction cache footprint (hot loop, DMA descriptor handling, interrupt acknowledgment) displaces umka-core's L1I entries. On return, umka-core's code must be re-fetched from L2.

This L1I displacement tax is a structural overhead of the isolation model that Linux does not incur — in Linux, all kernel code shares one address space and one working set, so the NVMe driver's hot loop and the VFS dispatch code coexist in L1I without displacement pressure from domain switches. In UmkaOS, each domain switch is a working set boundary: the hardware prefetcher and branch predictor must re-warm after every transition. The cache-cold penalty is ~2-3x on the isolation-related cycles (domain switches, LSM hooks, capability checks), but NOT on device I/O latency (NVMe latency is device-bound, not cache-bound).

The L2-warm row in the table below captures this effect. L1-hot numbers are the optimistic case; L2-warm (~2x multiplier on isolation cycles) is the realistic steady-state for production workloads with mixed request types or multiple active Tier 1 drivers. The headroom calculation (4.2% on x86-64) uses the L2-warm estimate, ensuring the 5% budget holds under realistic cache conditions.

Scenario	Isolation cycles	Total overhead	When this applies
L1-hot (sustained single-type RPC)	~755-1,051	~0.8-1.7%	Tight loop hitting one driver repeatedly
L2-warm (mixed RPC types, ~2x isolation penalty)	~1,200-1,700	~1.2-2.7%	Production steady-state: multiple drivers, varied request types
L3/cold (rare path, first request after idle, ~3x)	~1,800-2,500	~1.9-4.0%	First request after idle, cold code path

Tier 1 domain switch contribution: In all scenarios above, domain switch cycles (WRPKRU on x86-64, POR_EL0 on AArch64 POE, etc.) are the dominant isolation cost component. Per-architecture domain switch costs and the full breakdown are in the per-architecture table below (line "Isolation cost").

Key observations:

Single-request overhead is higher than batched but still within budget. The worst case (L3-cold, 2 DB reads per RPC) is ~3.5% on x86-64. This is under 5% but with less headroom than the nginx throughput scenario.
Doorbell coalescing is the biggest loss. Without batch-32, each NVMe submission costs ~150 cycles (raw MMIO write) instead of ~5 cycles/cmd. This is the single largest contributor to non-batched overhead. Mitigation: io_uring with IORING_SETUP_SQPOLL recovers batching even for RPC workloads by coalescing submissions from the poll thread.
NAPI batch-1 is common for low-rate RPC. When a NIC interrupt fires for a single packet, NAPI polls once and exits. The 4 domain switches (NIC enter/exit + umka-net enter/exit) are NOT amortized. Per-packet isolation overhead: ~92 cycles on a ~5 μs packet = ~1.8%. This matches the "without batching" line in the TCP RX budget above (~1.5-2.2%).
Tail latency vs throughput. The 5% budget targets throughput on macro benchmarks (as scoped in Section 1.3). Individual request tail latency (p99, p99.9) may show higher percentage overhead because:
Cache-cold paths on rare request types
NAPI poll finding only 1 packet (no amortization)
Coincidence with RCU grace period processing or FMA health check
TLB pressure from domain switches on page-table-fallback architectures

For p99 tail latency on latency-sensitive services, we target <8% overhead on fast-isolation architectures (x86-64 MPK, AArch64 POE) and <15% on page-table-fallback architectures. These are realistic bounds that account for cache variability and non-batched paths. Per-operation (individual syscall or interrupt) tail latency budgets are not specified — consistent with industry practice where only workload-level latency targets are meaningful.

Mitigation strategies for latency-sensitive services:
io_uring SQPOLL: Recovers doorbell coalescing even for single requests by batching submissions in the poll thread. Eliminates the ~150 cycle/cmd penalty.
Busy-polling (SO_BUSY_POLL): Avoids NAPI interrupt path entirely; the application polls the NIC ring directly (still through Tier 1 domain crossing). Trades CPU for latency determinism.
Tier 0 promotion for latency-critical drivers: On architectures without fast isolation, or for services where even MPK overhead matters, the admin can set a trusted NIC or NVMe driver to Tier 0 via echo 0 > /ukfs/kernel/drivers/<name>/tier. This eliminates domain switch overhead entirely at the cost of crash containment.
Adaptive NAPI coalescing: UmkaOS NAPI uses adaptive interrupt coalescing (Section 16.14). Under low load, packets are delivered with minimal batching (latency-optimized). Under high load, batching increases (throughput-optimized). This is the same tradeoff Linux makes.

Per-architecture non-batched RPC overhead (single gRPC request, 2 DB reads, L2-warm):

Architecture	Isolation cost/crossing	~10 crossings	Total RPC overhead
x86-64 (MPK)	~23 cycles	~460 (×2 warm) = ~920	~1.2-2.0%
AArch64 (POE)	~40-80 cycles	~1,200 (×2) = ~2,400	~2.5-4.0%
AArch64 (page table)	~150-300 cycles	~4,500 (×2) = ~9,000	~8-15%
ARMv7 (DACR + ISB)	~30-40 cycles	~700 (×2) = ~1,400	~1.8-3.0%
PPC64 (Radix PID)	~30-60 cycles	~900 (×2) = ~1,800	~1.8-3.0%
PPC32 (segments + isync)	~20-40 cycles	~600 (×2) = ~1,200	~1.5-2.5%
RISC-V (page table)	~200-500 cycles	~7,000 (×2) = ~14,000	~12-22%

On RISC-V and pre-POE AArch64, non-batched RPC overhead exceeds the 5% throughput budget. This is expected and documented: on platforms without Tier 1 hardware isolation, drivers run as Tier 0 (no isolation overhead, no crash containment) or Tier 2 (Ring 3 + IOMMU, higher overhead but full crash containment). The placement depends on licensing requirements, driver default preference, and sysadmin configuration (see Section 1.1).

The ~2.5% headroom is sufficient because: (cumulative overhead, same nginx workload):

Architecture	CpuLocal cost	PerCpu cost	Isolation cost	Shadow savings	Total
x86-64	~1 cycle	~1-3 (nosave)	~23 cycles/WRPKRU	~23-46 elided	~0.7-1.2%
AArch64 (POE)	~2-4 cycles	~3-6 (nosave)	~40-80/MSR	~40-80 elided	~1.2-2.0%
AArch64 (page table)	~2-4 cycles	~3-6 (nosave)	~150-300/switch	N/A (TLB)	~3-6%
ARMv7	~3-5 cycles	~4-8 (nosave)	~30-40/MCR+ISB	~30-40 elided	~1.8-3.0%
RISC-V	~5-10 cycles	~6-12 (nosave)	~200-500/PT	N/A (TLB)	~4-10%
PPC64	~1-3 cycles	~2-5 (nosave)	~30-60/mtspr	~30-60 elided	~0.8-1.5%
PPC32	~3-6 cycles	~5-10 (nosave)	~20-40/mtsr+isync	~20-40 elided	~1.5-3.0%

Boundary crossing cycle cost reference table:

Individual crossing costs for the key operations that appear in budgets above. All values are per-crossing, measured or estimated per architecture:

Crossing	x86-64	AArch64 (POE)	ARMv7	RISC-V	PPC64	PPC32
MPK/POE domain switch (WRPKRU / POR_EL0)	~23	~40-80	~30-40	N/A (PT)	~30-60	~20-40
Shadow elision (back-to-back same domain)	saves ~23	saves ~40-80	saves ~30-40	N/A	saves ~30-60	saves ~20-40
UserPtr validation (copy_to_user bounds)	~5-8	~6-10	~8-12	~8-15	~5-10	~8-15
CapHandle validation (cached ValidatedCap + REVOKED_FLAG)	~8-19	~9-21	~11-23	~11-25	~9-20	~11-25
LSM hook (static dispatch, single LSM)	~10-30	~12-35	~15-40	~15-45	~12-35	~15-45
NetBuf slab alloc (magazine pop, single)	~15-25	~18-30	~20-35	~20-40	~18-30	~20-40
NetBuf full lifecycle (alloc + ring ser/deser + free)	~48-78	~56-95	~65-115	~65-130	~56-95	~65-130
VFS domain crossing (full round-trip)	~46-92	~80-160	~60-80	~400-1000	~60-120	~40-80

Notes: - RISC-V VFS domain crossing uses page table switching (no MPK equivalent), hence ~400-1000 cycles including TLB flush. This is the primary overhead contributor on RISC-V. - All values assume L1-hot data (worst case for cache-cold paths is ~2-3x). - "Static dispatch, single LSM" means UmkaOS's default (one active LSM with direct function call, no indirect branch). Stacked LSMs (if supported) multiply per-hook cost by the stack depth. - RCU read-side cost (~1 cycle flag check) is included in isolation and capability validation paths. RCU write-side (grace period processing, callback invocation) is workload-dependent and not included in per-operation budgets — it runs on a dedicated kthread and amortizes across batched callbacks.

On x86-64, AArch64 with POE, ARMv7, PPC64, and PPC32: comfortably within the 5% budget with substantial headroom. On AArch64 without POE and RISC-V: isolation cost dominates, and the per-CPU/shadow optimizations matter less — the bottleneck is page-table-based domain switching, not per-CPU access.

3.4.1.4.1 Cache-Cold Sensitivity Analysis¶

The compound overhead tables above assume L1 cache-hot access for all metadata lookups (XArray node traversals, KABI vtable dereferences, CapEntry reads, ValidatedCap token checks). Under realistic workloads, some fraction of these accesses miss in L1 and hit in L2 or L3. This section quantifies the impact to confirm the 5% budget holds under pessimistic assumptions.

Per-component cache-miss penalty model:

Each metadata access has a baseline L1-hot cost. On an L1 miss, the penalty depends on where the line is found. The weighted miss penalty uses the distribution below.

Component	L1-hot cost	L2 hit (+ns)	L3 hit (+ns)	Per-miss weighted penalty
XArray node traversal (1-2 levels for typical radix depth)	3-5 ns	+5-15 ns	+30-80 ns	+12.5-34.5 ns
KABI vtable dereference (single indirect load)	2-3 ns	+5-15 ns	+30-80 ns	+12.5-34.5 ns
CapEntry read (RCU-protected, single cache line)	3-5 ns	+5-15 ns	+30-80 ns	+12.5-34.5 ns
ValidatedCap REVOKED_FLAG check (atomic load)	1-2 ns	+5-15 ns	+30-80 ns	+12.5-34.5 ns

Compound overhead under varying L1 miss rates (x86-64 MPK, NVMe 4KB read path):

L1 Miss Rate	XArray (ns)	Vtable Deref (ns)	CapEntry (ns)	REVOKED_FLAG (ns)	Compound Total (ns)	% of 10 μs NVMe	Within 5%?
0% (baseline)	3-5	2-3	3-5	1-2	9-15	0.09-0.15%	Yes
10% (typical steady-state)	4.3-8.5	3.3-6.5	4.3-8.5	2.3-5.5	14.2-29	0.14-0.29%	Yes
20% (moderate contention)	5.5-11.9	4.5-9.9	5.5-11.9	3.5-8.9	19-42.6	0.19-0.43%	Yes
30% (heavy contention / cold restart)	6.8-15.4	5.8-13.4	6.8-15.4	4.8-12.4	24.2-56.6	0.24-0.57%	Yes
50% (worst case: first access after long idle)	9.3-22.3	8.3-20.3	9.3-22.3	7.3-19.3	34.2-84.2	0.34-0.84%	Yes

Cross-architecture compound overhead at 20% L1 miss rate (NVMe 4KB read):

Architecture	Isolation hw cost (ns)	Metadata cache-cold (ns)	Combined (ns)	% of 10 μs NVMe	Within 5%?
x86-64 (MPK)	18-37	19-42.6	37-80	0.37-0.80%	Yes
AArch64 (POE)	32-64	19-42.6	51-107	0.51-1.07%	Yes
ARMv7 (DACR + ISB)	24-32	19-42.6	43-75	0.43-0.75%	Yes
PPC64 (Radix PID)	24-48	19-42.6	43-91	0.43-0.91%	Yes
PPC32 (segments + isync)	16-32	19-42.6	35-75	0.35-0.75%	Yes

Cross-architecture compound at 20% miss rate for 1 μs syscall (stat/open/close):

Architecture	Isolation hw cost (ns)	Metadata cache-cold (ns)	Combined (ns)	% of 1 μs syscall	Within 5%?
x86-64 (MPK)	18-37	19-42.6	37-80	3.7-8.0%	Marginal
AArch64 (POE)	32-64	19-42.6	51-107	5.1-10.7%	No
ARMv7 (DACR + ISB)	24-32	19-42.6	43-75	4.3-7.5%	Marginal
PPC64 (Radix PID)	24-48	19-42.6	43-91	4.3-9.1%	No
PPC32 (segments + isync)	16-32	19-42.6	35-75	3.5-7.5%	Marginal

For sub-microsecond metadata syscalls under cache pressure, the 5% budget is exceeded on some architectures. This is consistent with the metadata-heavy workload analysis above and is mitigated by the VFS readdir-plus prefetch mechanism (Section 14.1), which achieves ~95% hit rate in Core-memory buffers, eliminating domain crossings (and their associated cache-cold metadata lookups) for the common case.

Assumptions:

L2 hit latency: 5-15 ns (varies by microarchitecture: AMD Zen 4 ~5 ns, ARM Neoverse V2 ~8 ns, Intel Sapphire Rapids ~12 ns, IBM POWER10 ~10 ns)
L3 hit latency: 30-80 ns (Zen 4 ~30 ns, Neoverse V2 ~40 ns, Sapphire Rapids ~50 ns, POWER10 ~60 ns)
Miss distribution: 70% L2 hit / 30% L3 hit (conservative; real workloads show >85% L2 hit rate for kernel metadata due to temporal locality of XArray nodes, vtable pointers, and capability entries)
Weighted per-miss penalty: 0.7 × L2_mid + 0.3 × L3_mid = 0.7 × 10 + 0.3 × 55 = 23.5 ns (used for interpolation; tables above use the full range, not just the midpoint)
Metadata access count per NVMe read: 4 (XArray lookup, vtable dispatch, CapEntry validation, REVOKED_FLAG check). Additional accesses (LSM hook dispatch, VFS dentry lookup) are accounted separately in the LSM and metadata-heavy workload sections above.

Conclusion: Even under a pessimistic 30% L1 miss rate with 30% of misses propagating to L3, the compound metadata overhead on a 10 μs NVMe read reaches only ~0.57% of the operation cost — well within the 5% performance budget. Combined with the hardware isolation overhead (domain switches, shadow elision), the total stays under ~1.5% on all fast-isolation architectures. The per-VMA lock optimization (Section 4.8) further reduces cache pressure by eliminating mmap_lock contention, which is a primary source of L1 cache pollution on multi-core systems.

Note: These numbers represent the UmkaOS isolation metadata overhead on top of both the base syscall cost and the hardware isolation cost (domain switches). They do not double-count: the boundary crossing cycle cost table above covers hardware isolation; this section covers the software metadata lookups that accompany each crossing. A 0.57% metadata overhead plus ~0.5% hardware isolation overhead yields ~1.1% total — imperceptible to applications.

Optimization summary (all implemented from day one, not deferred):

Technique	Section	Cycles saved per I/O	Cumulative impact
CpuLocal register-based access	Section 3.2	~15-25 (vs old PerCpu CAS)	Major: hottest paths
Debug-only PerCpu CAS	Section 3.3	~15-20 (release elision)	Major: all PerCpu paths
IRQ save/restore elision	Section 3.3	~5-10 (per get_mut)	Moderate: IRQ/spinlock paths
RCU deferred quiescent state	Section 3.1	~4-9 (per outermost drop)	Moderate: all RCU paths
Isolation shadow elision	03a, Section 11.2	~23-80 (per elided write)	Major: back-to-back switches
Capability amortization	Section 12.3	~8-36 (per KABI dispatch)	Moderate: all KABI calls
Doorbell coalescing	Section 11.7	~145/cmd (batch-32)	Major: batched NVMe/virtio

RCU interaction: KabiDispatchGuard (which scopes ValidatedCap<'dispatch>) holds an RCU read-side lock for the duration of every KABI dispatch — see Section 12.3. This means every KABI call adds one RCU nesting level. The consequence for the concurrency model: capability revocations (rcu_call callbacks) cannot complete while a KABI dispatch is in progress on any CPU. Long KABI dispatches therefore increase RCU grace period latency, which is bounded by the maximum KABI call duration (~100 μs worst case for NVMe completion). Designers adding new KABI interfaces must keep dispatch handlers short; blocking operations (sleeping, waiting on locks) inside a KabiDispatchGuard scope are prohibited.

Revocation traversal cost: Capability revocation uses a two-phase breadth-first protocol (Section 9.1). Phase 1 is lock-free (~1-5 cycles, single fetch_or on active_ops). Phase 2 enqueues CapRevocationWork items to the per-CPU workqueue, processing one delegation tree level per workqueue pass. The worst-case spinlock hold time per node is O(256) iterations (one children list scan), not O(children × depth) as a recursive traversal would require. Total revocation latency for deep trees increases slightly due to workqueue scheduling overhead (~50-200 ns per level, up to 16 levels = ~0.8-3.2 μs added), but worst-case interrupt latency improves dramatically: no single spinlock hold exceeds ~256 iterations regardless of tree depth.

Tier 2 dispatch exception: Tier 2 drivers communicate via IPC syscalls, not domain ring buffers. The Tier 2 dispatch path does NOT hold a KabiDispatchGuard or RCU read lock during the cross-address-space IPC. Instead, capability validation uses a two-phase approach: 1. Validate capability under a short RCU read lock (~10 cycles). Copy ValidatedCap fields. 2. Drop RCU read lock before issuing the blocking IPC send/recv. 3. On IPC completion, re-validate if the response references new capabilities.

This avoids RCU stalls from Tier 2 latency while maintaining revocation safety: a capability revoked between phases 1 and 3 is caught by re-validation in phase 3.

Spectre mitigation interaction: The KABI vtable dispatch is an indirect call, which incurs retpoline overhead (~15-25 cycles) on pre-eIBRS hardware. This cost applies equally to any indirect call in both Linux and UmkaOS; UmkaOS's differential cost is exactly one additional retpoline per domain crossing (the vtable dispatch itself). On eIBRS-capable hardware (Intel Ice Lake+, AMD Zen 3+), the indirect call is predicted at ~2-5 cycles and retpoline is not used. All cycle counts in this section assume identical Spectre mitigations on both Linux and UmkaOS. See Section 2.18 for the complete per-mitigation overhead analysis.

/// RCU in Rust: zero-lock read path, deferred reclamation.
/// Readers hold an RcuReadGuard (analogous to rcu_read_lock).
/// Writers swap the pointer atomically and defer freeing the old
/// value until all readers have exited their critical sections.
///
/// **Clone-and-swap write pattern**: The canonical way to make incremental
/// changes to RCU-protected state (e.g., adding a route to a routing table,
/// registering a service in a registry) is:
///   1. Acquire the write-side lock (`Mutex<()>` co-located with the `RcuCell`).
///   2. Read the current value via `cell.read()`.
///   3. Clone the current value: `let mut new_val = (*current).clone();`
///   4. Modify `new_val` as needed.
///   5. Publish: `cell.update(new_val, &guard)`.
/// The old value is freed after the RCU grace period (all readers that saw
/// the old pointer have exited their critical sections). This is not a
/// performance concern for infrequent writes (service registration, config
/// updates, interface addition) — the cost is acceptable because these
/// operations happen at driver-load time or in response to admin actions.
/// The read path is always lock-free.
pub struct RcuCell<T: Send + Sync> {
    ptr: AtomicPtr<T>,
}

impl<T: Send + Sync> RcuCell<T> {
    /// Create a new RcuCell with an initial value. The value is heap-allocated
    /// via `Box::try_new` (fallible) and the RcuCell takes ownership of the
    /// raw pointer. Returns `Err(KernelError::OutOfMemory)` if allocation
    /// fails. The pointer is always non-null after successful construction.
    ///
    /// All RcuCell allocation is fallible. Callers must handle OutOfMemory.
    pub fn new(value: T) -> Result<Self, KernelError> {
        let ptr = Box::try_new(value).map_err(|_| KernelError::OutOfMemory)?;
        Ok(Self {
            ptr: AtomicPtr::new(Box::into_raw(ptr)),
        })
    }

    pub fn read<'a>(&'a self, _guard: &'a RcuReadGuard) -> &'a T {
        unsafe { &*self.ptr.load(Ordering::Acquire) }
    }

    /// Atomically replace the value. The old value is scheduled for deferred
    /// freeing after all current RCU read-side critical sections complete
    /// (grace period). The caller must NOT access the old value after this
    /// call — RCU owns it and will free it asynchronously.
    ///
    /// **Writer serialization**: Takes `&self` (not `&mut self`) plus a
    /// `MutexGuard` proof token that demonstrates the caller holds the
    /// write-side lock. This is the standard RCU pattern: readers are
    /// lock-free via `read()`, writers serialize through an external mutex.
    ///
    /// **Why `&self` instead of `&mut self`**: RCU cells are typically
    /// stored in global/static data structures (routing tables, config
    /// registries, module lists) accessed concurrently by many readers.
    /// Requiring `&mut self` would force the caller to hold an exclusive
    /// reference to the entire `RcuCell`, which is impractical for global
    /// state — it would require wrapping the `RcuCell` in a `Mutex` or
    /// `RwLock` that also blocks readers, defeating the purpose of RCU.
    /// With `&self` + lock proof, the `RcuCell` can live in a shared
    /// context (e.g., `static`, `Arc`, or behind `&`), readers access it
    /// without any lock, and writers prove serialization by passing the
    /// `MutexGuard` token. The `MutexGuard` lifetime ensures the lock is
    /// held for the duration of the `update()` call but does not restrict
    /// access to the `RcuCell` itself.
    ///
    /// Concurrent writers without the lock would race on the swap: writer A
    /// swaps old->new_A, writer B swaps new_A->new_B, then writer B defers
    /// freeing new_A — which was just published and may have active readers.
    /// The `MutexGuard` proof prevents this at compile time.
    ///
    /// All RcuCell allocation is fallible. Returns `Err(KernelError::OutOfMemory)`
    /// if allocation fails; the existing value is left unchanged in that case.
    pub fn update(
        &self,
        new_value: T,
        _writer_lock: &MutexGuard<'_, ()>,
    ) -> Result<(), KernelError> {
        let new_box = Box::try_new(new_value).map_err(|_| KernelError::OutOfMemory)?;
        let old = self.ptr.swap(Box::into_raw(new_box), Ordering::Release);
        // Schedule old value for deferred freeing after grace period.
        // rcu_defer_free takes ownership of the raw pointer and will
        // reconstruct the Box and drop it after the grace period elapses.
        // SAFETY: `old` was created by a previous `Box::into_raw()` call
        // (either in `new()` or a prior `update()`). The writer lock
        // guarantees no concurrent `update()` can swap the same pointer
        // twice. After this call, the caller must not access `old`.
        unsafe { rcu_defer_free(old) };
        Ok(())
    }
}

// Implementation note on MutexGuard proof tokens: The `_writer_lock` parameter
// ensures single-writer semantics at compile time. In the actual implementation,
// the `Mutex<()>` must be embedded in the same struct that contains the `RcuCell`,
// or in a per-instance wrapper, so that each RcuCell has its own dedicated mutex.
// Passing an unrelated mutex guard would compile but violate the invariant. This is
// enforced structurally: the kernel's RCU-protected data structures always pair their
// `RcuCell` and `Mutex` in the same struct (e.g., `struct RcuProtected<T> { cell:
// RcuCell<T>, writer_lock: Mutex<()> }`), and the `update()` call site acquires the
// co-located lock. This pattern is standard in Rust kernel design (similar to how
// Linux's `struct rcu_head` is always embedded in the protected struct).

// Note: `RcuCell<T>` does NOT block in its `Drop` implementation.
// Calling `rcu_synchronize()` (a blocking wait) from `Drop` would be
// unsafe in contexts where blocking is illegal: interrupt handlers, code
// executing under a spinlock, NMI handlers, or any other atomic context.
// Because `RcuCell` values can be dropped from any of these contexts
// (e.g., a global `RcuCell` freed during module unload while holding a lock),
// `Drop` uses `rcu_call()` (deferred callback) instead. The current pointer
// is enqueued for deferred freeing via the RCU callback mechanism; the actual
// `Box::drop` runs in the RCU grace period worker thread, which executes in
// a fully schedulable, non-atomic context. This matches the pattern used by
// `update()`, which also defers old-value freeing via `rcu_defer_free()`.
impl<T: Send + Sync> Drop for RcuCell<T> {
    fn drop(&mut self) {
        // SAFETY: We have &mut self (exclusive access). This guarantees no
        // concurrent writers (update() takes &self + MutexGuard, but &mut self
        // is incompatible with any shared reference). Readers may still hold
        // &T references obtained via read(); rcu_call() defers the actual
        // Box::drop until all pre-existing RCU read-side critical sections
        // complete (the grace period), ensuring no live references to the
        // pointed-to value remain before it is freed. The pointer was created
        // by Box::into_raw() in new() or update() and has not yet been passed
        // to rcu_defer_free() (only old values replaced by update() are
        // deferred there). This Drop path covers only the *current* (final)
        // value still held by the RcuCell at destruction time.
        //
        // Using rcu_call() (non-blocking enqueue) rather than rcu_synchronize()
        // (blocking wait) is mandatory here: Drop can be invoked from atomic
        // contexts (interrupt handlers, spinlock-held paths, etc.) where
        // blocking would deadlock or corrupt kernel state.
        let ptr = self.ptr.load(Ordering::Relaxed);
        if !ptr.is_null() {
            // SAFETY: ptr was created by Box::into_raw() and is non-null.
            // rcu_call takes ownership of the raw pointer and will reconstruct
            // the Box and drop it after the grace period elapses in the RCU
            // callback worker thread.
            unsafe extern "C" fn drop_box<T>(ptr: *mut ()) {
                // SAFETY: ptr was created by Box::into_raw::<T>() and is only
                // passed to this callback once, after the RCU grace period.
                drop(unsafe { Box::from_raw(ptr as *mut T) });
            }
            unsafe { rcu_call(drop_box::<T>, ptr as *mut ()) };
        }
    }
}

RCU Grace Period Detection — Hierarchical Tree-RCU:

The grace period mechanism determines when all pre-existing RCU read-side critical sections have completed, making it safe to execute deferred callbacks (such as freeing old values swapped out by RcuCell::update()).

UmkaOS uses hierarchical tree-RCU — a multi-level tree of RcuNode structures that aggregates per-CPU quiescent state reports bottom-up. This is the same fundamental design as Linux's Tree RCU (introduced in 2.6.29), adapted for UmkaOS's non-preemptible model.

Why a tree, not flat per-CPU polling? A flat array of per-CPU quiescent state flags requires the GP kthread to sequentially poll every online CPU. On a 256-CPU system with 8 NUMA nodes, even with per-node threads, each thread polls 32 CPUs sequentially — 32 remote cache line reads per GP. With a tree (fan-out 64), a single root node covers 64 leaf nodes, each covering 64 CPUs. Quiescent state propagation is O(log_{fanout}(nr_cpus)) per CPU, and the GP kthread only monitors the root node's qsmask. On 4096 CPUs this is 2 levels instead of 4096 polls.

Data structures:

/// `WaitQueue` is an alias for `WaitQueueHead` — the type used for blocking
/// waiters on grace period completion. Using the alias keeps RCU-specific
/// code readable without importing internal queue type names.
pub type WaitQueue = WaitQueueHead;

// ---------------------------------------------------------------------------
// Hierarchical RCU Node Tree
// ---------------------------------------------------------------------------

/// One node in the hierarchical RCU tree.
///
/// The tree is built at boot time based on the actual CPU topology discovered
/// from ACPI/DT (no compile-time MAX_CPUS constant). Leaf nodes cover contiguous
/// ranges of CPUs; interior nodes aggregate their children's quiescent state.
///
/// **Cache alignment**: Each `RcuNode` is 64-byte aligned to prevent false sharing
/// between nodes on different NUMA domains. The `lock` field and `qsmask` field are
/// co-located within the same cache line because they are always accessed together.
// kernel-internal, not KABI — RcuNode contains SpinLock and raw pointers.
#[repr(C, align(64))]
pub struct RcuNode {
    /// Spinlock protecting `qsmask`, `gp_seq`, and QS-related fields in this node.
    /// Held briefly during quiescent state propagation (one atomic bit-clear + check).
    /// IRQ-saving: QS reporting can occur from scheduler tick (softirq context).
    pub lock: SpinLock<RcuNodeInner>,

    /// Parent node in the tree. `None` for the root node.
    /// Set once at boot during `rcu_build_tree()`, never modified after.
    // SAFETY: Points into the `RcuState::nodes` Box<[RcuNode]> which is
    // allocated at boot and never freed or reallocated. The pointer is
    // valid for the kernel's lifetime. Only dereferenced in
    // `rcu_report_qs_up()` with the parent node's spinlock held.
    pub parent: Option<*const RcuNode>,

    /// Index of this node in the parent's children array. Used to compute the
    /// bit position to clear in the parent's `qsmask`.
    /// Set once at boot, never modified.
    pub parent_idx: u16,

    /// Level in the tree (0 = root, increasing toward leaves).
    /// Set once at boot, never modified.
    pub level: u8,

    /// For leaf nodes: index of the first CPU covered by this node.
    /// For interior nodes: index of the first CPU covered by any descendant leaf.
    /// Set once at boot, never modified.
    pub cpu_lo: u32,

    /// For leaf nodes: index of the last CPU (inclusive) covered by this node.
    /// For interior nodes: index of the last CPU covered by any descendant leaf.
    /// Set once at boot, never modified.
    pub cpu_hi: u32,

    /// NUMA node affinity hint. The GP kthread for this NUMA node monitors
    /// the corresponding subtree. For interior nodes that span multiple NUMA
    /// nodes, this is the NUMA node of the first child.
    pub numa_node: u32,
}

/// Mutable state within an `RcuNode`, protected by `RcuNode::lock`.
pub struct RcuNodeInner {
    /// Bitmask of children (leaf CPUs or child nodes) that have **not yet**
    /// reported a quiescent state for the current grace period.
    ///
    /// - For a **leaf node**: bit N corresponds to the (N + self.cpu_lo)-th CPU.
    ///   When CPU C passes a quiescent state, bit (C - cpu_lo) is cleared.
    /// - For an **interior node**: bit N corresponds to child node index N.
    ///   When child node N's `qsmask` reaches zero, bit N in this node is cleared.
    ///
    /// When `qsmask == 0`, all descendants have reported quiescent states.
    /// For interior nodes, this triggers propagation to the parent.
    /// For the root node, `qsmask == 0` means the grace period is complete.
    ///
    /// `u64` supports up to 64 children per node (matching Linux's RCU_FANOUT).
    /// On systems with more than 64 CPUs per leaf group, increase tree depth.
    pub qsmask: u64,

    /// Grace period sequence number last seen by this node. Used to detect
    /// stale QS reports from a previous grace period (if a CPU reports a QS
    /// for an already-completed GP, the report is silently discarded).
    pub gp_seq: u64,

    /// Number of children that this node covers.
    /// For leaf nodes: number of online CPUs in range [cpu_lo, cpu_hi].
    /// For interior nodes: number of child `RcuNode` entries.
    /// Set at boot, updated on CPU hotplug.
    pub n_children: u16,

    /// Bitmask of children that are online/active. Used during GP initialization
    /// to set `qsmask` to only the bits corresponding to active children.
    /// Updated on CPU hotplug (online/offline).
    pub online_mask: u64,
}

/// Global RCU state — one instance, allocated and initialized at boot.
pub struct RcuState {
    /// Monotonically increasing grace period sequence number.
    ///
    /// **Encoding**: The low 2 bits encode the grace period phase:
    /// - `0b00` (phase 0): idle — no grace period in progress.
    /// - `0b01` (phase 1): GP started — `qsmask` initialized on all nodes,
    ///   waiting for all CPUs to report quiescent states.
    /// - `0b10` (phase 2): GP completing — all QS reported, callbacks being
    ///   advanced, `gp_seq` about to be incremented to the next idle phase.
    /// - `0b11`: reserved (never used; provides detection of corruption).
    ///
    /// Each complete grace period advances `gp_seq` by 4 (one full cycle through
    /// idle → started → completing → idle). The upper 62 bits form the grace
    /// period number proper.
    ///
    /// *Wraparound*: at the theoretical maximum rate of one grace period per 10 μs,
    /// a `u64` counter wraps after approximately 5.8 million years. No special
    /// wraparound handling is required for `u64`.
    ///
    /// *Comparison semantics*: any code that compares two `gp_seq` values to decide
    /// whether grace period A completed before grace period B **must** use wrapping
    /// subtraction rather than a direct `<` comparison. Direct comparison is
    /// incorrect near the wraparound boundary (irrelevant for `u64` in practice,
    /// but required by design so that a future space-saving change to `u32` — where
    /// wraparound occurs in ~42 seconds at 100 K GP/s — cannot silently introduce
    /// a correctness bug):
    ///
    /// ```rust
    /// /// GP sequence phase constants.
    /// pub const RCU_GP_IDLE: u64 = 0;
    /// pub const RCU_GP_STARTED: u64 = 1;
    /// pub const RCU_GP_COMPLETING: u64 = 2;
    /// pub const RCU_GP_PHASE_MASK: u64 = 0x3;
    ///
    /// /// Returns true if grace period `a` completed strictly before `b`.
    /// #[inline]
    /// fn gp_before(a: u64, b: u64) -> bool {
    ///     (b.wrapping_sub(a) as i64) > 0
    /// }
    ///
    /// /// Extract the phase from a gp_seq value.
    /// #[inline]
    /// fn gp_phase(seq: u64) -> u64 {
    ///     seq & RCU_GP_PHASE_MASK
    /// }
    /// ```
    ///
    /// All callers that compare grace period sequence numbers (e.g.,
    /// `rcu_synchronize` checking whether its target sequence has been reached,
    /// `rcu_gp_kthread` waking waiters) must use `gp_before` or an equivalent
    /// wrapping form. The same pattern is used by Linux's `time_before` /
    /// `time_after` macros for the same reason.
    pub gp_seq: AtomicU64,

    /// Flat array of all `RcuNode` entries in the tree, level by level.
    /// Index 0 = root. Indices [1..1+fan_out) = level-1 nodes. Etc.
    /// Dynamically allocated at boot based on discovered CPU count and fan-out.
    /// Never reallocated after boot (CPU hotplug updates `online_mask` and
    /// `n_children` within existing nodes but does not grow the tree).
    pub nodes: Box<[RcuNode]>,

    /// Number of levels in the tree (1 = root only, for systems with <= fan_out CPUs).
    pub num_levels: u8,

    /// Fan-out of the tree (children per interior node). Default: 64.
    /// Configurable at boot via `rcu.fanout=N` kernel parameter.
    /// Must be in range [2, 64] (constrained by `qsmask: u64`).
    pub fan_out: u8,

    /// Fan-out of leaf nodes (CPUs per leaf). May differ from `fan_out` on
    /// architectures where leaf fan-out is constrained by cache topology.
    /// Default: same as `fan_out`. Configurable via `rcu.leaf_fanout=N`.
    pub leaf_fan_out: u8,

    /// Array of pointers from CPU ID → leaf `RcuNode` that covers that CPU.
    /// Length = `num_possible_cpus()`. Index = CPU logical ID.
    /// Dynamically allocated at boot; never reallocated.
    // SAFETY: Each element points into the `nodes` array (same Box allocation).
    // Valid for the kernel's lifetime. Read-only after boot except during
    // CPU hotplug (which updates under the global CPU hotplug lock).
    pub cpu_to_leaf: Box<[*const RcuNode]>,

    /// Dedicated kernel thread that drives grace period progression.
    /// One per system (pinned to NUMA node 0's first online CPU).
    /// Scheduled at `SCHED_FIFO` priority 10.
    pub gp_kthread: *mut Task,

    /// Wait queue for the GP kthread. The GP kthread sleeps here between
    /// grace periods, waiting for `gp_requested` to become true. Separated
    /// from `gp_completion_wq` to prevent missed-wakeup races: the kthread
    /// uses `scheduler::unblock()` semantics, while `rcu_synchronize()`
    /// callers use `wait_event()` with a condition predicate. Conflating
    /// both on a single wait queue risks the kthread consuming a wakeup
    /// intended for a synchronize caller, or vice versa.
    pub gp_kthread_wq: WaitQueue,

    /// Wait queue for `rcu_synchronize()` callers. Tasks blocked in
    /// `rcu_synchronize()` sleep here until the GP kthread wakes them
    /// after `gp_seq` advances past their target sequence number.
    /// Wakeup: GP kthread broadcasts to all waiters whose
    /// `wait_gp_seq <= complete_seq` at GP completion (step 13).
    pub gp_completion_wq: WaitQueue,

    /// Set to `true` by `rcu_call()` or `rcu_synchronize()` when a new GP
    /// is needed. The GP kthread checks this on wakeup.
    pub gp_requested: AtomicBool,

    /// Force-quiescent-state interval in nanoseconds. Default: 1 ms.
    /// After this interval without a QS report from a CPU, the GP kthread
    /// sends that CPU a reschedule IPI.
    /// Configurable at boot via `rcu.fqs_interval_ns=N`.
    pub fqs_interval_ns: u64,
}

impl RcuState {
    /// Root of the hierarchical `RcuNode` tree. The grace period is complete
    /// when `root().lock().qsmask == 0`.
    ///
    /// Derives the root as `&self.nodes[0]` — no separate raw pointer field needed.
    /// The `nodes` array is allocated at boot and never freed or reallocated, so
    /// the returned reference is valid for the kernel's lifetime. This eliminates
    /// the redundant raw pointer (previously `pub root: *const RcuNode`) that
    /// created a dangling pointer risk if `nodes` were ever dropped.
    #[inline(always)]
    pub fn root(&self) -> &RcuNode {
        &self.nodes[0]
    }
}

/// Tree construction at boot time.
///
/// Called once from `rcu_init()`, after the CPU topology is discovered from
/// ACPI MADT / device tree / platform enumeration. The tree geometry is
/// computed from the actual number of possible CPUs and the configured fan-out.
///
/// **Algorithm** (`rcu_build_tree`):
/// ```
/// 1. nr_cpus = num_possible_cpus()  // discovered from ACPI/DT, not hardcoded
/// 2. fan_out = boot_param("rcu.fanout", default=64), clamped to [2, 64]
/// 3. leaf_fan_out = boot_param("rcu.leaf_fanout", default=fan_out)
/// 4. Compute tree geometry:
///    - num_leaf_nodes = ceil(nr_cpus / leaf_fan_out)
///    - For each level L from leaves toward root:
///        nodes_at_L = ceil(nodes_at_(L+1) / fan_out)
///      until nodes_at_L == 1 (root level).
///    - num_levels = number of levels computed.
///    - total_nodes = sum of nodes at all levels.
/// 5. Allocate nodes: Box<[RcuNode]> with `total_nodes` entries (boot allocator).
/// 6. Initialize each node:
///    - Set level, parent pointer, parent_idx, cpu_lo, cpu_hi.
///    - Leaf nodes: cpu_lo = first CPU index, cpu_hi = last CPU index in range.
///    - Interior nodes: cpu_lo/cpu_hi span the union of children's ranges.
///    - Set numa_node based on ACPI SRAT proximity domain of first covered CPU.
/// 7. Allocate cpu_to_leaf: Box<[*const RcuNode]> with nr_cpus entries.
/// 8. For each CPU C, set cpu_to_leaf[C] = pointer to its covering leaf node.
/// 9. Initialize all qsmask = 0, online_mask = 0 (CPUs are offline at this point).
/// 10. Start gp_kthread on NUMA node 0.
/// ```
///
/// **Example**: 256-CPU system, fan_out=64:
/// - Leaf level: ceil(256/64) = 4 leaf nodes, each covering 64 CPUs.
/// - Root level: 1 root node with 4 children.
/// - Total: 5 nodes, 2 levels. Tree depth = 2.
///
/// **Example**: 4096-CPU system, fan_out=64:
/// - Leaf level: ceil(4096/64) = 64 leaf nodes.
/// - Level 1: ceil(64/64) = 1 interior node.
/// - Root = that 1 node. Total: 65 nodes, 2 levels.
///
/// **Example**: 16384-CPU system, fan_out=64:
/// - Leaf: ceil(16384/64) = 256 leaf nodes.
/// - Level 1: ceil(256/64) = 4 interior nodes.
/// - Root: 1 node. Total: 261 nodes, 3 levels.
pub fn rcu_build_tree(nr_cpus: u32, fan_out: u8, leaf_fan_out: u8) -> RcuState;

// NOTE: RCU callback rings are per-CPU, not global. Each CPU maintains its // own 4-segment callback pipeline in RcuPerCpu (below). This eliminates global // serialization on the callback-enqueue path — rcu_call() only touches the // calling CPU's next segment under a short preemption-disabled section, // with no cross-CPU lock contention.

/// Fixed-capacity ring of RCU callbacks. Per-CPU — each CPU has four /// segment rings (done/wait/next_ready/next), so rcu_call() never contends /// with other CPUs on the enqueue path. /// /// Capacity = 4096 entries per CPU (typical system drains well under 256 per GP). /// Pre-allocated at boot during per-CPU initialization (no runtime allocation). /// /// Design: UmkaOS uses a typed, pre-allocated ring rather than Linux's intrusive /// linked-list (rcu_head embedded in objects). This eliminates the need for /// container_of pointer arithmetic and gives a predictable allocation profile. /// Callers pass a closure-style (fn(*mut ()), *mut ()) pair; the receiving side /// calls func(data) after the grace period. Objects being freed pre-register their /// cleanup function at call_rcu() time rather than embedding a list node. /// /// Overflow policy: If a CPU's ring is full when rcu_call() is invoked (4096 /// callbacks pending), the behavior depends on the calling context: /// /// - Task context (preempt_count == 0, IRQs enabled): rcu_call() calls /// rcu_synchronize() to block until the current grace period completes, then /// invokes func(data) directly. A warning is logged to flag the condition. /// /// - Atomic context (IRQs disabled or preempt_count > 0): Blocking is forbidden. /// rcu_call() writes the callback into RcuPerCpu::overflow_buf, a /// RCU_OVERFLOW_BUF_CAPACITY-slot pre-allocated emergency buffer. The buffer is /// drained at the next timer tick by rcu_tick_drain_overflow() once ring space /// is available. If both ring and overflow buffer are full, the callback is dropped /// and an error is logged — this is a catastrophic condition indicating a persistent /// RCU stall, logged at log::error! severity. /// /// A warning is logged whenever the main ring is full, as it indicates either a grace /// period stall or an unusually high callback production rate that may need tuning /// (e.g., increasing the grace period thread priority or reducing batch sizes). /// Per-segment capacity. Each RCU callback segment holds up to 4096 entries. /// 4 segments × 4096 = 16384 outstanding callbacks per CPU. At peak callback /// rates of ~100K/sec, this provides ~160ms of headroom per grace period. pub const RCU_RING_CAPACITY: usize = 4096;

pub struct RcuCallbackRing { /// Inline callback storage, allocated at boot (no runtime allocation). // SAFETY: Backing memory is allocated from the boot allocator (static kernel // lifetime). Never freed. Use raw pointer instead of Box to avoid UB on drop. entries: *mut [RcuCallback; RCU_RING_CAPACITY], head: usize, tail: usize, }

/// A single RCU callback entry. pub struct RcuCallback { /// Cleanup function. Called with data after the grace period. func: unsafe fn(mut ()), /// Opaque pointer to the object being cleaned up (e.g., raw pointer to Box contents). data: mut (), }

/// Per-CPU RCU state (stored in the per-CPU data region, zero-allocation access). /// /// Each CPU has its own independent callback segments, quiescent state counter, and /// nesting tracker. This per-CPU design eliminates global serialization on the /// callback-enqueue path — rcu_call() only touches the local CPU's state. pub struct RcuPerCpu { /// Pointer to the leaf RcuNode covering this CPU. /// Set once at boot during tree construction (rcu_build_tree step 8). /// Never modified after boot. // SAFETY: Points into RcuState::nodes Box<[RcuNode]>, allocated at // boot and never freed or reallocated. Valid for the kernel's lifetime. // Dereferenced in rcu_check_callbacks() and rcu_report_qs_leaf() // with the leaf node's spinlock held. pub leaf_node: *const RcuNode,

/// Bit position of this CPU within the leaf node's `qsmask`.
/// Equal to `(cpu_id - leaf_node.cpu_lo)`. Set once at boot.
pub leaf_bit: u8,

/// Grace period sequence number last acknowledged by this CPU.
/// Compared against `RcuState::gp_seq` to determine whether a new
/// GP has started since this CPU last reported a quiescent state.
/// When `gp_seq_local != rcu_state.gp_seq`, this CPU owes a QS report.
pub gp_seq_local: u64,

/// True when this CPU still needs to report a quiescent state for the
/// current grace period. Set to `true` (Release) by the GP kthread when
/// a new GP starts (step 5a). Read (Acquire) and cleared (Relaxed) by
/// the local CPU's `rcu_check_callbacks()` (steps 2 and 4).
///
/// **AtomicBool rationale**: The GP kthread (running on a different CPU)
/// writes this field during GP initialization (step 5a), while the local
/// CPU reads it during `rcu_check_callbacks()`. A plain `bool` would be
/// a data race (undefined behavior) on weakly-ordered architectures.
/// AtomicBool with Release/Acquire ordering ensures the GP kthread's
/// write is visible to the local CPU before it checks the flag.
pub qs_pending: AtomicBool,

/// Nesting depth of active RcuReadGuards on this CPU.
pub nesting: u32,

/// 4-segment callback list partitioned by grace period generation.
///
/// The four segments form a pipeline that advances as grace periods complete:
/// - `done`: Callbacks whose grace period has completed. Ready for immediate
///   execution. Drained in softirq context by `rcu_process_callbacks()`.
/// - `wait`: Callbacks waiting for the current grace period to complete.
///   When the GP completes, `wait` → `done`.
/// - `next_ready`: Callbacks registered during the current GP. They cannot
///   be satisfied by the current GP (they need at least one more full GP).
///   When the GP completes, `next_ready` → `wait`.
/// - `next`: Callbacks being actively registered by `rcu_call()` right now.
///   When the GP completes, `next` → `next_ready`.
///
/// This 4-segment design ensures that callbacks registered during a GP
/// are never freed too early — they must wait for the *next* GP after the
/// one in which they were registered.
///
/// Each segment is a pre-allocated `RcuCallbackRing` (4096 slots per segment,
/// allocated at boot). The 4-segment advancement is O(1) — just pointer swaps.
/// **Lock ordering exemption**: `cb_segments` spinlocks are exempt from the
/// global lock level table ([Section 3.4](#cumulative-performance-budget--lock-ordering)).
/// Rationale: they are per-CPU (never acquired cross-CPU), held with IRQs
/// disabled (SpinLock does this automatically), and serialize only single-CPU
/// access to the callback pipeline. Since `rcu_call()` can be called from
/// `Drop` implementations under arbitrary lock contexts, these locks cannot
/// be assigned a single global level — they must be below ALL other locks.
/// The per-CPU + IRQ-disabled invariant ensures no deadlock: no other CPU
/// can hold this lock, and no interrupt on this CPU can preempt and re-acquire.
pub cb_segments: [SpinLock<RcuCallbackRing>; 4],

/// Emergency overflow buffer: used when the `next` ring is full AND the caller
/// is in atomic context (IRQs disabled or preempt_count > 0). Blocking
/// (`rcu_synchronize`) is forbidden in atomic context, so callbacks are staged
/// here instead. Drained at the next timer tick via `rcu_tick_drain_overflow()`.
/// Pre-allocated at boot; never heap-allocates on the enqueue path.
///
/// `u8` is sufficient for the length field: `RCU_OVERFLOW_BUF_CAPACITY` (64) < 256.
pub overflow_buf: [MaybeUninit<RcuCallback>; RCU_OVERFLOW_BUF_CAPACITY],
pub overflow_len: u8,

}

/// Callback segment indices for RcuPerCpu::cb_segments. pub const RCU_DONE: usize = 0; pub const RCU_WAIT: usize = 1; pub const RCU_NEXT_READY: usize = 2; pub const RCU_NEXT: usize = 3;

**Memory ordering requirements:**

On RcuReadGuard acquisition (rcu_read_lock()): No barrier needed. The read-side section is a pure software contract. Hardware TSO (x86) and load-acquire semantics (ARM, RISC-V) ensure that loads within the critical section see all stores that completed before rcu_read_lock() was called.

On RcuReadGuard drop (rcu_read_unlock()): Relaxed store to rcu_passed_quiesce (CpuLocal, AtomicBool). rcu_passed_quiesce uses Relaxed ordering because the flag is consumed only by the local CPU's rcu_check_callbacks() at the next tick — no cross-CPU visibility is needed for this flag itself. The cross-CPU ordering guarantee comes from the leaf node lock acquisition in rcu_report_qs_leaf(), not from this store. The lock acquisition (implicit Acquire barrier) ensures that all RCU-protected loads within the critical section are visible to other CPUs before the qsmask bit-clear is observed by the GP kthread.

On quiescent state propagation (rcu_report_qs_leaf → parent): Acquire load on leaf node's lock acquisition (implicit in SpinLock::lock()). The lock acquisition ensures the reporting CPU's RCU-protected stores are visible before the qsmask bit-clear is observed by the GP kthread or any parent node.

On grace period start (gp_kthread initializes tree): Release store on gp_seq (via AtomicU64::store(_, Release)). All per-node qsmask initializations must be visible before gp_seq transitions to the "started" phase. The GP kthread acquires each node's lock to set qsmask, providing the necessary ordering per node.

On grace period completion (root qsmask reaches 0): Acquire load on root node's qsmask (under lock). Full fence (smp_mb) before executing callbacks. This ensures callbacks see all memory stores made by RCU-protected writers before the grace period started. The fence is issued by the GP kthread after confirming root.qsmask == 0 and before draining the done segments.

On gp_seq reads by rcu_synchronize() callers: Acquire load on rcu_state.gp_seq. Ensures the caller sees all memory stores that were ordered before the GP kthread's Release store to gp_seq at GP completion.

**Quiescent state identification:**

- Quiescent points occur at: (1) context switch (the outgoing task releases all
  `RcuReadGuard`s before being descheduled — preempt_count == 0 implies no active
  RCU read-side critical section), (2) idle entry (`cpu_idle_enter()` — a CPU
  entering the idle loop has no active critical sections), (3) return to userspace
  (user code never holds kernel RCU references), and (4) explicit
  `rcu_quiescent_state()` calls in long-running kernel loops that do not hold RCU
  references.
- KABI boundary crossing constitutes an RCU quiescent state: every KABI vtable call
  entry and return is treated as a quiescent point for the calling CPU. This ensures
  that Tier 1 drivers that return from KABI calls within bounded time (enforced by the
  per-call timeout watchdog, [Section 11.4](11-drivers.md#device-registry-and-bus-management--timeouts)) cannot block RCU grace period completion
  indefinitely. Drivers that perform long-polling loops must call `rcu_quiescent_state()`
  at each poll iteration, or use the KABI polling helper `kabi_poll_wait()` which includes
  an implicit quiescent state.
- Grace period detection is batched: multiple `rcu_defer_free()` / `rcu_call()` invocations
  are coalesced into the same grace period to amortize the per-CPU reporting overhead.

```rust
/// RCU read-side guard. Obtained via `rcu_read_lock()`, released via Drop.
///
/// This is a zero-cost marker type — it does not perform any atomic operations
/// or memory barriers on acquisition. The RCU read-side critical section is
/// purely a contract with the grace period detection mechanism: as long as any
/// CPU holds an `RcuReadGuard`, the current grace period cannot complete.
///
/// The guard is `!Send` because RCU read-side sections are per-CPU — the quiescent
/// state tracking (Section 3.1.1, "RCU Grace Period Detection") is CPU-local.
/// Sending an `RcuReadGuard` to another thread would allow the grace period
/// detection to miss an active reader.
///
/// # Example
/// ```rust
/// let guard = rcu_read_lock();
/// // Within this scope, any RCU-protected data can be safely read.
/// // The grace period will not complete until this guard is dropped.
/// let value = rcu_cell.read(&guard);
/// // guard dropped here; this CPU may now pass through a quiescent point
/// ```
pub struct RcuReadGuard {
    /// CPU ID on which this guard was acquired. Used for debug assertions
    /// and the KRL timeout mitigation (Section 9.2.8). Not used for
    /// grace period tracking — that is handled by per-CPU quiescent state counters.
    _cpu_id: u32,
    /// Nesting depth snapshot at acquisition time. Used only in debug builds
    /// (`debug_assert!` in `Drop`) to verify that the per-CPU nesting counter
    /// has not been corrupted between lock and unlock. In release builds, the
    /// field is retained for layout stability but is not read.
    ///
    /// The actual nesting tracking is in `CpuLocal.rcu_nesting` — the Drop
    /// impl reads that counter, not this field. This field exists solely as
    /// a cross-check.
    _nesting: u32,
    /// Marker to prevent Send/Sync auto-traits.
    _not_send: PhantomData<*const ()>,
}

impl Drop for RcuReadGuard {
    fn drop(&mut self) {
        // Decrement the per-CPU nesting counter via CpuLocal (Section 3.1.2).
        // Only the outermost guard (nesting reaches 0) needs further action.
        // Nested RCU read sections work correctly — an inner guard's drop
        // does NOT affect quiescent state while the outer section is active.
        let nesting = cpu_local::rcu_nesting_dec();
        if nesting > 0 {
            return; // Still inside an outer RCU read-side critical section.
        }

        // Outermost guard dropped — set the per-CPU "passed quiescent point"
        // flag. This is a CpuLocal boolean write (~1 cycle), NOT an immediate
        // report to the grace period machinery. The actual quiescent state
        // report is deferred to the next scheduler tick or context switch,
        // which are the natural quiescent checkpoints (see below).
        //
        // On architectures with weak memory ordering, a Release store is
        // used to prevent RCU-protected accesses from being reordered past
        // the guard's drop point.
        cpu_local::set_rcu_passed_quiesce(true);

        // === Design rationale: deferred quiescent state reporting ===
        //
        // **Why NOT report immediately on every outermost drop**:
        // The previous design called `rcu_note_quiescent_state()` here — a
        // function call + per-CPU atomic store (~5-10 cycles). On NVMe paths
        // with frequent short RCU sections (conntrack lookup, routing table
        // lookup), this adds ~5-10 cycles per I/O. Across millions of IOPS,
        // the overhead is measurable.
        //
        // **How deferred reporting works**:
        // 1. `RcuReadGuard::drop()` sets `cpu_local.rcu_passed_quiesce = true`
        //    (~1 cycle CpuLocal write, no function call, no atomic).
        // 2. The scheduler tick handler (`scheduler_tick()`, running at HZ=1000)
        //    checks the flag and, if set, calls `rcu_check_callbacks()`
        //    which propagates the QS up the RcuNode tree. This batches all
        //    quiescent state reports from the last 1ms into a single tree walk.
        // 3. `context_switch()` also checks and reports — every voluntary or
        //    involuntary context switch is a quiescent point regardless.
        // 4. `cpu_idle_enter()` reports unconditionally — idle is always quiescent.
        //
        // **Grace period stall prevention for CPU-bound kernel threads**:
        // A CPU-bound kernel thread that never context-switches would, in a
        // naive deferred model, block RCU grace periods indefinitely. UmkaOS
        // prevents this through the scheduler tick: even CPU-bound threads
        // receive timer ticks at HZ=1000 (1ms intervals). The tick handler
        // checks `rcu_passed_quiesce` and reports. This guarantees that no
        // CPU goes more than ~1ms without reporting a quiescent state, which
        // is well within acceptable grace period latency (~10-100ms typical).
        //
        // For `nohz_full` CPUs (tickless, used for RT isolation per Section 8.2.5),
        // the tick is suppressed. These CPUs report quiescent state via:
        // (a) context switches (if they occur), or
        // (b) the `rcu_nocbs` mechanism — RCU callbacks are offloaded to a
        //     designated non-isolated CPU, and the isolated CPU's quiescent
        //     state is inferred from its lack of RCU activity (matching
        //     Linux's `rcu_nocbs` behavior exactly).
        //
        // **Comparison with Linux**: In non-preemptible Linux kernels,
        // `rcu_read_unlock()` generates zero code — quiescent states are
        // inferred entirely from context switches, idle, and usermode return.
        // UmkaOS's approach now matches Linux's model: the outermost drop sets
        // a lightweight per-CPU flag, and the actual report is deferred to
        // tick/switch checkpoints. The per-flag-write cost (~1 cycle) is
        // effectively zero compared to the prior ~5-10 cycle function call,
        // while maintaining the stall-freedom guarantee via the tick handler.

        // Re-enable preemption (matching the preempt_count_inc in rcu_read_lock).
        // If preempt_count reaches 0 and need_resched is set, invoke the scheduler.
        cpu_local::preempt_count_dec_and_test_resched();
    }
}

impl !Send for RcuReadGuard {}
impl !Sync for RcuReadGuard {}

/// Acquire an RCU read-side critical section guard.
///
/// The returned guard prevents the current RCU grace period from completing
/// until it is dropped. On drop, the outermost guard re-enables preemption
/// and sets a per-CPU `rcu_passed_quiesce` flag (CpuLocal write, ~1 cycle).
/// The actual quiescent state report to the grace period machinery is deferred
/// to the next scheduler tick or context switch — see `RcuReadGuard::drop()`
/// for the full design rationale.
///
/// **Cost**: near-zero-cost (~2 instructions: increment `preempt_count` via
/// CpuLocal register, no memory barriers, no cache-line bouncing). Preemption
/// is disabled for the duration of the RCU read-side critical section
/// (non-preemptible RCU model). This means RCU readers must not sleep or
/// block — any context switch is a quiescent state by definition.
///
/// # Safety invariants
/// - Must be paired with a `Drop` (RAII pattern — cannot be leaked).
/// - Must not be held across a blocking operation (sleep, mutex acquisition)
///   unless the holder is prepared for an extended grace period latency.
/// - The KRL timeout mitigation (Section 9.2.8) enforces a maximum critical
///   section duration for KRL access to prevent DoS.
pub fn rcu_read_lock() -> RcuReadGuard {
    // Non-preemptible RCU: disable preemption by incrementing preempt_count.
    // This is ~2 instructions via CpuLocal register (read-modify-write on
    // per-CPU preempt_count), no memory barriers, no cache-line bouncing.
    // A context switch while preempt_count > 0 is prevented, so any context
    // switch is a quiescent state — the core of non-preemptible RCU.
    cpu_local::preempt_count_inc();
    let nesting = cpu_local::rcu_nesting_inc();
    RcuReadGuard {
        _cpu_id: cpu_id(),
        _nesting: nesting,
        _not_send: PhantomData,
    }
}

/// A slice wrapper that can only be dereferenced within an RCU read-side
/// critical section. Used for RCU-protected arrays (e.g., KRL revoked_keys).
///
/// This is a zero-cost newtype around a raw pointer. The `Deref` implementation
/// is gated on an `RcuReadGuard` reference, ensuring that the pointed-to data
/// cannot be accessed after its backing memory is freed by the RCU callback.
///
/// # Type parameter
/// - `T`: The element type of the slice. Typically `[u8; 32]` for hash arrays.
///
/// # Safety contract
/// - The pointer must have been obtained from memory that will remain valid
///   until at least the next RCU grace period.
/// - The `RcuReadGuard` passed to `deref()` must have been acquired after the
///   last RCU update that could have freed this memory.
/// - Callers must not hold the `Deref` result across an RCU grace period
///   boundary (e.g., must not call `rcu_synchronize()` while holding a
///   reference derived from this slice).
///
/// # Example
/// ```rust
/// pub struct KeyRevocationList {
///     pub revoked_keys: RcuSlice<[u8; 32]>,
///     pub revoked_count: u32,
/// }
///
/// fn is_revoked(krl: &RcuCell<KeyRevocationList>, fingerprint: &[u8; 32]) -> bool {
///     let guard = rcu_read_lock();
///     let krl_ref = krl.read(&guard);
///     // Deref RcuSlice within the guard's lifetime
///     let keys: &[[u8; 32]] = krl_ref.revoked_keys.deref(&guard);
///     keys[..krl_ref.revoked_count as usize]
///         .binary_search(fingerprint)
///         .is_ok()
/// }
/// ```
// kernel-internal, not KABI — RcuSlice contains raw pointer with platform-dependent size.
#[repr(C)]
pub struct RcuSlice<T> {
    ptr: *const T,
    len: usize,
}

impl<T> RcuSlice<T> {
    /// Create a new RcuSlice from a raw pointer and length.
    ///
    /// # Safety
    /// The caller must ensure that:
    /// 1. The pointer is valid and properly aligned for type `T`.
    /// 2. `ptr` points to `len` contiguous initialized elements of type `T`.
    /// 3. The memory pointed to will remain valid until at least the next RCU
    ///    grace period after the last access via this slice.
    pub unsafe fn from_raw(ptr: *const T, len: usize) -> Self {
        Self { ptr, len }
    }
}

impl<T> RcuSlice<T> {
    /// Dereference the slice within an RCU read-side critical section.
    ///
    /// The returned reference is valid only for the lifetime of the guard.
    /// Accessing the reference after the guard is dropped is undefined behavior.
    ///
    /// # Arguments
    /// - `_guard`: A reference to an `RcuReadGuard`, proving that the caller
    ///   is within an RCU read-side critical section. The guard's lifetime
    ///   bounds the returned reference.
    ///
    /// # Returns
    /// A shared reference to the underlying slice of `T` elements.
    /// For `RcuSlice<[u8; 32]>`, this returns `&[[u8; 32]]`.
    pub fn deref<'a>(&self, _guard: &'a RcuReadGuard) -> &'a [T] {
        // SAFETY: The caller has provided an RcuReadGuard, proving they are
        // within an RCU read-side critical section. The memory pointed to by
        // self.ptr was allocated by an RCU-protected update and will remain
        // valid until at least the next grace period. Since the guard prevents
        // grace period completion, the memory is valid for the guard's lifetime.
        // The len field was set at construction time and is invariant.
        unsafe { core::slice::from_raw_parts(self.ptr, self.len) }
    }
}

// RcuSlice does NOT implement Send or Sync directly. It can only be accessed
// through an RcuReadGuard, which is itself !Send. The containing struct
// (e.g., KeyRevocationList) provides the necessary Send/Sync impls when
// accessed through RcuCell, which enforces the RCU lifetime contract.

/// Queue a callback to be invoked after the current RCU grace period.
/// Safe to call from any context (including interrupt context).
/// Does NOT block.
///
/// # Implementation
/// Adds `RcuCallback { func, data }` to the calling CPU's per-CPU
/// `next` callback segment (`RcuPerCpu::cb_segments[RCU_NEXT]`). Each CPU
/// has its own independent 4096-slot ring per segment, so there is no
/// global serialization bottleneck on the enqueue path.
///
/// **Overflow policy**: If the calling CPU's `next` segment ring is full, the
/// behavior depends on the calling context:
///
/// - **Task context** (preempt_count == 0, IRQs enabled): `rcu_call()` calls
///   `rcu_synchronize()` to block until the grace period completes, then invokes
///   `func(data)` directly. The callback is guaranteed to execute before return.
///   A warning is logged to flag the overflow condition.
///
/// - **Atomic context** (IRQs disabled or preempt_count > 0): `rcu_call()` writes
///   the callback into `RcuPerCpu::overflow_buf`. The buffer is drained at the next
///   timer tick by `rcu_tick_drain_overflow()`. If the overflow buffer is also full,
///   the callback is dropped and `Err(RcuCallError::RingFull)` is returned — this
///   is a catastrophic condition (persistent RCU stall) and is logged at error level.
///
/// # Ordering guarantee
/// After `rcu_call(func, data)` returns `Ok(())`, `func(data)` will be called at
/// some future point after a complete grace period has elapsed (i.e., after every
/// CPU has passed through a quiescent state). In the task-context overflow fallback
/// case, `func(data)` has already been called by the time `rcu_call()` returns.
pub fn rcu_call(func: unsafe fn(*mut ()), data: *mut ()) -> Result<(), RcuCallError>;

rcu_call(func, data) algorithm:

rcu_call(func, data):
  1. Save caller's preemption state: was_atomic = CpuLocal::preempt_count() > 0 || irqs_disabled().
  2. Disable preemption (ensures we stay on this CPU for the duration).
  3. Get this CPU's RcuPerCpu state.
     Acquire lock: guard = cb_segments[RCU_NEXT].lock().
     // The SpinLock serializes concurrent callers on the same CPU
     // (e.g., an IRQ handler calling rcu_call while a task-context
     // rcu_call is in progress). Without this lock, two callers
     // could race on ring.push() and corrupt the ring's head pointer.
  4. If guard.is_full():
     a. Release lock: drop(guard).
     b. Log warning: "RCU callback ring full on CPU N".
     c. If was_atomic:
        // Atomic context: cannot block. Use pre-allocated overflow buffer.
        // Safe from IRQ re-entrancy: overflow_buf is per-CPU and accessed
        // with preemption disabled. If an IRQ fires here, it will see
        // irqs_disabled()=true in its own rcu_call() path and use the
        // same buffer — but only AFTER this store completes (single-CPU
        // sequential execution). No concurrent access is possible.
        if overflow_len < RCU_OVERFLOW_BUF_CAPACITY as u8:
          overflow_buf[overflow_len as usize].write(RcuCallback { func, data })
          overflow_len += 1
          // Will be drained at next tick by rcu_tick_drain_overflow().
          Re-enable preemption.
          return Ok(()).
        else:
          // Both ring and overflow buffer full: catastrophic RCU stall.
          log::error!("rcu_call: overflow buffer full in atomic context — callback dropped")
          Re-enable preemption.
          return Err(RcuCallError::RingFull).
     d. Else:
        // Task context: safe to block.
        Re-enable preemption.
        log::warn!("rcu_call: ring full, synchronizing (task context)")
        rcu_synchronize();  // Block until current grace period completes.
        unsafe { func(data) };  // Execute directly — GP elapsed, readers done.
        return Ok(()).
  5. guard.push(RcuCallback { func, data }).
  6. let len = guard.len();
     Release lock: drop(guard).
  7. If len >= RCU_BATCH_DRAIN_THRESHOLD (default: 256):
     Set rcu_state.gp_requested = true.
     Wake rcu_state.gp_kthread via scheduler::unblock().
  8. Re-enable preemption.
  9. return Ok(()).

rcu_tick_drain_overflow() algorithm (called at each timer tick for CPUs with overflow_len > 0):

rcu_tick_drain_overflow():
  1. Disable preemption.
  2. Get this CPU's RcuPerCpu state. ring = cb_segments[RCU_NEXT].
  3. While overflow_len > 0 && !ring.is_full():
     a. overflow_len -= 1.
     b. cb = overflow_buf[overflow_len].assume_init_read().
     c. ring.push(cb).
  4. If overflow_len == 0 && ring.len() >= RCU_BATCH_DRAIN_THRESHOLD:
     Set rcu_state.gp_requested = true.
     Wake rcu_state.gp_kthread via scheduler::unblock().
  5. Re-enable preemption.

/// Wait for an RCU grace period to complete (blocking).
///
/// This function blocks the calling thread until all pre-existing RCU
/// read-side critical sections have completed. Use this when you need
/// to free memory that may be referenced by RCU readers.
///
/// **MUST NOT be called from atomic context** (interrupt handler, spinlock-held,
/// preempt-disabled, or NMI). In atomic contexts, use `rcu_call()` instead.
pub fn rcu_synchronize();

rcu_seq_snap() — Compute GP target sequence number:

/// Symbolic constants for RCU grace period sequence encoding.
///
/// The low 2 bits of `gp_seq` encode the grace period phase:
///   0b00 = idle, 0b01 = started, 0b10 = completing, 0b11 = reserved.
/// Each complete GP advances `gp_seq` by 4.
///
/// `RCU_SEQ_STATE_MASK` matches Linux `kernel/rcu/tree.c` naming.
/// `RCU_GP_PHASE_MASK` is the alias used in the `gp_phase()` helper
/// (defined alongside `gp_before()` in the RcuState doc comment).
/// Both are the same value.
pub const RCU_SEQ_STATE_MASK: u64 = 0x3;
pub const RCU_GP_PHASE_MASK: u64 = RCU_SEQ_STATE_MASK;

/// Compute a snapshot value that is guaranteed to be past the end of any
/// grace period that is in progress at the time of the call, AND past the
/// end of at least one full grace period that has not yet started.
///
/// This is the correct target for `rcu_synchronize()`: the caller must wait
/// until `gp_seq` advances past the returned snapshot value.
///
/// The formula adds `2 * RCU_SEQ_STATE_MASK + 1` (= 7) to ensure that even
/// if `seq` is mid-GP (phase 1 or 2), the target rounds UP past the current
/// GP and requires one additional full GP. Masking off the low bits produces
/// a phase-0 (idle) target that can only be reached after both the current
/// in-progress GP and the next GP complete.
///
/// This matches Linux `kernel/rcu/tree.c` `rcu_seq_snap()` exactly.
///
/// Examples:
///   seq=0  (idle):    snap = (0 + 7) & !3 = 4   → wait for GP 1 to complete
///   seq=1  (started): snap = (1 + 7) & !3 = 8   → wait for GP 1 AND GP 2
///   seq=2  (completing): snap = (2 + 7) & !3 = 8 → wait for GP 2
///   seq=4  (idle):    snap = (4 + 7) & !3 = 8   → wait for GP 2
///   seq=5  (started): snap = (5 + 7) & !3 = 12  → wait for GP 2 AND GP 3
#[inline]
pub fn rcu_seq_snap(seq: u64) -> u64 {
    (seq.wrapping_add(2 * RCU_SEQ_STATE_MASK + 1)) & !RCU_SEQ_STATE_MASK
}

rcu_synchronize() algorithm:

rcu_synchronize():
  1. Snapshot seq = rcu_state.gp_seq.load(Acquire).
  2. Compute target = rcu_seq_snap(seq).
     // rcu_seq_snap() rounds UP past the current GP to ensure that all
     // pre-existing RCU read-side critical sections have completed.
     // When called mid-GP (seq is phase 1 or 2), the snapshot targets
     // one GP beyond the current one — the current GP may have started
     // before our snapshot, so readers from before our call may not have
     // exited until the next GP completes.
  3. Set rcu_state.gp_requested.store(true, Relaxed).
     Wake rcu_state.gp_kthread via scheduler::unblock().
     // ALWAYS request a GP, regardless of current GP phase. When called
     // mid-GP, rcu_seq_snap() targets the end of the NEXT GP. If we only
     // set gp_requested when idle, and no callbacks are pending (nobody
     // called rcu_call()), the current GP completes without requesting
     // another one. Our target requires TWO GPs but only one runs —
     // deadlock. The GP kthread checks gp_requested at GP completion
     // (step 14) and starts another if set. This matches Linux's
     // rcu_gp_init() which always checks for pending work.
  4. Add current task to rcu_state.gp_completion_wq with wait_gp_seq = target.
  5. schedule() — task sleeps until woken by gp_kthread.
  6. On wakeup: verify gp_before(target, rcu_state.gp_seq.load(Acquire) + 1).
     // gp_before(a, b) returns true when (b - a) as i64 > 0, i.e., b > a
     // (wrapping-aware signed comparison). Adding 1 to gp_seq converts the
     // strict-before check into an at-or-after check:
     //   gp_before(target, gp_seq + 1) ≡ gp_seq + 1 > target ≡ gp_seq >= target.
     // Without the + 1, we would check gp_seq > target, missing the case
     // where gp_seq == target (GP completed exactly to our target).
     If not satisfied (spurious wakeup), re-sleep (go to step 5).
  7. Return (grace period completed).

3.4.2 Hierarchical Quiescent State Reporting¶

When a CPU passes through a quiescent state (context switch, idle entry, userspace return, explicit rcu_quiescent_state() call), the report propagates bottom-up through the RcuNode tree. This is the core mechanism that makes Tree RCU scale: each CPU only touches its leaf node's lock, and propagation only climbs the tree when a node's last child reports.

rcu_qs() — record quiescent state on current CPU (called from context switch, idle entry, userspace return, KABI boundary):

rcu_qs():
  1. Set CpuLocal::rcu_passed_quiesce = true.
  // The actual tree propagation is deferred to rcu_check_callbacks(),
  // called from the scheduler tick or context_switch(). This avoids
  // acquiring the leaf node's spinlock on every quiescent point —
  // batching QS reports from the 1 ms tick interval.

rcu_check_callbacks() — propagate QS up the tree (called from scheduler tick handler at HZ=1000, and from context_switch()):

rcu_check_callbacks():
  1. If CpuLocal::rcu_nesting != 0: return.                     // Inside RCU read-side
     // critical section — reporting QS now would allow callbacks
     // (including Box::drop) to execute while readers hold references.
     // The QS will be reported after the outermost RcuReadGuard drops.
  2. If !CpuLocal::rcu_passed_quiesce: return.                  // No QS to report.
  3. If !rcu_percpu.qs_pending.load(Acquire): return.           // No GP needs our report.
  4. CpuLocal::rcu_passed_quiesce = false.
  5. rcu_percpu.qs_pending.store(false, Relaxed).
     // Relaxed is sufficient here because this is a local-CPU-only write.
     // The GP kthread will not read this field until the next GP start,
     // at which point it writes `true` (Release), providing the ordering.
  6. rcu_percpu.gp_seq_local = rcu_state.gp_seq.load(Relaxed).
  7. Call rcu_report_qs_leaf(rcu_percpu.leaf_node, rcu_percpu.leaf_bit).

rcu_report_qs_leaf() — clear bit in leaf node and propagate (called from rcu_check_callbacks() with preemption disabled):

rcu_report_qs_leaf(leaf: &RcuNode, bit: u8):
  1. Acquire leaf.lock.
  2. If leaf.inner.gp_seq != rcu_state.gp_seq.load(Relaxed):
     // Stale report from a previous GP — discard silently.
     Release leaf.lock.
     return.
  3. Clear bit `bit` in leaf.inner.qsmask:
     leaf.inner.qsmask &= !(1u64 << bit).
  4. mask = leaf.inner.qsmask.
  5. Release leaf.lock.
  6. If mask != 0: return.  // Other CPUs in this leaf haven't reported yet.
  7. // This leaf is fully quiescent — propagate to parent.
     rcu_report_qs_up(leaf).

rcu_report_qs_up() — propagate zero-qsmask up toward root (called from rcu_report_qs_leaf() when a leaf's qsmask reaches 0):

rcu_report_qs_up(child: &RcuNode):
  node = child
  loop:
    parent = node.parent
    If parent is None:
      // We just cleared the root's qsmask to 0 — GP is complete.
      // Wake the GP kthread (it sleeps on rcu_state.gp_kthread_wq).
      scheduler::unblock(rcu_state.gp_kthread)
      return.
    idx = node.parent_idx
    Acquire parent.lock.
    If parent.inner.gp_seq != rcu_state.gp_seq.load(Relaxed):
      // Stale — GP already advanced. Discard.
      Release parent.lock.
      return.
    Clear bit `idx` in parent.inner.qsmask:
      parent.inner.qsmask &= !(1u64 << idx).
    mask = parent.inner.qsmask.
    Release parent.lock.
    If mask != 0: return.  // Other children haven't reported yet.
    node = parent
    // Continue propagating upward.

Lock contention analysis: In the common case (many CPUs reporting QS for the same GP), each CPU acquires only its leaf node's lock (contention limited to leaf_fan_out CPUs per lock, default 64). Propagation to the parent occurs only when the last CPU in a leaf group reports — so parent locks see at most one acquisition per leaf node per GP. On a 4096-CPU system (64 leaves, 1 interior node = root), the root lock sees at most 64 acquisitions per GP. This is O(num_nodes) total lock acquisitions per GP, not O(num_cpus).

3.4.3 Grace Period State Machine (`rcu_gp_kthread`)¶

A single dedicated kernel thread (rcu_gp_kthread, SCHED_FIFO priority 10, pinned to NUMA node 0's first online CPU) drives grace period progression. Unlike the previous flat model with per-NUMA-node threads, the hierarchical tree makes a single GP kthread sufficient — the tree structure distributes the QS collection work to the CPUs themselves (bottom-up propagation), so the GP kthread only needs to initialize the tree and wait for the root to clear.

Constants:

/// Starting wait time for force-quiescent-state (FQS) scan interval.
pub const RCU_FQS_INITIAL_MS: u64 = 1;
/// Maximum FQS scan interval cap.
pub const RCU_FQS_MAX_MS: u64 = 100;
/// FQS backoff multiplier between scans.
pub const RCU_FQS_BACKOFF_MULTIPLIER: u64 = 2;
/// Absolute maximum grace period wait before RCU stall warning.
/// After this duration, a warning is emitted (advisory, not fatal).
pub const RCU_STALL_WARN_MS: u64 = 10_000;
/// Default force-quiescent-state IPI interval in nanoseconds.
/// After this interval without a QS from a CPU, send it a reschedule IPI.
/// Configurable at boot via `rcu.fqs_interval_ns=N`.
pub const RCU_FQS_IPI_NS: u64 = 1_000_000; // 1 ms

/// Capacity of the per-CPU overflow buffer used by `rcu_call()` in atomic context
/// when the main `RcuCallbackRing` is full. Pre-allocated in `RcuPerCpu`; no
/// heap allocation occurs on the enqueue path. Sized to absorb bursts from short
/// interrupt storms; `rcu_tick_drain_overflow()` drains the buffer at each timer tick.
///
/// **Sizing rationale**: The overflow buffer is a last resort — only used when ALL
/// 4096 entries in the main ring are full AND the caller is in atomic context. In
/// practice, atomic-context `rcu_call()` bursts are bounded by the work done per
/// interrupt/softirq invocation: a single NAPI poll cycle processes at most ~64
/// packets, a single timer tick processes a bounded number of deferred items. The
/// 64-entry buffer covers the typical burst from any single softirq handler.
///
/// **Monitoring**: A per-CPU overflow counter is exposed via
/// `/sys/kernel/rcu/per_cpu/N/overflow_count` to detect systems approaching the
/// limit. If real-world profiling shows exhaustion, increase to 256.
pub const RCU_OVERFLOW_BUF_CAPACITY: usize = 64;

Algorithm:

rcu_gp_kthread (single kernel thread, SCHED_FIFO priority 10):

Loop:
  ╔══════════════════════════════════════════════════════════════════════╗
  ║ Phase 0: IDLE — wait for work                                      ║
  ╚══════════════════════════════════════════════════════════════════════╝
  1. Sleep on rcu_state.gp_kthread_wq until rcu_state.gp_requested == true
     (or woken by rcu_call / rcu_synchronize / rcu_report_qs_up on root).

  ╔══════════════════════════════════════════════════════════════════════╗
  ║ Phase 1: GP START — initialize tree                                ║
  ╚══════════════════════════════════════════════════════════════════════╝
  2. rcu_state.gp_requested.store(false, Relaxed).
  3. new_seq = rcu_state.gp_seq.load(Relaxed) + 1.
     // Advance from idle (phase 0) to started (phase 1).
     assert!(gp_phase(new_seq) == RCU_GP_STARTED).
  4. Initialize the tree — for each RcuNode (top-down, root first):
     a. Acquire node.lock.
     b. node.inner.qsmask = node.inner.online_mask.
        // Set bits for all online children/CPUs. Offline children are
        // already "quiescent" — they have no active readers.
     c. node.inner.gp_seq = new_seq.
     d. Release node.lock.
  5. rcu_state.gp_seq.store(new_seq, Release).
     // The Release store ensures all tree initialization (qsmask writes,
     // qs_pending stores below) are ordered after gp_seq. This is critical
     // for RCU-F06: rcu_report_qs_leaf() loads gp_seq with Relaxed and
     // compares against the leaf node's gp_seq. If gp_seq were stored
     // AFTER qs_pending, a CPU on a weakly-ordered architecture (ARM/RISC-V/PPC)
     // could observe qs_pending=true, report QS, but see a stale gp_seq in the
     // leaf node staleness check — discarding a valid QS report.
     // By storing gp_seq FIRST (Release), the subsequent qs_pending stores
     // (also Release) are ordered after gp_seq, and the qs_pending Acquire
     // load in rcu_check_callbacks() transitively orders the gp_seq visibility.
  6. For each online CPU C:
     a. rcu_percpu[C].qs_pending.store(true, Release).
     // Note: CpuLocal::rcu_passed_quiesce is NOT cleared here. The GP
     // kthread cannot write to another CPU's register-based CpuLocal field.
     // Stale rcu_passed_quiesce from a previous GP is harmless because
     // rcu_check_callbacks() gates on rcu_nesting == 0 (step 1) and
     // qs_pending == true (step 3). A stale flag causes the CPU to
     // report QS "early" for the new GP, but this is safe: if rcu_nesting
     // is 0, the CPU is genuinely outside all RCU read-side critical
     // sections at the point of the report.
     // Idle CPUs: check CpuLocal::is_idle[C]. If idle, immediately report
     // QS for CPU C by calling rcu_report_qs_leaf(cpu_to_leaf[C], C - leaf.cpu_lo).
     // Idle CPUs have no active RCU readers — idle entry is a quiescent state.

  ╔══════════════════════════════════════════════════════════════════════╗
  ║ Phase 2: WAIT — wait for root.qsmask == 0                         ║
  ╚══════════════════════════════════════════════════════════════════════╝
  7. fqs_wait = RCU_FQS_INITIAL_MS.
     total_wait = 0.
  8. Loop (force-quiescent-state scan loop):
     a. Sleep for fqs_wait ms (with 10% jitter to spread wakeups:
        jitter = rand_bounded(fqs_wait / 10 + 1), sleep fqs_wait + jitter).
     b. total_wait += fqs_wait + jitter.
     c. Check root: acquire rcu_state.root().lock.
        If root.inner.qsmask == 0: release lock, goto step 11 (GP complete).
        holdout_mask = root.inner.qsmask.
        Release root.lock.
     d. FQS scan — for each set bit B in holdout_mask:
        Walk the subtree rooted at root.children[B] to find holdout CPUs.
        For each holdout leaf node L:
          For each set bit in L.inner.qsmask:
            cpu = L.cpu_lo + bit_position.
            If CpuLocal::is_idle[cpu]:
              // Idle CPU — report QS on its behalf.
              rcu_report_qs_leaf(L, bit_position).
            Else if total_wait >= RCU_FQS_IPI_NS / 1_000_000:
              // CPU is running and has not reported QS — send reschedule IPI.
              arch::current::interrupts::send_reschedule_ipi(cpu).
              // The IPI forces a context switch, which is a quiescent state.
              // The target CPU's next scheduler tick will call
              // rcu_check_callbacks() and propagate the QS up the tree.
     e. If total_wait >= RCU_STALL_WARN_MS:
        emit_rcu_stall_warning(total_wait, holdout_mask).
        // Advisory, not fatal. Continue waiting.
     f. fqs_wait = min(fqs_wait * RCU_FQS_BACKOFF_MULTIPLIER, RCU_FQS_MAX_MS).
     g. Check root again (quick path — avoids sleeping if QS arrived during scan):
        acquire root.lock, check qsmask, release.
        If root.inner.qsmask == 0: goto step 11.
     h. Continue loop (go to step 8a).

  ╔══════════════════════════════════════════════════════════════════════╗
  ║ Phase 3: GP COMPLETE — advance callbacks and wake waiters          ║
  ╚══════════════════════════════════════════════════════════════════════╝
  11. smp_mb().
      // Full memory barrier ensures all RCU-protected stores from all CPUs
      // (which were ordered before their QS reports) are visible before
      // callbacks execute.
  12. Store complete_seq to gp_seq (Release). Broadcast RCU_SOFTIRQ to
      ALL online CPUs for per-CPU softirq advancement:
      complete_seq = (rcu_state.gp_seq.load(Relaxed) & !RCU_SEQ_STATE_MASK)
                     .wrapping_add(4).
      // Relaxed ordering is sufficient here because the preceding smp_mb() (step 11)
      // already provides the necessary ordering fence. This load only reads the
      // current gp_seq to compute the next completed sequence number; it does not
      // need to synchronize with any concurrent writer (only this kthread writes gp_seq).
      // Uses symbolic constant RCU_SEQ_STATE_MASK (= 0x3) instead of a magic number.
      // The current gp_seq is at phase 1 (started). Masking off the phase bits
      // and adding 4 advances to phase 0 (idle) of the next GP number.
      // This is equivalent to the previous `(gp_seq | RCU_GP_PHASE_MASK) + 1`
      // formulation but uses the canonical mask-and-add form for consistency
      // with `rcu_seq_snap()`.
      // wrapping_add(4) matches the wrapping_add discipline in rcu_seq_snap().
      rcu_state.gp_seq.store(complete_seq, Release).
      raise_softirq_on_all_cpus(SoftirqVec::Rcu).
      // Broadcast to ALL online CPUs, not a filtered subset. This is O(1)
      // (one IPI bitmap write to the APIC/GIC/etc.) and avoids an O(nr_cpus)
      // scan of per-CPU cb_segments[RCU_WAIT] to determine which CPUs have
      // pending callbacks. Each CPU's RCU_SOFTIRQ handler checks its own
      // gp_seq_local < gp_seq condition locally (step 3 below) and returns
      // immediately if there is no work — the cost of a no-op softirq is
      // negligible compared to the cache-line bouncing of scanning remote
      // per-CPU data.
      //
      // Note: a CPU's softirq may fire early from a stale pending flag
      // (e.g., raised during a previous GP). This is harmless: the handler
      // reads the *current* gp_seq (Acquire), so early firing either
      // advances segments correctly (if gp_seq is already updated) or
      // finds gp_seq_local == gp_seq and returns with no action.
      // No additional synchronization is needed between the gp_seq store
      // and the broadcast.
      //
      // This matches Linux's approach in rcu_gp_cleanup() which calls
      // swake_up_one_online() per leaf node and relies on each CPU's
      // rcu_core() to check locally.
  13. Wake all tasks on rcu_state.gp_completion_wq whose wait_gp_seq <= complete_seq.
  14. If rcu_state.gp_requested.load(Relaxed):
      // More callbacks arrived during this GP — start another immediately.
      Goto step 2.
  15. Goto step 1 (sleep, wait for next request).

Callback execution (rcu_process_callbacks):

rcu_process_callbacks() — softirq handler (RCU_SOFTIRQ), per-CPU:
  // Phase 1: Advance local callback segments if GP(s) have completed.
  // This replaces the previous design where the GP kthread acquired
  // remote locks on all CPUs (O(nr_cpus) remote lock acquisition).
  // Now each CPU advances its own segments in softirq context — zero
  // remote lock acquisition, proven scalable at 256+ CPUs. Matches
  // Linux's per-CPU callback advancement in rcu_core().
  //
  // Multi-stage advancement: If multiple GPs completed between softirq
  // invocations (e.g., the CPU was in a long IRQ-disabled section while
  // two GPs completed), we advance segments by the number of completed GPs.
  // Each completed GP advances the pipeline by one stage:
  //   1 GP:  WAIT→DONE, NEXT_READY→WAIT, NEXT→NEXT_READY
  //   2 GPs: WAIT→DONE (then drain), NEXT_READY→DONE, NEXT→WAIT
  // Without multi-stage advancement, callbacks in WAIT would stall until
  // the next GP completes — unnecessary delay (RCU-F14).
  1. Disable preemption.
  2. local_gp_seq = rcu_state.gp_seq.load(Acquire).
  3. gps_completed = (local_gp_seq - rcu_percpu.gp_seq_local) / 4.
     // Each GP advances gp_seq by 4 (one full phase cycle).
     // Division by 4 gives the number of complete GPs since last check.
  4. If gps_completed >= 1:
     // At least one GP completed. Advance segments.
     // First advancement: WAIT→DONE, NEXT_READY→WAIT, NEXT→NEXT_READY.
     a. Acquire rcu_percpu.cb_segments[RCU_DONE].lock.
     b. Swap DONE ↔ WAIT (ring buffer pointer swap, O(1)).
        // "Swap" means exchanging the (head, tail, data_ptr) triple of two
        // RcuCallbackRing instances — 3 pointer-sized writes per swap.
        // No individual callbacks are moved. The ring buffer backing memory
        // is identity-swapped: what was the WAIT ring becomes the DONE ring.
        // This is O(1) regardless of the number of queued callbacks.
     c. Release lock.
     d. Acquire rcu_percpu.cb_segments[RCU_WAIT].lock.
     e. Swap WAIT ← NEXT_READY (ring buffer pointer swap, O(1)).
     f. Release lock.
     g. Acquire rcu_percpu.cb_segments[RCU_NEXT_READY].lock.
     h. Swap NEXT_READY ← NEXT (ring buffer pointer swap, O(1)).
     i. Release lock.
  5. If gps_completed >= 2:
     // Second GP also completed — advance again. The previous NEXT_READY
     // (now WAIT) callbacks have also satisfied their GP requirement.
     a. Drain DONE into local batch (to be executed in Phase 2).
     b. Acquire rcu_percpu.cb_segments[RCU_DONE].lock.
     c. Swap DONE ↔ WAIT (WAIT callbacks also done now).
     d. Release lock.
     e. Acquire rcu_percpu.cb_segments[RCU_WAIT].lock.
     f. Swap WAIT ← NEXT_READY.
     g. Release lock.
     // For gps_completed >= 3: further advancement is a no-op because
     // NEXT_READY and NEXT are both empty after two rounds (no callbacks
     // could have been registered between GPs that both completed while
     // this CPU was not running softirqs). Two advancement rounds is the
     // maximum useful depth.
  6. rcu_percpu.gp_seq_local = local_gp_seq.
     // After advancement:
     //   done = callbacks ready for execution (from 1 or 2 completed GPs)
     //   wait = callbacks that need the *next* GP
     //   next_ready = callbacks registered during the most recent GP
     //   next = empty (ready for new rcu_call() registrations)
  // Phase 2: Drain and execute done callbacks.
  7. Acquire rcu_percpu.cb_segments[RCU_DONE].lock.
  8. Drain all entries from the done ring, appending to local batch.
  9. Release lock.
  10. Re-enable preemption.
  11. For each callback in the local batch:
      unsafe { (cb.func)(cb.data) };  // Typically Box::drop or dealloc.
  12. If batch_size > RCU_OFFLOAD_THRESHOLD (default: 64):
      // Log advisory: high callback rate on this CPU.
      // Consider offloading to rcuo kthread (rcu_nocbs mode).

3.4.4 Force-Quiescent-State (FQS) Scan¶

The FQS mechanism handles CPUs that are slow to report quiescent states. It runs as part of the GP kthread's wait loop (step 8d above) and uses exponential backoff to balance latency against overhead.

FQS for idle CPUs: The GP kthread reports QS on behalf of idle CPUs directly. An idle CPU is always in a quiescent state — cpu_idle_enter() guarantees no active RCU read-side critical sections. The GP kthread reads CpuLocal::is_idle[C] (an AtomicBool per CPU, set/cleared by cpu_idle_enter() / cpu_idle_exit()) and calls rcu_report_qs_leaf() for idle CPUs without sending an IPI. This avoids waking idle CPUs unnecessarily (power savings on partially loaded systems).

FQS for running CPUs: After fqs_interval_ns (default 1 ms) without a QS report, the GP kthread sends a lightweight reschedule IPI to the holdout CPU (arch::current::interrupts::send_reschedule_ipi(cpu_id)). The IPI has no dedicated handler — it forces a context switch on the target CPU by setting PREEMPT_NEED_RESCHED in the target's preempt_count. The next preempt_enable() or return-from-interrupt path on that CPU calls schedule(), which is a quiescent state. The QS propagates up the tree via the normal rcu_check_callbacks() path.

FQS tree walk: The GP kthread does not scan all CPUs sequentially. It reads the root's qsmask to identify which top-level subtrees still have holdouts, then descends only into those subtrees. On a 4096-CPU / 64-leaf system where 4000 CPUs have already reported, the FQS scan touches only the 1-2 leaf nodes covering the remaining holdout CPUs — not all 64 leaves.

FQS scan #	Wait before scan	Cumulative wait	Action
1	1 ms	1 ms	Scan holdout leaves; report idle CPUs
2	2 ms	3 ms	Same + IPI to running holdouts (if > fqs_interval_ns)
3	4 ms	7 ms	Same
4	8 ms	15 ms	Same
5	16 ms	31 ms	Same
6	32 ms	63 ms	Same
7	64 ms	127 ms	Same
8+	100 ms (capped)	227+ ms	Same; stall warning at 10 s

3.4.5 Expedited Grace Periods¶

synchronize_rcu_expedited() bypasses the normal tree-based wait and forces an immediate quiescent state on all online CPUs via IPI. This is expensive (O(nr_cpus) IPIs) but provides bounded GP latency.

/// Force an immediate grace period by IPI-ing all online CPUs.
///
/// **Cost**: One IPI per online CPU + one context switch per CPU.
/// On a 256-CPU system: ~256 IPIs, ~50 μs total latency.
///
/// **Use sparingly**: This is appropriate for emergency operations
/// (module unload, CPU hotplug, OOM kill cleanup) where waiting for
/// the normal GP (10-100 ms) is unacceptable. Normal `rcu_synchronize()`
/// is preferred for all other cases.
///
/// **MUST NOT be called from atomic context.**
pub fn synchronize_rcu_expedited();

Algorithm:

synchronize_rcu_expedited():
  1. If only one CPU is online: return immediately (single CPU = always quiescent).
  2. Snapshot seq = rcu_state.gp_seq.load(Acquire).
  3. Initialize the tree (same as GP start, steps 2-6 above).
  4. For each online CPU C (excluding self):
     Send IPI_RCU_EXP to CPU C.
     // IPI handler on target CPU:
     //   a. If CpuLocal::rcu_nesting > 0: set CpuLocal::rcu_passed_quiesce = true.
     //      (CPU is in an RCU read-side section — the handler sets the flag;
     //      the section's outermost RcuReadGuard::drop will call rcu_qs(),
     //      and the next tick's rcu_check_callbacks() propagates the QS.)
     //   b. If CpuLocal::rcu_nesting == 0: call rcu_check_callbacks() directly.
     //      (CPU is not in an RCU section — report QS immediately.)
  5. Call rcu_check_callbacks() on self (report own QS).
  6. Wait for root.qsmask == 0 (busy-wait with short pause loop, no backoff).
     // Expedited GPs are rare; busy-wait is acceptable.
     // Timeout: if root.qsmask != 0 after 1 second, emit stall warning
     // and send a second round of IPIs.
  7. smp_mb().
  8. Advance callback segments (same as GP complete, steps 12-15).
  9. Return.

3.4.6 RCU Interaction with Live Kernel Evolution¶

During Phase B of live evolution (Section 13.18), all CPUs are halted via IPI for the atomic vtable pointer swap (~1-10 us). No RCU grace period can complete during this window because no CPU can pass through a quiescent state while halted. RCU callbacks queued before or during Phase B are processed after all CPUs resume normal execution. The Phase B window is short enough (bounded by the stop-the-world timeout, default 100 us) that RCU grace period stall detection (threshold: 21 seconds) is never triggered. The GP kthread resumes its normal polling loop after Phase B completes and detects quiescent states from the resumed CPUs within one scheduler tick (~1 ms).

3.4.7 CPU Hotplug and the RCU Tree¶

When a CPU comes online or goes offline, the tree's online_mask fields must be updated to prevent the GP kthread from waiting for a CPU that will never report.

CPU online (rcu_cpu_online(cpu)):

rcu_cpu_online(cpu):
  1. leaf = rcu_state.cpu_to_leaf[cpu].
  2. bit = cpu - leaf.cpu_lo.
  3. Acquire leaf.lock.
  4. leaf.inner.online_mask |= (1u64 << bit).
  5. leaf.inner.n_children += 1.
  6. Release leaf.lock.
  7. // If a GP is in progress, this CPU's bit will be set in qsmask
     // at the next GP start. For the current GP, the newly online CPU
     // is not required to report (it wasn't online when the GP started).
  8. Initialize rcu_percpu[cpu]: gp_seq_local = rcu_state.gp_seq,
     qs_pending.store(false, Relaxed).

CPU offline (rcu_cpu_offline(cpu)):

rcu_cpu_offline(cpu):
  1. leaf = rcu_state.cpu_to_leaf[cpu].
  2. bit = cpu - leaf.cpu_lo.
  3. Acquire leaf.lock.
  4. leaf.inner.online_mask &= !(1u64 << bit).
  5. leaf.inner.n_children -= 1.
  6. If leaf.inner.qsmask & (1u64 << bit) != 0:
     // This CPU owed a QS for the current GP. Clear its bit
     // (offline = implicit quiescent state — no active readers).
     leaf.inner.qsmask &= !(1u64 << bit).
     mask = leaf.inner.qsmask.
     Release leaf.lock.
     If mask == 0:
       rcu_report_qs_up(leaf).  // Propagate to parent.
  7. Else:
     Release leaf.lock.
  8. Drain rcu_percpu[cpu].cb_segments — migrate pending callbacks to the
     current CPU's segments (the offlined CPU will never process them).
     // Migration is done under the local CPU's preemption-disabled section.
     // Callbacks are appended to the current CPU's `next` segment.

3.4.8 NUMA-Aware Tree Construction¶

The tree is constructed with NUMA affinity: leaf nodes covering CPUs on the same NUMA domain are grouped under the same interior node when possible. This ensures that the most frequent lock acquisitions (leaf-level QS reporting) hit NUMA-local locks.

rcu_build_tree NUMA heuristic:
  1. Sort CPUs by NUMA proximity domain (from ACPI SRAT or device tree).
  2. Assign leaf nodes in NUMA-domain order:
     - CPUs 0-63 on NUMA node 0 → leaf 0.
     - CPUs 64-127 on NUMA node 0 → leaf 1.
     - CPUs 128-191 on NUMA node 1 → leaf 2.
     - Etc.
  3. Group leaf nodes by NUMA domain under interior nodes:
     - If NUMA node 0 has 2 leaves and NUMA node 1 has 2 leaves,
       the root has 2 interior children (one per NUMA domain), each
       with 2 leaf children.
     - This groups NUMA-local leaves together, minimizing cross-node
       lock traffic during QS propagation.
  4. If the topology doesn't divide evenly, remaining CPUs fill partial
     leaf nodes. The last leaf node may have fewer than leaf_fan_out CPUs.

Per-architecture notes:

Architecture	NUMA discovery	Notes
x86-64	ACPI SRAT + MADT	Standard path. Intel and AMD systems with 2-8 NUMA nodes.
AArch64	ACPI SRAT or DT `numa-node-id`	Server-class ARM (Ampere Altra, Graviton) uses ACPI. Embedded uses DT.
ARMv7	Single NUMA node (UMA)	All CPUs in one leaf group. Tree depth = 1.
RISC-V 64	DT `numa-node-id`	NUMA support varies by platform. Single-node typical today.
PPC32	Single NUMA node (UMA)	All CPUs in one leaf group.
PPC64LE	ACPI SRAT (PowerVM) or DT (KVM)	POWER9/10 NUMA with up to 16 nodes.
s390x	Single NUMA node (UMA)	z/VM LPARs present as single-node.
LoongArch64	ACPI SRAT	Loongson 3C5000 has 4 NUMA nodes.

rcu_read_lock() / rcu_read_unlock() algorithms (non-preemptible RCU):

UmkaOS uses non-preemptible RCU: rcu_read_lock() disables preemption via preempt_count, so any context switch is a quiescent state. This is simpler and lower overhead than preemptible RCU (no deferred quiescent state tracking needed on the preempt path), and appropriate for a kernel targeting <5% overhead. RCU readers cannot sleep or be preempted. Grace periods are detected by quiescent state tracking (context switch, idle, user return) with bottom-up tree propagation — no synchronization between readers and the GP kthread.

rcu_read_lock():
  1. Increment CpuLocal.preempt_count (disables preemption).
  2. Increment CpuLocal.rcu_nesting.
  Cost: ~2 instructions (two CpuLocal register writes, no memory barriers,
  no cache-line bouncing). Preemption remains disabled while rcu_nesting > 0.
  (A descheduled task has by definition passed through a quiescent state,
  since context switches only occur when preempt_count == 0.)

rcu_read_unlock():
  1. Decrement CpuLocal.rcu_nesting.
  2. If rcu_nesting == 0:
       If CpuLocal.rcu_passed_quiesce == false:
         Set CpuLocal.rcu_passed_quiesce = true.
         // Do NOT acquire the leaf node's lock here.
         // The actual tree propagation is deferred to the next scheduler
         // tick or context switch via rcu_check_callbacks(). This avoids
         // a spinlock acquisition on every outermost RCU drop (critical
         // on high-IOPS paths such as NVMe/conntrack/routing).
  3. Decrement CpuLocal.preempt_count (re-enables preemption if count == 0).
  4. If preempt_count == 0 and resched_pending: call schedule().

UmkaOS Tree-RCU design choices vs. Linux: - Hierarchical tree with runtime-discovered geometry: Tree depth and fan-out are computed at boot from actual CPU count and NUMA topology. No CONFIG_RCU_FANOUT compile-time constant — the tree adapts to hardware. - 4-segment callback pipeline: done/wait/next_ready/next segments advance as pointer swaps (O(1)) at GP completion. Compared to Linux's segmented callback list (which uses intrusive linked-list splicing), UmkaOS uses pre-allocated ring buffers — no container_of pointer arithmetic, no per-callback heap allocation. - Single GP kthread: The hierarchical tree makes per-NUMA-node GP threads unnecessary — CPUs propagate QS reports bottom-up through the tree, and the GP kthread only monitors the root. This reduces kthread count from O(NUMA_nodes) to 1. - FQS tree walk: The force-quiescent-state scan descends only into subtrees with holdout CPUs (guided by the root's qsmask), not all CPUs sequentially. - rcu_call overflow: Context-aware: task context falls back to rcu_synchronize() + direct execution (never drops callbacks); atomic context uses a pre-allocated 64-slot overflow_buf drained at the next timer tick via rcu_tick_drain_overflow(). Both paths log warnings; only a full overflow buffer in atomic context (catastrophic RCU stall) drops a callback and logs at error level.

/// Defer a callback to run after an RCU grace period (non-blocking).
///
/// This is the preferred way to free memory from atomic contexts. The callback
/// will be invoked in the RCU worker thread context after all pre-existing
/// readers have completed.
///
/// # Safety
/// - The callback must not access any data that was freed before the callback runs.
/// - The callback runs in a kernel thread context (not interrupt context), so
///   it may block, but should complete quickly to avoid delaying other callbacks.
///
/// # Example
/// ```rust
/// // Defer Box::drop after grace period
/// let ptr = Box::into_raw(old_value);
/// unsafe { rcu_defer_free(ptr) }; // callback will reconstruct Box and drop it
/// ```
pub unsafe fn rcu_defer_free<T>(ptr: *mut T);

/// Defer Box::drop after an RCU grace period (non-blocking).
///
/// This is a convenience wrapper around `rcu_defer_free()` for the common case
/// of freeing a Box<T> after an RCU grace period.
///
/// # Safety
/// Same as `rcu_defer_free()` — ptr must have been obtained from `Box::into_raw()`.
pub unsafe fn rcu_call_box_drop<T>(ptr: *mut T) {
    // Reconstruct the Box and let it drop after the grace period.
    rcu_defer_free(ptr);
}

RCU Slice Lifetime Safety Example (KRL):

The KeyRevocationList in Section 9.3 demonstrates correct RcuSlice usage:

// Boot-time KRL allocation (bump-allocated, 'static lifetime):
let boot_krl = KeyRevocationList {
    revoked_keys: unsafe { RcuSlice::from_raw(boot_keys_ptr) },  // lives forever
    revoked_count: boot_count,
    // ... other fields
};

// Runtime KRL allocation (slab-allocated, RCU-managed):
let rt_keys = Box::try_new_in_slice(count, slab_allocator)?;
let rt_krl = Box::try_new(KeyRevocationList {
    revoked_keys: unsafe { RcuSlice::from_raw(Box::as_slice_ptr(rt_keys)) },
    revoked_count: count,
    // ... other fields
})?;
// rt_krl is published via RcuCell::update(). Old KRL (if any) is freed
// by the RCU callback after the grace period, including its revoked_keys array.

/// Compile-time lock ordering: deadlock prevention via the type system.
/// A Lock<T, 30> can only be acquired while holding a Lock<_, N> where N < 30.
/// Lock levels are spaced by 10 (0, 10, 20, ...) to allow future insertions
/// between adjacent levels without full renumbering.
/// Attempting to acquire locks out of order is a compile-time error.
///
/// **Compiler feature note**: The `Assert<{ HELD < LEVEL }>: IsTrue` trait-bound
/// pattern shown below requires `#![feature(generic_const_exprs)]` (nightly,
/// tracking issue #76560, `incomplete` status as of 2025 — not on a
/// stabilization path). The **stable alternative** is `const { assert!(HELD < LEVEL) }`
/// (inline const, stable since Rust 1.79). Implementation SHOULD use the stable
/// form; the spec uses the trait-bound form for clarity of intent. Both produce
/// identical compile-time failures on ordering violations with zero runtime cost.
pub struct Lock<T, const LEVEL: u32> {
    inner: SpinLock<T>,  // or MutexLock<T> for sleeping locks
}

impl<T, const LEVEL: u32> Lock<T, LEVEL> {
    pub fn lock<const HELD: u32>(&self, _proof: &LockGuard<HELD>) -> LockGuard<LEVEL>
    where
        Assert<{ HELD < LEVEL }>: IsTrue,
        // Stable alternative (Rust 1.79+):
        //   const { assert!(HELD < LEVEL, "lock ordering violation") }
    {
        LockGuard::new(self.inner.lock())
    }

    /// Acquiring the first lock in a chain requires no proof.
    /// Any level can be the starting lock — the constraint is that no other
    /// lock is currently held. The returned `LockGuard<LEVEL>` then constrains
    /// all subsequent lock acquisitions to levels strictly greater than LEVEL.
    ///
    /// **Enforcement**: The type system alone cannot prevent a caller from
    /// invoking `lock_first()` while already holding a `LockGuard` from a
    /// different scope. Therefore, `lock_first()` performs a **runtime check**
    /// in all build configurations: it reads the per-CPU `max_held_level`
    /// variable and panics if any lock is currently held. This is a cheap
    /// check (one per-CPU variable read + branch, ~2-3 cycles) and is always
    /// enabled — not only in debug mode. The `lock()` method's compile-time
    /// ordering guarantee (via `HELD < LEVEL`) remains the primary enforcement;
    /// the runtime check in `lock_first()` closes the only loophole.
    ///
    /// **Cross-session ABBA prevention**: Lock ordering is enforced by a global
    /// total order defined at compile time via the `LEVEL` const type parameter.
    /// Two sessions acquiring lock A (level 2) then lock B (level 5) always
    /// acquire in the same order because the type system enforces `HELD < LEVEL`
    /// at every step.
    pub fn lock_first(&self) -> LockGuard<LEVEL> {
        assert_no_locks_held(); // per-CPU max_held_level == NONE
        LockGuard::new(self.inner.lock())
    }
}

/// **Lock ordering: ZERO exceptions.**
///
/// The `Lock<T, LEVEL>` compile-time ordering system enforces a TOTAL order
/// on all lock acquisitions: a thread holding a lock at level N may only
/// acquire locks at levels > N. There are NO escape hatches, no
/// `lock_read_unchecked()`, no compile-time call-site caps, no runtime
/// fallback validators.
///
/// **History**: The spec previously required a `lock_read_unchecked()` method
/// for the page fault path, which acquired `INVALIDATE_LOCK(90, read)` under
/// `VMA_LOCK(105, read)` -- a descending-level violation. This exception was
/// eliminated by replacing `INVALIDATE_LOCK` (an rwsem) with `InvalidateSeq`
/// (a lockless seqcount). The fault path now performs two atomic loads instead
/// of acquiring a lock:
///
///   **Fault path lock chain (strictly ascending):**
///   `VMA_LOCK(105, read)` -> `PAGE_LOCK(180)` -> `PTL(185)`
///
/// No lock ordering violation. No exception needed.
///
/// This means `Lock<T, LEVEL>` is UNIVERSAL: every lock acquisition in the
/// entire kernel goes through the compile-time ordering check. Debug builds
/// additionally validate at runtime via the per-CPU `held_locks` stack
/// ([Section 3.5](#locking-strategy--runtime-lock-ordering-validation)).

Lock categories: Subsystems define named lock categories to group related lock levels and prevent cross-subsystem lock violations:

/// Named lock category for subsystem-level lock grouping.
/// Each category maps to a range of lock levels. The runtime lockdep checker
/// validates that locks from different categories are never held simultaneously
/// unless explicitly permitted in the cross-category ordering table.
#[repr(u32)]
pub enum LockCategory {
    /// Core kernel locks (scheduler run queues, memory allocator).
    Core       = 0,
    /// Filesystem and block layer locks.
    Fs         = 1,
    /// Network stack locks.
    Net        = 2,
    /// Windows Emulation Architecture (NT object manager, WEA syscalls).
    WEA        = 3,
    /// Driver subsystem locks (device registry, KABI VTable).
    Driver     = 4,
}

3.4.8.1.1 Lock Ordering¶

Lock level assignment table (authoritative; used by all Lock<T, LEVEL> instantiations):

Level 0 is reserved for locks that must be acquirable while the lock-ordering checker is active and before any subsystem lock is held. Locks at level 0 are never acquired while holding another level-0 lock; they are independent entry points into the lock graph. FUTEX_BUCKET is the primary example: futex_wake() acquires a bucket lock and then calls scheduler::enqueue(), which acquires RQ_LOCK (level 5). The bucket lock must therefore be below TASK_LOCK (level 2) to keep the acquisition order valid.

PI futex and RT_MUTEX: Priority-inheritance (PI) futex operations acquire FUTEX_BUCKET (level 0) to locate the waiter, then walk the priority inheritance chain via RT_MUTEX (level 1) to propagate priority boosting. The RT_MUTEX lock protects the PI waiter tree and the priority inheritance chain; it must sit between FUTEX_BUCKET and TASK_LOCK because PI chain walking reads (but does not modify) the task struct, and may call rt_mutex_adjust_prio() which acquires TASK_LOCK. The ordering is therefore: FUTEX_BUCKET(0) -> RT_MUTEX(10) -> TASK_LOCK(20).

Levels 30-40 cover per-task sub-locks (SIGHAND_LOCK < FDTABLE_LOCK), taken after the task lock but before the scheduler's run queue lock. Levels 70-90 cover capability and memory management locks (CAP_TABLE_LOCK < VM_LOCK < ADDR_SPACE_LOCK), reflecting the invariant that capability validation precedes address space mutation, and VMA-level decisions precede page table modifications. Filesystem locks (FS_SB_LOCK < INODE_LOCK < DENTRY_LOCK) follow the same outer-to-inner principle: superblock state is acquired before per-inode operations, which in turn precede dentry cache manipulation.

Lock levels are spaced by 10 (0, 10, 20, ..., 260) with intermediate values (e.g., 105, 125) used for locks that must nest between adjacent primary levels. This provides 9 insertion points between any two adjacent levels for future additions.

Level	Lock Name	Subsystem	Category
0	`FUTEX_BUCKET`	FutexBucket spinlock	Core
0	`IRQ_DESC_LOCK`	Interrupt descriptor	Core

Level-0 non-co-holding proof: FUTEX_BUCKET and IRQ_DESC_LOCK both use level 0. The compile-time ordering system rejects Lock<T, 0> → Lock<U, 0> acquisition (0 < 0 is false). These two locks are never co-held in any code path: FUTEX_BUCKET is taken by futex syscalls (process context only); IRQ_DESC_LOCK is taken by interrupt setup/teardown and irq_desc access. No futex code path touches IRQ descriptors, and no IRQ setup code path touches futex hash buckets. | 10 | RT_MUTEX | SpinLock — per-RtMutex waiter tree lock — protects the priority inheritance waiter tree and PI chain. Acquired by futex_lock_pi() after FUTEX_BUCKET to walk and adjust the PI chain. Released before acquiring TASK_LOCK for priority adjustment. | Core | | 20 | TASK_LOCK | Per-task struct | Core | | 30 | SIGHAND_LOCK | SpinLock — SignalHandlers::lock — serializes sigaction() and signal delivery across threads sharing a signal handler table. Held briefly; never taken with IRQs disabled. | Core | | 40 | SIGLOCK | SpinLock — per-task signal queue lock — serializes signal delivery, pending mask updates, and group stop state. Acquired under SIGHAND_LOCK(30) during signal delivery; chains to RQ_LOCK(50) via try_to_wake_up() when waking the target task. IRQ-safe. | Core | | 40 | FDTABLE_LOCK | SpinLock — FdTable::inner — serializes fd alloc/close/dup within a shared file descriptor table. Held briefly for O(1) fd operations. Mutually exclusive with SIGLOCK — both at level 40 means the compiler rejects holding both simultaneously (40 < 40 is false). Never co-held in any Linux or UmkaOS code path. | Core | | 45 | PI_LOCK | Priority inheritance chain — protects HeldMutexes list for PI chain walking. Below RQ_LOCK(50) so that try_to_wake_up() can acquire PI_LOCK then RQ_LOCK without inversion. | Core | | 50 | RQ_LOCK | Scheduler run queue | Core | | 70 | CAP_TABLE_LOCK | Capability table write lock — serializes capability slot insert/remove/revoke across a task's capability space. Revocation invariant: during cap_revoke(), the per-CapEntry children spinlock is held for at most O(256) iterations per workqueue item. No recursive spinlock acquisition — each delegation tree level is processed by a separate workqueue item (Section 9.1). | Core | | 80 | I_RWSEM | RwLock — per-inode read-write semaphore — serializes file read/write/truncate/fallocate. This is the most contended filesystem lock. Ordering: I_RWSEM < VM_LOCK in the truncate path (truncate takes I_RWSEM(write) then VM_LOCK(write) to unmap pages). The page fault path does NOT acquire I_RWSEM — truncation-fault coordination is provided by InvalidateSeq (Section 4.8), a lockless seqcount. The fault path's lock chain is strictly ascending: VMA_LOCK(105, read) -> PAGE_LOCK(180) -> PTL(185). No lock ordering exception is needed. Truncate holds I_RWSEM(write) and increments InvalidateSeq before mutating the page cache; faults detect the mutation via two atomic seq loads and retry. | Fs | | 100 | VM_LOCK | VMA tree write lock (mmap_lock equivalent) — protects the per-process virtual memory area tree during mmap/munmap/fault handling | Core | | 105 | VMA_LOCK | RwLock<()> — per-VMA lock (Vma::vm_lock). mmap_lock nests outside vm_lock. Page fault fast path acquires only vm_lock.read(). VMA modification acquires mmap_lock.write() then vm_lock.write(). A thread holding vm_lock.read() must never acquire mmap_lock in any mode. See Section 4.8. | Core | | 110 | ADDR_SPACE_LOCK | Address space page table lock — serializes page table modifications (map/unmap/protect) within a single address space; acquired under VM_LOCK | Core | | 120 | BUDDY_LOCK | Per-NUMA buddy allocator | Core | | 125 | SLAB_DEPOT_LOCK | SpinLock — per-SlabCache magazine depot lock — serializes depot full/empty magazine exchange on the magazine-miss slow path. Ordering: BUDDY_LOCK(120) < SLAB_DEPOT_LOCK(125) < SLAB_LOCK(130). The free slow path may acquire depot lock first, then SLAB_LOCK (via drain_magazine_to_slabs()). See Section 4.3 for the complete slab lock ordering chain. | Core | | 130 | SLAB_LOCK | SpinLock — per-NUMA slab partial list (node_partial) — serializes slab insertion/removal from the per-node partial list. Acquired after SLAB_DEPOT_LOCK on the free slow path (drain path). The alloc slow path drops this lock before calling slab_grow() to avoid self-deadlock (slab_grow() re-acquires it internally). | Core | | 140 | WORKQUEUE_LOCK | Per-workqueue drain / flush serialization | Core | | 150 | FS_SB_LOCK | Per-superblock lock — protects superblock-level state (mount flags, fs-wide counters, journal commit); acquired before per-inode locks during mount/unmount and fs-wide operations | Fs | | 160 | INODE_LOCK | Per-inode metadata SpinLock (VFS) — protects inode attributes (size, timestamps, link count). Distinct from I_RWSEM which serializes I/O operations. INODE_LOCK is a SpinLock for quick metadata updates; I_RWSEM is an RwLock for I/O serialization. | Fs | | 170 | WRITEBACK_LOCK | SpinLock — per-inode writeback state lock — serializes writeback initiation and I/O completion callbacks. Acquired under FS_SB_LOCK during sync, after INODE_LOCK during per-inode writeback flush. Never held across allocation (no GFP_NOFS concern). | Fs | | 175 | DSM_FETCH_COMPLETION | WaitQueue — DSM remote page fetch completion. Faulting thread blocks here while waiting for RDMA data arrival. Acquired after VMA_LOCK(105, read), before PAGE_LOCK(180). DSM fault path lock chain: VMA_LOCK(105, read) → DSM_FETCH_COMPLETION(175) → PAGE_LOCK(180) → PTL(185). TASK_KILLABLE wait (only SIGKILL interrupts). See Section 6.12. | DSM | | 180 | PAGE_LOCK | Per-page lock (page cache) — acquired after WRITEBACK_LOCK during writeback submission, and after INODE_LOCK during per-inode dirty page iteration. The sync path nesting is: FS_SB_LOCK(150) → INODE_LOCK(160) → WRITEBACK_LOCK(170) → PAGE_LOCK(180). In the page fault path: VMA_LOCK(105, read) → PAGE_LOCK(180) → PTL(185) — strictly ascending, zero exceptions. In the DSM fault path: VMA_LOCK(105, read) → DSM_FETCH_COMPLETION(175) → PAGE_LOCK(180) → PTL(185). | Fs | | 185 | PTL | SpinLock — per-page-table-page lock (Page Table Lock) — serializes concurrent PTE modifications within a single page table page. Acquired by the fault handler (file fault, COW fault) after PAGE_LOCK and before writing the PTE. Also acquired by unmap_mapping_range() during truncation PTE zapping, and by try_to_unmap() during page reclaim. Leaf-level lock: no further locks acquired under PTL. The fault path nesting is: VMA_LOCK(105, read) → PAGE_LOCK(180) → PTL(185). | Core | | 190 | DENTRY_LOCK | Dentry cache per-entry lock — protects dentry reference counts, parent/child linkage, and name hash chain membership | Fs | | 200 | MOUNT_LOCK | Mount table | Fs | | 210 | HIERARCHY_LOCK | RwLock<()> — cgroup hierarchy lock — serializes cgroup creation and deletion (write-lock) across the cgroup tree. Task migration holds a read-lock (concurrent migrations allowed). Acquired before per-cgroup subsystem locks. Never held across filesystem operations (no GFP_NOFS concern — cgroup ops do not allocate from filesystem-backed paths). Cgroup task migration protocol: The two-phase release is MANDATORY — failure to release HIERARCHY_LOCK before acquiring RQ_LOCK(50) is a deadlock (level 210 > 50, compile-time rejected by the Lock<T, LEVEL> mechanism). Protocol: (1) record migration intent and update cgroup membership under HIERARCHY_LOCK (read), (2) release HIERARCHY_LOCK, (3) acquire RQ_LOCK to perform the actual runqueue dequeue/enqueue and GroupEntity transfer. The task's cgroup_migration_state: AtomicU8 (CgroupMigrationState enum: None/Migrating/Complete) prevents concurrent migrations of the same task between phases. See Section 17.2 for the full migration protocol specification. | Container | | 220 | EVM_LOCK | RwLock — per-superblock EVM (Extended Verification Module) lock — serializes xattr integrity verification and HMAC recomputation. Read-held during file open (IMA check); write-held during xattr update. Must be after INODE_LOCK (xattr ops hold inode lock). | Security | | 230 | SOCK_LOCK | Per-socket | Net | | 240 | CONNTRACK_BUCKET | Conntrack bucket | Net | | 250 | DEV_REG_LOCK | Device registry | Driver | | 260 | VTABLE_LOCK | KABI vtable swap | Driver | | 270 | EFI_RUNTIME_LOCK | UEFI runtime call serialization — leaf, IRQs disabled. Cold-path only (variable reads, time set, reboot). Never nested. | Boot |

Sleeping Mutex Ordering (separate from SpinLock levels — sleeping Mutexes use Mutex<T>, not Lock<T, LEVEL>, and are acquired only in process context):

Order	Mutex Name	Subsystem	Constraint
1 (outermost)	`EVOLUTION_MUTEX`	Live evolution	Must be outermost — no lock may be held when acquiring. Held for the entire Phase A/A'/B/C evolution sequence.
2	`KABI_REGISTRY_MUTEX`	KABI registry	Acquired AFTER `EVOLUTION_MUTEX` for write-side updates to the global KABI service registry during component swap.
3	`OOM_LOCK`	OOM killer	Global OOM serialization Mutex (`static OOM_LOCK: Mutex<()>`). Ensures only one OOM kill sequence runs at a time — both global and per-cgroup OOM paths acquire this single global lock. Acquired after `mmap_lock` (read or write) — the allocation path may hold `mmap_lock` when triggering OOM. Neither OOM Mutex conflicts with `EVOLUTION_MUTEX` — they operate in non-overlapping code paths. See Section 4.5.

Sleeping Mutexes and SpinLocks do not directly nest (a sleeping Mutex must not be acquired with a SpinLock held, because Mutex::lock() may sleep). The Lock<T, LEVEL> compile-time system only covers SpinLocks. The sleeping Mutex ordering is documented here for implementer reference.

SpinLock<T> vs Lock<T, LEVEL> policy: Use Lock<T, LEVEL> (ordered) for ALL locks that participate in cross-subsystem locking paths — this is the default. Bare SpinLock<T> (unordered) is permitted ONLY for: 1. Per-CPU locks held exclusively with IRQs disabled (e.g., RCU cb_segments), where cross-CPU acquisition is structurally impossible. 2. Subsystem-internal leaf locks (see below) that are never held across subsystem boundaries and never nest with any ordered lock. Every bare SpinLock<T> must have a comment justifying why it does not use Lock<T, LEVEL>. If in doubt, use Lock<T, LEVEL> — the compile-time check is free at runtime.

This table will be extended as subsystems are implemented. The ordering invariant is: a thread holding a lock at level N may only acquire locks at levels > N. Cross-category acquisitions (e.g., Core lock then Fs lock) follow the same rule — the numeric level is the sole ordering criterion.

Subsystem-internal locks not listed above (e.g., per-NIC TX queue lock, per-pipe buffer lock, per-timer wheel bucket lock) are considered subsystem-internal: they are only acquired within a single subsystem and never held across subsystem boundaries. These do not need global level assignments — their ordering is enforced within the subsystem's own code. Only locks that participate in cross-subsystem acquisition chains require a level in this table.

Exempt locks: RcuPerCpu::cb_segments spinlocks are explicitly exempt from this table. They are per-CPU, held with IRQs disabled, and callable from rcu_call() under arbitrary lock contexts (including Drop implementations). See the cb_segments doc comment for the full rationale.

Known cross-subsystem locks not yet assigned levels (to be added as subsystems are implemented): TcpCb.lock (networking), run_lock / memslots_update_lock / io_bus_lock (KVM), Process::lock (process), wb.list_lock (writeback), xa_lock (XArray), MapleTree::write_lock (VMA), IPC message queue locks, audit ring lock, f_pos_lock (file position), capability space locks. These locks may participate in cross-subsystem chains and will be assigned levels as their nesting relationships are fully specified during implementation.

Known limitation — total ordering only: The const-generic approach enforces a total order on lock levels (level 0 < level 1 < level 2 < ...). This prevents deadlocks caused by circular lock chains, but it cannot express a partial order where two locks at the same conceptual level are safe to acquire together (because they protect independent subsystems). In practice, this means some subsystems that Linux allows to lock "in parallel" must be assigned adjacent but distinct levels in UmkaOS, potentially over-constraining the lock graph. If this becomes a scalability issue, the fallback is to introduce a lock_independent() API that takes two locks at the same level with a static proof that their domains are disjoint (e.g., per-CPU locks on different CPUs). For now, the total-order approach covers all known subsystem interactions, and runtime lockdep (debug-mode only) validates that no partial-order case is missed.

3.5 Locking Strategy¶

UmkaOS eliminates all "big kernel lock" patterns that plague Linux scalability:

Linux Problem	UmkaOS Solution
RTNL global mutex	Per-table RCU + per-route fine-grained locks
`dcache_lock` contention	Per-directory RCU + per-inode locks
`zone->lock` on NUMA	Per-CPU page lists + per-NUMA-node pools
`tasklist_lock`	RCU-protected process table + per-PID locks
`files_lock` (file table)	Per-fdtable RCU + per-fd locks
`inode_hash_lock`	Per-bucket RCU-protected hash chains

3.5.1 Locking Primitive Types¶

UmkaOS defines four concrete locking types used throughout all subsystems. These are the actual implementations underlying Lock<T, LEVEL> (Section 3.4) and all per-subsystem locks in the lock level table above.

3.5.1.1.1 `RawSpinLock`¶

A bare spinlock that disables preemption but does not save or restore IRQ state. Modeled after Linux raw_spinlock_t. Suitable for: scheduler internals (runqueue locks), interrupt-handler paths where IRQs are already disabled, and short critical sections where the caller manages IRQ state explicitly. No RAII guard — the lock is manually acquired and released, because a guard cannot enforce the caller's IRQ context.

/// Bare spinlock. Disables preemption on acquire; does NOT save/restore IRQ state.
///
/// # Safety
/// The caller is responsible for IRQ state. This type is correct only when:
/// - Called from an interrupt handler (IRQs already disabled), OR
/// - The caller has explicitly disabled IRQs (holds an `IrqDisabledGuard`), OR
/// - The critical section provably contains no IRQ-sensitive operations.
///
/// For general use, prefer `SpinLock<T>` which manages IRQ state automatically.
///
/// # Architecture-specific spinlock algorithm (all provide starvation-free acquisition)
///
/// - **x86-64**: Queued spinlock (MCS-like). Eliminates cache-line bounce on high-contention
///   paths; all CPUs spin on CPU-local memory. Uncontended path identical to test-and-set cost.
///
/// - **AArch64**: Queued spinlock (LDAXR/STLXR LL/SC pairs). AArch64 LL/SC is efficient for
///   the uncontended case and scales better than ticket under high contention.
///
/// - **ARMv7**: Ticket lock (LDREX/STREX on 32-bit word). ARMv7's reservation model does not
///   efficiently support MCS queue-node allocation; ticket provides fairness with lower overhead.
///
/// - **RISC-V**: Queued spinlock (LR/SC reservation pairs). Uses the MCS-derived qspinlock
///   algorithm for starvation-free acquisition on multi-hart systems. RISC-V LR/SC reservation
///   mechanism naturally supports the per-CPU queue-node spinning pattern. Matches Linux
///   `CONFIG_RISCV_COMBO_SPINLOCKS` default (runtime qspinlock/ticket selection); UmkaOS uses
///   qspinlock unconditionally for fairness.
///
/// - **PPC32**: Ticket lock (lwarx/stwcx. on 32-bit word). PPC32's reservation model is
///   32-bit word aligned; ticket fits naturally and provides FIFO fairness. **Intentional
///   UmkaOS improvement**: Linux PPC32 uses a simple test-and-set spinlock (`simple_spinlock.h`)
///   which is unfair under contention. UmkaOS uses a ticket lock for starvation prevention.
///
/// - **PPC64LE**: Queued spinlock (lwarx/stwcx with 64-bit). POWER cores benefit from
///   MCS-style local spinning on NUMA-distant workloads under high contention.
///
/// # Hold Time Budget
///
/// RawSpinLock critical sections **must complete within 10 microseconds**.
/// This constraint feeds the 50us worst-case interrupt latency guarantee
/// ([Section 3.4](#cumulative-performance-budget)):
///
/// | Component | Budget |
/// |-----------|--------|
/// | Spinlock hold (worst case) | 10 us |
/// | Interrupt dispatch (vector lookup + context save) | 5 us |
/// | IRQ handler first-level (acknowledge + enqueue) | 10 us |
/// | Margin (cache misses, cross-NUMA, contention) | 25 us |
/// | **Total worst-case interrupt latency** | **50 us** |
///
/// Subsystems requiring longer critical sections must use `Mutex<T>` (which
/// allows preemption and sleeping). Debug builds assert hold time via
/// `lock_stat` timestamp comparison (warn if > 10us, panic if > 50us in
/// `CONFIG_LOCKDEP` mode).
pub struct RawSpinLock {
    /// Lock state word. Encoding is algorithm-specific per architecture (see above).
    /// All architectures guarantee: 0 = unlocked; non-zero = locked (or queued).
    state: AtomicU32,
}

impl RawSpinLock {
    /// Acquire the lock (spin-wait). Disables preemption. Does NOT touch IRQs.
    ///
    /// # Safety
    /// See struct-level safety note. The caller must ensure IRQ state is correct.
    pub unsafe fn lock(&self);

    /// Release the lock. Re-enables preemption.
    ///
    /// # Safety
    /// Must be called by the same CPU that called `lock()`.
    pub unsafe fn unlock(&self);

    /// Try to acquire the lock once (non-blocking). Returns `true` if acquired.
    ///
    /// # Safety
    /// See `lock()`.
    pub unsafe fn try_lock(&self) -> bool;
}

3.5.1.1.2 `SpinLock<T>`¶

RAII spinlock that saves and restores IRQ state on acquire/release. This is the correct default for any data shared between normal kernel context and interrupt handlers. Equivalent to Linux spinlock_t (acquired with spin_lock_irqsave). The protected data T is only accessible through the returned guard, ensuring the lock is always held when the data is accessed.

/// IRQ-saving spinlock. Disables preemption AND saves/restores IRQ state.
///
/// This is the correct default for data shared with interrupt handlers.
/// Acquiring this lock is safe from any context (normal kernel, softirq, hardirq).
///
/// The protected value `T` is accessible only through `SpinLockGuard<'_, T>`,
/// which re-enables IRQs and preemption on drop.
pub struct SpinLock<T> {
    inner: RawSpinLock,
    data: UnsafeCell<T>,
}

/// Guard returned by `SpinLock::lock()`. Releases lock and restores IRQs on drop.
pub struct SpinLockGuard<'a, T> {
    lock: &'a SpinLock<T>,
    /// Saved IRQ flags (RFLAGS on x86, DAIF on AArch64, SSTATUS.SIE on RISC-V, etc.).
    saved_flags: ArchIrqFlags,
}

impl<T> SpinLock<T> {
    /// Acquire the lock: save IRQ state, disable IRQs and preemption, spin-wait.
    /// Returns a guard that provides exclusive access to `T`.
    pub fn lock(&self) -> SpinLockGuard<'_, T>;

    /// Acquire without saving IRQ state. Use this when the caller already holds
    /// an `IrqDisabledGuard` (Section 3.3.1), avoiding a redundant save/restore
    /// pair (~1-3 cycles saved on the fast path).
    ///
    /// # Safety
    /// The caller must hold an `IrqDisabledGuard` for the entire duration of the
    /// returned guard's lifetime. Dropping the `IrqDisabledGuard` while the
    /// `SpinLockGuard` is still alive would re-enable IRQs while the lock is held.
    pub unsafe fn lock_nosave<'a>(
        &'a self,
        _irq: &'a IrqDisabledGuard,
    ) -> SpinLockGuard<'a, T>;
}

impl<T> Deref for SpinLockGuard<'_, T> {
    type Target = T;
}

impl<T> DerefMut for SpinLockGuard<'_, T> {}

impl<T> Drop for SpinLockGuard<'_, T> {
    fn drop(&mut self) {
        // Release the RawSpinLock, then restore IRQ state from saved_flags.
        // Order matters: release spin first, then re-enable IRQs. If IRQs
        // were re-enabled while the lock was still held, an interrupt handler
        // on the same CPU could try to acquire the same lock, causing a
        // single-CPU deadlock. Releasing the lock first prevents this.
    }
}

lock_nosave and IrqDisabledGuard interaction: SpinLock::lock() internally performs irq_save() + raw_spin_lock(). If the caller already holds an IrqDisabledGuard (from PerCpu::get_mut_nosave(), see Section 3.3.1), use lock_nosave() to skip the redundant save/restore. This mirrors the get_mut_nosave() pattern and saves ~1-3 cycles on the fast path.

3.5.1.1.3 MMIO Barrier Batching (`io_sync` flag — PPC64LE)¶

On PPC64LE, every MMIO write requires a preceding sync instruction (~100+ cycles on POWER9) to order the store with respect to cacheable memory. When a driver performs multiple MMIO writes within a single critical section (common for NIC ring doorbell sequences), each MMIO write would pay the full sync cost independently.

UmkaOS batches these barriers using a per-CPU io_sync flag in the CpuLocal block:

/// Per-CPU MMIO barrier deferred-sync flag.
/// PPC64LE only; zero-sized on all other architectures.
pub struct IoSyncFlag {
    /// Set to `true` by `mmio_write_*()`. Checked and cleared by
    /// `SpinLock::unlock()`. When set, unlock issues `sync` before
    /// the store-release that releases the lock.
    #[cfg(target_arch = "powerpc64")]
    pending: Cell<bool>,
}

Protocol:

mmio_write_32(addr, val) on PPC64LE: emit sync; stw val, 0(addr), then set CpuLocal::io_sync.pending = true.
SpinLock::unlock() on PPC64LE: if io_sync.pending is true, emit sync before the lock release store and clear the flag. This ensures all MMIO writes within the critical section are ordered before the lock release.
mmiowb() (explicit MMIO write barrier): emit sync and clear io_sync.pending. Used at the end of a critical section where MMIO writes must be visible to the device before the lock is released to another CPU.

Cost saving: If a critical section performs N MMIO writes, the naive approach issues N sync instructions. With io_sync, only one sync is issued (in the unlock path), saving (N-1) × ~100 cycles. This is the same optimization Linux implements via paca->io_sync on PowerPC.

Other architectures: On x86-64, MMIO writes are naturally ordered by TSO. On AArch64, dsb st provides the MMIO write barrier at lower cost (~20-40 cycles). The io_sync flag is a zero-sized type on non-PPC architectures and the unlock check compiles away.

3.5.1.1.4 `IntrusiveList<T>` — Intrusive Doubly-Linked List¶

An intrusive doubly-linked list used throughout UmkaOS for lock-free and low-overhead queues. Elements embed a link field (IntrusiveLink<T>) rather than allocating separate list nodes. This avoids heap allocation and improves cache locality.

UmkaOS vs Linux list_head: Linux embeds raw prev/next pointers in each element with no ownership or membership tracking. An element can silently be on multiple lists simultaneously, or removed from the wrong list — a common source of kernel bugs. UmkaOS's design uses a sealed HasIntrusiveLink<T> trait with type-system-guided single-link-per-type enforcement: each element type embeds at most one IntrusiveLink<T>, so it can participate in at most one list at a time via that link. Runtime double-insertion detection is provided by a debug-only in_list: bool flag in IntrusiveLink (checked on push, panics on double-insert in debug builds). Pinning (Pin<&mut T>) prevents moving an element while it is linked, which would corrupt the list.

/// An intrusive doubly-linked list. Elements must embed an `IntrusiveLink<T>`
/// field to participate in the list.
///
/// **UmkaOS vs Linux `list_head`**: Linux embeds raw `prev`/`next` pointers in each
/// element with no ownership tracking. This allows elements to be on multiple lists
/// simultaneously (a common source of bugs: removing from the wrong list). UmkaOS's
/// design enforces single-link-per-type membership via the owned `IntrusiveLink`.
/// A debug-only `in_list` flag in `IntrusiveLink` provides runtime double-insertion
/// detection (panics in debug builds).
///
/// **Pinning**: Elements must be `Pin`ned before they can be inserted. Moving an element
/// while it is linked would corrupt the list. The `Pin<&mut T>` API at insertion ensures
/// the element address is stable for the element's lifetime in the list.
///
/// **Performance**: Equivalent to `list_head` — O(1) insert/remove, O(n) iteration.
/// No heap allocation; all pointers live in the embedded `IntrusiveLink` fields.
pub struct IntrusiveList<T: HasIntrusiveLink<T>> {
    /// Sentinel node. `head.next` points to the first element; `head.prev` points to
    /// the last element. An empty list has `head.next == head.prev == &head`.
    head: IntrusiveLink<T>,
    /// Number of elements currently in the list. O(1) cached count.
    len: usize,
}

/// The link field that must be embedded in each element type `T` that participates
/// in an `IntrusiveList<T>`. Each element may embed at most ONE `IntrusiveLink<T>`
/// (embedding two would require the element to be in two lists simultaneously, which
/// is allowed by the data structure but should be avoided — use separate wrapper types).
pub struct IntrusiveLink<T> {
    /// Next element in the list (or &sentinel if this is the tail).
    next: *mut IntrusiveLink<T>,
    /// Previous element in the list (or &sentinel if this is the head).
    prev: *mut IntrusiveLink<T>,
    /// Zero-size marker; `T` is the owning type that contains this link.
    _marker: PhantomData<T>,
    /// Debug-only double-insertion guard. Set to `true` when inserted into
    /// a list, cleared on removal. In debug builds, `insert_*()` methods
    /// assert `!in_list` before linking. In release builds, this field is
    /// compiled out (`#[cfg(debug_assertions)]`) — zero overhead.
    ///
    /// **Release-build consequence**: Without this guard, double-insertion
    /// in release builds creates two lists sharing the same link node.
    /// Removing from either list corrupts the other (next/prev pointers
    /// updated for the wrong list). This is the intentional trade-off:
    /// debug builds catch the programming error; release builds omit
    /// the per-node overhead (1 byte + alignment padding per link).
    /// The debug assertion is the primary defense; correct callers never
    /// trigger the condition in production.
    #[cfg(debug_assertions)]
    in_list: bool,
}

/// Marker trait: `T` has an `IntrusiveLink<T>` field accessible via `link()`.
/// # Safety
/// The returned `IntrusiveLink` must be embedded in `self` at a stable address
/// (i.e., `self` must be pinned). Moving `self` while linked is undefined behavior.
pub unsafe trait HasIntrusiveLink<T> {
    /// Return a pointer to the embedded link field.
    fn link(this: *mut T) -> *mut IntrusiveLink<T>;

    /// Recover the owning `T*` from a `*mut IntrusiveLink<T>` (via `offsetof`).
    /// # Safety: `link` must point to the link field of a valid `T`.
    unsafe fn from_link(link: *mut IntrusiveLink<T>) -> *mut T;
}

impl<T: HasIntrusiveLink<T>> IntrusiveList<T> {
    /// Create an empty list. The sentinel is self-referential.
    pub fn new() -> Self { ... }

    /// Number of elements in the list.
    pub fn len(&self) -> usize { self.len }

    /// True if the list has no elements.
    pub fn is_empty(&self) -> bool { self.len == 0 }

    /// Insert `elem` at the back of the list. `elem` must be pinned.
    /// # Safety: `elem` must not already be in any IntrusiveList.
    pub unsafe fn push_back(&mut self, elem: *mut T) { ... }

    /// Insert `elem` at the front of the list.
    /// # Safety: same as `push_back`.
    pub unsafe fn push_front(&mut self, elem: *mut T) { ... }

    /// Remove and return the front element, or `None` if empty.
    pub fn pop_front(&mut self) -> Option<*mut T> { ... }

    /// Remove and return the back element, or `None` if empty.
    pub fn pop_back(&mut self) -> Option<*mut T> { ... }

    /// Remove `elem` from the list (it must currently be in this list).
    /// # Safety: `elem` must be in this list.
    pub unsafe fn remove(&mut self, elem: *mut T) { ... }

    /// Iterate over elements in order (front to back).
    pub fn iter(&self) -> IntrusiveListIter<'_, T> { ... }

    /// Move all elements from `other` to the back of `self`. O(1).
    pub fn splice_back(&mut self, other: &mut Self) { ... }
}

/// Iterator over an `IntrusiveList`. Elements are yielded as raw pointers.
pub struct IntrusiveListIter<'a, T: HasIntrusiveLink<T>> {
    current: *mut IntrusiveLink<T>,
    sentinel: *const IntrusiveLink<T>,
    _lifetime: PhantomData<&'a T>,
}

/// Type alias: a bare link node used when the containing type is opaque or the
/// list is parameterized externally. Used by `WaitQueueEntry` and `MutexWaiter`.
pub type IntrusiveListNode = IntrusiveLink<()>;

/// Type alias: the sentinel head of an intrusive list where the element type is
/// managed externally (e.g., `WaitQueueHead` which holds a list of `WaitQueueEntry`).
pub type IntrusiveListHead = IntrusiveList<()>;

3.5.1.1.5 `Mutex<T>`¶

A sleeping lock. The caller blocks (is descheduled) if the lock is contended. Must not be acquired from interrupt context or while holding a SpinLock or RawSpinLock (sleeping under a spinlock causes a deadlock — the spinner cannot be scheduled out, but the sleeping thread cannot progress). Equivalent to Linux struct mutex.

/// Sleeping mutual exclusion lock. Blocks (deschedules) on contention.
///
/// # Contexts
/// - Safe to acquire from normal kernel task context.
/// - MUST NOT be acquired from interrupt context (hardirq, softirq, NMI).
/// - MUST NOT be acquired while holding a `SpinLock` or `RawSpinLock`.
///   Doing so would put the spinlock holder to sleep — the deadlock detector
///   ([Section 3.4](#cumulative-performance-budget), lock level ordering) treats Mutex as higher-level than
///   SpinLock in the lock hierarchy.
pub struct Mutex<T> {
    /// Lock state: 0 = unlocked, 1 = locked (no waiters), 2 = locked (waiters present).
    state: AtomicU32,
    /// List of tasks blocked waiting for this mutex. Protected by `waiters_lock`.
    /// Uses `IntrusiveListHead` (= `IntrusiveList<()>`) to match the `<()>`
    /// parameterization of `MutexWaiter.node: IntrusiveListNode` (= `IntrusiveLink<()>`).
    /// The containing type (`MutexWaiter`) is recovered via `container_of`-style
    /// `MutexWaiter::from_link()` — same pattern as `WaitQueueHead`.
    waiters: IntrusiveListHead,
    /// Spinlock protecting the `waiters` list. Held only briefly during enqueue/dequeue.
    waiters_lock: RawSpinLock,
    data: UnsafeCell<T>,
}

/// Task node embedded in `Mutex::waiters` when a task is blocked.
pub struct MutexWaiter {
    task: *mut Task,
    node: IntrusiveListNode,
}

/// Guard returned by `Mutex::lock()`. Releases lock on drop.
pub struct MutexGuard<'a, T> {
    mutex: &'a Mutex<T>,
}

impl<T> Mutex<T> {
    /// Acquire the lock. Blocks if contended. Returns when the lock is held exclusively.
    pub fn lock(&self) -> MutexGuard<'_, T>;

    /// Non-blocking acquire. Returns `Some(guard)` if acquired, `None` if contended.
    pub fn try_lock(&self) -> Option<MutexGuard<'_, T>>;
}

impl<T> Deref for MutexGuard<'_, T> {
    type Target = T;
}

impl<T> DerefMut for MutexGuard<'_, T> {}

impl<T> Drop for MutexGuard<'_, T> {
    fn drop(&mut self) {
        // Fast path: CAS state 1 → 0 (no waiters).
        if self.mutex.state
            .compare_exchange(1, 0, Ordering::Release, Ordering::Relaxed)
            .is_ok()
        {
            return;
        }
        // Slow path: state == 2 (waiters present).
        // 1. Acquire waiters_lock (lock level 25, below RQ_LOCK(50)).
        // SAFETY: IRQ state managed by the caller; waiters_lock is held briefly.
        unsafe { self.mutex.waiters_lock.lock() };
        // 2. Dequeue the first waiter from the list.
        //    `pop_front()` returns `Option<*mut ()>` (the IntrusiveListNode pointer).
        //    Recover the containing `MutexWaiter` via `from_link()` (container_of).
        let waiter_opt = self.mutex.waiters.pop_front();
        // 3. Update state while still holding waiters_lock:
        //    - If list is now empty: state = 0 (no waiters, unlocked).
        //    - If list still has waiters: leave state at 2 (waiters present,
        //      but lock is now logically available for the woken task).
        if self.mutex.waiters.is_empty() {
            self.mutex.state.store(0, Ordering::Release);
        }
        // State remains 2 if waiters exist — the woken task will CAS 2→1
        // (or 2→2 if more waiters arrive concurrently) on wakeup.
        // 4. Release waiters_lock BEFORE calling scheduler::unblock().
        //    This avoids holding waiters_lock across scheduler code (which
        //    acquires RQ_LOCK at level 50 — higher than waiters_lock at 25).
        unsafe { self.mutex.waiters_lock.unlock() };
        // 5. Wake the dequeued waiter OUTSIDE waiters_lock.
        //    The woken task's lock() path handles the race where the lock
        //    has been re-acquired by another task between steps 4 and 5
        //    (CAS loop on state in the lock path).
        if let Some(node) = waiter_opt {
            // SAFETY: `node` was dequeued from a list of `MutexWaiter` entries.
            // `from_link()` recovers the containing `MutexWaiter` from the
            // embedded `IntrusiveListNode` via offset arithmetic (container_of).
            // The MutexWaiter is stack-allocated by the waiting task and remains
            // valid until the task returns from `Mutex::lock()`.
            let waiter = unsafe { MutexWaiter::from_link(node) };
            // SAFETY: waiter was dequeued while waiters_lock was held;
            // the task pointer is valid until the task is freed.
            unsafe { scheduler::unblock((*waiter).task) };
        }
    }
}

3.5.1.1.6 `RwLock<T>`¶

A sleeping reader-writer lock. Multiple concurrent readers OR one exclusive writer. New readers are blocked when a writer is waiting (writer preference — prevents writer starvation). Must not be acquired from interrupt context.

/// Sleeping reader-writer lock. Many readers OR one writer. Writer-preferring.
///
/// # Contexts
/// Same restrictions as `Mutex<T>`: task context only, not under a SpinLock.
///
/// # Writer preference
/// When a writer is waiting, new `read_lock()` calls block. This prevents writer
/// starvation at the cost of reduced read throughput under write pressure.
pub struct RwLock<T> {
    /// Packed state word:
    ///   bit 31 = WRITER_HELD (a writer currently holds the lock)
    ///   bit 30 = WRITER_WAITING (at least one writer is enqueued in `waiters`)
    ///   bits [29:0] = active reader count
    ///
    /// Readers check `state & (WRITER_HELD | WRITER_WAITING)` atomically before
    /// incrementing the reader count. If either bit is set, the reader takes the
    /// slow path (enqueues on the waiters list and sleeps). This provides writer
    /// preference without requiring readers to acquire `waiters_lock` on the fast
    /// path — the WRITER_WAITING bit is set/cleared by writers under `waiters_lock`
    /// and is visible to readers via the atomic state word.
    ///
    /// Reader count limit: 2^30 - 1 (1,073,741,823). Safe because reader count
    /// is bounded by the number of schedulable tasks (max ~64K per NUMA node in
    /// practice). Exceeding this wraps into the writer-waiting bit, causing
    /// incorrect lock behavior. `mem::forget` on `RwLockReadGuard` leaks the
    /// reader count — this is a programming error, not an operational risk.
    /// Debug builds: assert!(reader_count < (1 << 30)) in read() acquisition.
    ///
    /// **Longevity analysis**: 2^30 overflow requires ~1.07 billion leaked read guards.
    /// At one leaked guard per second (sustained `mem::forget` bug), overflow occurs
    /// after ~34 years. This is a programming error scenario, not an operational risk --
    /// the 50-year uptime target applies to correct programs. The debug assertion
    /// catches the leak during development. A saturating check in release builds
    /// (one compare-and-branch in `read()`) is a viable future hardening option but
    /// is not included in the initial design due to cost on a sleeping-lock fast path.
    /// Exempt from u64 widening: this is a bounded refcount (not a monotonic identifier).
    state: AtomicU32,
    /// Queued readers and writers. Each entry carries a `WaiterKind` tag.
    waiters: IntrusiveList<RwWaiter>,
    /// Protects the `waiters` list.
    waiters_lock: RawSpinLock,
    data: UnsafeCell<T>,
}

/// RwLock state word bit layout constants.
const RWLOCK_WRITER_HELD:    u32 = 1 << 31;
const RWLOCK_WRITER_WAITING: u32 = 1 << 30;
const RWLOCK_READER_MASK:    u32 = (1 << 30) - 1;

pub enum WaiterKind { Reader, Writer }

pub struct RwWaiter {
    kind: WaiterKind,
    task: *mut Task,
    node: IntrusiveListNode,
}

/// Guard for shared read access. Multiple `RwLockReadGuard`s can coexist.
pub struct RwLockReadGuard<'a, T> {
    lock: &'a RwLock<T>,
}

/// Guard for exclusive write access. Uniquely held.
pub struct RwLockWriteGuard<'a, T> {
    lock: &'a RwLock<T>,
}

impl<T> RwLock<T> {
    /// Acquire shared read access. Blocks if a writer holds or is waiting for the lock.
    pub fn read(&self) -> RwLockReadGuard<'_, T>;

    /// Acquire exclusive write access. Blocks until all readers and the current
    /// writer (if any) release the lock.
    pub fn write(&self) -> RwLockWriteGuard<'_, T>;

    /// Non-blocking read acquire.
    pub fn try_read(&self) -> Option<RwLockReadGuard<'_, T>>;

    /// Non-blocking write acquire.
    pub fn try_write(&self) -> Option<RwLockWriteGuard<'_, T>>;
}

3.5.1.1.7 `Condvar`¶

A condition variable for Mutex-guarded condition waiting. Wraps a WaitQueueHead and provides atomic mutex-release-and-sleep semantics that prevent lost-wakeup races. Equivalent to Linux's wait_event pattern but paired with Mutex<T> rather than a spinlock-protected wait queue.

/// Condition variable paired with `Mutex<T>`. Allows a thread to atomically
/// release a mutex and sleep until a condition becomes true.
///
/// Built on `WaitQueueHead` ([Section 3.6](#lock-free-data-structures--waitqueuehead--blocking-wait-queue)):
/// the condition variable IS a WaitQueueHead with Mutex-aware wait/notify
/// methods. The underlying WaitQueueHead's spinlock-protected waiter list
/// provides the atomicity guarantee between mutex release and sleep.
///
/// # Contexts
/// - `wait()` and `wait_interruptible()` MUST NOT be called from interrupt
///   context (they sleep). Same restrictions as `Mutex<T>`.
/// - `notify_one()` and `notify_all()` may be called from any context
///   (including interrupt context), same as `WaitQueueHead::wake_up()`.
pub struct Condvar {
    wq: WaitQueueHead,
}

impl Condvar {
    /// Create a new condition variable.
    pub const fn new() -> Self {
        Self { wq: WaitQueueHead::new() }
    }

    /// Release the mutex, sleep until notified, then re-acquire the mutex.
    ///
    /// Protocol:
    /// 1. Enqueue current task on `self.wq` as an exclusive waiter.
    /// 2. Drop `guard` (releases the mutex — Mutex::state transitions).
    /// 3. `schedule()` — task sleeps.
    /// 4. On wakeup: re-acquire the mutex via `mutex.lock()`.
    /// 5. Return the new `MutexGuard`.
    ///
    /// The enqueue (step 1) happens while the mutex is still held, and the
    /// WaitQueueHead's internal spinlock is held during the enqueue+state-change
    /// sequence. This prevents the lost-wakeup race: if `notify_one()` is called
    /// between steps 1 and 2, the waiter is already enqueued and will be woken.
    pub fn wait<'a, T>(&self, guard: MutexGuard<'a, T>) -> MutexGuard<'a, T> {
        let mutex = guard.mutex;
        // Protocol:
        // 1. Lock self.wq.lock (WQ internal spinlock).
        // 2. Enqueue current task as exclusive waiter on self.wq.
        // 3. Set task state to TASK_UNINTERRUPTIBLE.
        // 4. Unlock self.wq.lock.
        // 5. Drop guard → releases Mutex (MutexGuard::drop transitions state).
        //    The WQ enqueue (step 2) happens BEFORE mutex release (step 5),
        //    so a concurrent notify_one() between steps 4 and 5 will find
        //    our waiter on the queue and set us TASK_RUNNING before we sleep.
        // 6. schedule() → suspend until notify_one/notify_all wakes us.
        // 7. On wakeup: re-acquire mutex via mutex.lock().
        //
        // NOTE: Unlike wait_event(), Condvar::wait() does NOT take a condition
        // closure. The condition check is the CALLER's responsibility after
        // re-acquiring the mutex. Standard usage pattern:
        //   let mut guard = mutex.lock();
        //   while !condition(&*guard) {
        //       guard = condvar.wait(guard);
        //   }
        //   // condition is true and mutex is held
        //
        // Implementation:
        self.wq.prepare_to_wait_exclusive(); // steps 1-4
        core::mem::drop(guard);               // step 5: release mutex
        schedule();                           // step 6: sleep
        finish_wait(&self.wq);                // remove from WQ if not already
        mutex.lock()                          // step 7: re-acquire mutex
    }

    /// Like `wait()`, but returns `Err(KernelError::Interrupted)` if the
    /// sleeping task receives a signal. The mutex is always re-acquired
    /// before returning (even on signal interruption).
    pub fn wait_interruptible<'a, T>(
        &self,
        guard: MutexGuard<'a, T>,
    ) -> Result<MutexGuard<'a, T>, KernelError> {
        let mutex = guard.mutex;
        // Same protocol as wait() but with TASK_INTERRUPTIBLE.
        self.wq.prepare_to_wait_exclusive_interruptible();
        core::mem::drop(guard);
        schedule();
        let interrupted = signal_pending();
        finish_wait(&self.wq);
        let new_guard = mutex.lock();
        if interrupted {
            Err(KernelError::Interrupted)
        } else {
            Ok(new_guard)
        }
    }

    /// Wake one waiter (the highest-priority exclusive waiter).
    pub fn notify_one(&self) {
        self.wq.wake_up_one();
    }

    /// Wake all waiters.
    pub fn notify_all(&self) {
        self.wq.wake_up_all();
    }
}

3.5.1.1.8 Lock Hierarchy Summary¶

The four types form a strict containment hierarchy. A thread may hold a higher-level lock while acquiring a lower-level one, but not the reverse:

Type	Sleeps?	IRQ-safe?	Can hold while acquiring...
`RawSpinLock`	No	Caller-managed	(nothing lower)
`SpinLock<T>`	No	Yes (saves/restores)	`RawSpinLock`
`Mutex<T>`	Yes	No (task context only)	`RawSpinLock`, `SpinLock<T>`
`RwLock<T>`	Yes	No (task context only)	`RawSpinLock`, `SpinLock<T>`

Critical constraint: holding a Mutex or RwLock guard and then acquiring a SpinLock is permitted (Mutex is at a higher conceptual level and does not spin). The reverse — acquiring a SpinLock and then calling Mutex::lock() — would put the spinlock holder to sleep, violating the spinlock contract. The const-generic lock level system (Section 3.4) enforces this at compile time by assigning SpinLock-backed locks to lower numeric levels than Mutex-backed locks.

Lock hierarchy migration protocol: When a subsystem needs to change which lock protects a data structure (e.g., replacing a SpinLock with a Mutex or splitting a coarse lock into per-object fine-grained locks), the following protocol applies: 1. Both old and new locks must co-exist during the transition. The new lock is introduced alongside the old one; all readers acquire both (old then new). 2. Writers are migrated one call site at a time. Each call site is updated to acquire the new lock instead of the old; the old lock acquisition is removed. 3. Once all call sites use the new lock, the old lock is removed in a separate commit. This ensures bisectability. 4. During the co-existence period, the lock ordering rule is: old lock must always be acquired before the new lock (never reversed), to prevent deadlocks. 5. The const-generic lock level for the new lock must be assigned a level that does not conflict with existing hierarchies. The lock_level_check! macro validates this at compile time.

3.5.2 Preemption and Interrupt Context Model¶

UmkaOS uses separate per-CPU fields for preemption depth and interrupt context, stored in CpuLocalBlock (see Section 3.2). This differs from Linux's packed preempt_count u32 (which encodes preemption depth, softirq count, hardirq count, and NMI state in a single word with bit-field packing). UmkaOS uses separate fields for clarity and type safety:

/// In CpuLocalBlock (see cpulocal section for full struct):
///
/// preempt_count: u32  — preemption-disable nesting depth only.
///                        Incremented by preempt_disable() / SpinLock::lock().
///                        Decremented by preempt_enable() / SpinLock::unlock().
///
/// irq_count: u32      — hardirq nesting count.
///                        Incremented by irq_enter().
///                        Decremented by irq_exit().
///
/// softirq_count: u32  — softirq (bottom-half) nesting count.
///                        Incremented by local_bh_disable().
///                        Decremented by local_bh_enable().
///
/// need_resched: AtomicBool — set by IPI/scheduler tick when a reschedule is pending.
///                        Checked on preempt_enable() and return-from-interrupt.
///                        AtomicBool because IPIs write from remote CPUs (see §3.1.2).

The preemptibility check is preempt_count == 0 && irq_count == 0 && softirq_count == 0 (three loads, all from the same cache line in CpuLocalBlock). The in_interrupt() check is irq_count > 0 || softirq_count > 0. The in_softirq() check is softirq_count > 0. UmkaOS uses separate typed fields instead of Linux's packed preempt_count bitfield — see Section 3.2.

Why not Linux's packed format? Linux packs everything into one u32 so that the return-from-interrupt fast path can test preempt_count == 0 as a single branch. UmkaOS trades that single-compare trick for separate typed fields that are easier to reason about, debug, and extend. The two-field check is still a single cache line access and adds at most one extra compare instruction — negligible on modern out-of-order CPUs.

BPF compatibility: BPF programs and tracing tools that read Linux's packed preempt_count are handled by the BPF helper layer, which synthesizes the expected packed format from the separate fields when accessed via bpf_get_preempt_count().

NMI tracking: NMI state is tracked via a separate in_nmi: bool field in CpuLocalBlock (not bit-packed). At most one NMI can be active per CPU (non-nestable on all supported architectures).

/// Query helpers — read per-CPU fields from CpuLocalBlock.

/// True if executing in any interrupt context (hardirq or softirq).
/// Used by sleeping-lock debug assertions (sleeping in interrupt = bug).
/// Matches Linux's `in_interrupt()` which checks both HARDIRQ_MASK and
/// SOFTIRQ_MASK bits in the packed preempt_count.
#[inline(always)]
pub fn in_interrupt() -> bool {
    let cl = cpu_local();
    cl.irq_count > 0 || cl.softirq_count > 0
}

/// True if preemption is currently enabled (preempt depth == 0 and
/// not in any interrupt context). Does NOT check need_resched.
/// All three fields must be zero: preempt nesting, hardirq depth, and
/// softirq depth. UmkaOS checks these as separate fields; Linux checks
/// `preempt_count() == 0` where the packed word includes all three.
#[inline(always)]
pub fn preemptible() -> bool {
    let cl = cpu_local();
    cl.preempt_count == 0 && cl.irq_count == 0 && cl.softirq_count == 0
}

/// True if executing in NMI context.
#[inline(always)]
pub fn in_nmi() -> bool {
    cpu_local().in_nmi.load(Relaxed)
}

Interaction with locking primitives:

Primitive	Effect on CpuLocalBlock fields
`RawSpinLock::lock()`	`preempt_count += 1`
`RawSpinLock::unlock()`	`preempt_count -= 1` + check `need_resched`
`SpinLock::lock()`	Saves IRQ flags, disables IRQs, `preempt_count += 1`
`SpinLock::unlock()`	`preempt_count -= 1`, restores IRQ flags, check `need_resched`
`irq_enter()`	`irq_count += 1`
`irq_exit()`	`irq_count -= 1` + process pending softirqs if `irq_count == 0 && softirq_count == 0`
`local_bh_disable()`	`softirq_count += 1`
`local_bh_enable()`	`softirq_count -= 1` + run pending softirqs if `softirq_count == 0 && irq_count == 0`
`Mutex::lock()` / `RwLock::read()`	No change (these sleep, preempt must be enabled)

The preempt_enable() decrement is the primary voluntary preemption point: after decrementing, if preempt_count == 0 && irq_count == 0 && softirq_count == 0 and need_resched is set, schedule() is called immediately. This is how preemptive multitasking works in UmkaOS: every spin_unlock() and preempt_enable() is an implicit preemption check. UmkaOS uses separate irq_count (hardirq) and softirq_count (BH) fields in CpuLocalBlock instead of Linux's packed bitfield — see Section 3.2.

3.5.3 Lock Contention Tracking¶

Lock contention is the primary scalability bottleneck on systems with many cores. UmkaOS provides lock contention instrumentation for debugging and performance analysis, integrated with the tracepoint subsystem (Section 20.2) and FMA (Section 20.1).

Tracepoints — two static tracepoints on every Lock<T, LEVEL>:

/// Emitted when a thread begins waiting for a contended lock.
/// `lock_addr`: address of the lock. `level`: compile-time lock level.
/// `caller`: return address of the `lock()` call site.
tracepoint!(lock_contention_begin, lock_addr: usize, level: u32, caller: usize);

/// Emitted when the thread acquires the lock (contention resolved).
/// `wait_ns`: nanoseconds spent waiting. Zero means the lock was
/// acquired on the first attempt (no contention — tracepoint still
/// fires for consistent tracing, but can be filtered by `wait_ns > 0`).
tracepoint!(lock_contention_end, lock_addr: usize, level: u32, wait_ns: u64);

These tracepoints are compiled in unconditionally but have zero cost when no tracer is attached (static branch, same as all UmkaOS tracepoints). When a tracer is attached, overhead is <1% of the contended lock path (Section 3.4). In production deployments where tracepoint overhead is unacceptable, lock contention tracing can be disabled at runtime via the tracepoint enable/disable mechanism.

Per-lock contention counters (debug builds only, #[cfg(debug_assertions)]):

/// Per-lock debug statistics. Embedded in Lock<T, LEVEL> when debug
/// assertions are enabled. Not present in release builds (zero overhead).
pub struct LockContentionStats {
    /// Number of times this lock was acquired with contention (waited > 0 ns).
    pub contention_count: AtomicU64,
    /// Cumulative wait time in nanoseconds.
    pub total_wait_ns: AtomicU64,
    /// Maximum single wait time in nanoseconds.
    pub max_wait_ns: AtomicU64,
}

Exposed via /sys/kernel/debug/locks/<lock_name>/contention_count, total_wait_ns, max_wait_ns. Reset on read.

Runtime lock ordering validation (debug builds only):

In debug builds, a per-CPU held_locks: ArrayVec<HeldLockEntry, 16> stack tracks currently held lock addresses and levels:

/// Debug-only per-CPU lock tracking entry.
/// Stored in a per-CPU ArrayVec (max depth 16 — deepest observed nesting
/// in Linux is ~12; 16 provides headroom). IRQ context pushes onto the
/// same stack: since IRQ handlers acquire locks at levels higher than
/// any non-IRQ lock they nest inside, the ordering check remains valid.
#[cfg(debug_assertions)]
struct HeldLockEntry {
    /// Address of the lock instance (for identification in diagnostics).
    addr: usize,
    /// Lock level (from the const generic LEVEL parameter).
    level: u32,
}

On each lock acquisition, the runtime checker verifies that the new lock's level is strictly greater than all currently held locks. Violation -> BUG() with a diagnostic message showing the lock ordering chain. This is equivalent to Linux's lockdep but leverages the compile-time level system -- the runtime check catches cases where the static checker cannot (e.g., locks acquired through dynamic dispatch).

The lock ordering system has ZERO escape hatches: no lock_read_unchecked(), no compile-time call-site caps, no read-vs-write mode tracking. Every lock acquisition -- read or write -- must satisfy the strictly-ascending level invariant. This was made possible by replacing INVALIDATE_LOCK (an rwsem that forced a descending-level exception in the page fault path) with InvalidateSeq (a lockless seqcount that requires no lock acquisition at all). See Section 3.4.

Release builds rely solely on the compile-time guarantee.

3.6 Lock-Free Data Structures¶

Where possible, lock-free data structures replace locked ones:

MPSC ring buffers: For all cross-domain communication (io_uring-style)
RCU-protected radix trees: For page cache lookups
Per-CPU freelists: For slab allocation (no cross-CPU contention)
Atomic reference counts: For capability tokens and shared objects
Sequence locks (seqlock): For rarely-written, frequently-read data (e.g., system time, mount table snapshot). See SeqLock<T> specification below.
IDR (Integer ID Radix tree): For integer-keyed namespaces (PIDs, IPC IDs, file descriptors). See Section 3.6 below.

3.6.1 `SeqLock<T>` — Sequence Lock¶

A sequence lock optimized for rarely-written, frequently-read data. Readers never block and never modify shared state (no atomic RMW). Writers are mutually excluded by a spinlock. Readers detect concurrent writes via a sequence counter and retry.

Used by: timekeeping (Section 7.8), vDSO data page (Section 2.22), mount table snapshots, DSM coherence metadata (Section 6.5), IMA policy (Section 9.5).

/// Sequence lock for read-optimized, rarely-written shared data.
///
/// # Design
/// - `seq` is an even number when no write is in progress; odd during a write.
/// - Readers snapshot `seq`, copy the data, re-read `seq`. If the two `seq`
///   values differ or either is odd, the read was torn and must be retried.
/// - Writers acquire `lock`, increment `seq` to odd, write data, increment
///   `seq` to even, release `lock`.
///
/// # Constraints
/// - `T: Copy` required: readers copy the entire `T` into a local variable.
///   Non-Copy types would require partial reads that could observe torn state.
/// - Readers MUST NOT hold references into the SeqLock data — only copies.
/// - Writers must not panic while holding the write lock (seq would remain odd,
///   causing all readers to spin forever).
///
/// # Memory Ordering (per architecture)
///
/// | Operation | x86-64 (TSO) | AArch64 / ARMv7 | RISC-V | PPC32/PPC64LE | s390x (near-TSO) | LoongArch64 |
/// |-----------|-------------|-----------------|--------|---------------|------------------|-------------|
/// | Reader: seq load | Plain MOV (compiler barrier) | `LDAR` (load-acquire) | `fence r,r` + load | `lwsync` + load | Plain load (compiler barrier) | `dbar 0x14` (load-load) + load |
/// | Reader: data read | Plain MOV | Plain LDR (after acquire on seq) | Plain load (after fence) | Plain load (after lwsync) | Plain load | Plain load (after dbar) |
/// | Reader: seq re-read | `fence(Acquire)` (no-op on TSO) + plain MOV | `DMB ISHLD` + plain LDR | `fence r,r` + plain load | `lwsync` + plain load | Plain load (no-op — near-TSO) | `dbar 0x14` + plain load |
/// | Writer: seq increment (odd) | Plain MOV + compiler barrier | `STLR` (store-release) | `fence w,w` + store | `lwsync` + store | Plain store (compiler barrier) | `dbar 0x12` (store-store) + store |
/// | Writer: data write | Plain MOV | Plain STR | Plain store | Plain store | Plain store | Plain store |
/// | Writer: seq increment (even) | Plain MOV + compiler barrier | `STLR` (store-release) | `fence w,w` + store | `lwsync` + store | Plain store (compiler barrier) | `dbar 0x12` + store |
///
/// On x86-64 and s390x (near-TSO), the hardware guarantees loads are not reordered past
/// loads and stores are not reordered past stores. Compiler barriers
/// (`core::hint::black_box` or `asm!("" ::: "memory")`) suffice; no hardware fence
/// needed. s390x provides sequential-consistency for most operations; only store-load
/// reordering is possible, which does not affect seqlock correctness (the reader's
/// load-load path is naturally ordered). Full serialization when needed uses
/// `BCR 15,0`.
///
/// On LoongArch64, the memory model is weakly ordered (similar to ARM). The `dbar`
/// instruction with hint values controls barrier granularity: `0x14` = load-load
/// barrier, `0x12` = store-store barrier, `0x00` = full barrier.
///
/// **Note**: The dbar hint values `0x14` and `0x12` are Loongson
/// implementation-specific (3A5000/3C5000 series). The LoongArch ISA
/// architecture manual (Volume 1, §2.2.10.1) defines `dbar 0` as a full
/// barrier but leaves partial-barrier hint values as implementation-defined.
/// A future LoongArch CPU from a different vendor may treat unrecognized
/// hint values as `dbar 0` (full barrier). The Rust compiler (via LLVM's
/// LoongArch backend) emits correct barrier instructions for
/// `Ordering::Acquire`/`Release` regardless of these specific hint values,
/// so generated code is always correct. The hint values in the table above
/// document current Loongson hardware behavior for performance analysis,
/// not correctness requirements.
///
/// On the remaining weakly-ordered architectures (ARM, RISC-V, PPC), explicit
/// acquire/release barriers are required to prevent the CPU from reordering data
/// reads/writes past the sequence counter.
pub struct SeqLock<T: Copy> {
    /// Sequence counter. Even = idle, odd = write in progress.
    seq: AtomicU32,
    /// Writer mutual exclusion.
    lock: RawSpinLock,
    /// Protected data.
    data: UnsafeCell<T>,
}

// SAFETY: SeqLock is Sync when T is Send (writers need exclusive access;
// readers only copy T values).
unsafe impl<T: Copy + Send> Sync for SeqLock<T> {}

impl<T: Copy> SeqLock<T> {
    /// Create a new SeqLock with the given initial value.
    pub const fn new(val: T) -> Self {
        Self {
            seq: AtomicU32::new(0),
            lock: RawSpinLock::new(),
            data: UnsafeCell::new(val),
        }
    }

    /// Simple reader: returns a copy of the protected data.
    /// Automatically retries on torn reads. Suitable for single-value
    /// reads where the retry loop is simple.
    ///
    /// # Panics (debug only)
    /// Logs a warning if the retry count exceeds 1000 (indicates a
    /// writer is holding the lock for too long or a livelock).
    pub fn read(&self) -> T {
        loop {
            let s1 = self.read_begin();
            // SAFETY: we will check for torn read via read_retry.
            let val = unsafe { *self.data.get() };
            if !self.read_retry(s1) {
                return val;
            }
            core::hint::spin_loop();
        }
    }

    /// Begin an optimistic read. Returns the current sequence number.
    /// The caller reads the protected data, then calls `read_retry(seq)`
    /// to check for concurrent writes. Use this for multi-field reads
    /// where the caller needs to read multiple related fields consistently.
    ///
    /// ```rust
    /// loop {
    ///     let seq = seqlock.read_begin();
    ///     let (a, b, c) = unsafe { read_fields(&*seqlock.data_ptr()) };
    ///     if !seqlock.read_retry(seq) {
    ///         break (a, b, c);
    ///     }
    /// }
    /// ```
    #[inline(always)]
    pub fn read_begin(&self) -> u32 {
        loop {
            let s = self.seq.load(Ordering::Acquire);
            if s & 1 == 0 {
                return s;
            }
            // Odd = write in progress; spin until even.
            core::hint::spin_loop();
        }
    }

    /// Check if a concurrent write occurred since `read_begin()` returned
    /// `start_seq`. Returns `true` if the read must be retried (torn).
    ///
    /// **Memory ordering**: The `fence(Acquire)` before the `Relaxed` load
    /// matches Linux's `smp_rmb()` semantics exactly. The fence ensures all
    /// preceding data reads (between `read_begin()` and this call) are
    /// ordered before the sequence counter re-read. Without this fence, a
    /// weakly-ordered CPU could reorder the seq re-read before data reads,
    /// observing a clean seq while the data was actually torn.
    ///
    /// On x86-64 and s390x (TSO): `fence(Acquire)` is a no-op (hardware
    /// provides load-load ordering natively). Zero cost.
    /// On AArch64/ARMv7: emits `dmb ishld` (load-load + load-store barrier).
    /// On RISC-V: emits `fence r,r` (load-load barrier).
    /// On PPC32/PPC64LE: emits `lwsync` (load-load + load-store barrier).
    ///
    /// Why not `load(Acquire)`? An Acquire load on the seq counter provides
    /// ordering for loads *after* the load, not *before* it. We need the
    /// reverse: ensure all data loads *before* this point complete before the
    /// seq re-read. `fence(Acquire)` provides this "loads before the fence
    /// complete before loads after the fence" guarantee.
    ///
    /// Why not `SeqCst`? Wastes ~20 cycles on x86-64 (`MFENCE`) for ordering
    /// that is not needed. The seqlock read path only requires load-load
    /// ordering, not a full store-load fence.
    ///
    /// Why not `compiler_fence`? Unsound on 6/8 architectures. A compiler
    /// fence prevents the compiler from reordering, but the CPU can still
    /// reorder loads across it on ARM, RISC-V, PPC, and LoongArch.
    #[inline(always)]
    pub fn read_retry(&self, start_seq: u32) -> bool {
        core::sync::atomic::fence(Ordering::Acquire);
        let s = self.seq.load(Ordering::Relaxed);
        s != start_seq
    }

    /// Acquire write access. Mutually exclusive with other writers.
    /// Preemption is disabled for the duration (RawSpinLock).
    ///
    /// # Safety
    /// Caller must ensure IRQ state is appropriate (see RawSpinLock safety).
    pub unsafe fn write_lock(&self) {
        self.lock.lock();
        // Increment seq to odd (write in progress).
        // Single-writer (RawSpinLock ensures exclusion), so plain load+store
        // suffices — saves ~8-18 cycles vs atomic RMW (LOCK XADD on x86).
        let s = self.seq.load(Ordering::Relaxed);
        self.seq.store(s + 1, Ordering::Release);
    }

    /// Release write access. Increments seq to even (write complete).
    ///
    /// # Safety
    /// Must be called by the same CPU that called `write_lock()`.
    pub unsafe fn write_unlock(&self) {
        // Release ordering ensures all data writes are visible before
        // the seq increment that signals "write complete" to readers.
        // Single-writer: plain load+store, not atomic RMW.
        let s = self.seq.load(Ordering::Relaxed);
        self.seq.store(s + 1, Ordering::Release);
        self.lock.unlock();
    }

    /// Write a new value. Convenience wrapper around write_lock/write_unlock.
    ///
    /// # Safety
    /// See `write_lock()`.
    pub unsafe fn write(&self, val: T) {
        self.write_lock();
        *self.data.get() = val;
        self.write_unlock();
    }

    /// Raw pointer to the data (for multi-field reads in read_begin/read_retry
    /// patterns). The pointer is valid for the lifetime of the SeqLock.
    /// MUST only be dereferenced between read_begin() and read_retry() calls.
    pub fn data_ptr(&self) -> *const T {
        self.data.get()
    }
}

Longevity analysis for seq: AtomicU32: The sequence counter increments by 2 per write (once to odd, once to even). At 1 billion writes per second (extreme), wrap occurs in ~4.3 seconds — but wrap is safe. The read_retry check uses equality (s != start_seq), which correctly detects wrap-around writes. A reader preempted between read_begin() and read_retry() for exactly 2^31 write cycles would observe a false-clean read. At typical write rates (timekeeping at 1 KHz), this requires ~25 days of continuous preemption — impractical, and the data would be functionally harmless (the reader gets the last-written value, which is the most recent available regardless of wrap). SeqLock readers are fully preemptible (no PreemptGuard requirement); the safety argument rests on the impracticality of 2^31 writes during a single preemption window, not on preemption being disabled. The u32 seq is retained for Linux vDSO ABI compatibility (the vDSO exposes u32 seq to userspace). Kernel-internal SeqLock paths with unbounded preemption latency (e.g., SIGSTOP) are acceptable because a stale-but-consistent read is no worse than the stale time value the reader would get anyway.

3.6.2 Compare-and-Swap Semantics Differ by Architecture¶

All UmkaOS lock-free algorithms that update shared state use compare-and-swap (CAS) operations, exposed in Rust as AtomicT::compare_exchange and AtomicT::compare_exchange_weak. The semantics of these operations differ materially across ISAs, and algorithms must be written to be correct on the weakest variant.

x86 CMPXCHG — guaranteed single-attempt success: On x86, CMPXCHG is an atomic read-modify-write instruction that succeeds if and only if the current memory value matches the expected value. There are no spurious failures: if no other CPU has modified the location since the load, the CMPXCHG will succeed on the first attempt, always. This makes CAS on x86 appear to be a simple conditional store, and algorithms that assume single-try success work correctly there.

ARM LDXR/STXR — load-exclusive/store-exclusive with spurious failure: AArch64 and ARMv7 implement CAS using load-exclusive (LDXR) and store-exclusive (STXR) instruction pairs. The store-exclusive can fail even when no competing CPU has modified the target address. The CPU's exclusive monitor — a hardware reservation mechanism — can be cleared by an interrupt, a context switch, a hypervisor VM exit, a cache line eviction, or other microarchitectural events that have no relation to conflicting memory accesses. When the exclusive monitor is cleared, STXR returns a failure status and the CAS must be retried. This spurious failure is architecturally specified and guaranteed to occur occasionally in practice.

RISC-V LR/SC — load-reserved/store-conditional with spurious failure: RISC-V uses load-reserved (LR.W/LR.D) and store-conditional (SC.W/SC.D) pairs with identical semantics. The RISC-V specification explicitly permits SC to spuriously fail even in the absence of competing accesses. The reservation may be invalidated by any exception, interrupt, or other event during the LR/SC sequence.

PowerPC ldarx/stdcx. — load-and-reserve/store-conditional: PowerPC uses lwarx/stwcx. (32-bit) and ldarx/stdcx. (64-bit) with the same spurious-failure semantics. An interrupt, context switch, or cache invalidation between the load-reserve and store-conditional will cause the store-conditional to indicate failure, requiring retry.

Design implication for UmkaOS lock-free algorithms: Every lock-free algorithm in UmkaOS that modifies shared state through a CAS loop MUST be written as an unconditional retry loop that tolerates spurious failure. On x86, the retry branch is never taken in practice; on ARM, RISC-V, and PowerPC, it will be taken occasionally and the algorithm must converge correctly regardless of how many spurious failures occur before a genuine success.

A CAS loop that assumes at most one retry after a genuine conflict is incorrect on ARM/RISC-V/PPC. A CAS loop that checks for spurious failure explicitly but does not re-load the current value before retrying is also incorrect — the retry must reload the current memory value and re-evaluate the expected value before each subsequent attempt.

3.6.3 LL/SC Restrictions on Non-Cacheable (MMIO) Memory¶

On architectures that implement atomics via LL/SC pairs, atomic operations cannot be used on MMIO registers or other non-cacheable memory regions. This is a hard architectural constraint, not a performance guideline.

Architecture	Restriction	Consequence
PowerPC	`lwarx`/`stwcx.` require Memory Coherence Required (M=1) storage. Cache-inhibited or write-through storage triggers a data storage or alignment exception.	Atomic operations on device registers must use lock-based access or device-specific compare-and-swap commands (e.g., PCIe AtomicOps). The generic `AtomicT` API must never be used on MMIO addresses.
ARM (AArch64/ARMv7)	`LDXR`/`STXR` on Device memory type is implementation-defined. Some cores (Cortex-A57, erratum 832075) deadlock on exclusive + device load.	Exclusive load/store sequences targeting Device-nGnRnE or Device-nGnRE memory may silently fail or hang. MMIO registers must be accessed via plain loads/stores with appropriate barriers (`ldr`/`str` + `dsb`), never via exclusive pairs.
RISC-V	`LR`/`SC` on I/O memory (PMA = non-cacheable) is implementation-defined. The spec permits but does not require support.	Assume `LR`/`SC` on MMIO fails. Use lock-based MMIO access or PCI AtomicOps where the device supports them.
x86-64	`LOCK CMPXCHG` on uncacheable (UC) memory works correctly but is not guaranteed to be atomic with respect to device-side state.	x86 atomics on MMIO are architecturally permitted but should still be avoided — the device's view of atomicity differs from the CPU's.

Design rule: The AtomicT Rust types (AtomicU32, AtomicU64, etc.) must never be used to access memory-mapped I/O registers. All MMIO access goes through the dedicated mmio_read_* / mmio_write_* functions (which emit plain loads/stores with architecture- appropriate barriers), or through the volatile read_volatile / write_volatile wrappers. Compile-time enforcement: MMIO regions are typed as MmioRegion<T>, not as raw pointers to atomic types.

Rust compare_exchange vs compare_exchange_weak: - compare_exchange on platforms that use LR/SC or LDXR/STXR internally wraps the operation in a loop that retries on spurious failure, presenting the caller with x86-equivalent "fail only on genuine conflict" semantics. This loop is invisible to the caller and adds retries inside the atomic operation itself. - compare_exchange_weak exposes the underlying hardware semantics directly, permitting spurious failure to propagate to the caller. The caller's retry loop must handle it.

In UmkaOS lock-free algorithms that already contain an explicit retry loop (the standard pattern for CAS-based updates), compare_exchange_weak is preferred. Using compare_exchange inside an outer retry loop results in double-looping on ARM/RISC-V: the inner hidden loop retries spurious failures, and the outer loop retries genuine conflicts. The compare_exchange_weak variant eliminates the inner loop, letting the outer loop handle both cases, reducing instruction count and branch pressure on the architectures where LR/STXR instructions are used.

3.6.4 `Idr<T>` — Integer ID Allocator¶

Idr<T> is a radix-tree-based integer ID allocator, equivalent to Linux's struct idr (reimplemented on XArray since Linux 4.20). It maps small non-negative integer keys to values with O(1) average-case lookup and integrated next-ID allocation.

/// Radix-tree-based integer ID allocator.
///
/// Internally uses a 64-ary radix tree (6 bits per level). For the common
/// case of <64K entries, the tree is 3 levels deep (6+6+6 = 18 bits).
/// Full u32 range: 6 levels (6×6 = 36 > 32 bits). Full u64 range: 11
/// levels (6×11 = 66 > 64 bits). Sparse ID spaces incur only the cost of
/// allocated nodes — unpopulated subtrees are null pointers with no memory.
/// Each node is cache-line-aligned (64 bytes metadata + 512 bytes pointers).
///
/// **Concurrency model** (matches Linux's IDR/XArray):
/// - **Reads**: RCU-protected. `idr_find()` requires only `rcu_read_lock()`
///   and performs a lock-free radix tree walk. Multiple readers on different
///   CPUs never contend.
/// - **Writes**: Serialized by an internal `SpinLock`. `idr_alloc()` and
///   `idr_remove()` acquire the lock, update the tree, then publish changes
///   via RCU (store-release on the node pointer). Freed nodes are reclaimed
///   after an RCU grace period.
/// - **No external locking required** for basic operations. Callers that
///   need atomic read-modify-write sequences (e.g., "find and update if
///   exists") must hold their own lock.
pub struct Idr<T> {
    /// Root of the radix tree.
    root: *mut IdrNode<T>,
    /// Next ID hint (speeds up sequential allocation).
    next_id: AtomicU32,
    /// Write-side lock (protects tree mutations).
    lock: SpinLock<()>,
}

Operations: - idr_alloc(value) -> Result<u32, IdrError>: Allocate the next free ID, store value. - idr_alloc_range(min, max, value) -> Result<u32, IdrError>: Allocate within a range. - idr_find(id) -> Option<&T>: Lock-free lookup (requires rcu_read_lock). - idr_remove(id) -> Option<T>: Remove entry, RCU-deferred node reclamation. - idr_for_each(f: FnMut(u32, &T)): Iterate all entries (requires rcu_read_lock).

Usage: PID namespaces (Section 17.1), SysV IPC ID allocation (Section 17.1), file descriptor tables (Section 8.1), and any kernel subsystem mapping small integers to objects.

3.6.5 `WaitQueueHead` — Blocking Wait Queue¶

A WaitQueueHead is a spinlock-protected intrusive list of waiters. Kernel code that needs to block until a condition is true (e.g., pipe readable, socket writable, child exited) inserts itself into a WaitQueueHead and calls schedule(). When the condition changes, the producer calls wake_up() to unblock one or all waiters.

/// Flags controlling waiter behavior in a `WaitQueueHead`.
bitflags! {
    pub struct WaitFlags: u32 {
        /// Exclusive waiter: only ONE exclusive waiter is woken per wake_up() call.
        /// Non-exclusive waiters are always woken. This prevents thundering herd
        /// on resources where only one waiter can make progress (e.g., accept()).
        const EXCLUSIVE     = 1 << 0;
        /// Interruptible by signals. If unset, uses TASK_UNINTERRUPTIBLE.
        const INTERRUPTIBLE = 1 << 1;
        /// Bookmark entry used for safe list iteration during wake_up_all().
        /// Not a real waiter — never invokes wakeup function.
        const BOOKMARK      = 1 << 2;
    }
}

/// A single waiter registered on a WaitQueueHead.
/// Embedded in the waiter's stack frame or task struct.
pub struct WaitQueueEntry {
    /// Wakeup function called when the waiter is unblocked.
    /// For normal task sleep: sets task state to TASK_RUNNING and
    /// calls `try_to_wake_up()` (enqueues on runqueue).
    /// Custom wakeup functions are used by epoll (ep_poll_callback),
    /// io_uring (io_poll_wake), and autoremove waiters.
    pub wakeup: fn(*mut WaitQueueEntry) -> bool,
    /// Link in the wait queue list (intrusive).
    pub link: IntrusiveListNode,
    /// Pointer to the waiting task.
    /// **Design tradeoff**: The full Arc cost per wait/wake cycle is ~10-20 cycles:
    /// `Arc::clone` on insertion costs ~5-10 cycles (one `LOCK XADD` for refcount
    /// increment, NOT a heap allocation), and the paired `Arc::drop` on removal
    /// costs another ~5-10 cycles (one `LOCK XADD` decrement + deallocation branch).
    /// This prevents use-after-free when a task is killed (SIGKILL) while on a
    /// wait queue — a historically common bug class in Linux where raw
    /// `task_struct*` with manual ordering guarantees has caused subtle UAF issues.
    /// UmkaOS pays ~10-20 cycles per wait/wake cycle to eliminate this entire bug
    /// class. The wait path already takes the WaitQueueHead spinlock (~10-30
    /// cycles), so the Arc pair adds ~33-67% to the spinlock cost, or ~25-40%
    /// to the total wait/wake cost (~30-50 cycles overall). The design decision
    /// is sound: the total cost remains small, and the UAF prevention is worth it.
    pub task: Arc<Task>,
    /// Waiter flags: EXCLUSIVE, INTERRUPTIBLE, BOOKMARK.
    pub flags: WaitFlags,
    /// Private data for the wakeup function (e.g., poll key for epoll).
    pub private: usize,
}

/// Head of a wait queue. Holds a spinlock and the list of waiters.
/// Embedding one in a struct means "tasks can block waiting for this object".
/// Used by: PipeBuffer (read_wait/write_wait), sockets, epoll fds, futexes,
/// page fault waits, inode event waits.
pub struct WaitQueueHead {
    /// Protects the waiter list and coordinates with the waker.
    pub lock: SpinLock<()>,
    /// List of `WaitQueueEntry` nodes. Non-exclusive waiters are at the front
    /// (FIFO order); exclusive waiters are appended at the tail. This ordering
    /// ensures `wake_up()` wakes ALL non-exclusive waiters plus exactly ONE
    /// exclusive waiter, preventing thundering herd while still broadcasting
    /// to poll/epoll listeners.
    ///
    /// **Priority ordering within exclusive waiters**: exclusive entries are
    /// sorted by task scheduling class (DL > RT > CFS) so that higher-priority
    /// tasks are woken first. This prevents priority starvation among waiters
    /// (a low-priority exclusive waiter cannot indefinitely precede a
    /// high-priority one in the wake order). Note: this is *not* priority
    /// inversion prevention — true PI requires a PI-mutex mechanism
    /// ([Section 8.4](08-process.md#real-time-guarantees--priority-inheritance-protocol)) that boosts
    /// the lock holder's priority, not just the wake order of waiters.
    pub head: IntrusiveListHead,
}

impl WaitQueueHead {
    /// Block until `condition()` returns true. Interruptible by signals.
    /// Returns `Ok(())` when condition is true, `Err(EINTR)` if interrupted.
    ///
    /// **Algorithm** (standard wait-loop; avoids lost-wakeup race):
    /// ```
    /// 1. Allocate WaitQueueEntry on caller's stack with wakeup=default_wakeup.
    /// 2. Lock self.lock.
    /// 3. Insert entry into self.head:
    ///       - Non-exclusive (flags & EXCLUSIVE == 0): prepend to head (front).
    ///       - Exclusive: append to tail, sorted by scheduling class priority
    ///         (DL > RT > CFS) among other exclusive entries.
    /// 4. Set current task state to TASK_INTERRUPTIBLE.
    /// 5. Unlock self.lock.
    /// 6. Check condition(). If true:
    ///       Set task state to TASK_RUNNING.
    ///       Remove entry from queue (lock → splice out → unlock).
    ///       Return Ok(()).
    /// 7. Call schedule(). Suspend until wake_up_one/wake_up_all is called.
    /// 8. On resume: set task state to TASK_RUNNING.
    ///       Check for pending signal: if signal_pending(), go to step 9.
    ///       Check condition(). If true: remove entry, return Ok(()).
    ///       Otherwise: set state to TASK_INTERRUPTIBLE, go to step 7 (retry loop).
    /// 9. Signal path: remove entry from queue, return Err(EINTR).
    /// ```
    /// Step 4 before step 6 is essential: a waker calling `wake_up_one()` between
    /// steps 5 and 6 sets the task TASK_RUNNING before `schedule()` is called, so
    /// `schedule()` will return immediately — the wakeup is never lost.
    pub fn wait_event<F: Fn() -> bool>(&self, condition: F) -> Result<(), Errno>;

    /// Block until `condition()` returns true. NOT interruptible by signals.
    /// Returns only when the condition is true (no EINTR).
    /// Same algorithm as `wait_event` but uses TASK_UNINTERRUPTIBLE at step 4
    /// and skips the signal check at step 8.
    pub fn wait_event_uninterruptible<F: Fn() -> bool>(&self, condition: F);

    /// Block until `condition()` returns true OR `timeout_ns` nanoseconds
    /// elapse, whichever comes first. Returns `true` if the condition was
    /// met before the timeout, `false` if the timeout expired.
    ///
    /// **Algorithm** (matches `wait_event` with timer-based wakeup):
    /// ```
    /// 1. Check condition() — if true, return true immediately.
    /// 2. Allocate WaitQueueEntry on stack (exclusive = false).
    /// 3. Add entry to self.head (under self.lock).
    /// 4. Set current task state to TASK_UNINTERRUPTIBLE.
    /// 5. Arm a high-resolution timer (hrtimer) with `timeout_ns` duration.
    ///    The timer callback sets the task state to TASK_RUNNING and calls
    ///    sched_enqueue(current) — identical to a wake_up() from the
    ///    condition producer.
    /// 6. Re-check condition() — if true, cancel timer, remove entry, return true.
    /// 7. Call schedule() to yield the CPU.
    /// 8. On wakeup: check if timer expired (timer callback sets a local flag).
    ///    If expired: remove entry from queue, return false (timeout).
    /// 9. Re-check condition() — if true, cancel timer, remove entry, return true.
    ///    Otherwise: go to step 4 (retry loop — spurious wakeup).
    /// ```
    ///
    /// The timer ensures bounded wait even if the condition producer never
    /// calls `wake_up()`. The timer is cancelled on the success path to
    /// avoid a spurious wakeup after return.
    ///
    /// MUST NOT be called from atomic context (sleeps).
    pub fn wait_event_timeout<F: Fn() -> bool>(
        &self,
        timeout_ns: u64,
        condition: F,
    ) -> bool;

    /// Wake waiters: all non-exclusive waiters PLUS exactly one exclusive waiter.
    ///
    /// This is the standard Linux `wake_up()` semantic. It prevents thundering
    /// herd: only one exclusive waiter is woken (the highest-priority one at the
    /// tail), while all non-exclusive waiters (epoll, poll) are always notified.
    ///
    /// **Algorithm**:
    /// ```
    /// 1. Lock self.lock.
    /// 2. If self.head is empty: unlock, return.
    /// 3. Walk the list from head:
    ///       For each non-exclusive entry: call entry.wakeup(&entry).
    ///       For the first exclusive entry encountered: call entry.wakeup(&entry),
    ///         then STOP (do not wake further exclusive waiters).
    /// 4. Remove woken entries from the list.
    /// 5. Unlock self.lock.
    /// ```
    /// Wakeup callbacks run under the lock to prevent the woken task from
    /// re-inserting before the walk completes. The lock hold time is bounded
    /// by the number of non-exclusive waiters (typically ≤ epoll fd count).
    pub fn wake_up(&self);

    /// Wake up exactly one waiter (the first in FIFO order, ignoring flags).
    /// Used when the caller knows exactly one waiter should proceed.
    ///
    /// **Algorithm**:
    /// ```
    /// 1. Lock self.lock.
    /// 2. If self.head is empty: unlock, return.
    /// 3. Pop the first WaitQueueEntry from self.head.
    /// 4. Unlock self.lock.
    /// 5. Call entry.wakeup(&entry):
    ///       default_wakeup: atomically sets entry.task.state = TASK_RUNNING,
    ///       then calls sched_enqueue(entry.task) to place task on its CPU's runqueue.
    /// ```
    /// Unlocking before calling `sched_enqueue` avoids holding the wait queue
    /// spinlock during scheduler operations (which may acquire runqueue locks).
    pub fn wake_up_one(&self);

    /// Wake up all waiters simultaneously (broadcast).
    ///
    /// **Algorithm**:
    /// ```
    /// 1. Lock self.lock.
    /// 2. Drain self.head into a local list (swap head pointer to empty list).
    /// 3. Unlock self.lock.
    /// 4. For each entry in local list: call entry.wakeup(&entry).
    /// ```
    /// Draining to a local list (step 2) means wakeups run outside the lock,
    /// avoiding contention between woken tasks and new waiters inserting.
    pub fn wake_up_all(&self);
}

Usage: PipeBuffer (§17.3.2), socket wait queues, epoll event delivery, page fault wait-on-writeback, waitpid() child-exit notification.

3.6.6 `Completion` — One-Shot (or Multi-Shot) Signaling Primitive¶

A counter-based synchronization primitive for "producer does work, consumer waits for completion" patterns. Built on WaitQueueHead + AtomicU32. Equivalent to Linux's struct completion (include/linux/completion.h).

Used by: device probe completion, firmware load wait, module init synchronization, workqueue flush, AHCI command completion (Section 15.4), DMA fence completion, driver unload synchronization.

/// Counter-based completion primitive.
///
/// `done` tracks the number of completions:
/// - 0 = not yet completed; `wait()` will block.
/// - 1..u32::MAX-1 = completed N times; `wait()` decrements and returns.
/// - u32::MAX = "complete all" sentinel; `wait()` returns without decrement.
///
/// Supports three patterns:
/// 1. **One-shot**: `complete()` once, `wait()` once. Standard case.
/// 2. **Pre-completion**: `complete()` before `wait()`. The waiter returns
///    immediately (done > 0).
/// 3. **Barrier**: `complete_all()` sets done to u32::MAX, waking ALL
///    current and future waiters until `reinit()` is called.
pub struct Completion {
    /// Completion counter. 0 = pending, >0 = done.
    done: AtomicU32,
    /// Wait queue for blocked waiters.
    wq: WaitQueueHead,
}

impl Completion {
    /// Create a new, uncompleted Completion.
    pub const fn new() -> Self {
        Self {
            done: AtomicU32::new(0),
            wq: WaitQueueHead::new(),
        }
    }

    /// Block until the completion is signaled (done > 0).
    ///
    /// If `done` is already > 0 (pre-completion or multi-completion),
    /// atomically decrements `done` and returns immediately (no sleep).
    /// If `done` == u32::MAX (complete_all), returns without decrement.
    ///
    /// **Atomicity**: Uses a CAS loop to atomically check-and-decrement `done`.
    /// The previous load+fetch_sub pattern had a TOCTOU race: two concurrent
    /// waiters could both observe `done > 0`, both call `fetch_sub(1)`, and
    /// underflow `done` below the intended value. The CAS loop ensures that
    /// exactly one waiter consumes each completion increment.
    ///
    /// MUST NOT be called from atomic context.
    pub fn wait(&self) {
        self.wq.wait_event(|| {
            loop {
                let d = self.done.load(Ordering::Acquire);
                if d == u32::MAX {
                    return true; // complete_all sentinel: always ready
                }
                if d == 0 {
                    return false; // not complete — sleep
                }
                // Try to atomically consume one completion: d → d-1.
                if self.done.compare_exchange_weak(
                    d, d - 1,
                    Ordering::AcqRel, Ordering::Acquire,
                ).is_ok() {
                    return true;
                }
                // CAS failed (concurrent complete() or another waiter) — retry.
            }
        });
    }

    /// Block until the completion is signaled or `timeout_ns` nanoseconds
    /// elapse. Returns `true` if completed, `false` on timeout.
    ///
    /// Uses the same CAS loop as `wait()` for atomicity. See `wait()` for
    /// the TOCTOU race rationale.
    ///
    /// MUST NOT be called from atomic context.
    pub fn wait_timeout(&self, timeout_ns: u64) -> bool {
        self.wq.wait_event_timeout(timeout_ns, || {
            loop {
                let d = self.done.load(Ordering::Acquire);
                if d == u32::MAX {
                    return true; // complete_all sentinel
                }
                if d == 0 {
                    return false; // not complete — sleep or timeout
                }
                if self.done.compare_exchange_weak(
                    d, d - 1,
                    Ordering::AcqRel, Ordering::Acquire,
                ).is_ok() {
                    return true;
                }
            }
        })
    }

    /// Non-blocking check. Returns `true` if the completion has been
    /// signaled (done > 0). Does NOT consume the completion.
    pub fn try_wait(&self) -> bool {
        self.done.load(Ordering::Acquire) > 0
    }

    /// Signal one waiter. Increments `done` and wakes the
    /// highest-priority waiter on the wait queue.
    ///
    /// If `done == u32::MAX` (the `complete_all` sentinel), this is a no-op:
    /// `complete_all()` has already been called, and incrementing would wrap
    /// `done` to 0, silently destroying the sentinel and causing all future
    /// `wait()` calls to block forever. The guard is unconditional (not
    /// debug-only) because sentinel destruction is a latent correctness bug
    /// that would manifest silently in production, not just in tests.
    ///
    /// **Atomicity**: Uses a CAS loop to atomically check-and-increment `done`,
    /// matching the CAS loop in `wait()`. The previous load-then-fetch_add
    /// pattern had a TOCTOU race: between the `load(Relaxed)` guard check and
    /// the `fetch_add(1, Release)`, a concurrent `complete_all()` could store
    /// `u32::MAX`. The `fetch_add` would then wrap `u32::MAX` to 0, destroying
    /// the sentinel. The CAS loop prevents this by re-checking the value before
    /// committing the increment.
    ///
    /// **Cost**: ~2-5 cycles more than the previous non-atomic path in the
    /// uncontended case (one CAS instead of one load + one fetch_add). Negligible
    /// for a completion signaling path.
    ///
    /// May be called from any context (including hardirq).
    pub fn complete(&self) {
        loop {
            let prev = self.done.load(Ordering::Relaxed);
            if prev == u32::MAX {
                // complete_all() sentinel already set — wake is harmless.
                break;
            }
            debug_assert!(prev < u32::MAX - 1, "Completion counter overflow");
            if self.done.compare_exchange_weak(
                prev, prev + 1, Ordering::Release, Ordering::Relaxed,
            ).is_ok() {
                break;
            }
            // CAS failed (concurrent complete() or complete_all()) — retry.
            // spin_loop() maps to YIELD on ARM, PAUSE on x86, nop on RISC-V.
            // This mitigates LL/SC contention on weakly-ordered architectures
            // where spurious CAS failures thrash the cache line reservation.
            // The loop is bounded by N concurrent callers (at most N*K
            // iterations where K is the LL/SC retry factor, typically 2-4).
            core::hint::spin_loop();
        }
        self.wq.wake_up_one();
    }

    /// Signal ALL current and future waiters. Sets `done` to u32::MAX
    /// (sentinel) and wakes all waiters on the wait queue.
    ///
    /// After `complete_all()`, any call to `wait()` returns immediately
    /// until `reinit()` is called. Used for barrier patterns where all
    /// waiters should proceed (e.g., module init gate).
    ///
    /// May be called from any context (including hardirq).
    pub fn complete_all(&self) {
        self.done.store(u32::MAX, Ordering::Release);
        self.wq.wake_up_all();
    }

    /// Reset the completion for reuse. Sets `done` to 0.
    ///
    /// MUST only be called when no waiters are blocked (typically after
    /// `complete_all()` + all waiters have returned from `wait()`).
    /// Calling `reinit()` with blocked waiters is a programming error.
    pub fn reinit(&self) {
        debug_assert!(
            self.wq.is_empty(),
            "Completion::reinit() called with blocked waiters"
        );
        // Release ordering is defensive: by contract, no concurrent readers
        // exist when reinit() is called. Relaxed would suffice for correctness.
        // Release is used for consistency with complete()/complete_all() and
        // costs nothing on TSO architectures (x86-64, s390x) and ~1-2 cycles
        // on weakly-ordered architectures (ARM, RISC-V, PPC).
        self.done.store(0, Ordering::Release);
    }
}

Longevity analysis for done: AtomicU32: Exempt from u64 requirement as a bounded refcount (not a monotonic identifier). The counter is bounded by complete_all() (sets to u32::MAX sentinel) or reinit() (resets to 0). Normal complete() increments by 1; wait() decrements by 1. Wrap past u32::MAX-1 is detected by a debug assertion. In practice, done rarely exceeds single digits (one-shot pattern). The only risk is a bug calling complete() without matching wait() — the debug assertion catches this.

3.6.7 `SpscRing<T, N>` — Lock-Free Single-Producer Single-Consumer Ring Buffer¶

A fixed-capacity, NMI-safe, lock-free ring buffer for unidirectional data flow between exactly one producer and one consumer. This is a core concurrency primitive used across multiple subsystems wherever a single writer must communicate with a single reader without any locking.

/// Lock-free single-producer single-consumer ring buffer.
///
/// # Safety Guarantees
/// - **NMI-safe**: The producer may run in NMI context (no locks, no allocation,
///   no blocking). The consumer runs in task or softirq context.
/// - **Wait-free producer**: `try_push()` completes in O(1) with no CAS loops.
///   It either succeeds or returns `Err(Full)` — never spins.
/// - **Lock-free consumer**: `try_pop()` completes in O(1).
/// - **Memory ordering**: Producer uses `Release` store on `head`; consumer uses
///   `Acquire` load on `head`. Symmetric for `tail`. This is sufficient for all
///   architectures including ARMv7 and PPC32 (no fences needed beyond the
///   atomic ordering).
///
/// # Capacity
/// `N` must be a power of two (enforced at compile time). Effective capacity
/// is `N - 1` to distinguish full from empty.
///
/// # Type constraint
/// `T: Copy` — elements are memcpy'd into/out of the ring. No destructors
/// run inside the ring (NMI safety requires no heap interaction).
pub struct SpscRing<T: Copy, const N: usize> {
    /// Ring buffer storage. Cache-line padded to avoid false sharing
    /// between producer (writes to `buf[head]`) and consumer (reads from
    /// `buf[tail]`).
    buf: [MaybeUninit<T>; N],
    /// Next write position (producer-owned). Indexed modulo N.
    /// Cache-line aligned to avoid false sharing with `tail`.
    /// **Longevity**: AtomicU32 wrapping is correct by design. The distance
    /// `head.wrapping_sub(tail)` is always in [0, N) because the producer
    /// cannot advance head beyond tail + N (full check). Power-of-two
    /// bitmask indexing (`idx & (N - 1)`) is wrap-safe. No operational
    /// risk from counter wrapping — indices are relative, not absolute.
    head: CacheAligned<AtomicU32>,
    /// Next read position (consumer-owned). Indexed modulo N.
    tail: CacheAligned<AtomicU32>,
}

impl<T: Copy, const N: usize> SpscRing<T, N> {
    /// Compile-time bounds: N must fit in u32 (head/tail are AtomicU32) and
    /// must be a power of two for efficient modulo via bitmask.
    ///
    /// Placed as an associated const inside the `impl` block so the assertions
    /// have access to the generic `N` parameter. A bare `const _: () = { ... }`
    /// at module level does not have `N` in scope — it requires the assertions
    /// to appear inside the type's own `impl` block where `N` is bound.
    const _ASSERT: () = {
        assert!(N <= u32::MAX as usize, "SpscRing N exceeds u32::MAX");
        assert!(N.is_power_of_two(), "SpscRing N must be a power of two");
    };
    /// Push an element. Returns `Err(Full)` if the ring is full.
    /// NMI-safe: no allocation, no lock, no blocking.
    pub fn try_push(&self, val: T) -> Result<(), RingError>;

    /// Pop an element. Returns `Err(Empty)` if the ring is empty.
    pub fn try_pop(&self) -> Result<T, RingError>;

    /// Number of elements currently in the ring.
    pub fn len(&self) -> u32;

    /// Whether the ring is empty.
    pub fn is_empty(&self) -> bool;
}

pub enum RingError {
    Full,
    Empty,
}

Subsystem usage:

Subsystem	Type Parameter	Capacity	Producer Context	Consumer Context
perf PMU	`RawSample`	per-event (configurable)	NMI/PMI handler	perf reader kthread
seccomp notify	`SeccompNotif`	256	syscall entry	supervisor process
EDAC	`u64` (timestamp)	`UE_BURST_WINDOW`	MCE handler (NMI)	EDAC poller kthread
IMA	`ImaMeasurement`	4096	file open path	IMA worker kthread
audit	`AuditRecord`	8192	syscall exit	audit daemon kthread
input	`InputEvent`	64	IRQ handler	evdev reader
userfaultfd	`UffdMsg`	2048	page fault handler	uffd monitor process
MCE log	`MceLogEntry`	32	NMI handler	mcelog reader
pstore	`PstoreRecord`	64	panic/NMI	pstore flush kthread

All uses share the same SpscRing<T, N> implementation — no per-subsystem reimplementation. The ring is instantiated inline (no heap allocation for the ring itself; the containing struct allocates it as a field).

3.7 Scalability Analysis: Hot-Path Metadata on 256+ Cores¶

Moving from a monolithic kernel to a hybrid one can introduce new contention bottlenecks in the "core" layer — specifically in metadata structures that every I/O operation must touch. This section analyzes UmkaOS's three highest-contention metadata subsystems on many-core machines (256-512 cores, 4-8 NUMA nodes).

1. Capability Table (Section 9.1)

Every syscall and driver invocation performs a capability check. On a 512-core machine running 10,000 concurrent processes, the capability table sees millions of lookups per second.

Design for contention: - Per-process capability tables: each process has its own capability table (a small array indexed by capability handle, typically <256 entries). Lookups are process-local with no cross-process contention. Two processes on different CPUs never touch the same capability table. - Capability creation/delegation (write path): uses per-process lock. Only contends when the same process creates capabilities from multiple threads simultaneously (rare — capability creation is a cold path). - Capability revocation: RCU-deferred. Revoking a capability marks it as invalid (atomic store), then defers memory reclamation to an RCU grace period. No lock on the read path.

Contention profile: none on read path (per-process, indexed by local handle). Comparable to Linux's file descriptor table — per-process, never a global bottleneck.

2. Unified Object Namespace / umkafs (Section 20.5)

The object namespace provides /sys/kernel/umka/ introspection. On a busy system, monitoring tools (umka-top, prometheus-exporter) may read thousands of objects per second.

Design for contention: - Registry reads: the device registry (Section 11.4) is RCU-protected. Reads are lock-free. Enumeration snapshots use a seqlock to detect concurrent modifications. - Attribute reads (sysfs-style): most attributes are backed by atomic counters or per-CPU counters that are aggregated on read. Example: a NIC's rx_packets counter is per-CPU; reading it sums all per-CPU values. No lock on the write path (per-CPU increment), brief aggregation on the read path. - Object creation/destruction (hot-plug, process exit): per-subsystem lock, not global. Creating a network device takes the network subsystem lock; creating a block device takes the block subsystem lock. No global namespace lock. - Pathological case: a monitoring tool reading every object attribute on every CPU at 1Hz on a 512-core, 10,000-device system. The aggregation overhead is proportional to num_cpus × num_objects — at 512 × 10,000, this is ~5M atomic reads per scan. At ~5ns each, one scan takes ~25ms. This is acceptable for 1Hz monitoring; for higher-frequency monitoring, tools should read only the objects they need. Note: This 25ms scan runs in process context (a monitoring thread), not in interrupt context. It does not affect the 50μs interrupt latency guarantee from Section 8.4, which governs ISR entry latency — a completely orthogonal concern.

Contention profile: lock-free reads (RCU + atomics), per-subsystem writes. Bottleneck only if monitoring tools read aggressively (configurable rate limiting).

3. Page Cache and Memory Management (Section 4.1)

The page cache is the single highest-contention structure in any kernel. Every file read, every mmap fault, every page writeback touches it.

Design for contention: - Page cache lookups: RCU-protected radix tree, per-inode (same as Linux's xarray). Two threads faulting pages from different files never contend. Two threads faulting different pages from the same file contend only at the xarray level (fine-grained, per-node locking within the radix tree). - LRU lists: per-NUMA-node LRU lists with per-CPU pagevec batching (same design as Linux). Pages are added to a per-CPU buffer; the buffer is drained to the LRU list in batches of 15 pages. This amortizes the per-NUMA LRU lock to 1 acquisition per 15 page faults. - Buddy allocator: per-NUMA-node lock, with per-CPU free page pools absorbing >95% of allocations (Section 4.3). On a 512-core, 8-NUMA-node system, each buddy allocator sees ~1/8 of the allocation traffic, and per-CPU pools absorb most of it. - Per-VMA locks (see Section 4.8) eliminate mmap_lock as a scalability bottleneck for the page fault path. On 256 cores, page faults acquire only the per-VMA vm_lock.read(), which is per-VMA (not per-process). VMA structural modifications (mmap/munmap) still acquire mmap_lock.write(), but these are infrequent relative to page faults (typically <0.1% of memory operations). The remaining contention point is the page table lock (pte_lock or pmd_lock), which is per-page-table-page and thus naturally distributed. - Known scaling limit: extreme mmap/munmap churn (thousands of VMAs per second per process) contends on the per-process mmap_lock.write(). UmkaOS uses the same maple tree (lockless reads, locked writes) as Linux 6.1+. For workloads that create/destroy VMAs at extreme rates (JVMs, some databases), this is a known bottleneck in all current kernels. Per-VMA locks do not help here — the contention is on mmap_lock.write() for structural modifications, not on the fault path. UmkaOS does not claim to solve it — the contention is fundamental to the VMA data structure, not to the hybrid architecture.

Contention profile: per-inode, per-NUMA, per-CPU on all hot paths. No new bottlenecks introduced by the hybrid architecture — UmkaOS's page cache follows the same design as Linux (which already runs on 512+ core machines).

Summary — the hybrid architecture's overhead (isolation domain switches) is orthogonal to metadata contention. The "core" layer's metadata structures use the same per-CPU, per- NUMA, RCU-based techniques that make Linux scale to 256+ cores. No new global locks are introduced. The capability system is per-process (no shared structure), the object namespace is RCU-read / per-subsystem-write, and the page cache follows Linux's proven xarray + pagevec design.

3.8 Interrupt Handling¶

All device interrupts are routed to threaded interrupt handlers (same as Linux IRQF_THREADED).
Top-half handlers are kept minimal: acknowledge interrupt, wake thread.
This ensures all interrupt processing is preemptible and schedulable.
MSI/MSI-X is preferred for all PCI devices (per-queue interrupts, no sharing).
High-rate interrupt paths (100GbE at line-rate, NVMe at millions of IOPS): the threaded-handler model introduces scheduling latency (~1-5μs per interrupt) that could limit throughput if each packet triggered a separate interrupt. UmkaOS addresses this the same way Linux does — NAPI-style polling: the first interrupt wakes the thread, the thread then polls the device ring in a busy loop until the ring is drained, then re-enables interrupts. For 100GbE, this means the thread is woken once per batch of 64-256 packets, not once per packet. Combined with MSI-X per-queue affinity (one interrupt thread per CPU, one queue per CPU), the scheduling overhead is amortized to <100ns per packet, which is within the performance budget (Section 1.3).

3.8.1 s390x Interrupt Model¶

The s390x architecture uses a PSW-swap interrupt model that is fundamentally different from all other supported architectures. There is no external interrupt controller (no APIC, GIC, PLIC, or equivalent). Interrupt routing is an architectural feature of the CPU itself, mediated through the lowcore (a per-CPU memory page at fixed physical addresses).

3.8.1.1 Interrupt Classes¶

s390x defines six interrupt classes, each with a dedicated pair of PSW save/load slots in the lowcore:

Class	Old PSW Offset	New PSW Offset	Triggers
Restart	0x120	0x1A0	SIGP restart order from another CPU
External	0x130	0x1B0	Signals, timers, clock comparator, SIGP external call/emergency signal
SVC	0x140	0x1C0	System call instruction (`svc`)
Program	0x150	0x1D0	Faults, traps, illegal instructions, addressing exceptions
Machine Check	0x160	0x1E0	Hardware errors, channel failures
I/O	0x170	0x1F0	Channel I/O completion, subchannel status pending

Each PSW is 16 bytes (128-bit). Old PSWs span 0x120-0x17F; New PSWs span 0x1A0-0x1FF.

When an interrupt fires, the hardware atomically saves the current PSW to the "Old PSW" slot and loads the "New PSW" from the adjacent slot. The new PSW contains the entry point of the interrupt handler for that class. No software interaction is needed to claim the interrupt — the hardware dispatch is implicit in the PSW swap.

3.8.1.2 Stack Setup on Interrupt Entry¶

The PSW swap only sets the instruction pointer — it does not switch the stack pointer. Each interrupt handler's prologue must establish a valid kernel stack before any function calls. The stack setup sequence:

Read the prefix register: The lowcore page is per-CPU, relocated via the CPU prefix register (SIGP SET_PREFIX). The handler reads the saved stack pointer from LC_ASYNC_STACK (offset 0x0350 in the lowcore) for async interrupts (External, I/O, Machine Check) or uses the current kernel stack for synchronous exceptions (SVC, Program). The async stack is a per-CPU 16 KB stack separate from the process kernel stack, preventing async interrupts from overflowing a shallow kernel stack.
Save registers to lowcore: STMG r8,r15,0x0200 (__LC_SAVE_AREA, lowcore offset 0x0200, save_area: [u64; 8], 64 bytes) for callee-saved registers and the stack pointer needed for handler setup. The full 16-register save (STMG r0,r15 = 128 bytes) would overwrite critical lowcore fields at 0x0240+ (including stack_canary and other per-CPU data). After establishing the kernel stack (step 3), the handler saves the remaining registers (r0-r7) to the stack frame. This two-phase save matches Linux arch/s390/kernel/entry.S STMG %r8,%r15 (saving 8 registers to lowcore, then the full set to the stack).
Load kernel stack pointer: LG r15, __LC_ASYNC_STACK (or stay on current stack for SVC/Program if already in kernel mode). For user→kernel transitions: load the per-CPU kernel stack from LC_KERNEL_STACK (offset 0x0348).
Build a standard stack frame: 160 bytes (s390x ABI minimum). The STPT instruction saves the CPU timer for accounting.

s390x lowcore offset reference (verified against arch/s390/include/asm/lowcore.h in torvalds/linux master):

Symbol	Offset	Type	Description
`__LC_SAVE_AREA`	0x0200	`[u64; 8]`	Register save area for interrupt entry (r8-r15, 64 bytes; full set saved to stack)
`LC_KERNEL_STACK`	0x0348	`u64`	Per-CPU kernel stack pointer
`LC_ASYNC_STACK`	0x0350	`u64`	Per-CPU async (external/I/O) interrupt stack
`LC_MCK_STACK`	0x0368	`u64`	Per-CPU machine check interrupt stack

NMI (Machine Check) stack: Machine check handlers use a dedicated per-CPU stack (LC_MCK_STACK, offset 0x0368, 8 KB) because a machine check can interrupt any context, including other interrupt handlers. This three-stack model (kernel, async, machine check) prevents stack overflow even under nested interrupt scenarios.

NMI/MCE stack budget per architecture:

Architecture	NMI stack size	NMI source	Budget constraint
x86-64	8 KB (IST entry 2)	NMI pin, APIC NMI LVT, `INT 2`	Max call depth ~40 frames; no allocation, no sleeping, no page faults
AArch64	8 KB (dedicated SP_EL1 region)	SError (async abort), FIQ (secure NMI)	SError handler must be self-contained; FIQ reserved for secure firmware
ARMv7	4 KB (FIQ mode stack)	FIQ (used for NMI-like signaling on vexpress)	Minimal handler; logs event and returns
RISC-V 64	Shared with kernel stack	No true NMI; `scause` MSB distinguishes	Software convention: NMI-like IPIs limited to 2 KB stack usage
PPC32	4 KB (critical interrupt stack)	Machine check, critical input	SPR save area at fixed offsets; 1 KB for handler logic
PPC64LE	8 KB (machine check stack in PACA)	Machine check (MCE), system reset	PACA save area; handler must not touch SLB or HPT
s390x	8 KB (LC_MCK_STACK)	Machine check	Lowcore save area; 6 KB usable after register save
LoongArch64	8 KB (dedicated per-CPU)	Machine error exception (Ecode=62)	Handler reads CSR.MERR*, logs, returns via ERTN

3.8.1.3 I/O Interrupt Routing¶

I/O interrupts are generated by channel subsystem subchannels and float to any CPU that has the appropriate Interrupt Sub-Class (ISC) enabled. ISC bits are controlled via Control Register 6 (CR6): each of the 8 ISC classes (0-7) can be independently masked per-CPU. UmkaOS manages I/O interrupt affinity by masking ISC bits in CR6 on each CPU — this is the s390x equivalent of interrupt affinity routing on other platforms.

The UmkaOS IRQ domain for s390x maps subchannel interrupts to generic software IRQ numbers via ISC-to-vector translation: when an I/O interrupt is received, the handler reads the subchannel identification word (SCHID) from the lowcore I/O interruption code area, translates the (SCHID, ISC) pair to a software IRQ number, and dispatches through the standard IrqDescriptor path.

3.8.1.4 External Interrupts¶

External interrupts include: - Clock comparator: fires when TOD clock reaches the programmed comparator value. - CPU timer: fires when the per-CPU timer decrements to zero. - SIGP external call: sent by another CPU via SIGP EXTERNAL_CALL. - SIGP emergency signal: high-priority inter-CPU signal via SIGP EMERGENCY_SIGNAL. - Service signal: from the service element (SE) or hypervisor.

External interrupt subclass codes are in the lowcore external interruption code field. UmkaOS routes these to dedicated handlers: timer interrupts to the timekeeping subsystem, SIGP signals to the IPI handler.

3.8.1.5 Inter-Processor Interrupts (IPI)¶

s390x uses the SIGP (Signal Processor) instruction for all inter-CPU communication:

SIGP Order	Purpose
`EXTERNAL_CALL`	General-purpose IPI (schedule, TLB flush, function call)
`EMERGENCY_SIGNAL`	High-priority IPI (stop, NMI-equivalent)
`RESTART`	Boot/restart a stopped CPU
`SET_PREFIX`	Set the lowcore prefix (per-CPU page base address)
`STOP`	Halt a CPU
`SENSE`	Query CPU status

UmkaOS maps the generic arch::current::interrupts::send_ipi() interface to SIGP EXTERNAL_CALL for normal IPIs and SIGP EMERGENCY_SIGNAL for NMI-class events.

3.8.1.6 IrqChip Adaptation¶

The s390x IrqChip implementation differs from controller-based architectures: - ack(): no-op (the PSW swap implicitly acknowledges the interrupt). - mask(): disables the relevant ISC bit in CR6 for the current CPU. - unmask(): enables the ISC bit in CR6. - set_affinity(): adjusts ISC masks across the target CPU set. - eoi(): no-op (no end-of-interrupt concept in the PSW-swap model).

3.8.2 LoongArch64 Interrupt Model¶

LoongArch64 uses a two-level interrupt controller architecture: the EIOINTC (Extended I/O Interrupt Controller) for general device interrupts and the LIOINTC (Legacy I/O Interrupt Controller) for UART and other legacy devices.

3.8.2.1 EIOINTC — Extended I/O Interrupt Controller¶

The EIOINTC provides 256 interrupt vectors with per-CPU routing capability. Configuration is performed through IOCSR (I/O Control and Status Register) space, accessed via the IOCSRRD and IOCSRWR instructions.

Key EIOINTC capabilities: - 256 vectors (0-255), each independently routable to any CPU. - Per-vector CPU affinity: routing registers specify the target CPU for each vector. UmkaOS programs these at device probe time based on IRQ affinity policy. - Per-vector priority: each interrupt source has a configurable priority level. - Enable/disable per-vector: individual interrupt lines can be masked independently.

EIOINTC initialization sequence: 1. Discover EIOINTC base via ACPI MADT or device tree. 2. Disable all 256 vectors (write zero to enable registers). 3. Set default routing: all vectors to BSP (CPU 0). 4. Configure priority levels. 5. Enable desired interrupt lines as devices are probed.

3.8.2.2 LIOINTC — Legacy I/O Interrupt Controller¶

The LIOINTC handles legacy device interrupts (UART, RTC, etc.) that do not route through the EIOINTC. It provides a smaller set of interrupt lines (typically 32) with fixed or limited routing. The LIOINTC cascades into the CPU's interrupt input, and UmkaOS creates a secondary IrqDomain for LIOINTC that parents to the EIOINTC root domain.

3.8.2.3 Interrupt Enable and Masking¶

Global interrupt enable is controlled by the CSR.CRMD.IE bit (Control and Status Register — Current Mode): - IE = 1: interrupts enabled. - IE = 0: interrupts disabled (masked globally).

UmkaOS sets CSR.CRMD.IE = 0 on interrupt entry (automatic by hardware) and restores it on exception return. Per-line masking is handled through the EIOINTC vector enable registers.

3.8.2.4 Inter-Processor Interrupts (IPI)¶

LoongArch64 IPIs use the IOCSR mailbox mechanism combined with a dedicated EIOINTC IPI vector: 1. The sender writes the IPI message (action bitmask) to the target CPU's IOCSR mailbox register via IOCSRWR. 2. The sender triggers the IPI by asserting the designated IPI vector in the EIOINTC (or directly via the IOCSR IPI send register). 3. The target CPU receives the interrupt on the IPI vector, reads its IOCSR mailbox to determine the requested action(s), clears the mailbox, and dispatches accordingly.

UmkaOS maps arch::current::interrupts::send_ipi() to this IOCSR mailbox + EIOINTC mechanism. The IPI action bitmask encodes: reschedule, TLB flush, function call, and stop.

3.8.2.5 IrqChip Adaptation¶

The LoongArch64 EIOINTC IrqChip implementation: - ack(): clears the pending bit for the vector in the EIOINTC status register. - mask(): clears the enable bit for the vector via IOCSR write. - unmask(): sets the enable bit for the vector via IOCSR write. - set_affinity(): reprograms the per-vector routing register to target the new CPU set. - eoi(): clears the pending bit and unmasks (standard ack-then-unmask sequence).

3.8.3 Softirq: Deferred Interrupt Processing¶

Softirqs are the bottom-half mechanism for work that cannot be done in hardirq context but must run with low latency before returning to process context. Every network packet, every timer tick, every block I/O completion, and every RCU callback batch involves softirq processing.

UmkaOS design decisions vs Linux: - 10 softirq vectors matching Linux for /proc/softirqs ABI compatibility. - No tasklets: tasklets are deprecated in Linux (replaced by threaded handlers and workqueues). UmkaOS skips tasklets entirely. - Non-preemptible by default: optimized for throughput (UmkaOS's primary server target). Matches Linux PREEMPT_NONE behavior. - Evolvable preemption hook: runtime-switchable threaded softirq mode for latency- sensitive workloads (per cgroup/workload class), enabling PREEMPT_RT-style behavior without kernel rebuild.

3.8.3.1 Softirq Vector Table¶

/// Softirq vector indices. Matches Linux `include/linux/interrupt.h` for
/// `/proc/softirqs` output compatibility.
#[repr(u32)]
pub enum SoftirqVec {
    HiPriority   = 0,  // HI_SOFTIRQ: high-priority tasklet replacement (timer-critical)
    Timer        = 1,  // TIMER_SOFTIRQ: timer wheel expiry processing
    NetTx        = 2,  // NET_TX_SOFTIRQ: network transmit completion
    NetRx        = 3,  // NET_RX_SOFTIRQ: network receive (NAPI poll)
    Block        = 4,  // BLOCK_SOFTIRQ: block I/O completion
    IrqPoll      = 5,  // IRQ_POLL_SOFTIRQ: IRQ polling (blk-iopoll)
    Tasklet      = 6,  // TASKLET_SOFTIRQ: no-op handler in UmkaOS (tasklets are
                       // deprecated; this slot is present solely for /proc/softirqs
                       // positional ABI compatibility with Linux, which exposes
                       // per-vector counters by position, not by name).
    Sched        = 7,  // SCHED_SOFTIRQ: scheduler load balancing
    HrTimer      = 8,  // HRTIMER_SOFTIRQ: high-resolution timer expiry
    Rcu          = 9,  // RCU_SOFTIRQ: RCU callback processing
}

/// Total number of softirq vectors.
pub const NR_SOFTIRQS: usize = 10;

/// Per-softirq handler function type. Called with preemption disabled, IRQs
/// enabled (within the handler — softirq execution re-enables IRQs after
/// the initial hardirq context exit). The handler must not sleep.
pub type SoftirqHandler = fn();

/// Per-CPU softirq pending bitmask. Bit N is set when softirq vector N
/// has been raised and not yet processed. `AtomicU32` stored in
/// `CpuLocalBlock` ([Section 3.2](#cpulocal-register-based-per-cpu-fast-path)).
///
/// Atomicity is required because a hardirq can preempt `do_softirq()`
/// between snapshot and clear. `raise_softirq()` uses `fetch_or(bit, Relaxed)`;
/// `do_softirq()` uses `swap(0, Relaxed)` for atomic snapshot-and-clear.
/// See the `CpuLocalBlock::softirq_pending` doc comment for the full rationale.

3.8.3.2 Raising a Softirq¶

/// Mark a softirq vector as pending on the current CPU.
/// May be called from any context (hardirq, softirq, process).
/// The softirq will be processed at the next `irq_exit()` or
/// `local_bh_enable()` call on this CPU.
///
/// # Implementation
/// Atomically sets bit `vec` in the per-CPU `softirq_pending` bitmask
/// using `fetch_or(bit, Relaxed)`. No IPI is sent — softirqs are
/// CPU-local by design. `Relaxed` ordering suffices because the only
/// consumer is `do_softirq()` on the same CPU, and the pending check
/// at `irq_exit()` is ordered by the interrupt return sequence.
pub fn raise_softirq(vec: SoftirqVec) {
    // Use get() (shared &CpuLocalBlock), NOT get_mut(). The fetch_or on
    // AtomicU32 requires only &self. Using get_mut() would create &mut
    // CpuLocalBlock, violating Rust aliasing rules: NMI/IPI handlers may
    // concurrently access other atomic fields (in_nmi, need_resched) through
    // &CpuLocalBlock. See CpuLocal::get_mut() safety contract.
    let _guard = local_irq_save();
    let block = CpuLocal::get();
    block.softirq_pending.fetch_or(1 << (vec as u32), Ordering::Relaxed);
}

/// Raise a softirq from hardirq context. Same as `raise_softirq()` but
/// inlined for use in IRQ handlers (avoids function-call overhead).
#[inline(always)]
pub fn raise_softirq_irqoff(vec: SoftirqVec) {
    // IRQs already disabled in hardirq context; no save/restore needed.
    // Use get() — see raise_softirq comment on why get_mut() is unsound here.
    CpuLocal::get().softirq_pending.fetch_or(1 << (vec as u32), Ordering::Relaxed);
}

3.8.3.3 Softirq Processing Algorithm¶

do_softirq() — called from irq_exit() and local_bh_enable():

  Precondition: softirq_pending is non-zero AND NOT in hardirq context
                (irq_count == 0) AND NOT in softirq context (softirq_count == 0).

  0. Guard check (defense-in-depth): if irq_count > 0 || softirq_count > 0,
     return immediately. Both callers (irq_exit, local_bh_enable) already
     enforce this precondition, but the explicit check protects against future
     callers that might omit it. Cost: one compare on a cold path (softirq
     entry is infrequent relative to the work done inside).

  1. Increment softirq_count (mark "in softirq context") to prevent re-entry.
  2. Enable IRQs (softirq handlers run with IRQs enabled to reduce
     interrupt latency — a key difference from hardirq context).
  3. iterations = 0.
  4. Loop:
     a. pending = softirq_pending.swap(0, Relaxed).
        // Atomic snapshot-and-clear. The swap atomically reads the current
        // value and replaces it with 0 in a single operation, preventing
        // the TOCTOU race where a hardirq sets a bit between the read and
        // the clear. Relaxed ordering suffices: the only producer is the
        // local CPU's hardirq handler, and the interrupt return sequence
        // provides the necessary ordering between the hardirq write and
        // the softirq read.
     b. For each set bit N in pending (lowest to highest):
        - Call softirq_handlers[N]().
     c. iterations += 1.
     d. If softirq_pending.load(Relaxed) != 0 AND iterations < MAX_SOFTIRQ_RESTART (10):
        - Handlers may have raised new softirqs; loop back to (a).
     e. If softirq_pending.load(Relaxed) != 0 AND iterations >= MAX_SOFTIRQ_RESTART:
        - Residual softirqs remain. Wake ksoftirqd (step 5).
        - Break out of loop.
  5. Disable IRQs.
  6. Decrement softirq_count (leave softirq context).
  7. If residual softirqs remain: wake per-CPU ksoftirqd kthread.

MAX_SOFTIRQ_RESTART = 10 (matches Linux).

3.8.3.4 ksoftirqd Fallback¶

Each CPU has a dedicated ksoftirqd/N kthread (SCHED_OTHER, nice 0). It processes softirqs that could not be fully drained in the hardirq-exit path (due to the 10-iteration limit). Without ksoftirqd, a sustained softirq storm (e.g., 100GbE line-rate packet flood) would prevent the CPU from ever reaching process context.

ksoftirqd/N thread:
  Loop:
    1. If softirq_pending.load(Relaxed) == 0: schedule() (sleep until woken by do_softirq).
    2. local_bh_disable().
    3. While softirq_pending.load(Relaxed) != 0:
       a. do_softirq() — same algorithm as above.
       b. If need_resched is set: cond_resched() (yield to higher-priority tasks).
    4. local_bh_enable().

3.8.3.5 Context Semantics¶

Property	Hardirq	Softirq	Process
Preemption	Disabled	Disabled (default)	Enabled
IRQs	Disabled	Enabled (within handler)	Enabled
May sleep	No	No	Yes
May allocate	No (except emergency)	GFP_ATOMIC only	GFP_KERNEL
Nested interrupts	Higher-priority only	Hardirqs only	All
SpinLock behavior	Already has IRQs off	local_bh_disable (via softirq_count)	Saves/restores IRQs

3.8.3.6 Interaction with Locking Primitives¶

SpinLock: SpinLock::lock() calls local_irq_save() and increments preempt_count (NOT irq_count -- see Section 3.5 for the authoritative table). Releasing calls local_irq_restore() and decrements preempt_count. Softirq execution is indirectly prevented while a SpinLock is held because hardware IRQs are masked -- softirq processing occurs at irq_exit() which cannot execute while IRQs are disabled. This is an indirect effect of IRQ masking, not an irq_count mechanism.
local_bh_disable() / local_bh_enable(): Increments/decrements the dedicated CpuLocalBlock::softirq_count field. When softirq_count > 0, softirqs are not processed at irq_exit(). local_bh_enable() checks softirq_pending.load(Relaxed) and calls do_softirq() if any softirqs are pending and softirq_count reaches 0. UmkaOS uses separate typed fields (irq_count for hardirq, softirq_count for BH) instead of Linux's packed bitfield layout — see Section 3.2.
RCU: RCU_SOFTIRQ (vector 9) processes completed grace period callbacks. Softirq context counts as a quiescent state for RCU purposes (softirqs run between RCU read-side critical sections).

3.8.3.7 Evolvable Preemption Policy Hook¶

The default non-preemptible softirq model is optimal for throughput-oriented server workloads. For latency-sensitive workloads (real-time audio, trading), softirqs can be converted to threaded execution via an Evolvable policy hook:

/// Evolvable softirq scheduling policy.
///
/// **Nucleus** (data): `SoftirqVec` enum, `softirq_pending: AtomicU32` bitmask, handler table.
/// **Evolvable** (policy): this trait controls whether softirqs run inline
/// (non-preemptible, default) or as dedicated kthreads (preemptible).
///
/// Runtime-switchable per cgroup via the `cpu.softirq_mode` cgroup knob:
/// - `inline` (default): softirqs run in `do_softirq()` with preemption disabled.
/// - `threaded`: each softirq vector is serviced by a dedicated SCHED_FIFO kthread,
///   allowing preemption between softirq handlers.
pub trait SoftirqPolicy: Send + Sync {
    /// Returns true if the given softirq vector should run as a dedicated
    /// kthread rather than inline in do_softirq(). Checked per-invocation.
    fn is_threaded(&self, vec: SoftirqVec) -> bool;
}

The SoftirqPolicy hook is consulted at the start of do_softirq(). When is_threaded() returns true for a vector, the corresponding kthread is woken instead of running the handler inline. This is the UmkaOS equivalent of Linux's PREEMPT_RT forced-threading, but runtime-switchable rather than compile-time.

3.9 Memory Model Differences Across Architectures¶

Critical Implementation Warning: UmkaOS relies extensively on lock-free concurrency, asynchronous ring buffers, and RCU to meet its strict performance budget. All lock-free algorithms must be correct across every target architecture — not just on the architecture where they were developed and tested.

3.9.1 x86 TSO Conceals Ordering Bugs¶

x86_64 implements Total Store Ordering (TSO). Under TSO, loads are not reordered with other loads, stores are not reordered with other stores, and stores are observed in program order by all CPUs. This is a significantly stronger ordering guarantee than software actually requires for most algorithms.

The consequence for development is that lock-free code with missing memory barriers, or code that uses Ordering::Relaxed where Ordering::Release is required, will almost always execute correctly on x86 and pass all tests there. The hardware silently supplies the ordering that the programmer omitted. Bugs of this class are structurally invisible on x86 — no amount of stress-testing on x86 hardware will expose them, because the CPU never exercises the reorderings that would trigger the race.

x86 TSO pays a hidden hardware cost. The strong ordering guarantee is not free: Intel and AMD CPUs implement store buffers and memory ordering machinery that impose a hardware tax on every store, whether or not the software needs ordered visibility. Code using Ordering::Release on x86 compiles to a plain store (the hardware already provides release semantics), but the CPU still pays the store-ordering overhead internally. Software running on x86 is paying for ordering it often does not need.

The practical conclusion: x86 is an unreliable test platform for lock-free code. A lock-free algorithm that passes all tests on x86 has not been validated — it has been tested on the platform most likely to hide its bugs. Correctness on x86 is necessary but not sufficient.

3.9.2 ARM, RISC-V, and PowerPC: Explicit Memory Ordering Surfaces True Requirements¶

AArch64, ARMv7, RISC-V, and PowerPC implement relaxed memory models. The CPU is permitted to reorder independent memory reads and writes for performance — a store to address A followed by a load from address B can complete in either order if A and B are in different cache lines and there is no explicit ordering constraint between them. Stores from one CPU become visible to other CPUs at different times, and different CPUs may observe stores in different orders unless barriers enforce a consistent sequence.

This is not a deficiency in these architectures — it is the architecturally correct exposure of what ordering actually costs. These architectures give software precise control: pay for ordering when the algorithm requires it, pay nothing when it does not. A missing memory barrier on AArch64 or RISC-V produces visible, reproducible failures: sequence locks read torn data, ring buffer consumers observe the tail pointer advance before payload data is visible, and RCU readers dereference pointers to uninitialized memory. The bug that x86 conceals, ARM and RISC-V surface.

Develop and test lock-free algorithms on ARM or RISC-V. Only a platform with relaxed memory ordering can expose ordering bugs. x86 is unreliable as a correctness gate for lock-free primitives.

Implementation Mandates: To ensure lock-free algorithms are correct across all target architectures, the following rules apply:

Explicit Rust Atomics: All shared memory synchronization must use Rust's std::sync::atomic types with mathematically correct memory orderings (Acquire, Release, AcqRel, SeqCst). Never rely on implicit hardware ordering; write code that is correct on the weakest memory model UmkaOS targets.
Release-Acquire Semantics: The standard pattern for lock-free publishing in UmkaOS (e.g., advancing a ring buffer head, or updating an RCU pointer) MUST pair an Ordering::Release store on the producer with an Ordering::Acquire load on the consumer.
No Relaxed in Control Flow: Ordering::Relaxed may only be used for pure statistical counters (e.g., rx_packets). It must never be used to synchronize visibility of other data.
Mandatory Multi-Arch CI: Lock-free primitives (MPSC queues, RCU, seqlocks) must be subjected to heavy stress-testing natively on AArch64 and RISC-V hardware (or QEMU/emulators with memory-reordering fuzzing enabled). x86_64 test passes are considered insufficient to prove the correctness of lock-free code.

3.9.3 Ordering Instruction Cost: Ring Buffer and RCU Performance by Architecture¶

The implementation mandate to use Ordering::Release/Ordering::Acquire on ring buffer head/tail pointer updates and RCU pointer publications has measurably different instruction- level costs across architectures. Understanding this is essential for interpreting benchmark results and for choosing the minimum correct ordering at each synchronization point.

Ordering::Release store — instruction cost by architecture:

Architecture	Compiled instruction	Approximate cost
x86-64	Plain `MOV` (no additional instruction)	~1 cycle (TSO provides release for free)
AArch64	`STLR` (store-release) or `STR` + `DMB ISHST`	~5-20 cycles (barrier flushes store buffer)
ARMv7	`STR` + `DMB ISHST`	~10-30 cycles
RISC-V	`FENCE RW,W` before store (or `amoswap` with `.rl`)	~10-30 cycles
PPC64	`lwsync` before store	~10-20 cycles
PPC32 (e500)	`sync` before store	~20-40 cycles (e500v1/v2 cores do NOT support `lwsync` — it causes an Illegal Instruction trap; `sync`/`msync` must be used instead)
s390x	Plain `ST`/`STG` (no additional instruction)	~0 cycles (strongly ordered (at least TSO; sequential consistency for single-copy atomics) for single-copy atomics — release is free)
LoongArch64	`DBAR 0x12` (store-release ordering) + `ST.D`	~15-25 cycles

Ordering::Acquire load — instruction cost by architecture:

Architecture	Compiled instruction	Approximate cost
x86-64	Plain `MOV` (no additional instruction)	~1 cycle (TSO prevents load reordering)
AArch64	`LDAR` (load-acquire)	~5-20 cycles
ARMv7	`LDR` + `DMB ISH`	~10-30 cycles
RISC-V	`FENCE R,RW` after load (or `lr` with `.aq`)	~10-30 cycles
PPC64	`lwsync` after load or `isync` on branch path	~10-20 cycles
PPC32 (e500)	`sync` after load or `isync` on branch path	~20-40 cycles (e500 lacks `lwsync`; see Release note above)
s390x	Plain `L`/`LG` (no additional instruction)	~0 cycles (strongly ordered (at least TSO; sequential consistency for single-copy atomics) — acquire is free)
LoongArch64	`LD.D` + `DBAR 0x14` (load-acquire ordering)	~15-25 cycles

Implications for UmkaOS ring buffer throughput:

Ring buffer head/tail pointer updates require a Release store on the producer side and an Acquire load on the consumer side — this is the minimum correct ordering and cannot be reduced without introducing a race. On x86, these compile to ordinary MOV instructions; the TSO hardware silently provides the ordering. On ARM and RISC-V, each ring buffer publish or consume event pays a real instruction-level barrier cost.

The consequence is that ring buffer throughput on AArch64 and RISC-V will be measurably lower than on x86 for identical Rust source code — not because of a bug or a missing optimization, but because x86 is paying its ordering cost in hardware (through store-buffer logic and pipeline constraints that are invisible to software), while ARM and RISC-V pay it explicitly in the instruction stream. Both are paying; only the accounting differs.

Ordering::Relaxed usage in UmkaOS ring buffers:

Within the ring buffer implementation, Ordering::Relaxed is used selectively where only atomic visibility (not ordering relative to other accesses) is required:

Statistical counters (rx_packets, tx_bytes, drop counters): Relaxed on both store and load. These are sampled for reporting only; no other memory access is conditioned on their value.
Per-CPU freelist size counters: Relaxed. The size is advisory; the actual allocation uses a separate acquire-load on the freelist head pointer.
Dead-reckoning checks (e.g., "is the ring approximately empty?"): Relaxed. If the approximate check is stale by one entry, the caller falls back to a serializing path.

Ordering::Relaxed is never used for the head/tail pointers that control whether a slot is safe to read or write. Misuse of Relaxed on these pointers produces silent data corruption on ARM and RISC-V; it works by accident on x86. The rule from mandate (3) is absolute: Relaxed may not appear in any code path that controls visibility of payload data.

3.9.4 Endianness¶

PPC32 and s390x are big-endian; all other supported architectures are little-endian. Wire-format structs use Le types (Section 6.1). Kernel-internal structs use native endianness. Conversion happens at RDMA/wire boundaries only.

3.10 Algorithm Dispatch and In-Kernel SIMD¶

Two closely related mechanisms let UmkaOS select the best available algorithm implementation per platform and use hardware SIMD safely from kernel code.

3.10.1 AlgoDispatch: Zero-Cost Runtime Dispatch¶

AlgoDispatch<F> holds a single function pointer chosen once at boot from a priority-ordered list of candidates. After boot, dispatching is a plain indirect call — no atomic, no lock, no branch on a feature flag.

Design rationale. The naïve alternative is checking a feature flag at every call site:

// Bad: branch in hot path, missed inlining, repeated flag read
if this_cpu_has_crypto(CryptoCaps::SHA2_256) { sha256_sha_ni(data, out) }
else { sha256_generic(data, out) }

AlgoDispatch moves the branch to boot time and stores the result as an immutable function pointer. The call site becomes:

SHA256.call()(data, out)   // single indirect call, predictable branch target

Implementation:

/// A boot-initialised, immutable-after-init function pointer.
///
/// `F` is a function-pointer type, e.g. `fn(&[u8], &mut [u8; 32])`.
///
/// # Thread Safety
/// `AlgoDispatch<F>` is `Sync` when `F: Copy + Send`. After `init()`, the
/// contained function pointer is never modified, so concurrent reads from
/// multiple CPUs are safe without synchronisation.
///
/// # Heterogeneous CPU Support
/// Candidate selection uses `all_cpus_have_*()` (§2.1.2.18.3), ensuring the
/// chosen implementation is valid on every CPU in the system. A kthread using
/// the dispatched function can migrate freely without correctness issues.
/// On RISC-V systems where harts have differing ISA extensions, the scheduler
/// already constrains task placement via `isa_required` affinity (§7.1.5.9);
/// `AlgoDispatch` selects based on the universal intersection, which is the
/// correct choice for kthreads that may run on any hart.
pub struct AlgoDispatch<F: Copy + Send + 'static> {
    func: UnsafeCell<MaybeUninit<F>>,
    init_done: AtomicBool,
}

/// Safety: after init() sets init_done, func is immutable; read-only access
/// from multiple threads is safe.
unsafe impl<F: Copy + Send + 'static> Sync for AlgoDispatch<F> {}

impl<F: Copy + Send + 'static> AlgoDispatch<F> {
    /// Construct an uninitialised dispatch slot (for use in `static` context).
    pub const fn uninit() -> Self {
        Self {
            func: UnsafeCell::new(MaybeUninit::uninit()),
            init_done: AtomicBool::new(false),
        }
    }

    /// Boot-time initialiser. Must be called exactly once, during boot phase 9
    /// (after `cpu_features_freeze()`, before any concurrent kernel code runs).
    /// Panics if called twice or after concurrent access begins.
    ///
    /// Selects the first candidate in `candidates` whose requirements are
    /// satisfied by the universal CPU feature intersection. The last entry MUST
    /// have empty requirements (the generic fallback); the function panics if
    /// no candidate matches.
    pub fn init(&self, candidates: &[AlgoCandidate<F>]) {
        assert!(
            !self.init_done.load(Ordering::Relaxed),
            "AlgoDispatch::init called twice"
        );
        for candidate in candidates {
            if all_cpus_have_crypto(candidate.crypto_required)
                && all_cpus_have_atomics(candidate.atomics_required)
                && min_simd_width_bytes() >= candidate.min_simd_bytes
            {
                // SAFETY: init_done is false, so no concurrent reads yet.
                unsafe { (*self.func.get()).write(candidate.func) };
                self.init_done.store(true, Ordering::Release);
                log::info!(
                    "algo_dispatch: selected '{}' for '{}'",
                    candidate.name,
                    candidate.algo_name,
                );
                return;
            }
        }
        panic!("AlgoDispatch: no candidate matched (missing generic fallback?)");
    }

    /// Call the dispatched implementation. Panics in debug if `init()` was
    /// not called; in release, calling before init is undefined behaviour
    /// (the AlgoDispatch initialisation sequence in boot phase 9 prevents this).
    #[inline(always)]
    pub fn get(&self) -> F {
        // Acquire pairs with the Release store in init(), ensuring the func
        // write is visible on weakly-ordered architectures (ARM, RISC-V).
        // This Acquire load is NOT a debug check — it is a correctness
        // requirement for memory ordering. On x86-64 (TSO), Acquire on
        // AtomicBool compiles to a plain MOV (zero extra cost). On AArch64
        // it compiles to LDAR (~1 extra cycle vs plain load). The compiler
        // cannot hoist or elide this load because AtomicBool::load is opaque.
        //
        // **Design decision**: The ~1 cycle Acquire load per call is accepted.
        // AlgoDispatch is used for crypto algorithm selection and SIMD dispatch —
        // warm-path operations (per-packet or per-block-I/O, not per-instruction).
        // At the highest anticipated dispatch frequency (~1M calls/sec for crypto),
        // the LDAR adds ~1ms/sec — negligible. The Acquire is retained because it
        // provides the happens-before with init() on weakly-ordered architectures
        // and prevents the compiler from caching the init_done check across calls.
        //
        // **Future optimization (Phase 3+)**: A static-key / patched-NOP approach
        // could eliminate the Acquire load entirely by patching the init_done check
        // to a NOP at boot time. This requires self-modifying code with I-cache
        // coherency across all 8 architectures — deferred in favor of the simpler,
        // correct, and quantifiably cheap Acquire approach.
        let ready = self.init_done.load(Ordering::Acquire);
        debug_assert!(ready, "AlgoDispatch used before init()");
        // SAFETY: init_done is true (Acquire) → func is fully written and immutable.
        unsafe { (*self.func.get()).assume_init() }
    }
}

/// One candidate in an `AlgoDispatch` selection list.
pub struct AlgoCandidate<F: Copy> {
    /// Algorithm name for boot log (e.g. "sha256").
    pub algo_name: &'static str,
    /// Implementation variant name (e.g. "sha_ni", "sha2_ce", "zknh", "generic").
    pub name: &'static str,
    /// All listed cryptographic capabilities must be in the universal intersection.
    pub crypto_required: CryptoCaps,
    /// All listed atomic capabilities must be universal.
    pub atomics_required: AtomicCaps,
    /// Minimum SIMD register width required in bytes. 0 = scalar, no SIMD needed.
    pub min_simd_bytes: u16,
    /// The function pointer for this implementation.
    pub func: F,
}

Declaration macro for ergonomic global registration:

/// Declare a module-level `AlgoDispatch` and its candidate list.
/// Candidates are listed highest-priority first; the last MUST have no
/// requirements (generic fallback).
///
/// The macro expands to a `static` `AlgoDispatch` and a `fn init_<name>()`
/// that is registered with the boot-phase-9 init table via `#[algo_init]`.
///
/// Example:
/// ```rust
/// algo_dispatch! {
///     pub static SHA256: AlgoDispatch<fn(&[u8], &mut [u8; 32])> = {
///         algo: "sha256",
///         candidates: [
///             // x86-64 SHA-NI / RISC-V Zknh / AArch64 SHA2-CE
///             { "sha_ni_or_ce", crypto: SHA2_256, simd: 0,  sha256_hwaccel },
///             // AVX2 4-way parallel schedule (x86-64 without SHA-NI)
///             { "avx2_4way",   crypto: {},        simd: 32, sha256_avx2_4way },
///             // NEON 4-way parallel schedule (AArch64 without SHA2-CE)
///             { "neon_4way",   crypto: {},        simd: 16, sha256_neon_4way },
///             // Portable scalar — always eligible, always last
///             { "generic",     crypto: {},        simd: 0,  sha256_generic },
///         ],
///     };
/// }
/// ```
macro_rules! algo_dispatch { ... }

All algo_dispatch! statics in the kernel are initialised by a single call to algo_dispatch_init_all() at boot phase 9. This function iterates the linker section __algo_dispatch_inits (populated by #[algo_init] attribute macros on generated init functions) and calls each init function in source-order within each crate, crates in link order. No heap allocation occurs. Total init time is O(candidates × algo_count), bounded in practice to < 1 ms.

3.10.2 SimdKernelGuard: Safe In-Kernel SIMD Use¶

Task FPU context (§2.1.2.17) is managed lazily: user tasks pay no cost unless they use floating-point. That mechanism handles user FPU context — the task's architectural floating-point state that must survive context switches.

A separate problem is deliberate kernel SIMD use: a kthread or kernel function that intentionally issues SIMD instructions for bulk operations (AES encryption, hash computation, SIMD memcpy, compression). Three invariants must hold:

Non-preemptible while SIMD is active. SIMD register state is per-CPU and not saved on preemption unless the task-FPU mechanism is engaged. Preemption mid-SIMD would corrupt the registers if the task is migrated to another CPU. Remedy: disable preemption for the SIMD region's duration.
Forbidden in interrupt context. An interrupt handler issuing SIMD instructions would corrupt the interrupted task's (or kthread's) SIMD state, which may be live but not yet saved (lazy save deferred). Kernel SIMD is unconditionally forbidden from IRQ handlers, NMI handlers, and softirq handlers.
SIMD unit must be enabled for kernel mode. Most architectures disable the FPU/SIMD unit in kernel mode by default (the mechanism that causes the #NM / EL0 FP-trap that the lazy-FPU handler catches). Kernel SIMD requires explicitly re-enabling it for the duration of the operation.

SimdKernelGuard enforces all three invariants via RAII:

/// RAII guard for deliberate kernel-initiated SIMD/FPU use.
///
/// Acquire this guard before issuing any SIMD instruction in kernel code
/// (crypto, compression, SIMD memcpy, etc.). The guard is not required for
/// scalar floating-point in kthreads that legitimately own FPU context (e.g.
/// `PdControllerState` in the IntentOptimizer kthread, §7.3.5).
///
/// # Invariants Enforced
/// - Cannot be acquired from interrupt context (asserted at acquisition).
/// - Preemption is disabled for the guard's lifetime.
/// - The architecture's SIMD/FPU unit is enabled for kernel mode on acquisition
///   and disabled on drop.
/// - If the current task had live (unsaved) user FPU state, it is saved to the
///   task's FPU save area before kernel SIMD is enabled, so that it can be
///   restored correctly on the next return-to-user or context switch.
///
/// # Nestable (reference-counted)
/// Acquiring a `SimdKernelGuard` while one is already held on this CPU is safe:
/// the inner guard increments `simd_kernel_depth` but skips SIMD enable (already
/// active). On drop, the inner guard decrements depth but skips SIMD disable.
/// Only the outermost guard (depth transitions 0→1 on acquire and 1→0 on drop)
/// actually enables/disables the SIMD unit. Debug builds assert nesting depth
/// < 16 to catch unbounded recursion.
///
/// # Architecture-Specific Enable/Disable
///
/// | Arch    | Enable in kernel mode              | Disable                          |
/// |---------|------------------------------------|----------------------------------|
/// | x86-64  | `clts` (clear CR0.TS)              | `mov cr0, cr0 | CR0_TS`          |
/// | AArch64 | `CPACR_EL1.FPEN ← 0b11`           | `CPACR_EL1.FPEN ← 0b00`         |
/// |         | `ZCR_EL1.LEN ← max` (if SVE)      | (context switch restores ZCR)    |
/// | ARMv7   | `FPEXC.EN ← 1` (MCR p10,7,FPEXC) | `FPEXC.EN ← 0`                  |
/// | RISC-V  | `sstatus.FS ← 01` (Initial)        | `sstatus.FS ← 00` (Off)         |
/// |         | `sstatus.VS ← 01` if RVV present   | `sstatus.VS ← 00`               |
/// | PPC32   | `MSR.VEC ← 1` (mtmsr)             | `MSR.VEC ← 0`                   |
/// | PPC64LE | `MSR.VEC ← 1, MSR.VSX ← 1`        | `MSR.VEC ← 0, MSR.VSX ← 0`     |
/// | s390x   | `STCTG`/`LCTG` CR0: set AFP bit    | `LCTG` CR0: clear AFP bit        |
/// |         | (Additional Floating-Point). Enables| Disables vector register access. |
/// |         | VX (vector extension) register      | `arch_state` saves previous CR0. |
/// |         | access for SIMD instructions.       |                                  |
/// | LoongArch64 | `CSR.EUEN.FPE ← 1` (FP enable) | `CSR.EUEN.FPE ← 0`             |
/// |         | `CSR.EUEN.SXE ← 1` (128-bit LSX)   | `CSR.EUEN.SXE ← 0`             |
/// |         | `CSR.EUEN.ASXE ← 1` (256-bit LASX) | `CSR.EUEN.ASXE ← 0`            |
///
/// The `arch_state` field stores whatever per-arch context is needed at drop
/// time (e.g., the previous CR0 value on x86-64, the previous CPACR_EL1 value
/// on AArch64). It is zero-sized on architectures where a single bit-set/clear
/// suffices and the previous value is known (e.g., CR0.TS is always 1 before
/// the guard; always restored to 1 on drop).
///
/// **Per-architecture `SimdKernelState` sizes:**
///
/// | Architecture | Size (bytes) | Saved state |
/// |---|---|---|
/// | x86-64 | 0 (ZST) | CR0.TS is always 1 before; restored unconditionally. |
/// | AArch64 | 8 | Previous CPACR_EL1 (u64): restore FPEN+ZEN fields on drop. |
/// | ARMv7 | 4 | Previous FPEXC (u32): restore EN bit on drop. |
/// | RISC-V 64 | 8 | Previous sstatus (u64): restore FS and VS fields on drop. |
/// | PPC32 | 4 | Previous MSR (u32): restore VEC bit on drop. |
/// | PPC64LE | 4 | Previous MSR low word (u32): restore VEC+VSX bits on drop. |
/// | s390x | 8 | Previous CR0 (u64): restore AFP bit on drop via `LCTG`. |
/// | LoongArch64 | 4 | Previous CSR.EUEN (u32): restore FPE+SXE+ASXE bits on drop. |
pub struct SimdKernelGuard {
    _preempt: PreemptGuard,
    arch_state: arch::current::cpu::SimdKernelState,
    // Prevent Send: the guard must be dropped on the CPU that acquired it.
    _no_send: PhantomData<*mut ()>,
}

impl SimdKernelGuard {
    /// Acquire the guard. Panics if called from interrupt context
    /// (`CpuLocal::irq_count > 0` or `CpuLocal::preempt_count` indicates IRQ depth).
    #[must_use]
    #[inline]
    pub fn new() -> Self {
        debug_assert!(
            !arch::current::interrupts::in_interrupt(),
            "SimdKernelGuard::new() called from interrupt context"
        );
        #[cfg(not(debug_assertions))]
        if arch::current::interrupts::in_interrupt() {
            // Panic even in release builds: returning a noop guard would
            // allow the caller to execute SIMD instructions without the
            // SIMD unit enabled, causing a #UD fault or silent data
            // corruption. There is no safe fallback here — the caller
            // expects SIMD to be available after acquiring the guard.
            panic!("SimdKernelGuard::new() called from interrupt context");
        }
        let preempt = PreemptGuard::new();
        let block = CpuLocal::get();
        let depth = block.simd_kernel_depth.load(Ordering::Relaxed);
        debug_assert!(
            depth < 16,
            "SimdKernelGuard nesting depth {} exceeds limit — likely unbounded recursion",
            depth,
        );
        if depth > 0 {
            // Already active on this CPU — return a noop guard that only
            // increments depth. SIMD unit is already enabled by the outer guard.
            block.simd_kernel_depth.fetch_add(1, Ordering::Relaxed);
            return Self {
                _preempt: preempt,
                arch_state: arch::current::cpu::SimdKernelState::NOOP,
                _no_send: PhantomData,
            };
        }
        // Outermost acquisition: save task FPU state if it is live (lazy-FPU
        // mechanism may not have saved it yet). After this, the save area is
        // up-to-date for context switch.
        arch::current::cpu::save_task_fpu_if_live();
        let arch_state = arch::current::cpu::simd_kernel_enable();
        // Track nesting depth in CpuLocal for is_active() check.
        //
        // NMI window: between save_task_fpu_if_live() and the depth increment
        // below, an NMI handler would see depth == 0. This is safe because NMI
        // handlers are prohibited from using SIMD (checked by the in_interrupt()
        // guard above). The enable-before-increment order avoids needing to undo
        // depth on enable failure.
        CpuLocal::get().simd_kernel_depth.fetch_add(1, Ordering::Relaxed);
        Self { _preempt: preempt, arch_state, _no_send: PhantomData }
    }

    /// Create a no-op guard that skips SIMD enable/disable. Used when:
    /// - Called from interrupt context (FPU state already saved by the
    ///   interrupt entry trampoline on architectures that require it).
    /// - The caller knows SIMD is already active (`is_active() == true`)
    ///   but needs a guard value for API uniformity.
    ///
    /// The returned guard holds a `PreemptGuard` (preemption disabled)
    /// but does NOT touch FPU/SIMD registers or increment `simd_kernel_depth`.
    /// Drop is a no-op beyond re-enabling preemption.
    pub fn noop() -> Self {
        Self {
            _preempt: PreemptGuard::new(),
            arch_state: arch::current::cpu::SimdKernelState::NOOP,
            _no_send: PhantomData,
        }
    }

    /// True if a `SimdKernelGuard` is currently held on this CPU.
    /// Use for nesting avoidance in composable functions:
    /// ```rust
    /// let _guard = if !SimdKernelGuard::is_active() { Some(SimdKernelGuard::new()) } else { None };
    /// ```
    /// Check if a SimdKernelGuard is active on the CURRENT CPU.
    /// Preemption must be disabled (or an existing PreemptGuard held)
    /// to ensure the CpuLocal read is on the correct CPU. If called
    /// without preemption disabled, a migration between the CpuLocal
    /// read and the field access could return a false positive from
    /// the previous CPU, causing a #UD fault on the new CPU.
    #[inline(always)]
    pub fn is_active() -> bool {
        let _preempt = PreemptGuard::new();
        CpuLocal::get().simd_kernel_depth.load(Ordering::Relaxed) > 0
    }
}

impl Drop for SimdKernelGuard {
    fn drop(&mut self) {
        // Use CpuLocal::get() (shared reference) + AtomicU8 operations to avoid
        // the &mut CpuLocalBlock aliasing violation (see SF-101/SF-102). The
        // fetch_sub returns the PREVIOUS value; depth reaches 0 when prev == 1.
        let block = CpuLocal::get();
        let prev = block.simd_kernel_depth.fetch_sub(1, Ordering::Relaxed);
        if prev == 1 {
            // Outermost guard: disable the SIMD unit.
            arch::current::cpu::simd_kernel_disable(&self.arch_state);
        }
        // Inner guards: arch_state is NOOP, simd_kernel_disable is a no-op anyway.
        // _preempt dropped here: preemption re-enabled after SIMD disabled.
    }
}

A simd_kernel_depth: AtomicU8 field is added to CpuLocalBlock (§3.1.2). It tracks the nesting depth for is_active() and for the nestable guard protocol: only the outermost guard (depth 0→1) enables SIMD; only the last drop (depth 1→0) disables it. The field is zero when no SimdKernelGuard is held. AtomicU8 (with Relaxed ordering) is used instead of plain u8 to allow access via CpuLocal::get() (shared &CpuLocalBlock) without requiring get_mut() — which would unsoundly create &mut CpuLocalBlock while NMI handlers may concurrently access other atomic fields. On x86-64, Relaxed atomic ops compile to plain loads/stores with zero overhead.

3.10.3 Combined Usage Pattern¶

A kernel subsystem using hardware-accelerated algorithms combines both mechanisms:

// ── Module level (initialised at boot phase 9) ──────────────────────────────

type AesGcmFn = fn(key: &AesGcmKey, nonce: &[u8; 12],
                   aad: &[u8], plaintext: &[u8], ct_out: &mut [u8]);

algo_dispatch! {
    pub(crate) static AES_GCM: AlgoDispatch<AesGcmFn> = {
        algo: "aes-gcm",
        candidates: [
            // VAES + VPCLMULQDQ: 8-block parallelism (x86-64 AVX-512+)
            { "vaes_vpclmul", crypto: VAES | CLMUL, simd: 64, aes_gcm_vaes_vpclmul },
            // AES-NI + PCLMULQDQ (x86-64 baseline accelerated, AArch64 PMULL)
            { "aesni_clmul",  crypto: AES_BLOCK | CLMUL, simd: 16, aes_gcm_aesni_clmul },
            // RISC-V Zkne + Zkg scalar
            { "zkne_zkg",     crypto: AES_BLOCK | CLMUL, simd: 0, aes_gcm_zkne_zkg },
            // PPC64LE vcipher + vpmsumd
            { "vcipher_ppc",  crypto: AES_BLOCK | CLMUL, simd: 16, aes_gcm_vcipher_ppc },
            // Portable constant-time scalar (always eligible, always last)
            { "generic",      crypto: {},                simd: 0, aes_gcm_generic },
        ],
    };
}

// ── Call site ────────────────────────────────────────────────────────────────

pub fn encrypt(key: &AesGcmKey, nonce: &[u8; 12],
               aad: &[u8], plaintext: &[u8], ct_out: &mut [u8]) {
    // Acquire the SIMD guard only when the selected implementation uses SIMD.
    // `min_simd_bytes > 0` in the selected candidate means SIMD is needed.
    // If a SimdKernelGuard is already held by a caller, reuse it (no re-entry).
    let _guard = if AES_GCM_USES_SIMD && !SimdKernelGuard::is_active() {
        Some(SimdKernelGuard::new())
    } else {
        None
    };
    AES_GCM.get()(key, nonce, aad, plaintext, ct_out);
}

The boolean AES_GCM_USES_SIMD is a static bool initialised alongside the AlgoDispatch during boot phase 9; it captures whether the selected candidate has min_simd_bytes > 0. This avoids the guard acquisition overhead on platforms (or algorithm variants) that selected a purely scalar implementation.

3.10.4 Feature-Dependent Subsystem Catalog¶

Every kernel subsystem that benefits from CPU-specific hardware acceleration uses AlgoDispatch for variant selection. The table below is the authoritative catalog. New entries must be added here when a subsystem gains hardware acceleration.

For the full kernel image structure showing how these modules are packaged and loaded, see Section 2.21.

3.10.4.1 Cryptographic Algorithms¶

Algorithm	`AlgoDispatch` Static	Priority-Ordered Variants
AES-GCM	`AES_GCM`	VAES+VPCLMUL (x86 AVX-512) → AES-NI+CLMUL (x86) → CE+PMULL (AArch64) → Zkne+Zkg (RISC-V) → vcipher+vpmsumd (PPC64) → generic
AES-XTS	`AES_XTS`	AES-NI (x86) → CE (AArch64) → Zkne (RISC-V) → generic
ChaCha20-Poly1305	`CHACHA20_POLY1305`	AVX-512 (x86) → AVX2 (x86) → NEON (AArch64) → generic
SHA-256	`SHA256`	SHA-NI (x86) → AVX2 4-way (x86) → SHA2-CE (AArch64) → NEON 4-way (AArch64) → Zknh (RISC-V) → generic
SHA-512	`SHA512`	AVX2 (x86) → SHA512-CE (AArch64) → generic
SHA-3 / SHAKE	`SHA3`	AVX2 Keccak-4x (x86) → SHA3-CE (AArch64, FEAT_SHA3) → generic
SM3	`SM3`	AVX2+AES-NI (x86) → SM3-CE (AArch64, FEAT_SM3) → Zksh (RISC-V) → generic
SM4	`SM4`	AES-NI affine (x86) → SM4-CE (AArch64, FEAT_SM4) → Zksed (RISC-V) → generic
ML-KEM-768	`ML_KEM_768`	AVX2 NTT (x86) → NEON NTT (AArch64) → generic
ML-DSA-65	`ML_DSA_65`	AVX2 (x86) → NEON (AArch64) → generic
GHASH	`GHASH`	PCLMULQDQ (x86) → PMULL (AArch64) → Zkg (RISC-V) → generic
Poly1305	`POLY1305`	AVX2 (x86) → NEON (AArch64) → generic

3.10.4.2 Checksum and Hash¶

Algorithm	`AlgoDispatch` Static	Priority-Ordered Variants
CRC32C	`CRC32C`	SSE4.2 crc32 instr (x86) → CRC32 instr (AArch64) → Zbc (RISC-V) → generic
xxHash64	`XXHASH64`	AVX2 (x86) → NEON (AArch64) → generic
Adler32	`ADLER32`	SSSE3 (x86) → NEON (AArch64) → generic

3.10.4.3 Compression¶

Algorithm	`AlgoDispatch` Static	Priority-Ordered Variants
zstd compress	`ZSTD_COMPRESS`	AVX2 match finder (x86) → NEON (AArch64) → generic
zstd decompress	`ZSTD_DECOMPRESS`	BMI2 bit extraction (x86) → generic
LZ4	`LZ4`	AVX2 sequence match (x86) → NEON (AArch64) → generic
zlib/deflate	`ZLIB_DEFLATE`	SSE4.2+PCLMULQDQ (x86) → CRC32+PMULL (AArch64) → generic

3.10.4.4 Memory Operations¶

Operation	`AlgoDispatch` Static	Priority-Ordered Variants
memcpy (kernel)	`KERNEL_MEMCPY`	ERMS+FSRM rep movsb (x86) → AVX2 (x86, no FSRM) → NEON ldp/stp (AArch64) → generic
memset / page zero	`KERNEL_MEMSET`	ERMS rep stosb (x86) → AVX2 (x86) → DC ZVA (AArch64) → NEON stp xzr (AArch64) → generic
memcmp	`KERNEL_MEMCMP`	SSE4.2 PCMPISTRI (x86) → NEON (AArch64) → generic

3.10.4.5 RAID Parity¶

Operation	`AlgoDispatch` Static	Priority-Ordered Variants
XOR (RAID5)	`RAID_XOR`	AVX-512 (x86) → AVX2 (x86) → SVE (AArch64) → NEON (AArch64) → RVV (RISC-V) → generic
P+Q (RAID6)	`RAID_PQ`	AVX-512 (x86) → AVX2 (x86) → NEON (AArch64) → generic

3.10.4.6 Networking¶

Operation	`AlgoDispatch` Static	Priority-Ordered Variants
TCP/UDP/IP checksum	`NET_CSUM`	AVX2 (x86) → SSE2 (x86) → NEON (AArch64) → generic
RSS Toeplitz hash	`NET_TOEPLITZ`	PCLMULQDQ (x86) → PMULL (AArch64) → generic

3.10.4.7 Code Alternatives (Instruction-Level Dispatch)¶

AlgoDispatch selects which function to call. A complementary mechanism — code_alternative! — selects which instruction to use within a function. This covers CPU errata workarounds, new instruction adoption, and microarchitectural tuning at the instruction level:

Category	Examples	Mechanism
Spectre/Meltdown mitigations	Retpoline → eIBRS direct branch; KPTI enable/disable; VERW on context switch	`code_alternative!` + `ErrataCaps`
New instructions replacing old	`SERIALIZE` replacing `CPUID`; `WRMSRNS` replacing `WRMSR`; `LKGS` for syscall entry	`code_alternative!` + `arch_raw`
Errata workarounds	LFENCE serialization; CLEARBHB (AArch64); DSB before TLBI (Cortex-A76)	`code_alternative!` + `ErrataCaps`
Page zeroing	DC ZVA vs STP xzr (AArch64, depends on ZVA block size)	`code_alternative!`
Power management	MWAIT hint value selection; HWP enable/disable; C-state depth	`MicroarchHints` + platform PM driver

See Section 2.16 for the full code_alternative! specification, ErrataCaps bitflags, and MicroarchHints struct.

3.10.4.8 Not Dispatched via AlgoDispatch or code_alternative!¶

The following subsystems have CPU-feature-dependent behavior but use neither AlgoDispatch nor code_alternative!, because their dispatch is structurally different:

Subsystem	Mechanism	Reason
Isolation domain switch	`arch::current::isolation` (compile-time per target triple)	Affects entire driver model topology, not a single callsite. Runtime fallback (POE → page table on AArch64) is per-driver-init, not per-call.
BPF JIT	Per-arch JIT backend (x86-64, AArch64, RISC-V, etc.)	Code generation, not algorithm selection. JIT emits arch-native instructions; no "variant" to dispatch.
Context switch	`arch::current::context::context_switch()`	Per-arch assembly. Only one implementation per target triple.
TLB flush	`arch::current::mm::flush_tlb_*()`	Hardware instruction, no variants.
Interrupt entry/exit	Per-arch asm (IDT/GIC/PLIC/OpenPIC vectors)	Hardware trap path, no dispatch.
Page table depth	`MicroarchHints.page_table_levels` → configured once at boot	4-level vs 5-level is an MMU configuration, not an instruction alternative.

3.10.4.9 Three-Level Dispatch Summary¶

Compile time          Boot time (phase 9)         Runtime
    │                       │                        │
    ▼                       ▼                        ▼
arch::current::      code_alternative!         AlgoDispatch
(per target triple)  (instruction patching)     (function pointer)
    │                       │                        │
    │  Selects arch module  │  Patches instructions  │  Selects algorithm
    │  (x86 vs ARM vs RV)  │  within arch code for  │  implementation
    │                       │  specific CPU model    │  (SHA-NI vs generic)
    │                       │                        │
    └──── compile-time ─────┴──── zero-overhead ─────┴── one indirect call ──
                                 after patching            (branch-predicted)

All three mechanisms together ensure the kernel binary adapts to the exact hardware at boot — no per-host recompilation, no runtime branches in hot paths. This is how UmkaOS avoids the "compile for your CPU" approach (Gentoo/Slackware) while still extracting maximum performance.

3.10.4.10 Module Packaging Rule¶

Default (Model A): All AlgoDispatch candidates are compiled inline into the module that declares the algo_dispatch! static. Dead variants remain in the image (~few KB each). This is the default for all algorithms in the catalog above.

Exception (Model B): If a single variant implementation exceeds 64 KB of code (e.g., a complex AVX-512 implementation with hand-tuned assembly), it MAY be split into a separate feature-variant module loaded by AlgoDispatch at boot phase 9. The decision is per-algorithm, documented in the variant's algo_dispatch! declaration with a // MODEL_B: <module-name> comment.

Currently, no algorithm in the catalog requires Model B — all are under the 64 KB threshold. This section exists to define the mechanism for future use.

3.11 Workqueue / Deferred Work¶

Kernel operations that cannot complete in interrupt context — because IRQ handlers must be atomic and non-sleeping — or that must not block the calling thread need deferred execution. UmkaOS provides a structured workqueue mechanism for this.

UmkaOS improvement over Linux: Linux uses an anonymous kworker pool model where all kernel-wide deferred work competes in a shared thread pool, causing priority inversion, poor debuggability (ps shows meaningless kworker/0:1), and no backpressure (the queue grows unboundedly under load). UmkaOS instead requires each subsystem to create a named thread pool with explicit priority, bounded depth, and CPU affinity:

ps shows umkad-net-rx-0, umkad-blk-io-3 — always attributable to a subsystem
Bounded queues return WouldBlock instead of silently accumulating unbounded work
Priority isolation: network Rx runs SCHED_FIFO; background scan runs SCHED_IDLE
No priority inversion: high-priority subsystems are never delayed by background tasks

3.11.1 Core Types¶

/// Opaque handle to a submitted work item. Used for cancellation.
pub struct WorkHandle(u64);

/// A unit of deferred work. `f` runs in a workqueue thread context:
/// preemptible, may sleep, may allocate with GFP_KERNEL.
///
/// # Constraint
/// Work items MUST NOT be submitted from IRQ context. For IRQ-safe
/// deferred work (e.g., memory reclamation), use `rcu_call` ([Section 3.4](#cumulative-performance-budget)).
pub struct WorkItem {
    pub f:           fn(*mut ()),
    pub data:        *mut (),
    /// Deadline hint in nanoseconds from boot. `NO_DEADLINE` (`u64::MAX`) = no
    /// deadline (schedule at discretion). If set and expired, the item is still
    /// executed (not dropped); the deadline serves as an urgency signal
    /// to the scheduler.
    ///
    /// Uses `NO_DEADLINE` sentinel instead of `Option<u64>` to save 8 bytes:
    /// `Option<u64>` is 16 bytes (no niche optimization for u64), while
    /// `u64` is 8 bytes. `u64::MAX` nanoseconds is ~584 years from boot —
    /// safely beyond any practical deadline. This keeps `WorkItem` at 24
    /// bytes (three pointer-sized fields) instead of 32.
    pub deadline_ns: u64,
}
/// Sentinel value for "no deadline". `u64::MAX` nanoseconds = ~584 years from boot.
pub const NO_DEADLINE: u64 = u64::MAX;

impl WorkItem {
    /// Create a new work item.
    ///
    /// `deadline_ns == u64::MAX` (`NO_DEADLINE`) indicates no deadline — the work
    /// item is scheduled at the workqueue's discretion. All u64 values are valid;
    /// `NO_DEADLINE` is not a "special" invalid value that needs validation — it is
    /// simply the sentinel for "no deadline." There is no way to distinguish
    /// "accidental u64::MAX" from "intentional NO_DEADLINE" — they are the same
    /// value by definition.
    pub fn new(f: fn(*mut ()), data: *mut (), deadline_ns: u64) -> Self {
        Self { f, data, deadline_ns }
    }
}

// SAFETY: WorkItem is Send; the caller must ensure `data` is valid
// for the lifetime of the work item.
unsafe impl Send for WorkItem {}

/// Delayed work: queued for execution after `delay_ns` nanoseconds
/// from the time of submission. Implemented via the UmkaOS timer framework
/// (Section 7.5): a one-shot timer enqueues the WorkItem on expiry.
pub struct DelayedWork {
    pub item:     WorkItem,
    pub delay_ns: u64,
}

/// Named thread pool for deferred work.
pub struct WorkQueue {
    name:     &'static str,
    queue:    Arc<BoundedMpmcRing<WorkItem>>,
    /// Opaque handle to a kernel thread (defined in [Section 8.7](08-process.md#resource-limits-and-accounting)).
    threads:  ArrayVec<KthreadHandle, WORKQUEUE_MAX_THREADS>,
    /// Scheduling policy for worker threads (defined in [Section 7.1](07-scheduling.md#scheduler)).
    sched:    SchedPolicy,
    cpu_mask: CpuMask,
}

/// Maximum threads in a single WorkQueue.
pub const WORKQUEUE_MAX_THREADS:   usize = 64;
/// Default maximum pending items per WorkQueue.
/// Callers receive `WouldBlock` when this is reached (backpressure, not panic).
pub const WORKQUEUE_DEFAULT_DEPTH: usize = 4096;

/// Serialized (single-thread) variant: guarantees strict FIFO execution order.
/// Use for device state machines and sequenced protocol stacks.
pub struct OrderedWorkQueue(WorkQueue); // internally max_threads = 1

/// Lock-free MPMC bounded ring buffer for work items.
///
/// Per-slot wrapper for the MPMC ring. Each slot carries a Lamport-style
/// sequence number that coordinates producers and consumers without a
/// global lock. The `seq` field is initialized to the slot's index at
/// creation time; producers advance it to `index + 1` after writing,
/// consumers advance it to `index + capacity` after reading.
///
/// The sequence number protocol ensures:
/// - A producer only writes to a slot where `seq == enq_head` (slot is empty).
/// - A consumer only reads from a slot where `seq == deq_tail + 1` (slot is full).
/// - No two threads ever access the same slot's `data` concurrently.
// kernel-internal, not KABI — generic type with T-dependent size.
#[repr(C, align(64))]  // cache-line aligned: prevents false sharing between adjacent slots
pub struct Slot<T> {
    /// Lamport sequence number for this slot. Controls slot ownership:
    /// - `seq == slot_index`: slot is empty, available for producers.
    /// - `seq == slot_index + 1`: slot is full, available for consumers.
    /// - Other values: slot is being written/read by another thread.
    pub seq:  AtomicU64,
    /// The actual data stored in this slot. Only accessed after the
    /// sequence number handshake confirms exclusive ownership.
    pub data: UnsafeCell<MaybeUninit<T>>,
}

/// Lock-free multi-producer multi-consumer (MPMC) bounded ring buffer.
/// Uses Lamport-style per-slot sequence numbers with CAS — see
/// [Section 3.11](#workqueue-deferred-work--boundedmpmcring-memory-ordering-specification)
/// for the full algorithm and memory ordering specification.
///
/// - Producer CAS on `enq_head` to claim a slot, then checks `slot.seq`
/// - Consumer CAS on `deq_tail` to claim a slot, then checks `slot.seq`
/// No heap allocation after creation; all storage is in the `Box<[Slot<T>]>`.
pub struct BoundedMpmcRing<T> {
    ring:     Box<[Slot<T>]>,
    capacity: usize,
    enq_head: AtomicU64,
    deq_tail: AtomicU64,
}

// SAFETY: BoundedMpmcRing is designed for concurrent multi-producer multi-consumer
// access; per-slot sequence numbers ensure no slot is accessed by two threads at once.
unsafe impl<T: Send> Sync for BoundedMpmcRing<T> {}

3.11.2 API¶

impl WorkQueue {
    /// Create a named work queue.
    ///
    /// - `name`: threads appear as `umkad-{name}-N` in process listings
    /// - `max_threads`: number of concurrent worker threads (1 ≤ N ≤ WORKQUEUE_MAX_THREADS)
    /// - `queue_depth`: maximum pending items (1 ≤ N ≤ 65535)
    /// - `sched`: scheduling policy for all worker threads
    /// - `cpu_mask`: CPU affinity mask; `CpuMask::all()` for no restriction
    pub fn new(
        name:        &'static str,
        max_threads: usize,
        queue_depth: usize,
        sched:       SchedPolicy,
        cpu_mask:    CpuMask,
    ) -> Result<Arc<Self>, KernelError>;

    /// Submit work for asynchronous execution. Returns immediately.
    ///
    /// Returns `WouldBlock` if the queue is at capacity. The caller must
    /// handle backpressure — retry after a delay, drop the work, or use
    /// a per-subsystem overflow strategy. Silent queuing of unbounded work
    /// is not permitted.
    pub fn queue_work(&self, item: WorkItem) -> Result<WorkHandle, KernelError>;

    // NOTE: WorkItem is pre-allocated from a per-CPU slab cache (Section 4.2
    // slab allocator) sized at WQ_MAX_ACTIVE per workqueue per CPU.
    // queue_work does not allocate: it takes a `WorkItem` by value (move
    // semantics) that was previously obtained from the per-CPU slab. If the
    // slab is exhausted, queue_work returns ENOMEM (backpressure). No heap
    // allocation occurs under any lock.

    /// Submit delayed work. The item is enqueued after `work.delay_ns` nanoseconds.
    pub fn queue_delayed_work(&self, work: DelayedWork)
        -> Result<WorkHandle, KernelError>;

    /// Cancel a pending (not yet started) work item.
    /// Returns `Ok(true)` if cancelled before execution.
    /// Returns `Ok(false)` if the item was already running or completed.
    /// Does NOT wait for a currently-running item to finish.
    pub fn cancel_work(&self, handle: WorkHandle) -> Result<bool, KernelError>;

    /// Cancel a pending or running work item and block until it completes.
    ///
    /// - If the item has not yet started: dequeued, never executed. Returns `Ok(true)`.
    /// - If the item is currently running: blocks until execution finishes.
    ///   Returns `Ok(false)` (item ran to completion, not cancelled).
    /// - If the item has already completed: returns `Ok(false)` immediately.
    ///
    /// MUST NOT be called from atomic context (sleeps while waiting for
    /// the running item to complete). This is the safe pattern for driver
    /// unload — ensures no work item references freed data after return:
    ///
    /// ```rust
    /// fn driver_unload(wq: &WorkQueue, handle: WorkHandle) {
    ///     wq.cancel_work_sync(handle).expect("cancel failed");
    ///     // Safe: work item has completed or was cancelled.
    ///     // Data referenced by the work item can now be freed.
    /// }
    /// ```
    pub fn cancel_work_sync(&self, handle: WorkHandle) -> Result<bool, KernelError>;

    /// Same as `cancel_work_sync` for delayed work items. If the timer has
    /// not yet fired, the timer is cancelled and the work item is dequeued.
    /// If the timer has fired and the work item is running, blocks until
    /// the running item completes.
    pub fn cancel_delayed_work_sync(&self, handle: WorkHandle) -> Result<bool, KernelError>;

    /// Wait for all currently-queued items to complete. Items submitted
    /// concurrently with or after `flush()` is called are not waited for.
    /// May sleep; must not be called from atomic context.
    pub fn flush(&self);

    /// Cancel all pending items and wait for any currently-running items
    /// to complete. After `drain()` returns, no items from before the call
    /// are in-flight.
    pub fn drain(&self);
}

impl OrderedWorkQueue {
    pub fn new(name: &'static str, queue_depth: usize, sched: SchedPolicy,
               cpu_mask: CpuMask) -> Result<Arc<Self>, KernelError>;
    // Delegates to WorkQueue with max_threads = 1.
    pub fn queue_work(&self, item: WorkItem) -> Result<WorkHandle, KernelError>;
    pub fn flush(&self);
    pub fn drain(&self);
}

3.11.3 Standard Named Queues¶

The following named queues are created at boot by their respective subsystems. All subsystems that need deferred work must use one of these or create their own named queue — anonymous work submission is not permitted.

Queue Name	Threads	Depth	Policy	Subsystem	Used by
`net-rx`	1/NIC	4096	SCHED_FIFO	Network Rx path	NIC drivers (GRO aggregation, TCP/UDP demux, netfilter deferred verdicts)
`blk-io`	4	8192	SCHED_FIFO	Block I/O completion	Block drivers (I/O completion callbacks, request queue drain, SCSI/NVMe status processing)
`rcu-reclaim`	1/NUMA	16384	SCHED_OTHER	RCU callback processing	RCU subsystem (deferred free of RCU-protected objects, slab page release, routing table entry reclaim)
`pm-async`	2	256	SCHED_OTHER	Device suspend/resume (§7.2.8)	Power management (async device suspend/resume, runtime PM state transitions, wakeup source bookkeeping)
`fw-loader`	2	64	SCHED_OTHER	Firmware loading (§11.5.15)	Driver framework (firmware blob fetch from filesystem, microcode upload, FPGA bitstream loading)
`dma-fence`	2	1024	SCHED_FIFO	DMA fence callbacks	GPU/DMA subsystem (fence signal callbacks, buffer release, inter-engine sync completion)
`crypto`	4	4096	SCHED_OTHER	Async crypto operations	Crypto API (async AES-GCM/ChaCha completions, dm-crypt block encryption, TLS record processing)
`fsync`	8	16384	SCHED_OTHER	Filesystem writeback	VFS/filesystems (dirty page writeback, journal commit, inode sync, periodic flush timer callbacks)
`events`	4	4096	SCHED_OTHER	General subsystem events	General deferred work (sysfs notifications, kobject cleanup, uevent dispatch, deferred probe retry)
`events-long`	2	1024	SCHED_IDLE	Background maintenance tasks	Long-running background work (memory compaction, slab cache shrink, periodic health checks, debug info collection)
`hotplug`	2	128	SCHED_OTHER	Device hotplug event processing (Section 11.4)	Bus subsystem (PCI/USB/platform device add/remove, driver bind/unbind, resource rebalancing)
`mod-loader`	4	256	SCHED_OTHER	Driver module loading with priority ordering (Section 11.4)	Module subsystem (driver module load/init, dependency resolution, symbol relocation, signature verification)

Security-critical invariant: Security-critical operations (IMA measurement, capability validation, LSM hooks) are synchronous in the caller's context. They are never submitted to shared workqueues and are never subject to workqueue backpressure. This prevents a saturated workqueue from delaying or dropping security checks.

3.11.3.1 Tier 1 Crash Recovery¶

When a Tier 1 driver crashes (Section 11.9), the crash recovery sequence MUST drain or cancel all pending work items from workqueues registered by the crashed driver before releasing the driver's address space. The protocol:

Mark driver as crashed: The crash handler sets the driver's state to CRASHED in the device registry (Section 11.4).
Cancel pending items: For each workqueue owned by (or shared with) the crashed driver, drain_or_cancel_driver_work(driver_id) iterates the queue's BoundedMpmcRing and removes any WorkItem whose owner_driver_id matches the crashed driver. Removed items are dropped without execution.
Wait for in-flight items: If a work item from the crashed driver is currently executing in a worker thread, the worker detects the crash via the driver state flag and aborts execution at the next safe point (cancellation point). The crash handler waits for in-flight items to complete or abort before proceeding.
Release address space: Only after all work items are drained/cancelled does the crash handler unmap the driver's code and data pages. This prevents use-after-unmap faults in worker threads.

For system-global workqueues (events, events-long, etc.), step 2 filters by owner_driver_id. The queue itself is not destroyed — only the crashed driver's items are removed. For driver-private workqueues (created by the driver at probe time), the entire queue is drained and destroyed.

Thread names appear in /proc/N/comm and ps output as umkad-{name}-{index}. The 1/NIC convention means one thread per network interface card is created by the network subsystem at NIC probe time.

Workqueue creation phase assignments (keyed to boot phases in Section 2.3):

Phase	Queues created	Rationale
Phase 2.7+ (workqueue init, early services) (see canonical boot table in Section 2.3)	`rcu-reclaim`, `events`, `events-long`	Created by `workqueue_init_early()` during Phase 2.7. Needed by core subsystems before device enumeration. Dependency: `rcu_init()` (Phase 2.8) creates the `rcu-reclaim` workqueue. The workqueue subsystem (Phase 2.7) must be initialized first. `rcu_init()` calls `create_workqueue("rcu-reclaim")` during its init sequence.
Phase 4.5 (block/storage init)	`blk-io`, `fsync`, `crypto`	Created during `block_init()` (Phase 4.5 in the canonical boot table, Section 2.3). Block I/O completion and filesystem writeback available
Phase 4.4a (bus enumeration)	`pm-async`, `fw-loader`, `dma-fence`, `hotplug`, `mod-loader`	Device probe begins; firmware loading and PM needed
Phase 5.3+ (NIC probe)	`net-rx` (1 per NIC)	Created dynamically as each NIC driver probes

3.11.4 BoundedMpmcRing Memory Ordering Specification¶

The BoundedMpmcRing algorithm (Producer CAS on enq_head, Consumer CAS on deq_tail, Lamport-style sequence numbers) is correct only with precisely specified memory orderings. Without explicit orderings, the implementation is a data race on all architectures with weak memory models (AArch64, RISC-V, PPC). TSO (x86-64) hides these bugs in testing.

Slot structure (one entry per ring position):

#[repr(C, align(64))]  // cache-line aligned: prevents false sharing between adjacent slots
struct Slot<T> {
    /// Sequence number. Initially equals the slot index. Advances by CAPACITY after
    /// each producer/consumer cycle. A producer sees `seq == head` when slot is free;
    /// a consumer sees `seq == tail + 1` when slot is filled.
    /// AtomicU64 (not AtomicUsize) — on 32-bit architectures, a 32-bit sequence
    /// counter wraps after ~1M full ring cycles at 4096 capacity, which can occur
    /// within hours at high callback rates. u64 provides 50-year safety.
    seq:  AtomicU64,
    data: UnsafeCell<MaybeUninit<T>>,
}

Enqueue (producer side):

fn push(&self, item: T) -> Result<(), T> {
    loop {
        // (1) Try to claim a slot index. Relaxed: ordering is on `seq` below.
        let head = self.enq_head.load(Relaxed);
        let slot = &self.slots[head & (CAPACITY - 1)];

        // (2) Check if slot is available (seq == head means slot is free).
        //     Acquire pairs with (D) Release in the consumer.
        let seq = slot.seq.load(Acquire);
        if seq == head {
            // Slot is free — try to claim it via CAS.
            match self.enq_head.compare_exchange_weak(
                head, head.wrapping_add(1), Relaxed, Relaxed,
            ) {
                Ok(_) => {
                    // (3) Write data. The Release in (4) makes this visible.
                    unsafe { slot.data.get().write(MaybeUninit::new(item)); }
                    // (4) Publish: seq = head + 1. Release pairs with (B) Acquire.
                    slot.seq.store(head.wrapping_add(1), Release);
                    return Ok(());
                }
                Err(_) => continue,  // Another producer won; retry.
            }
        } else if (seq.wrapping_sub(head) as i64) > 0 {
            // Signed comparison via wrapping_sub + cast to i64: correct as long
            // as the ring capacity is much smaller than 2^63 (capacity is typically
            // 256-4096, so this invariant holds by many orders of magnitude).
            // Pattern: Vyukov bounded MPMC queue / Linux kfifo.
            // seq > head (in signed terms): another producer already claimed this
            // slot (our head was stale). Reload head and retry.
            continue;
        } else {
            // seq < head (signed): ring is full — the consumer hasn't freed this slot.
            return Err(item);
        }
    }
}

Dequeue (consumer side):

fn pop(&self) -> Option<T> {
    loop {
        // (A) Load current dequeue position. Do NOT fetch_add unconditionally —
        //     that would claim a slot even if the ring is empty, causing infinite
        //     spin waiting for a producer that may never come.
        let tail = self.deq_tail.load(Relaxed);
        let slot = &self.slots[tail & (CAPACITY - 1)];

        // (A') Check if the slot is filled: seq == tail + 1 means a producer
        //      has written data here and advanced the sequence number.
        let seq = slot.seq.load(Acquire);  // (B) Acquire — pairs with (4) Release
        if seq == tail.wrapping_add(1) {
            // Slot is filled. Try to claim it via CAS on deq_tail.
            if self.deq_tail.compare_exchange_weak(
                tail, tail.wrapping_add(1), Relaxed, Relaxed
            ).is_err() {
                // Another consumer claimed it. Retry.
                continue;
            }
            // (C) Read data. Safe: the Acquire in (A') establishes happens-before
            //     with the producer's write in (3).
            let item = unsafe { slot.data.get().read().assume_init() };

            // (D) Recycle slot: seq = tail + CAPACITY. Release ordering ensures
            //     (C) read completes before the next producer overwrites the slot.
            slot.seq.store(tail.wrapping_add(CAPACITY), Release);
            return Some(item);
        } else if (seq.wrapping_sub(tail) as i64) <= 0 {
            // Signed comparison via wrapping_sub + cast to i64: correct as long
            // as ring capacity is much smaller than 2^63. Pattern: Vyukov MPMC / Linux kfifo.
            // seq <= tail (signed): slot not yet filled by producer. Ring is
            // empty (or this consumer is ahead of the producer for this slot).
            return None;
        }
        // seq > tail + 1 (signed): slot already consumed (stale tail). Reload and retry.
    }
}

Ordering summary:

Operation	Ordering	Pairs with	Purpose
`enq_head.load` + `enq_head.compare_exchange_weak`	Relaxed	—	Claim slot index only. Relaxed suffices: the CAS provides mutual exclusion for slot claiming, while the per-slot `seq` field (Acquire/Release) provides all data visibility guarantees.
`slot.seq.load` in producer spin	Acquire	(D) Release	See recycled slot before overwriting
`slot.seq.store(head+1)`	Release	(B) Acquire	Make data write visible before publishing
`deq_tail.load`	Relaxed	—	Read current dequeue position
`slot.seq.load` in consumer check	Acquire	(4) Release	See data write before reading slot
`deq_tail.compare_exchange_weak`	Relaxed, Relaxed	—	Claim slot atomically (MPMC-safe)
`slot.seq.store(tail+CAPACITY)`	Release	(2) Acquire	Recycle slot after read completes

Per-architecture compile result: - x86-64 (TSO): load(Acquire) = plain load; store(Release) = plain store. TSO makes these no-ops, but the orderings must still be written correctly — they are hints to the compiler's reordering optimizer regardless of hardware fences. - AArch64 (weak): load(Acquire) compiles to ldar; store(Release) to stlr. These instructions are required for correctness. Without them, the CPU can reorder loads/stores past the sequence number checks. - RISC-V (RVWMO): load(Acquire) → lw + fence r,rw; store(Release) → fence rw,w + sw. - PPC64 (power model): load(Acquire) → ld + cmpw + bc + isync; store(Release) → lwsync + std.

Generic rate limiter: TokenBucket (defined in Section 23.1) is the standard rate-limiting primitive. Subsystems requiring rate limiting (FMA audit, ML policy, TTY signal injection) should use this type rather than reimplementing token bucket logic.

3.11.5 Cgroup Integration¶

Workqueue threads are subject to cgroup CPU bandwidth accounting (Section 7.6). When a workqueue is created on behalf of a specific cgroup (e.g., a cgroup-scoped memory reclaim kthread), the worker threads are placed into that cgroup's cpu controller hierarchy. This ensures that deferred work charged to a container consumes the container's CPU budget, not the root cgroup's.

Accounting rules: - Work items submitted by a task inherit that task's cgroup context. The worker thread temporarily associates with the submitter's cgroup for the duration of the work item execution (cgroup_work_enter(submitter_css) / cgroup_work_exit()). CPU time consumed is charged to the submitter's cpu.stat. - System-global workqueues (e.g., umkad-rcu-gp, umkad-mm-compact) are placed in the root cgroup and are not subject to per-container bandwidth limits. - Per-cgroup workqueue thread counts are visible via /sys/fs/cgroup/<path>/cpu.workqueue_threads.

3.12 IRQ Chip and irqdomain Hierarchy¶

Hardware interrupt controllers form a cascade hierarchy: a root controller (APIC on x86-64, GIC on AArch64/ARMv7, PLIC on RISC-V, OpenPIC/XIVE on PPC, EIOINTC on LoongArch64) or architectural interrupt mechanism (PSW-swap on s390x) may cascade into secondary controllers (GPIO expanders, PCIe MSI controllers, I2C interrupt expanders). The irqdomain abstraction maps hardware interrupt numbers to UmkaOS internal software IRQ numbers and routes them through the correct chip-level handlers.

UmkaOS improvement over Linux: Linux uses generic irq_domain_ops function tables with void-pointer driver data, providing no type safety. UmkaOS uses typed Rust traits (IrqChip, IrqDomain) with concrete implementations per controller type. The cascade hierarchy is explicit at compile time, eliminating class-of-bug where a wrong ops table is installed.

3.12.1 Core Types¶

/// Hardware interrupt number — chip-local, not globally unique.
pub type HwIrq = u32;

/// UmkaOS global software IRQ number — globally unique, allocated sequentially.
///
/// **Longevity analysis (50-year counter policy)**: SwIrq is allocated
/// monotonically via `IrqTable::next_free: AtomicU32` during device probe.
/// IRQ numbers are recycled via `IrqDomain::free()` on hot-unplug; the
/// allocator scans for free slots starting from `next_free`. At 1000
/// hot-plug cycles/second (extreme server scenario), monotonic exhaustion
/// of the u32 space would occur in ~49.7 days. In practice:
/// - Hot-plug rates are typically <10/second (years to exhaustion).
/// - IRQ numbers ARE recycled (free returns slots to the pool).
/// - With recycling, the counter wraps safely (wrapping arithmetic
///   finds free slots by scanning from the wrapped value).
/// - Total IRQ count per system is typically <10K (well within u32).
/// If a future system requires >4B lifetime IRQ allocations without
/// recycling, widen to u64 and update `IrqTable::next_free` accordingly.
pub type SwIrq = u32;

/// Software IRQs 0-31 are reserved for architecture-specific use (NMI, MCE, etc.).
/// Dynamically allocated software IRQs start at this value.
pub const IRQ_FIRST_DYNAMIC: SwIrq = 32;

/// Sentinel value: no IRQ assigned.
pub const IRQ_INVALID: SwIrq = u32::MAX;

/// IRQ trigger type, set per-IRQ at probe time from device tree or ACPI _CRS.
/// Matches Linux `include/linux/irq.h` `IRQ_TYPE_*` values for DT compatibility.
#[repr(u8)]
pub enum IrqTrigger {
    /// Platform default / driver doesn't care (Linux `IRQ_TYPE_NONE`).
    /// The interrupt controller uses its hardware default or device-tree
    /// specified trigger type. Many DT-probed drivers specify `IRQ_TYPE_NONE`
    /// and rely on the platform to configure the correct trigger.
    PlatformDefault = 0,
    RisingEdge  = 1,
    FallingEdge = 2,
    BothEdges   = 3,
    LevelHigh   = 4,
    LevelLow    = 5,
}

/// Return value from an interrupt handler.
#[repr(u8)]
pub enum IrqReturn {
    /// IRQ was serviced; EOI will be sent.
    Handled    = 0,
    /// IRQ was not for this handler (spurious); do not send EOI.
    NotHandled = 1,
    /// Primary handler requests the threaded handler be woken (irq/N kthread).
    WakeThread = 2,
}

/// Chip-specific per-IRQ data stored inline in IrqDescriptor to avoid
/// pointer chasing on the hot interrupt path.
pub struct IrqChipData {
    pub hw_irq: HwIrq,
    /// MMIO base address of the interrupt controller (architecture-specific).
    pub regs:   *mut u8,
    /// Bit mask for this IRQ within its controller register (e.g., 1 << pin_index).
    pub mask:   u32,
}

/// Per-IRQ state descriptor. One instance per allocated software IRQ number.
pub struct IrqDescriptor {
    pub sw_irq:       SwIrq,
    pub hw_irq:       HwIrq,
    pub trigger:      IrqTrigger,
    pub chip:         &'static dyn IrqChip,
    pub chip_data:    IrqChipData,
    /// IrqDomains are registered at boot/module-load and never freed during
    /// normal operation, so a static reference suffices and avoids atomic
    /// refcount overhead on every interrupt.
    pub domain:       &'static dyn IrqDomain,
    /// Primary handler (always in atomic/IRQ context, must not sleep).
    /// Returns `WakeThread` to schedule the threaded handler (if registered).
    /// Returns `Handled` if the interrupt was fully serviced in hardirq context.
    ///
    /// SAFETY: `handler` and `handler_data` are registered together by `request_irq()`.
    /// Type erasure via `*mut ()` is safe because the IRQ subsystem never calls a
    /// handler with data from a different registration. The handler pointer is
    /// immutable after registration (updates require `free_irq()` + `request_irq()`).
    pub handler:      Option<fn(SwIrq, *mut ()) -> IrqReturn>,
    /// Threaded handler (runs in kthread context, may sleep and allocate).
    /// Scheduled when the primary handler returns `WakeThread`. The IRQ line
    /// remains masked at the chip until the threaded handler completes.
    /// `None` for pure hardirq-only handlers. Same safety invariant as `handler`.
    pub threaded_handler: Option<fn(SwIrq, *mut ()) -> IrqReturn>,
    pub handler_data: *mut (),
    /// Debug-only type tag for handler_data. Stores the TypeId of the concrete
    /// type passed to `request_irq()`, enabling runtime detection of mismatched
    /// data types during IRQ re-registration (e.g., driver reload with a
    /// different handler data type). Zero-cost in release builds.
    #[cfg(debug_assertions)]
    pub handler_data_type_id: Option<core::any::TypeId>,
    /// Action flags controlling handler behavior.
    pub flags:        IrqActionFlags,
    /// Serializes enable/disable/set_type operations on this descriptor.
    action_lock:      SpinLock<()>,
    /// Nesting depth counter: 0 = enabled, N > 0 = disabled N times.
    depth:            AtomicU32,
}

bitflags::bitflags! {
    /// Flags for IRQ handler registration (mirrors Linux IRQF_* flags).
    pub struct IrqActionFlags: u32 {
        /// Handler runs in a dedicated kthread (threaded IRQ).
        const THREADED    = 1 << 0;
        /// IRQ may be shared between multiple devices.
        const SHARED      = 1 << 1;
        /// Request a one-shot handler: IRQ stays masked until threaded handler completes.
        const ONESHOT     = 1 << 2;
        /// Do not disable this IRQ during suspend.
        const NO_SUSPEND  = 1 << 3;
    }
}

/// Interrupt chip operations. All methods are called with preemption disabled.
/// The chip is responsible for managing the physical hardware registers.
pub trait IrqChip: Send + Sync {
    /// Acknowledge (clear) a pending edge-triggered interrupt at the chip.
    /// Called immediately after the interrupt is claimed.
    fn ack(&self, data: &IrqChipData);

    /// Mask (disable) this IRQ line at the chip. Prevents further interrupt delivery.
    fn mask(&self, data: &IrqChipData);

    /// Unmask (enable) this IRQ line at the chip.
    fn unmask(&self, data: &IrqChipData);

    /// Configure trigger type. Returns `Err(KernelError::NotSupported)` if the
    /// hardware does not support the requested trigger type for this line.
    fn set_type(&self, data: &IrqChipData, trigger: IrqTrigger)
        -> Result<(), KernelError>;

    /// Set SMP affinity: route this IRQ to the specified CPUs.
    /// Returns `Ok(())` without effect if the chip does not support affinity
    /// (e.g., legacy 8259A PIC with fixed routing).
    fn set_affinity(&self, data: &IrqChipData, mask: &CpuMask)
        -> Result<(), KernelError>;

    /// Send End-Of-Interrupt to allow the next interrupt from this line.
    /// Default implementation: calls `unmask`, which is correct for edge-triggered
    /// controllers. Level-triggered controllers must override this to send a
    /// chip-specific EOI command before unmasking.
    fn eoi(&self, data: &IrqChipData) {
        self.unmask(data);
    }
}

/// Hardware IRQ specification as encoded in a device tree `interrupts` property
/// or ACPI `_CRS` extended interrupt descriptor. The interpretation is
/// domain-specific (each IrqDomain implementation defines the encoding).
///
/// For GIC (AArch64/ARMv7): cells[0] = IRQ type (SPI=0, PPI=1), cells[1] = INTID,
///   cells[2] = trigger flags.
/// For APIC (x86-64): cells[0] = vector number.
/// For PLIC (RISC-V): cells[0] = source number, cells[1] = trigger flags.
/// For s390x: cells[0] = ISC (0-7), cells[1] = subchannel ID (SCHID).
/// For EIOINTC (LoongArch64): cells[0] = vector number (0-255), cells[1] = trigger flags.
///
/// The `cells` array is a uniform encoding regardless of whether the source
/// is DTB (`interrupts` property) or ACPI (`_CRS` extended interrupt descriptor).
/// The IrqDomain implementation for each controller knows its source format
/// and interprets the cells accordingly — no discriminant field is needed
/// because a system uses either DTB or ACPI, never both simultaneously for
/// the same controller.
pub struct IrqSpec {
    pub cells: [u32; 3],
}

/// An irqdomain: maps hardware IRQ numbers to software IRQs for one controller.
/// Domains form a tree; `parent()` returns the upstream domain for cascades.
pub trait IrqDomain: Send + Sync {
    /// Translate a hardware IRQ specification (from DT `interrupts` or ACPI _CRS)
    /// into a hardware IRQ number for this domain.
    fn translate(&self, spec: &IrqSpec) -> Result<HwIrq, KernelError>;

    /// Allocate software IRQ(s) and create descriptors for a range of hardware IRQs.
    /// Called during device probe. Returns the first allocated SwIrq.
    /// `count` is typically 1 for regular IRQs, N for MSI-X vectors.
    fn alloc(
        &self,
        hw_irq:    HwIrq,
        count:     u32,
        chip_data: IrqChipData,
    ) -> Result<SwIrq, KernelError>;

    /// Free a previously allocated IRQ range, releasing software IRQ numbers.
    fn free(&self, sw_irq: SwIrq, count: u32);

    /// Activate: perform final hardware-side setup (e.g., program MSI address/data
    /// registers in PCIe config space). Called after alloc, before unmasking.
    fn activate(&self, desc: &mut IrqDescriptor) -> Result<(), KernelError>;

    /// Deactivate: reverse of activate (e.g., mask MSI in PCIe config space).
    /// Called before free.
    fn deactivate(&self, desc: &IrqDescriptor);

    /// Parent domain in the cascade hierarchy. `None` for root domains.
    /// Returns `&'static` because IrqDomains are registered at boot/module-load
    /// and never freed during normal operation — consistent with the `IrqChip`
    /// pattern (`&'static dyn IrqChip`). Avoids atomic refcount overhead (Arc
    /// clone + drop) on the IRQ dispatch hot path during hierarchical cascading.
    fn parent(&self) -> Option<&'static dyn IrqDomain>;
}

/// Global mapping from software IRQ number to IrqDescriptor. O(1) lookup by SwIrq.
///
/// The table is sized at boot based on the total IRQ count reported by all root domains.
/// No runtime resizing occurs after initialization.
pub struct IrqTable {
    /// Indexed by SwIrq. `None` = not allocated.
    ///
    /// **Collection policy exemption**: Uses a flat `Box<[Option<&'static ...>]>` array
    /// instead of XArray because the key space is dense (0..nr_irqs), sized once
    /// at boot, and never resized. A flat array gives true O(1) indexed access
    /// without XArray's radix-tree indirection overhead on the IRQ fast path.
    ///
    /// **`&'static` rationale**: IrqDescriptors are allocated at probe time from a
    /// dedicated slab and never freed during normal operation (same lifetime as
    /// IrqChip and IrqDomain). The IRQ receipt flow (step 4) indexes into this
    /// array on every interrupt — using `&'static` avoids atomic refcount overhead
    /// (Arc clone + drop: two atomic RMW ops per interrupt). Registration and
    /// deregistration paths (`request_irq()` / `free_irq()`) use the IrqTable's
    /// `lock` for serialization; no Arc is needed for concurrent access safety
    /// because the flow handler only borrows the descriptor for the interrupt
    /// duration (preemption disabled, single-CPU access).
    descriptors: Box<[Option<&'static IrqDescriptor>]>,
    next_free:   AtomicU32,
    /// Serializes `alloc` and `free` operations only. Does NOT protect lookups.
    lock:        SpinLock<()>,
}

pub static IRQ_TABLE: OnceCell<IrqTable> = OnceCell::new();

3.12.2 Root Domain Implementations Per Architecture¶

Each architecture instantiates a root IrqDomain at boot. Secondary domains (MSI, GPIO) are created by their subsystems and parent to the root domain.

Architecture	Root Controller	HW IRQ Range	Notes
x86-64	Local APIC + IOAPIC	vectors 0-255	Legacy PIC (8259A) disabled at boot
x86-64	PCI MSI	dynamically allocated	`MsiIrqDomain`, parents to APIC
AArch64	GIC-v3	SPIs 32-1019	PPIs (16-31) per-CPU; SGIs (0-15) for IPI
ARMv7	GIC-v2	SPIs 32-1019	Same numbering as GIC-v3
RISC-V	PLIC or APLIC	sources 1-1023	PLIC: legacy; APLIC: modern (wired + MSI modes). See APLIC note below
PPC32	OpenPIC	IRQs 0-511	External + internal sources
PPC64LE	XICS / XIVE	IRQs 0-65535	XICS: pseries default + POWER8 bare-metal (via hcalls or OPAL); XIVE: POWER9+ bare-metal or pseries with CAS negotiation. Event queue per CPU on XIVE; ICP/ICS model on XICS
s390x	PSW-swap (architectural)	ISC 0-7 × subchannels	No external controller; lowcore PSW pairs per interrupt class; I/O floats via ISC masks in CR6
LoongArch64	EIOINTC	vectors 0-255	Per-CPU routing via IOCSR registers; LIOINTC cascades for legacy devices

All platforms also support: - GpioIrqDomain: GPIO pins as IRQ sources; parents to the platform root domain. Created by the GPIO controller driver at probe time. - MsiIrqDomain: PCI MSI and MSI-X vectors; parents to the platform root domain. Created by the PCIe port driver at enumeration.

3.12.3 IRQ Receipt Flow¶

The full path from hardware exception to handler completion:

1. CPU receives hardware interrupt → architecture-specific entry stub
   (x86-64: IDT vector handler; AArch64: VBAR_EL1 IRQ vector;
    ARMv7: IRQ vector table; RISC-V: stvec trap handler;
    PPC32: external interrupt vector; PPC64LE: XICS/XIVE interrupt;
    s390x: PSW swap loads new PSW from lowcore; LoongArch64: EIOINTC vector)

2. arch::current::interrupts::handle_irq() is called from the entry stub
   with preemption disabled and interrupts masked.

3. Claim interrupt from controller:
   - x86-64:      vector is encoded in the IDT entry; no separate claim needed
   - AArch64:     read GIC_IAR1 to claim and get INTID
   - RISC-V:      read PLIC claim/complete register to claim
   - PPC32:       read OpenPIC IACK register
   - PPC64LE:     XIVE: pushes INTID to per-CPU event queue;
                   XICS: read XIRR via H_XIRR hcall (pseries) or ICP MMIO (powernv)
   - s390x:       implicit (PSW swap is the claim); read SCHID from lowcore I/O
                   interruption code area, translate (SCHID, ISC) → SwIrq
   - LoongArch64: read EIOINTC pending status register to get vector number

4. Look up IrqDescriptor: IRQ_TABLE[sw_irq].

5. desc.chip.ack(&desc.chip_data)
   Edge-triggered: clears pending interrupt at the controller.
   Level-triggered: no-op here; level is cleared by the device itself.

6. Call desc.handler(sw_irq, desc.handler_data):
   → Handled:    proceed to EOI (step 7)
   → NotHandled: spurious interrupt; log KERN_DEBUG, skip EOI for this handler
   → WakeThread: wake irq/{sw_irq} kthread (SCHED_FIFO, priority 50), proceed to EOI

7. desc.chip.eoi(&desc.chip_data)
   Sends End-Of-Interrupt; allows next interrupt from this line.

8. Return from exception → scheduler preemption check (if preempt_count == 0).

Threaded IRQ flow (when handler returns WakeThread):

The irq/{sw_irq} kthread (created at IRQ registration time) blocks on a WaitQueueHead. On WakeThread, the scheduler wakes the kthread, which: 1. Runs the threaded handler function in process context (may sleep, may allocate) 2. Calls desc.chip.unmask(&desc.chip_data) after the threaded handler returns 3. Returns to blocking on the WaitQueueHead

The primary handler is stored in IrqDescriptor.handler; the threaded handler in IrqDescriptor.threaded_handler. The IrqActionFlags::THREADED flag indicates that a threaded handler is registered. When both are present, the primary handler is called in hardirq context and returns WakeThread; the kthread then invokes the threaded handler, which runs the actual device service routine in process context.

RISC-V APLIC MSI mode: level-sensitive re-assertion (errata RiscvErrata::APLIC_LEVEL_MSI):

RISC-V APLIC (Advanced Platform-Level Interrupt Controller) operates in two modes: - Direct mode: wired interrupt delivery, similar to PLIC. Level-sensitive sources remain pending while asserted — no special handling needed. - MSI mode: interrupts are delivered as MSI writes to an IMSIC (Incoming MSI Controller). This is the preferred mode for scalable multi-hart systems.

In MSI mode, level-sensitive interrupts have a fundamental semantic mismatch: an MSI is an edge event (a single write), but a level-sensitive source stays asserted until the device deasserts. If the ISR services the device (clearing the level) but the device re-asserts before the ISR completes the EOI sequence, the re-assertion is lost — the APLIC already delivered the MSI and will not send another until the level drops and rises again. This causes permanent interrupt loss for the affected source.

Mandatory workaround: After the ISR clears the device's interrupt condition, the EOI handler must re-read the APLIC source's ip[n] (interrupt pending) bit. If the source has re-asserted between device clear and EOI, the APLIC IrqChip::eoi() implementation must re-trigger the MSI manually by writing the APLIC setipnum register:

EOI sequence for APLIC MSI mode (level-sensitive sources):
  1. ISR services device → device deasserts interrupt line
  2. Read APLIC sourcecfg[n].sm to confirm level-sensitive
  3. Read APLIC ip[n] bit → if set, source has re-asserted
  4. If re-asserted: write n to APLIC setipnum → triggers new MSI to IMSIC
  5. Complete EOI

This re-assertion check is mandatory for ALL level-sensitive sources when APLIC operates in MSI mode. Without it, any level-sensitive device (GPIO edge-to-level converters, I2C interrupt expanders, legacy PCI INTx) can permanently lose interrupts. The errata flag RiscvErrata::APLIC_LEVEL_MSI gates this workaround (set on all APLIC implementations that support MSI mode, since this is a specification-level issue, not a silicon bug).

3.13 Collection Usage Policy¶

Kernel code must choose collection types based on access path criticality:

Path Class	Examples	Allowed Collections	Heap Allocation
Hot (per-syscall, per-packet, scheduler tick, IRQ)	runqueue, routing lookup, page fault handler	`ArrayVec`, static arrays, slab objects, per-CPU pools, `XArray`/radix tree (integer-keyed)	Forbidden (XArray uses pre-allocated nodes from slab)
Warm (per-operation, bounded frequency)	driver init, mount, cgroup create, device probe	`XArray` (integer-keyed), `BTreeMap` (non-integer ordered keys), `ArrayVec`, bounded `Vec`, `Idr`	Bounded (max N known at design time)
Cold (boot, config load, debug, admin)	ACPI parsing, module load, sysfs population	`HashMap`, `Vec`, `BTreeMap`, `String`	Acceptable
RCU read-heavy	routing table, capability cache, module registry, page cache	`RcuHashMap`, `RcuIdr`, `RcuList`, `XArray` (RCU-protected)	Writers allocate; readers lock-free

Radix tree / XArray vs BTreeMap selection:

Criterion	XArray / Radix Tree	BTreeMap
Key type	Integer only (`u64`, page index, PID)	Any `Ord` type
Lookup complexity	O(1) — fixed depth (10 levels for 64-bit, 6 bits/level)	O(log N)
RCU-compatible reads	Yes — slot-level RCU, lock-free reads	No — requires external RCU wrapper
Cache behavior	Excellent — 64-way fanout, dense subtrees collapse	Good — B-tree nodes are cache-line-friendly
Sparse keys	Efficient — empty subtrees not allocated	Efficient — only present keys stored
Ordered iteration	Yes (by integer key)	Yes (by `Ord` key)
Hot-path suitability	Yes — all paths	Non-integer keys or range queries only

Rule: For all integer-keyed mappings — hot, warm, or cold — use XArray or Idr (which is built on XArray). There is no performance reason to use BTreeMap or HashMap with integer keys on any path: XArray is O(1) with better cache behavior, native RCU read support, and ordered iteration. BTreeMap is reserved for: - Non-integer keys (String, [u8; N], enum types, composite structs) - Composite keys where ordered iteration by the composite is needed (e.g., (deadline, task_id) for deadline trees) - Range queries that require BTreeMap::range() (e.g., IOVA containment lookup, I/O elevator seek-distance merge)

HashMap is reserved for non-integer keys on cold paths or RCU-protected writer paths. For integer keys, HashMap is never acceptable — use XArray.

Rules: 1. Hot-path structs must have O(1) or O(log N) access with bounded N. 2. No heap allocation under spinlock or with IRQs disabled. 3. Integer key → XArray. Always. No exceptions for "warm" or "cold" paths. 4. HashMap is only acceptable in cold paths with non-integer keys, or RCU writer paths. 5. Vec is acceptable when maximum size is known and documented. 6. BTreeMap is for non-integer ordered keys or integer keys requiring range queries.

Documented exemptions: - BTreeMap<u64, IommuMapping> in IOVA management — requires range(..=addr) for containment lookup. Hardware IOMMU page table handles the hot-path DMA translation. - BTreeMap<Lba, IoRequest> in I/O elevator — requires ordered iteration and range merge queries for seek-distance minimisation on rotational media.

3.14 Error Handling and Fault Containment¶

3.14.1 Kernel Error Model¶

Linux problem: Kernel functions return negative integers (-ENOMEM, -EINVAL) for errors, sometimes stuffed into pointers via ERR_PTR(). The type system enforces nothing — callers can silently ignore errors, confuse pointers with error pointers, or propagate the wrong errno. Unchecked kmalloc returns and missing error propagation are endemic.

UmkaOS design: All kernel-internal functions return Result<T, KernelError>. The ? operator propagates errors up the call stack. The Rust type system makes it impossible to silently ignore an error — no integer error codes, no sentinel values, no ERR_PTR.

/// Canonical kernel error type. All umka-core and driver-facing
/// kernel functions return `Result<T, KernelError>`.
#[non_exhaustive]
#[repr(u32)]
pub enum KernelError {
    OutOfMemory       = 1,   // Physical or virtual memory exhausted
    InvalidCapability = 2,   // Capability handle missing or revoked
    PermissionDenied  = 3,   // Capability lacks required permission bits
    InvalidArgument   = 4,   // Syscall argument out of range
    DeviceError       = 5,   // Device error or device in error state
    Timeout           = 6,   // Operation timed out
    WouldBlock        = 7,   // Non-blocking I/O: operation would block (passive wait —
                             // the caller need not do anything special before retrying;
                             // the resource will become available on its own, e.g.,
                             // O_NONBLOCK socket with no data, pipe with full buffer).
    NotFound          = 8,   // Requested object does not exist
    AlreadyExists     = 9,   // Object or resource already exists
    Interrupted       = 10,  // Operation interrupted by signal
    IoError           = 11,  // Generic I/O error (disk, network, DMA)
    ResourceBusy      = 12,  // Resource in use (EBUSY) — distinct from WouldBlock
    NoSpace           = 13,  // No space left on device (ENOSPC) — distinct from OutOfMemory
    NotSupported      = 14,  // Operation not supported (ENOSYS/EOPNOTSUPP)
    CrossDevice       = 15,  // Cross-device link (EXDEV) — needed by VFS rename
    TryAgain          = 16,  // Transient resource pressure: action needed before retry
                             // (e.g., memory pressure — trigger reclaim then retry;
                             // congestion — back off then retry). Unlike WouldBlock,
                             // the caller must DO something before the retry will
                             // succeed. Both map to EAGAIN for Linux ABI compat.
    // Extensible: new variants added at end, values are stable.
    // #[non_exhaustive] ensures forward compatibility: match sites use
    // `_ =>` with `#[allow(non_exhaustive_omitted_patterns, reason = "forward compat")]`.
    // KABI error translation happens at the syscall layer — drivers receive
    // KernelError but never need exhaustive matching of kernel-internal variants.
}

POSIX errno mapping: The syscall entry point (Section 19.1) converts KernelError to POSIX errno values at the syscall boundary — the only place integer error codes exist:

KernelError	POSIX errno	Value
OutOfMemory	ENOMEM	12
InvalidCapability	EBADF	9
PermissionDenied	EPERM / EACCES	1 / 13
InvalidArgument	EINVAL	22
DeviceError	EIO	5
Timeout	ETIMEDOUT	110
WouldBlock	EAGAIN	11
NotFound	ENOENT	2
AlreadyExists	EEXIST	17
Interrupted	EINTR	4
IoError	EIO	5
ResourceBusy	EBUSY	16
NoSpace	ENOSPC	28
NotSupported	ENOSYS / EOPNOTSUPP	38 / 95
CrossDevice	EXDEV	18
TryAgain	EAGAIN	11

Some variants map to different errnos depending on context (PermissionDenied becomes EPERM for capability operations, EACCES for filesystem operations). The translation is handled by the syscall dispatch layer, not by the originating subsystem.

WouldBlock vs TryAgain — advisory semantic distinction:

Both map to EAGAIN (11) for Linux ABI compatibility (EWOULDBLOCK == EAGAIN on all Linux platforms). The kernel-internal distinction is advisory and enables subsystem-specific retry logic:

Variant	Semantic	Caller action	Example
`WouldBlock`	Passive wait — resource will become available on its own	Poll/epoll/retry without special action	O_NONBLOCK socket with empty receive buffer
`TryAgain`	Action needed — caller must do something before retry succeeds	Back off, trigger reclaim, release lock, etc.	Memory allocation under pressure (trigger reclaim first)

Mandatory rule: every function that returns TryAgain must document in its doc comment what action the caller should take before retrying. A bare return Err(TryAgain) without guidance is a spec/code review violation — the distinction is useless if the caller does not know what to do differently.

3.14.2 Fault Containment Boundaries¶

UmkaOS has four fault containment domains. A fault in one domain does not propagate to domains above it in the hierarchy:

Domain	Failure scope	Recovery
umka-core (Tier 0)	Kernel panic — entire system	Reboot (same as Linux)
Tier 1 driver (domain-isolated)	Single driver crash	Automatic restart, device FLR (Section 11.9)
Tier 2 driver (process)	Single driver process crash	Automatic restart (Section 11.9)
Userspace process	Single process terminated	Application-level recovery

Linux problem: Linux has exactly one fault domain for the entire kernel. A null dereference in an obscure USB driver is indistinguishable from a bug in the scheduler — both trigger the same kernel panic. The only containment boundary is kernel vs. userspace.

UmkaOS design: The isolation model (Section 11.2) gives each Tier 1 driver its own isolation domain. When a CPU exception fires (page fault, general protection fault, divide-by-zero), umka-core's exception handler inspects the faulting context's isolation domain ID (architecture-specific: PKRU on x86, page table base on ARM/RISC-V — see arch::current::isolation::current_domain_id()):

Domain 0 (umka-core): The fault is in the trusted kernel. This is a genuine kernel panic — proceed to the panic handler (Section 3.14).
Domain 1-N (Tier 1 driver): The fault is in an isolated driver. The exception handler identifies the driver from the faulting domain ID, marks it as crashed, and invokes the crash recovery sequence (Section 11.9). The rest of the kernel continues running.

Tier 2 driver faults are even simpler: the driver runs in a separate address space, so a fault (SIGSEGV, SIGBUS, SIGFPE) terminates the driver process. The driver supervisor detects the exit and restarts it.

3.14.3 Panic Handling¶

A kernel panic means a bug in umka-core itself — the small trusted computing base. This is the only code whose failure is fatal.

Panic sequence:

1. DISABLE INTERRUPTS — local CLI on the faulting CPU.
   NMI IPI broadcast to all other CPUs. Each CPU receiving the NMI executes
   the NMI panic handler, which:
     (a) Saves the current register context to a pre-allocated per-CPU crash buffer
         (allocated at boot, never freed, immune to OOM — one 4KB page per CPU).
     (b) Disables local interrupts (preventing further preemption or nested exceptions).
     (c) Spins on an atomic flag waiting for the panic coordinator (the faulting CPU).
   This ensures all CPUs are in a known-safe state before the coordinator reads
   system data structures. The NMI handler is NMI-safe — it uses no locks, no
   allocation, and no printk. It writes only to the pre-allocated crash buffer.
   Architecture-specific NMI delivery: On x86-64, this uses the APIC NMI delivery
   mode. On AArch64 (GICv3.3+), the GIC NMI mechanism (GICD_INMIR) is used; on
   older GIC implementations, a highest-priority FIQ is used as a pseudo-NMI (same
   technique as Linux's CONFIG_ARM64_PSEUDO_NMI). On RISC-V, sbi_send_ipi() with a
   dedicated panic IPI vector is used. **Limitation**: standard RISC-V supervisor
   interrupts are maskable — sstatus.SIE=0 blocks all supervisor interrupts
   regardless of AIA priority. If the target CPU has interrupts disabled, the IPI
   will not be delivered. Mitigation: the panic coordinator uses a 100 ms timeout
   per CPU; CPUs that do not respond are marked "unavailable" in the crash dump.
   On systems implementing the Smrnmi extension (Resumable Non-Maskable Interrupts),
   UmkaOS uses RNMI delivery instead, which is truly non-maskable. On PPC64LE, the OPAL
   opal_signal_system_reset() call triggers a system reset interrupt on target CPUs.
2. CAPTURE STATE — faulting CPU registers, stack backtrace (.eh_frame),
   per-CPU crash buffers (from step 1), key data structures (process list,
   cap table, driver registry, last 64KB klog)
3. SERIAL FLUSH — panic message + backtrace to serial (Tier 0, polled, always works)
4. CRASH DUMP — if configured, write ELF core dump to reserved memory region;
   if NVMe panic-write path registered, polled-mode write to disk (Section 11.7)
5. HALT — default halt (umka.panic=halt), or reboot (umka.panic=reboot)

Driver panic vs. kernel panic: A panic!() inside Tier 1 driver code does NOT panic the kernel. The kernel is compiled with panic = "abort", so there is no stack unwinding. Instead, panic!() calls abort(), which executes an illegal instruction (ud2 on x86-64, udf on AArch64/ARMv7, unimp on RISC-V, trap on PPC). This triggers a CPU exception (invalid opcode / undefined instruction) within the driver's isolation domain. The exception handler identifies the faulting domain as a non-core driver domain and routes the fault to driver crash recovery (Section 11.9) — not to the kernel panic path.

OOM policy: When physical memory is exhausted, UmkaOS applies pressure in stages:

Reclaim page cache: Clean pages are evicted immediately (no I/O cost). Dirty pages are written back and then evicted.
Compress to CompressPool: Inactive anonymous pages are compressed and moved to the in-kernel compression tier (Section 4.12), reducing physical memory usage 2-3x.
Swap to disk: If a swap device is configured, compressed pages that haven't been accessed spill to disk.
OOM killer: If all of the above fail to free enough memory, the OOM killer selects a process to terminate. Heuristic: largest RSS, not marked OOM_SCORE_ADJ=-1000, not system-critical, not recently started. The selected process receives SIGKILL.

umka-core itself is never OOM-killed. Core kernel allocations draw from a reserved memory pool (configured at boot, default 64MB) that is excluded from the general-purpose allocator. If the reserved pool is exhausted — a symptom of a kernel memory leak — this is a kernel panic, not an OOM kill.

3.14.4 Error Reporting to Userspace¶

Syscall error returns: Standard Linux ABI — negative errno in the return register (rax on x86-64, x0 on AArch64, a0 on RISC-V). Applications, glibc, and musl all work unmodified.

Extended error information: For complex failures where a single errno is insufficient (e.g., "which capability was invalid?" or "which device returned an error?"), UmkaOS provides a per-thread extended error buffer:

/// Per-thread extended error context, populated on syscall failure.
#[repr(C)]
pub struct ExtendedError {
    pub errno: i32,        // POSIX errno (same as syscall return)        offset 0
    pub subsystem: u32,    // Kernel subsystem that generated the error   offset 4
    pub detail_code: u32,  // Subsystem-specific detail code              offset 8
    pub _pad: [u8; 4],     // Explicit padding for u64 alignment of object_id (offset 12)
    pub object_id: u64,    // Related capability/device/inode ID (0 = N/A) offset 16
}
// Layout: 4 + 4 + 4 + 4(pad) + 8 = 24 bytes. Padding made explicit per CLAUDE.md rule 11.
const_assert!(size_of::<ExtendedError>() == 24);

Queried via prctl(PR_GET_EXTENDED_ERROR, &buf) — entirely optional. Applications that don't use it see standard errno behavior with zero overhead (the buffer is only written on error). The subsystem and detail_code fields are stable, allowing diagnostic tools to produce messages like "capability 0x3f revoked by generation advance" instead of "EBADF".

Subsystem ID registry (stable values, never renumbered):

`subsystem`	Name	Example `detail_code` values
0	`GENERIC`	Generic errno, no subsystem-specific detail
1	`CAPABILITY`	1=revoked, 2=generation_mismatch, 3=delegation_depth_exceeded, 4=type_mismatch
2	`MEMORY`	1=oom_killed, 2=cgroup_limit, 3=mlock_limit, 4=huge_page_unavailable
3	`VFS`	1=dentry_negative, 2=mount_readonly, 3=quota_exceeded, 4=xattr_limit
4	`SCHEDULER`	1=affinity_conflict, 2=rt_bandwidth_exhausted, 3=cbs_throttled
5	`NETWORK`	1=route_unreachable, 2=socket_buffer_full, 3=congestion_drop
6	`STORAGE`	1=device_error, 2=dm_path_failed, 3=journal_aborted
7	`DRIVER`	1=domain_crashed, 2=timeout, 3=probe_failed, 4=tier_demotion
8	`SECURITY`	1=lsm_denied, 2=ima_appraisal_failed, 3=evm_mismatch
9	`IPC`	1=ring_full, 2=peer_disconnected, 3=message_too_large
10	`KABI`	1=version_mismatch, 2=service_unavailable, 3=signature_invalid
11	`DISTRIBUTED`	1=node_unreachable, 2=dlm_deadlock, 3=dsm_coherence_timeout
12	`CRYPTO`	1=key_expired, 2=algorithm_unavailable, 3=rng_reseed_needed
13-255	Reserved	For future subsystems. Values >= 256 are available for out-of-tree use.

Kernel log messages: Errors are logged to the kernel ring buffer (dmesg) with structured fields. The stable tracepoint ABI (Section 20.2) exposes these as machine-parseable events for external monitoring tools.

3.14.5 Error Escalation Paths¶

Errors escalate through a five-level hierarchy. Each level is attempted before moving to the next:

  retry → log → degrade → isolate → panic
    1       2       3         4        5

Retry: Transient hardware errors (bus timeout, CRC mismatch, link retrain) are retried with exponential backoff. Maximum retries are configured per error class (default: 3 retries, 1ms / 10ms / 100ms backoff). If the retry succeeds, no further escalation occurs — the event is logged at DEBUG level for trending.
Log: Persistent errors that survive retries are logged to the kernel ring buffer and recorded as stable tracepoint events (Section 20.2). The Fault Management engine (Section 20.1) ingests these events for threshold-based diagnosis. No state change yet — the subsystem continues operating.
Degrade: Repeated errors from the same subsystem trigger graceful degradation. Examples: storage path failover to a redundant controller, NIC fallback from hardware offload to software path, memory controller marking a DIMM rank as degraded (Section 20.1 RetirePages action). The subsystem continues at reduced capacity. Degradation is reported via uevent to userspace.
Isolate: A misbehaving driver is crashed and restarted via the recovery sequence (Section 11.9). If the same driver crashes repeatedly — 3 times within 60 seconds — it is demoted to Tier 2 (full process isolation). If it continues crashing at Tier 2 (5 crashes within 300 seconds), the driver is disabled entirely and its device is marked offline in the device registry (Section 11.4).
Panic: Reserved for corrupted umka-core state where continued operation risks data loss or silent corruption. Examples: invalid page table entries in kernel mappings, corrupted capability table metadata, scheduler invariant violations. Any state that cannot be recovered by isolating a single driver triggers a kernel panic (Section 3.14).

See also: Section 11.9 (Crash Recovery) for the full driver restart sequence. Section 20.1 (Fault Management) for proactive, telemetry-driven error handling before faults occur. Section 20.2 (Stable Tracepoints) for the machine-parseable event format used at escalation levels 2-4.

FMA integration: Escalation levels 3 (Degrade) and 4 (Isolate) automatically emit a FaultEvent to the FMA subsystem (Section 20.1). Level 2 (Log) emits FaultEvent only if the fma_warn_events_enabled sysctl is set (default: false, to avoid flooding). Level 1 (Retry) and Level 0 (pre-retry transient) never emit FaultEvent — they are logged via tracepoints only.