Chapter 3: Concurrency Model
Locking strategy, lock-free structures, PerCpu, RCU, atomic operations, memory ordering, interrupt handling
3.1 Concurrency Model
3.1.1 Rust Ownership for Lock-Free Paths
Rust's ownership system provides compile-time guarantees that replace many of Linux's runtime-only checking tools (lockdep, KASAN, KCSAN).
/// Guard returned by `preempt_disable()`. Held reference prevents the
/// current task from being preempted and migrated to another CPU.
/// Dropped by `preempt_enable()` (implicit on scope exit).
/// Cannot be sent across threads (`!Send`).
pub struct PreemptGuard {
_not_send: PhantomData<*const ()>,
}
/// Disable preemption on the current CPU. Returns a guard; preemption
/// is re-enabled when the guard is dropped. Nesting is tracked via
/// `CpuLocal.preempt_count` — the counter increments on acquire and
/// decrements on drop. Interrupts are **not** disabled by this call;
/// use `local_irq_save()` for interrupt masking.
///
/// # Panics (debug only)
/// Panics if called while already in an IRQ context (preempt_count high
/// bit set), as nested preempt-disable is not safe from interrupt context.
pub fn preempt_disable() -> PreemptGuard {
// SAFETY: single-instruction increment of preempt_count via CpuLocal.
unsafe { arch::current::cpu::preempt_count_inc() }
PreemptGuard { _not_send: PhantomData }
}
impl Drop for PreemptGuard {
fn drop(&mut self) {
// SAFETY: matching decrement; guard lifetime ensures no underflow.
unsafe { arch::current::cpu::preempt_count_dec() }
}
}
/// Per-CPU data: compile-time prevention of cross-CPU access.
/// Access requires a proof token that can only be obtained with
/// preemption disabled on the current CPU.
///
/// The backing storage is dynamically sized at boot based on the actual
/// CPU count discovered from ACPI MADT / device tree / firmware tables.
/// No compile-time MAX_CPUS constant — the kernel adapts to the hardware.
/// Allocation uses the boot allocator (Section 4.1.1) during early init,
/// before the general-purpose allocator is available.
pub struct PerCpu<T> {
/// Pointer to dynamically allocated array of per-CPU slots.
/// Length = num_possible_cpus(), discovered at boot.
data: *mut UnsafeCell<T>,
/// Number of CPU slots (set once at boot, never changes).
count: usize,
/// Per-slot borrow state tracking for runtime aliasing detection.
/// Length = count, parallel to data array. Each slot tracks:
/// - 0 = free (no active borrows)
/// - 1..=MAX-1 = that many active read borrows
/// - u32::MAX = mutably borrowed
/// This array enables detection of aliased borrows across multiple
/// PreemptGuard instances, which the type system cannot prevent alone.
borrow_state: *mut AtomicU32,
}
impl<T> PerCpu<T> {
/// Returns a reference to the borrow_state atomic for the given CPU slot.
/// # Panics
/// Panics if cpu >= count (bounds check).
fn borrow_state(&self, cpu: usize) -> &'static AtomicU32 {
assert!(cpu < self.count, "PerCpu: cpu {} out of range (count {})", cpu, self.count);
// SAFETY: borrow_state was allocated with `count` elements at boot,
// and cpu < count is verified above. The pointer remains valid for
// the kernel's lifetime (static allocation). The returned reference
// is 'static because the borrow_state array outlives all PerCpu usages.
unsafe { &*self.borrow_state.add(cpu) }
}
/// # Safety contract
///
/// `PerCpu<T>` is a global singleton per data type, accessed via a static
/// reference. Soundness depends on preventing aliased mutable references
/// to the same CPU's slot. Two distinct hazards must be addressed:
///
/// 1. **Cross-thread aliasing**: Two threads on different CPUs access
/// *different* slots, so no aliasing occurs. A thread cannot migrate
/// mid-access because preemption is disabled by the guard.
///
/// 2. **Interrupt aliasing**: Interrupt handlers (including NMI) may fire
/// while preemption is disabled. If an interrupt handler accesses the
/// same per-CPU variable mutably, this creates aliased `&mut T`
/// references — undefined behavior in Rust. Therefore:
/// - **Read-only access via `get()`** is safe with only preemption
/// disabled, because multiple `&T` references are permitted.
/// `get()` returns a `PerCpuRefGuard` that only disables preemption.
/// - **Mutable access via `get_mut()`** requires BOTH preemption
/// disabled AND local interrupts disabled. `get_mut()` returns a
/// `PerCpuMutGuard` that disables local interrupts (via
/// `local_irq_save`/`local_irq_restore`, defined in `arch::current::interrupts`)
/// for the duration of the borrow, in addition to disabling
/// preemption. This prevents any maskable interrupt handler from
/// observing or mutating the slot while the caller holds `&mut T`.
///
/// 3. **NMI constraint**: NMIs (non-maskable interrupts) cannot be
/// disabled by `local_irq_save`. An NMI that fires during a
/// `get_mut()` borrow will see `borrow_state == u32::MAX`. If the
/// NMI handler calls `get()` on the **same** PerCpu variable, the
/// borrow_state check detects the conflict and panics. This is the
/// intended safety mechanism — panicking is preferable to silent
/// data corruption.
///
/// **NMI handler rules**:
/// - NMI handlers MUST NOT call `get_mut()` on any PerCpu variable.
/// - NMI handlers SHOULD avoid calling `get()` on PerCpu variables
/// that are commonly accessed via `get_mut()` elsewhere, because
/// the conflict causes a panic (which in NMI context halts the
/// CPU).
/// - NMI handlers that need per-CPU state MUST use dedicated
/// NMI-specific PerCpu variables that are only read (never
/// mutated) from non-NMI contexts, or use raw atomic fields
/// outside the PerCpu abstraction (e.g., the pre-allocated
/// per-CPU crash buffer used by the panic NMI handler).
/// - UmkaOS's NMI handlers (panic coordinator, watchdog, perf sampling)
/// follow this rule: they write only to dedicated pre-allocated
/// per-CPU buffers and never access the general PerCpu<T> API.
///
/// The `PreemptGuard` is `!Send`, `!Clone`, and pin-bound to the issuing
/// CPU, proving that the caller is on the correct CPU. The borrow checker
/// prevents calling `get()` and `get_mut()` on the same guard
/// simultaneously, preventing aliased `&T` + `&mut T` within the same
/// non-interrupt context.
///
/// **Aliasing safety**: The `&mut PreemptGuard` borrow prevents aliasing
/// through the *same* guard, but nothing in the type system prevents a
/// caller from creating two separate `PreemptGuard`s and using them to
/// obtain aliased references (e.g., `&T` from one guard and `&mut T` from
/// another, or two `&mut T`). To close this hole, `PerCpu<T>` maintains a
/// per-slot `borrow_state: AtomicU32` with the following encoding:
/// - `0` = slot is free (no active borrows)
/// - `1..=MAX-1` = slot has that many active read borrows (`get()` in use)
/// - `u32::MAX` = slot is mutably borrowed (`get_mut()` in use)
///
/// `get()` atomically increments the reader count (fails if currently
/// `u32::MAX`, i.e., a writer is active). `get_mut()` atomically
/// transitions from `0` to `u32::MAX` (fails if any readers OR another
/// writer are active). Both `PerCpuRefGuard` and `PerCpuMutGuard` restore
/// the borrow_state on drop. In debug builds, violations panic
/// unconditionally, catching aliasing bugs during development and testing.
/// In release builds, the borrow-state CAS is elided for performance
/// (Section 3.1.3) — the structural invariants (preemption disabled, IRQs
/// disabled for `get_mut()`) are sufficient to prevent aliased access.
/// For the hottest per-CPU fields, `CpuLocal` (Section 3.1.2) bypasses
/// `PerCpu<T>` entirely, using architecture-specific registers for
/// ~1-10 cycle access.
///
/// **Comparison with Linux**: Linux's `this_cpu_read/write` on x86-64 compiles
/// to a single `gs:`-prefixed instruction (no atomic, no preempt_disable needed
/// for the single-instruction case). UmkaOS addresses this gap with a two-tier
/// per-CPU model (Section 3.1.2-c):
///
/// - **CpuLocal** (Section 3.1.2): Register-based access (~1-10 cycles, arch-dependent)
/// for the hottest ~10 fields (current_task, runqueue, slab magazines, etc.).
/// Matches Linux's per-CPU register pattern on all six architectures.
/// - **PerCpu\<T\>** (this struct): Generic abstraction for all other per-CPU data.
/// Borrow-state CAS is debug-only in release builds (Section 3.1.3), reducing cost
/// from ~20-30 cycles to ~3-8 cycles. The CAS catches aliasing bugs during
/// development; release builds trust the structural invariants (preemption
/// disabled + IRQs disabled = exclusive access).
///
/// See `get()` and `get_mut()` documentation below.
/// Read-only access to the current CPU's slot. Takes `&PreemptGuard`,
/// which only requires preemption to be disabled. Multiple `&T` references
/// are sound because `&T` is `Sync`-like — interrupt handlers may also
/// hold `&T` to the same slot without causing UB.
/// Returns a `PerCpuRefGuard<T>` that derefs to `&T`.
///
/// The lifetime `'g` ties the returned guard to the `PreemptGuard`, not
/// to `&self`. This prevents dropping the `PreemptGuard` (re-enabling
/// preemption and allowing migration) while still holding a reference
/// to this CPU's slot — which would be use-after-migrate unsoundness.
///
/// The `T: Sync` bound is required because interrupt handlers may
/// concurrently hold `&T` to the same slot. Without `Sync`, types like
/// `Cell<u32>` would allow data races between the caller and interrupt
/// handlers sharing the same per-CPU slot.
///
/// **Borrow-state protocol**: `get()` atomically increments the per-slot
/// `borrow_state` counter, provided the current value is not `u32::MAX`
/// (which indicates an active mutable borrow). If the slot is currently
/// mutably borrowed, `get()` panics. This prevents the `&T` + `&mut T`
/// aliasing UB that would arise if a caller obtained an immutable borrow
/// via one `PreemptGuard` while another context held a mutable borrow via
/// a second `PreemptGuard`. The `PerCpuRefGuard` decrements the counter
/// on drop, returning the slot to a lower reader count or back to 0.
pub fn get<'g>(&self, guard: &'g PreemptGuard) -> PerCpuRefGuard<'g, T>
where
T: Sync,
{
let cpu = guard.cpu_id();
assert!(cpu < self.count, "PerCpu: cpu_id {} out of range (count {})", cpu, self.count);
// Atomically increment the reader count. Fail if a writer holds the
// slot (borrow_state == u32::MAX). Use a compare-exchange loop so we
// can distinguish "writer active" from a successful increment.
// Active in debug builds only; elided in release (Section 3.1.3).
#[cfg(debug_assertions)]
{
let state = self.borrow_state(cpu);
let prev = state.fetch_update(Ordering::Acquire, Ordering::Relaxed, |v| {
if v == u32::MAX { None } else { Some(v + 1) }
});
if prev.is_err() {
panic!("PerCpu: slot {} already mutably borrowed — cannot take shared borrow", cpu);
}
}
// SAFETY: PreemptGuard guarantees we are on cpu_id() and cannot
// migrate. In debug builds, the borrow_state increment above ensures
// no mutable borrow is active. In release builds, the structural
// invariant (preemption disabled = CPU pinned) is sufficient for
// read-only access. T: Sync guarantees shared references are sound
// across concurrent contexts (caller + interrupt handlers).
unsafe {
PerCpuRefGuard {
value: &*self.data.add(cpu).as_ref().unwrap().get(),
#[cfg(debug_assertions)]
borrow_state: self.borrow_state(cpu),
_guard: PhantomData,
}
}
}
/// Mutable access to the current CPU's slot. Takes `&mut PreemptGuard`
/// to prevent aliasing with any concurrent `get()` or `get_mut()` on
/// the same guard in the caller's context. Additionally, the returned
/// `PerCpuMutGuard` disables local interrupts (via `local_irq_save`)
/// to prevent interrupt handlers from accessing this slot while `&mut T`
/// is live. Interrupts are restored when the guard is dropped.
///
/// The lifetime `'g` ties the returned guard to the `PreemptGuard`,
/// preventing use-after-migrate (same rationale as `get()`).
///
/// This two-layer protection is necessary because preemption-disable
/// alone does NOT prevent interrupt handlers from firing and potentially
/// accessing the same per-CPU variable.
///
/// **Aliasing safety**: The `&mut PreemptGuard` borrow prevents aliasing
/// through the *same* guard, but callers could theoretically create a
/// second `PreemptGuard` and call `get()` or `get_mut()` again on the
/// same slot, producing `&T` + `&mut T` or two `&mut T` — both are UB.
/// To close this hole, `get_mut()` atomically transitions the per-slot
/// `borrow_state` from exactly `0` (free) to `u32::MAX` (writer) using
/// a compare-exchange. If the slot has any active readers (count > 0) or
/// another writer (count == u32::MAX), the CAS fails and `get_mut()`
/// panics.
///
/// **Debug vs release**: In debug builds (`cfg(debug_assertions)`), the
/// CAS is always active — catching aliasing bugs during development. In
/// release builds, the CAS is elided (Section 3.1.3): the structural invariants
/// (preemption disabled + IRQs disabled) are sufficient to guarantee
/// exclusive access. This reduces `get_mut()` cost from ~20-30 cycles to
/// ~3-8 cycles. The `borrow_state` array is still allocated for binary
/// compatibility with debug-built modules.
///
/// `PerCpu<T>` is designed to be used with a
/// single `PreemptGuard` per critical section — creating multiple guards
/// and using them to access the same `PerCpu<T>` is a logic error detected
/// at runtime.
pub fn get_mut<'g>(&self, guard: &'g mut PreemptGuard) -> PerCpuMutGuard<'g, T> {
let cpu = guard.cpu_id();
assert!(cpu < self.count, "PerCpu: cpu_id {} out of range (count {})", cpu, self.count);
// SAFETY: local_irq_save() MUST be called BEFORE updating borrow_state.
// If we update borrow_state first and an interrupt fires before IRQs
// are disabled, an interrupt handler calling get() on the same PerCpu
// variable would see borrow_state == u32::MAX and panic. Disabling IRQs
// first ensures no interrupt handler can observe or race with the
// borrow_state transition.
let saved_flags = local_irq_save();
// Runtime borrow-state check: active in debug builds
// (`cfg(debug_assertions)`); elided in release builds for performance
// (see Section 3.1.3). Atomically transitions borrow_state from 0
// (free) to u32::MAX (writer). Fails if any reader (count > 0) or
// another writer (u32::MAX) is active. Fires both for &mut T + &mut T
// (two writers) and for &T + &mut T (reader + writer) aliasing.
// This check is safe from interrupt-handler races because IRQs are
// already disabled above.
#[cfg(debug_assertions)]
{
let state = self.borrow_state(cpu);
if state.compare_exchange(0, u32::MAX, Ordering::Acquire, Ordering::Relaxed).is_err() {
local_irq_restore(saved_flags);
panic!("PerCpu: slot {} already borrowed (reader or writer active)", cpu);
// Note: The kernel is compiled with `panic = "abort"` (no unwinding),
// so borrow_state cannot leak: a panic immediately halts the core.
// For Tier 1 drivers sharing the address space, driver panics are caught
// by the driver fault handler (Section 10.8) which resets the driver's state
// including any held per-CPU borrows and any Core kernel locks acquired
// through KABI calls (see Section 10.8 ISOLATE step: KABI lock registry).
}
}
// SAFETY: PreemptGuard prevents migration. local_irq_save() prevents
// interrupt handlers from accessing this slot. In debug builds, the CAS
// above additionally verifies no aliased borrows exist. In release builds,
// we trust the structural invariants. Together, these guarantee exclusive
// access to this CPU's slot.
unsafe {
PerCpuMutGuard {
value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
saved_flags,
#[cfg(debug_assertions)]
borrow_state: self.borrow_state(cpu),
_guard: PhantomData,
}
}
}
}
/// Guard for read-only per-CPU access. Only requires preemption disabled.
/// Implements `Deref<Target = T>` for ergonomic read access.
///
/// The lifetime `'a` is tied to the `PreemptGuard`, not to the `PerCpu<T>`
/// container. This ensures the guard cannot outlive the preemption-disabled
/// critical section, preventing use-after-migrate.
///
/// On creation, the per-slot `borrow_state` reader count was incremented by
/// `get()`. On drop, `PerCpuRefGuard` decrements the reader count, returning
/// the slot to its prior borrow state. This is necessary so that `get_mut()`
/// can detect when readers are no longer active and safely transition to
/// writer mode.
pub struct PerCpuRefGuard<'a, T> {
value: &'a T,
/// Reference to the per-slot borrow_state counter so Drop can decrement
/// the reader count. Only present in debug builds (`#[cfg(debug_assertions)]`);
/// in release builds, the borrow-state CAS is elided (Section 3.1.3).
#[cfg(debug_assertions)]
borrow_state: &'a AtomicU32,
/// Ties this guard's lifetime to the `PreemptGuard`, not to `PerCpu<T>`.
_guard: PhantomData<&'a PreemptGuard>,
}
/// `PerCpuRefGuard` must NOT be sent to another CPU/thread. It holds a
/// reference to a per-CPU slot that is only valid on the CPU where the
/// `PreemptGuard` was obtained. Sending it to another thread (which runs
/// on a different CPU) would allow reading another CPU's slot without any
/// synchronization, violating the per-CPU data invariant.
impl<T> !Send for PerCpuRefGuard<'_, T> {}
impl<'a, T> core::ops::Deref for PerCpuRefGuard<'a, T> {
type Target = T;
fn deref(&self) -> &T { self.value }
}
impl<'a, T> Drop for PerCpuRefGuard<'a, T> {
fn drop(&mut self) {
// Decrement the reader count with underflow detection.
// Only present in debug builds; release builds elide borrow tracking
// (Section 3.1.3).
#[cfg(debug_assertions)]
{
// CORRECTNESS: fetch_sub on 0 wraps to u32::MAX (the writer sentinel),
// which would corrupt the borrow state. We detect this case after the
// fact and panic rather than silently corrupting state.
//
// Normal case: borrow_state is 1..=u32::MAX-1 (one or more readers).
// Underflow case: borrow_state is 0 (no readers — this is a bug).
// Writer-corruption case: borrow_state is u32::MAX (impossible if get()
// correctly checked for writer before incrementing).
//
// We panic on error because it indicates a logic error in the PerCpu
// borrow tracking, and continuing would corrupt state. The recovery
// store attempts to restore a consistent state for debugging, but
// the kernel will halt anyway (panic = "abort" for kernel code).
//
// **Why no IRQ protection?** An interrupt handler that fires between
// fetch_sub and the panic check, and calls get() on the same PerCpu,
// will see borrow_state == u32::MAX (writer sentinel) and correctly
// fail its get() call. The handler does NOT proceed with corrupted
// state — it safely errors out. Adding IRQ save/restore to every guard
// drop (hot path) to protect a code path that already panics is
// unnecessary overhead.
let old = self.borrow_state.fetch_sub(1, Ordering::Release);
if old == 0 {
// Underflow: no reader was registered. This is a kernel bug.
// The fetch_sub wrapped to u32::MAX. Restore to 0 and panic.
self.borrow_state.store(0, Ordering::Release);
panic!("PerCpuRefGuard::drop: borrow_state underflow — double-drop or missing get()");
}
if old == u32::MAX {
// Was in writer mode — this should be impossible since get() fails
// when a writer is active. If we reach here, state is corrupted.
// The fetch_sub wrapped to u32::MAX-1. Restore sentinel and panic.
self.borrow_state.store(u32::MAX, Ordering::Release);
panic!("PerCpuRefGuard::drop: borrow_state was u32::MAX (writer sentinel) — corrupted state");
}
// Normal case: old was 1..=u32::MAX-1, now decremented successfully.
// No further action needed.
}
}
}
/// Guard for mutable per-CPU access. Disables local interrupts on creation,
/// restores them on drop. This prevents interrupt handlers from creating
/// aliased references to the same per-CPU slot.
///
/// The lifetime `'a` is tied to the `PreemptGuard` (via `PhantomData`),
/// not to the `PerCpu<T>` container. Same rationale as `PerCpuRefGuard`.
///
/// On creation, the per-slot `borrow_state` was set to `u32::MAX` (writer
/// sentinel) by `get_mut()`. On drop, `PerCpuMutGuard` resets `borrow_state`
/// to `0` (free) before restoring local IRQs, allowing subsequent `get()` or
/// `get_mut()` calls on this slot.
pub struct PerCpuMutGuard<'a, T> {
value: &'a mut T,
saved_flags: usize,
/// Reference to the per-slot borrow_state counter so Drop can reset it
/// to 0 (free), enabling detection of concurrent borrows in subsequent
/// callers. Only present in debug builds (`#[cfg(debug_assertions)]`);
/// in release builds, the borrow-state CAS is elided entirely
/// (Section 3.1.3) and the structural invariants (preemption disabled +
/// IRQs disabled) are sufficient to guarantee exclusive access.
#[cfg(debug_assertions)]
borrow_state: &'a AtomicU32,
/// Ties this guard's lifetime to the `PreemptGuard`, not to `PerCpu<T>`.
_guard: PhantomData<&'a mut PreemptGuard>,
}
/// `PerCpuMutGuard` must NOT be sent to another CPU/thread. The `saved_flags`
/// field contains the interrupt state of the originating CPU (saved via
/// `local_irq_save`). Restoring these flags on a different CPU would corrupt
/// that CPU's interrupt state — potentially enabling interrupts that should
/// be disabled or vice versa, leading to missed interrupts or unsafe re-entry.
impl<T> !Send for PerCpuMutGuard<'_, T> {}
impl<'a, T> core::ops::Deref for PerCpuMutGuard<'a, T> {
type Target = T;
fn deref(&self) -> &T { self.value }
}
impl<'a, T> core::ops::DerefMut for PerCpuMutGuard<'a, T> {
fn deref_mut(&mut self) -> &mut T { self.value }
}
impl<'a, T> Drop for PerCpuMutGuard<'a, T> {
fn drop(&mut self) {
// Reset borrow_state from u32::MAX (writer) back to 0 (free) BEFORE
// restoring IRQs. This ensures that any interrupt handler firing
// immediately after IRQ restoration sees the slot as free and can
// call get() or get_mut() without a false-positive panic.
// Only present in debug builds; release builds elide borrow tracking.
#[cfg(debug_assertions)]
{
self.borrow_state.store(0, Ordering::Release);
}
local_irq_restore(self.saved_flags);
}
}
/// Per-CPU locks: type-safe sharded locking without array-indexing aliasing hazards.
///
/// A common pattern in scalable kernels is to give each CPU its own lock protecting
/// a shard of data, avoiding contention on a single global lock. Naively implementing
/// this as `[SpinLock<T>; N]` creates a type-safety hole: Rust's aliasing rules permit
/// holding `&mut T` from `locks[i]` and `&mut T` from `locks[j]` simultaneously if
/// `i != j`, but nothing in the type system prevents accessing `locks[other_cpu]`
/// when the caller intended to access only `locks[this_cpu]`. Furthermore, if two
/// CPUs simultaneously hold their respective locks, the `&mut T` references are
/// derived from the same array allocation, which can violate LLVM's noalias assumptions
/// in ways that are difficult to reason about.
///
/// UmkaOS provides `PerCpuLock<T>` to enforce the per-CPU locking invariant at the
/// type level:
///
/// 1. **Separate allocations per CPU**: Each CPU's lock+data is a separate slab
/// allocation (one `PerCpuLockSlot<T>` per CPU), NOT an array element. This ensures
/// that `&mut T` references from different CPUs are derived from disjoint
/// allocations, satisfying Rust's aliasing rules without relying on the
/// "different array indices" reasoning that LLVM may not honor.
///
/// 2. **Access restricted to current CPU**: `PerCpuLock::lock()` takes a
/// `&PreemptGuard` (the same proof token used by `PerCpu<T>`) and only returns
/// a guard for the current CPU's lock. There is no API to access another CPU's
/// lock — the type system makes it impossible.
///
/// 3. **Cache-line aligned**: Each `PerCpuLockSlot<T>` is `#[repr(align(64))]` to
/// prevent false sharing. The alignment is part of the type, not a runtime hint.
///
/// 4. **Safe composition with `PerCpu<T>`**: For cases where per-CPU data needs
/// both a lock and lock-free access, use `PerCpu<UnsafeCell<T>>` with explicit
/// `PerCpuMutGuard` for writes, or wrap the data in `PerCpu<SpinLock<T>>` where
/// the lock itself is per-CPU. The key invariant is that `PerCpuLock<T>` never
/// exposes `&mut T` from one CPU while another CPU's lock is held — each CPU's
/// lock guards only that CPU's data shard.
///
/// **Use case**: Per-CPU statistics counters that need occasional atomic updates
/// across all CPUs (e.g., network Rx packet counts). Each CPU locks its own slot
/// for local updates; aggregating across all CPUs requires iterating without holding
/// any locks (the counters are `AtomicU64`, so reads are consistent).
pub struct PerCpuLock<T> {
/// Array of pointers to independently-allocated lock slots.
/// Each pointer points to a separately-allocated `PerCpuLockSlot<T>`.
/// Length = num_possible_cpus(), discovered at boot.
/// The indirection ensures each slot is a separate allocation for aliasing purposes.
///
/// **Allocation timing**: `PerCpuLock<T>` is initialized during phase 2 of boot
/// (after the slab allocator is available, Section 4.1.3). Each slot is allocated
/// via the slab allocator (not `Box`/global allocator). The pointer array itself
/// is allocated from the boot bump allocator during early init.
slots: *mut *mut PerCpuLockSlot<T>,
/// Number of CPU slots (set once at boot, never changes).
count: usize,
}
/// A single CPU's lock slot, cache-line aligned to prevent false sharing.
/// This is allocated independently (via boot allocator) for each CPU at boot.
///
/// `#[repr(align(64))]` guarantees:
/// - The struct's starting address is 64-byte aligned.
/// - The compiler pads the struct to a multiple of 64 bytes automatically.
/// - `size_of::<PerCpuLockSlot<T>>()` >= 64 regardless of `SpinLock<T>` size.
///
/// No explicit `_pad` field is needed — the compiler handles padding.
/// This works correctly for any `SpinLock<T>` size: small types get padded
/// up to 64 bytes, large types round up to the next 64-byte boundary.
#[repr(align(64))]
struct PerCpuLockSlot<T> {
/// The spinlock protecting this CPU's data shard.
lock: SpinLock<T>,
}
impl<T> PerCpuLock<T> {
/// Lock the current CPU's data shard.
///
/// # Safety contract
///
/// - Takes `&PreemptGuard` to prove the caller is pinned to a specific CPU.
/// - Returns `PerCpuLockGuard<'_, T>` that derefs to `&T` and `&mut T`.
/// - The guard holds the spinlock for this CPU's slot.
/// - No API exists to lock another CPU's shard — the only access path is
/// through the current CPU, enforced by the `PreemptGuard` proof token.
///
/// **Aliasing safety**: Because each slot is a separate heap allocation,
/// holding `&mut T` on CPU 0 and `&mut T` on CPU 1 simultaneously is sound —
/// the references are derived from disjoint allocations, not from different
/// indices of the same array. This satisfies Rust's aliasing rules and
/// LLVM's noalias semantics without subtle reasoning about array indexing.
///
/// **Interrupt safety**: `SpinLock::lock()` disables preemption for the
/// critical section. If the caller needs to also disable interrupts
/// (to prevent interrupt handlers from deadlocking on the same lock),
/// they must wrap the call in `local_irq_save()`/`local_irq_restore()`.
/// For per-CPU locks, this is rarely needed because interrupt handlers
/// typically access different data or use lock-free patterns.
pub fn lock<'g>(&self, guard: &'g PreemptGuard) -> PerCpuLockGuard<'g, T> {
let cpu = guard.cpu_id();
assert!(cpu < self.count, "PerCpuLock: cpu_id {} out of range", cpu);
// SAFETY: slots was allocated with count elements at boot. cpu is in bounds.
let slot_ptr = unsafe { *self.slots.add(cpu) };
// SAFETY: slot_ptr was obtained from slab allocation at boot and is never
// freed during kernel operation. It points to a valid PerCpuLockSlot<T>.
let slot = unsafe { &*slot_ptr };
PerCpuLockGuard {
inner: slot.lock.lock(),
_cpu_pin: PhantomData,
}
}
/// Try to lock the current CPU's data shard without blocking.
///
/// Returns `Some(PerCpuLockGuard)` if the lock was acquired, `None` if
/// the lock is currently held (e.g., by an interrupt handler on this CPU).
///
/// Same safety contract as `lock()`, but non-blocking.
pub fn try_lock<'g>(&self, guard: &'g PreemptGuard) -> Option<PerCpuLockGuard<'g, T>> {
let cpu = guard.cpu_id();
assert!(cpu < self.count, "PerCpuLock: cpu_id {} out of range", cpu);
let slot_ptr = unsafe { *self.slots.add(cpu) };
let slot = unsafe { &*slot_ptr };
slot.lock.try_lock().map(|inner| PerCpuLockGuard {
inner,
_cpu_pin: PhantomData,
})
}
/// Access all slots for cross-CPU aggregation (read-only, no locks held).
///
/// This is the ONLY way to access another CPU's slot, and it only provides
/// read-only access to the lock structure itself — NOT to the protected data.
/// The typical use case is iterating over all CPUs' atomic counters.
///
/// # Safety
///
/// The caller must ensure no CPU is currently mutating its data shard through
/// a `PerCpuLockGuard`. For atomic counter aggregation, this is safe because
/// the counters are read atomically without holding the lock. For non-atomic
/// data, the caller must use external synchronization (e.g., pause all CPUs
/// via IPI) before calling this method.
///
/// Returns an iterator over `&SpinLock<T>` for each CPU. The caller can
/// read the lock state or use `try_lock()` on each, but cannot obtain `&mut T`
/// through this path without holding the lock.
pub unsafe fn iter_slots(&self) -> impl Iterator<Item = &'_ SpinLock<T>> {
(0..self.count).map(move |i| {
let slot_ptr = *self.slots.add(i);
&(*slot_ptr).lock
})
}
}
/// Guard for a per-CPU lock. Implements `Deref`/`DerefMut` to access the protected data.
///
/// The guard holds a `SpinLockGuard` derived from the current CPU's lock slot.
/// The `PhantomData<&'a PreemptGuard>` ties the guard's lifetime to the CPU pin,
/// preventing use-after-migrate if the caller drops the `PreemptGuard` while
/// still holding the lock guard.
pub struct PerCpuLockGuard<'a, T> {
inner: SpinLockGuard<'a, T>,
_cpu_pin: PhantomData<&'a PreemptGuard>,
}
impl<'a, T> core::ops::Deref for PerCpuLockGuard<'a, T> {
type Target = T;
fn deref(&self) -> &T { &*self.inner }
}
impl<'a, T> core::ops::DerefMut for PerCpuLockGuard<'a, T> {
fn deref_mut(&mut self) -> &mut T { &mut *self.inner }
}
/// `PerCpuLockGuard` must NOT be sent to another CPU/thread.
/// The guard holds a `SpinLockGuard` from a specific CPU's lock slot.
/// Sending it to another thread would allow that thread to access a lock
/// that may be concurrently acquired by the original CPU (e.g., in an
/// interrupt handler), causing deadlock or data corruption.
impl<T> !Send for PerCpuLockGuard<'_, T> {}
Why separate allocations matter for type soundness:
The naive [SpinLock<T>; N] approach has a subtle aliasing problem. When CPU 0
holds locks[0].lock() and CPU 1 holds locks[1].lock(), both CPUs have
derived their &mut T references from the same array allocation. While Rust's
reference rules permit disjoint array element access, LLVM's noalias attribute
and the optimizer's alias analysis may not distinguish between "different indices
of the same array" and "same allocation." In practice, this is unlikely to cause
miscompilation with current LLVM, but the UmkaOS architecture takes a conservative
approach: each CPU's lock+data is a separate heap allocation, guaranteeing that
the &mut T references are truly disjoint at the allocation level.
This design also simplifies reasoning about memory reclamation: if a per-CPU lock slot needs to be freed (e.g., during CPU hot-unplug), the individual slab allocation can be returned without affecting other CPUs' slots.
3.1.2 CpuLocal: Register-Based Per-CPU Fast Path
PerCpu<T> is the correct abstraction for general per-CPU data, but its
indirection (CPU ID lookup + array indexing + borrow-state checking) is
too expensive for the kernel's hottest paths — scheduler pick_next_task,
slab magazine alloc/free, NAPI poll, RCU quiescent-state reporting. These
paths execute millions of times per second and every cycle matters.
Linux solves this with architecture-specific per-CPU registers that allow single- or two-instruction access to critical per-CPU fields. UmkaOS adopts the same approach as a two-tier per-CPU model:
- Tier 1 —
CpuLocal: A fixed-layout struct pointed to by the architecture's dedicated per-CPU register. Access is 1-4 instructions with no function-call overhead. Used for ~10 of the hottest fields. - Tier 2 —
PerCpu<T>: The existing generic abstraction. Used for everything else (statistics counters, per-CPU caches, driver state).
Per-architecture register assignment:
| Architecture | Register | Instruction | Cycles | Notes |
|---|---|---|---|---|
| x86-64 | GS segment | mov %gs:OFFSET, %reg |
~1 | Segment prefix encodes offset in instruction. Set via MSR_GS_BASE per CPU at boot. |
| AArch64 | TPIDR_EL1 |
mrs reg, tpidr_el1 + ldr |
~2-4 | System register, kernel-only (EL1). VHE kernels use TPIDR_EL2. |
| ARMv7 | TPIDRPRW |
mrc p15, 0, reg, c13, c0, 4 + ldr |
~3-5 | Privileged thread ID register (PL1 only). Requires ARMv6K+. |
| PPC64 | r13 (PACA) |
ld reg, OFFSET(r13) |
~1-3 | r13 permanently points to Per-processor Area (PACA). Matches Linux. |
| PPC32 | SPRG3 |
mfspr reg, SPRG3 + lwz |
~3-6 | SPRG3 is designated for OS use. Linux PPC32 does not optimize this; UmkaOS does. |
| RISC-V | sscratch CSR |
csrr reg, sscratch + ld |
~5-10 | Repurposed: stores per-CPU base with LSB=1 tag to distinguish from task pointer on trap entry. This follows Linux RISC-V entry.S's established pattern: on trap entry Linux swaps tp into sscratch so tp can carry the per-CPU pointer in process context; UmkaOS uses the same swap idiom, meaning sscratch use is fully compatible with trap handling. Fallback: __per_cpu_offset[cpu] array if tagging proves impractical. |
Design rationale: Only x86-64 achieves true single-instruction per-CPU
access (the segment register encodes the per-CPU offset within the
instruction itself). All other architectures require at minimum two
instructions: read the per-CPU base register, then load from base+offset.
This is still an order of magnitude faster than the PerCpu<T> generic
path (~3-5 cycles vs ~20-30 cycles with CAS borrow checking).
/// Per-size-class free-object magazine for lock-free CPU-local slab allocation.
/// A magazine holds up to `MAGAZINE_SIZE` pre-freed objects of one size class.
/// All slots are pointers to objects of identical size; the size class is
/// implicit from which `MagazinePair` entry this magazine lives in.
///
/// # Allocation fast path
/// 1. Caller checks `loaded.count > 0`.
/// 2. If true, returns `loaded.objects[--count]` (no lock, no atomic).
/// 3. If false, swaps `loaded` ↔ `spare`; if spare was non-empty, retry step 1.
/// 4. If both are empty, refills `loaded` from the global per-size-class slab
/// under a short spinlock (typically 64 objects at a time).
///
/// # Free fast path
/// 1. If `loaded.count < MAGAZINE_SIZE`, store freed pointer at `loaded.objects[count++]`.
/// 2. If loaded is full, swap loaded ↔ spare; if spare was full, drain spare
/// to the global slab under a spinlock, then retry.
///
/// # Memory layout
/// `objects` is a fixed-size array to keep the magazine in a single cache line.
/// At MAGAZINE_SIZE = 64, `size_of::<SlabMagazine>()` = 8 (count) + 64 × 8 (ptrs)
/// = 520 bytes, fitting in 9 cache lines. Magazines are allocated from the slab
/// allocator itself (size class for 520-byte objects).
pub struct SlabMagazine {
/// Number of valid object pointers currently loaded (0..=MAGAZINE_SIZE).
pub count: usize,
/// Object pointer slots — valid at indices [0..count).
/// Slots at indices [count..MAGAZINE_SIZE) are uninitialised and must
/// not be read. Written by the free path; read and cleared by the alloc path.
pub objects: [*mut u8; MAGAZINE_SIZE],
}
/// Maximum number of objects in a single `SlabMagazine`.
/// 64 objects = one cache line's worth of pointers at 8 bytes each.
/// Changing this constant requires recompiling all slab consumers and
/// re-tuning the global slab's bulk-refill batch size (currently 1 magazine).
pub const MAGAZINE_SIZE: usize = 64;
/// Two-magazine pair per CPU per size class.
///
/// The two-magazine design avoids the "thrash" case where a tight alloc/free
/// loop on the same CPU would otherwise bounce between the global slab and the
/// per-CPU cache. With two magazines, the free path can fill the spare magazine
/// before touching the global slab, and the alloc path can drain the loaded
/// magazine fully before swapping in the spare.
///
/// # Invariants
/// - `loaded` is always a valid, non-null pointer to a `SlabMagazine`.
/// - `spare` is always a valid, non-null pointer to a `SlabMagazine`.
/// - Both point to magazines for the same size class.
/// - Access to `loaded` and `spare` requires preemption disabled on the current
/// CPU (the containing `CpuLocalBlock` is only accessed with preemption off).
pub struct MagazinePair {
/// Currently active magazine: alloc pops from here, free pushes here first.
pub loaded: *mut SlabMagazine,
/// Backup magazine: swapped in when `loaded` is empty (alloc) or full (free).
pub spare: *mut SlabMagazine,
}
/// Number of slab size classes. Each size class has its own per-CPU magazine
/// pointer in CpuLocalBlock. Covers allocations from 8 bytes (class 0) to
/// 256 KB (class 15) in powers of 2. This constant MUST match the slab
/// allocator's size class table (Section 4.1).
pub const SLAB_SIZE_CLASSES: usize = 16;
/// Fixed-layout per-CPU data block. Accessed via the architecture's
/// dedicated per-CPU register (x86-64 GS, AArch64 TPIDR_EL1, etc.).
/// Contains only the hottest fields — those accessed on every syscall,
/// every interrupt, or every scheduler tick.
///
/// # Invariants
///
/// - One `CpuLocalBlock` is allocated per CPU at boot (boot allocator).
/// - The per-CPU register is initialized to point to this block during
/// `smp_prepare_boot_cpu()` (BSP) and `secondary_cpu_init()` (APs).
/// - The register value NEVER changes after init for a given CPU.
/// - Access requires preemption to be disabled (caller holds a
/// `PreemptGuard`), ensuring the CPU cannot change between register
/// read and field access. No runtime borrow check is needed — the
/// structural invariant (preemption disabled + dedicated register)
/// guarantees single-writer access.
///
/// # Field selection criteria
///
/// A field belongs in `CpuLocalBlock` only if ALL of the following hold:
/// 1. It is accessed on a hot path (syscall, IRQ, scheduler, slab, NAPI).
/// 2. It is per-CPU (not per-task, not global).
/// 3. It is a scalar or fixed-size pointer (no dynamically-sized data).
/// 4. It changes infrequently relative to how often it is read.
/// All other per-CPU data belongs in `PerCpu<T>`.
#[repr(C, align(64))]
pub struct CpuLocalBlock {
/// Pointer to the currently executing task.
/// Read on every syscall entry, every interrupt, every context switch.
pub current_task: *mut TaskStruct,
/// Pointer to this CPU's runqueue.
/// Read by the scheduler on every tick and every wakeup.
pub runqueue: *mut RunQueue,
/// Preemption nesting count.
/// Incremented/decremented on every preempt_disable/enable pair.
/// Must be in CpuLocal because preempt_disable itself needs to
/// access it without going through a preempt-disabled guard (circular).
pub preempt_count: u32,
/// Interrupt nesting count (hardirq + softirq depth).
pub irq_count: u32,
/// Per-CPU slab magazine pairs (one per size class).
/// The slab fast path (alloc/free) reads and updates these pairs.
/// See `MagazinePair` and `SLAB_SIZE_CLASSES` above (= 16, covering 8 B to 256 KB).
/// Each entry is a `MagazinePair` containing a loaded and a spare `SlabMagazine`.
pub slab_magazines: [MagazinePair; SLAB_SIZE_CLASSES],
/// NAPI poll budget remaining for the current poll cycle.
pub napi_budget: u32,
/// RCU nesting depth. Quiescent state is eligible for reporting when
/// this drops to 0 (see Section 3.1.3.1 for deferred reporting design).
pub rcu_nesting: u32,
/// Flag: set to `true` by `RcuReadGuard::drop()` when the outermost
/// RCU read-side critical section exits. Checked and cleared by
/// `scheduler_tick()` and `context_switch()`, which report the
/// quiescent state to the grace period machinery. This deferred
/// model avoids a per-guard-drop function call (~5-10 cycles) by
/// batching reports into the tick/switch path (~1 cycle flag write).
pub rcu_passed_quiescent: bool,
/// Timestamp of the last scheduler tick (nanoseconds, monotonic).
pub last_tick_ns: u64,
/// CPU index (redundant with register read, but avoids arch-specific
/// decoding of the register value to extract the CPU number).
pub cpu_id: u32,
/// Isolation register shadow — per-CPU cache of the current hardware
/// isolation register value. `switch_domain()` ([Section 10.2.5.1](10-drivers.md#10251-pkru-write-elision-mandatory))
/// compares the target value against this shadow and skips the hardware
/// write if they match. This is mandatory: every isolation register
/// write goes through `switch_domain()`, which enforces shadow comparison.
/// The field is architecture-specific:
/// - x86-64: PKRU value (u32, upper bits unused)
/// - AArch64: POR_EL0 value (u64, POE) or ASID (u64, page-table fallback)
/// - ARMv7: DACR value (u32)
/// - PPC64: Radix PID (u32)
/// - PPC32: not stored here (segment registers use a separate shadow array)
/// - RISC-V: not applicable (page-table isolation, no register shadow)
pub isolation_shadow: u64,
// No explicit padding field needed — `#[repr(C, align(64))]` on the struct
// ensures cache-line alignment and prevents false sharing between adjacent
// CPUs' blocks. The compiler pads to a 64-byte boundary automatically.
}
Architecture accessor functions (provided by arch::current::cpu):
/// Returns a raw pointer to the current CPU's CpuLocalBlock.
/// On x86-64: reads GS base (zero instructions — the offset is encoded
/// in subsequent gs:-prefixed loads, so this function is a no-op that
/// returns a sentinel used by the inline accessors below).
/// On AArch64: `mrs x0, tpidr_el1` (1 instruction).
/// On RISC-V: `csrr x0, sscratch; andi x0, x0, ~1` (2 instructions).
///
/// # Safety
///
/// Caller must have preemption disabled. The returned pointer is valid
/// only on the current CPU. If preemption is re-enabled, the pointer
/// may refer to another CPU's block after migration.
#[inline(always)]
pub unsafe fn cpu_local_block() -> *const CpuLocalBlock;
/// Read the current task pointer. Single instruction on x86-64.
#[inline(always)]
pub fn current_task() -> *mut TaskStruct {
// SAFETY: preemption is structurally disabled in all kernel code
// paths that call current_task() (we are in kernel mode with a
// valid stack, which implies preempt_count > 0 or IRQs off).
unsafe { (*cpu::cpu_local_block()).current_task }
}
/// Read this CPU's runqueue pointer.
#[inline(always)]
pub fn this_rq() -> *mut RunQueue {
unsafe { (*cpu::cpu_local_block()).runqueue }
}
On x86-64, current_task() compiles to a single mov %gs:0, %rax,
matching Linux's current macro exactly.
CpuLocal vs PerCpu — when to use which:
| Criterion | CpuLocal | PerCpu\<T> |
|---|---|---|
| Access cost | ~1-10 cycles (arch-dependent) | ~3-5 cycles (release) / ~20-30 cycles (debug) |
| Borrow checking | None (structural safety) | Debug-only CAS (see Section 3.1.3) |
| Data types | Fixed scalars/pointers only | Any T |
| Number of fields | ~10 (fixed at compile time) | Unlimited (one PerCpu<T> per data type) |
| Adding new fields | Requires CpuLocalBlock layout change |
Just declare a new PerCpu<T> |
| Interrupt safety | Inherent (no borrow state to corrupt) | get_mut() disables IRQs |
| Use case | current_task, runqueue, preempt_count, slab magazines | Per-CPU caches, stats, driver state |
3.1.3 PerCpu Borrow Checking: Debug-Only in Release Builds
The PerCpu<T> borrow-state CAS (Section 3.1.1) serves as a runtime bug
detector, not a safety mechanism. The actual safety guarantee comes from
the structural invariants:
get()requires&PreemptGuard→ preemption disabled → CPU pinned.get_mut()requires&mut PreemptGuard+ disables IRQs → exclusive access.- Therefore: if the caller follows the API contract (one guard per critical section), aliased access is structurally impossible.
The CAS detects violations of rule (3) — e.g., creating two PreemptGuards
and using both to obtain &mut T. This is a logic error in the caller, not a
race condition. In release builds, this class of bug should have been caught
during development and testing.
Design decision: In release builds (cfg(not(debug_assertions))), the
borrow-state CAS is replaced with a no-op. The borrow_state array is still
allocated (for binary compatibility with debug modules), but get() and
get_mut() skip the atomic operations:
impl<T> PerCpu<T> {
pub fn get_mut<'g>(&self, guard: &'g mut PreemptGuard) -> PerCpuMutGuard<'g, T> {
let cpu = guard.cpu_id();
// SAFETY: local_irq_save() MUST be called BEFORE updating borrow_state.
// See Section 3.1.1 for the full rationale.
let saved_flags = local_irq_save();
#[cfg(debug_assertions)]
{
// Full CAS borrow-state checking — catches aliasing bugs.
let state = self.borrow_state(cpu);
if state.compare_exchange(0, u32::MAX, Ordering::Acquire, Ordering::Relaxed).is_err() {
local_irq_restore(saved_flags);
panic!("PerCpu: slot {} already borrowed", cpu);
}
}
// SAFETY: PreemptGuard proves CPU is pinned. local_irq_save()
// prevents interrupt handler interference. In debug builds, the
// CAS above additionally verifies no aliased borrows exist.
// In release builds, we trust the structural invariants.
unsafe {
PerCpuMutGuard {
value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
saved_flags,
#[cfg(debug_assertions)]
borrow_state: self.borrow_state(cpu),
_guard: PhantomData,
}
}
}
}
impl<'a, T> Drop for PerCpuMutGuard<'a, T> {
fn drop(&mut self) {
#[cfg(debug_assertions)]
{
self.borrow_state.store(0, Ordering::Release);
}
local_irq_restore(self.saved_flags);
}
}
Release-mode cost: get_mut() = CPU ID lookup (~1-3 cycles) + array
index + local_irq_save/local_irq_restore (~5-10 cycles total).
Approximately ~3-8 cycles per access, down from ~20-30 with the CAS.
3.1.3.1 IRQ Save/Restore Elision: get_mut_nosave()
The get_mut() function unconditionally calls local_irq_save() and
local_irq_restore() to guarantee exclusive access. On x86-64, this costs
~5-10 cycles (pushfq/cli + popfq). However, many hot paths already
have IRQs disabled at the call site:
- Hardirq handlers: IRQs disabled by hardware on entry.
- Softirq context: IRQs disabled during
do_softirq()execution. - Spinlock holders:
spin_lock_irqsave()disables IRQs before acquiring. - Context switch path:
schedule()disables IRQs around runqueue locking.
In these contexts, the IRQ save/restore is redundant — IRQs are already off.
UmkaOS provides a proof-token variant that elides the redundant operation.
This is a core design decision, not a deferred optimization — the
IrqDisabledGuard token is woven into the type system from day one.
The two primitive operations are implemented per-arch in umka_core::arch::current::interrupts:
/// Disable local interrupts and return the previous interrupt state (flags register
/// value on x86, DAIF on AArch64, `sstatus.SIE` on RISC-V, etc.).
/// Returns an opaque `usize` whose only valid use is as the argument to `local_irq_restore`.
/// Ordering: `local_irq_save()` acts as a compiler fence — no reads/writes are
/// reordered across it. Does NOT prevent NMIs (non-maskable interrupts).
pub fn local_irq_save() -> IrqDisabledGuard { /* arch-specific */ }
/// Restore interrupt state saved by a prior `local_irq_save()`.
/// Must be called on the same CPU that called `local_irq_save()`.
/// Calling this on a different CPU is undefined behavior.
pub fn local_irq_restore(guard: IrqDisabledGuard) { /* arch-specific via Drop */ }
Per-architecture IRQ save/restore:
The IrqDisabledGuard uses architecture-specific instructions to save and
restore the interrupt enable flag atomically. The pseudocode for each
supported architecture:
| Architecture | Save (disable IRQs, return old state) | Restore |
|---|---|---|
| x86-64 | PUSHF; POP rax (rax = eflags); CLI |
PUSH rax; POPF |
| AArch64 | MRS x0, DAIF; MSR DAIFSet, #0xF |
MSR DAIF, x0 |
| ARMv7 | MRS r0, CPSR; CPSID if |
MSR CPSR_c, r0 |
| RISC-V 64 | CSRRCI a0, sstatus, 0x2 (returns old sstatus, clears SIE) |
CSRW sstatus, a0 |
| PPC32 | MFMSR r0; RLWINM r1,r0,0,17,15; MTMSR r1 |
MTMSR r0 |
| PPC64LE | MFMSR r0; LI r1,0; RLDICL r1,r0,49,15; MTMSRD r1,1 |
MTMSRD r0,1 |
Notes:
- x86-64: PUSHF/POP captures the entire EFLAGS including IF (bit 9).
CLI clears IF. PUSH/POPF restores all flags including IF.
- AArch64: DAIF = Debug/SError/IRQ/FIQ mask bits. MSR DAIFSet, #0xF
sets all four mask bits (disables all async exceptions).
Restore writes the entire DAIF register, re-enabling only the
exceptions that were enabled before save.
- ARMv7: CPSR_c = control byte of CPSR. CPSID if = disable IRQ+FIQ.
- RISC-V: sstatus.SIE (bit 1) controls supervisor interrupt enable.
CSRRCI atomically reads old value and clears SIE in one instruction.
- PPC32: MSR.EE (bit 15) = external interrupt enable.
RLWINM with mask clears bit 15. MTMSR restores.
- PPC64LE: MTMSRD with L=1 only changes bits 48 (EE) and 62 (RI)
atomically, avoiding the need to disable and re-enable translation.
The platform-independent IrqDisabledGuard::new() calls
arch::current::interrupts::local_irq_save() which expands to the
appropriate instruction sequence above. This is zero-cost abstraction:
the compiler selects the single-architecture path at compile time.
IrqDisabledGuard is the RAII wrapper returned by local_irq_save(). Its Drop
implementation calls the per-arch restore sequence, ensuring correct pairing:
/// Proof token that IRQs are disabled on the current CPU. Created by
/// `local_irq_save()` or `irq_disabled_guard()` (the latter asserts
/// that IRQs are already disabled and constructs the token without
/// a redundant CLI). The token is `!Send` and `!Sync` — it cannot
/// be transferred to another CPU where IRQ state may differ.
///
/// **Implies preemption disabled.** Disabling hardware IRQs masks the
/// scheduler tick interrupt, so the scheduler cannot preempt the running
/// task while this guard is held. Code holding `IrqDisabledGuard` therefore
/// has the same preemption-safety guarantee as code holding `PreemptGuard`.
/// The converse is NOT true: `PreemptGuard` alone (via `preempt_disable()`)
/// does not disable IRQs — interrupt handlers can still fire and must not
/// call `get_mut_nosave()` unless they hold their own `IrqDisabledGuard`.
pub struct IrqDisabledGuard {
/// Saved interrupt flags. If this guard was created by
/// `local_irq_save()`, Drop restores them. If created by
/// `irq_disabled_guard()` (assertion-only), Drop is a no-op.
saved_flags: Option<usize>,
_not_send: PhantomData<*const ()>,
}
impl Drop for IrqDisabledGuard {
fn drop(&mut self) {
if let Some(flags) = self.saved_flags {
// SAFETY: restoring flags that we saved earlier.
unsafe { local_irq_restore(flags) };
}
// If saved_flags is None, IRQs were already disabled at creation
// time and the caller is responsible for their eventual restoration.
}
}
/// Assert that IRQs are already disabled and obtain a proof token.
/// Panics (debug) if IRQs are actually enabled. In release, this is
/// a zero-cost construction — the flag check compiles to nothing.
pub fn irq_disabled_guard() -> IrqDisabledGuard {
#[cfg(debug_assertions)]
{
assert!(
!arch::current::interrupts::are_enabled(),
"irq_disabled_guard() called with IRQs enabled"
);
}
IrqDisabledGuard {
saved_flags: None,
_not_send: PhantomData,
}
}
The PerCpu<T> variant that accepts the proof token:
/// Guard returned by `PerCpu::get_mut_nosave()`.
///
/// Holds a mutable reference to the per-CPU value. Unlike `PerCpuMutGuard<T>`,
/// this guard does NOT save/restore IRQ flags: the caller already holds an
/// `IrqDisabledGuard` proving IRQs are disabled, so saving/restoring them
/// again would be redundant (~5-10 cycles saved on x86-64 per access).
///
/// # Safety invariant
/// The caller must ensure `IrqDisabledGuard` remains live for the entire
/// lifetime `'a` of this guard. Dropping the `IrqDisabledGuard` while this
/// guard is alive would re-enable IRQs, allowing interrupt handlers to race
/// on the same per-CPU slot — undefined behaviour.
///
/// The `PhantomData<&'a IrqDisabledGuard>` field enforces this at the type
/// level: the borrow checker will not allow the `IrqDisabledGuard` to be
/// consumed (moved or dropped) while a `PerCpuMutRefNosave` derived from it
/// is still in scope.
///
/// # Drop behaviour
/// `Drop` only clears the debug-mode borrow state (sets the per-slot
/// `borrow_state` back to `0`). It does NOT call `local_irq_restore()` —
/// that is the caller's responsibility via their `IrqDisabledGuard`.
pub struct PerCpuMutRefNosave<'a, T: 'a> {
value: &'a mut T,
/// Proof that the caller holds an `IrqDisabledGuard` for lifetime `'a`.
_irq: PhantomData<&'a IrqDisabledGuard>,
/// Debug-builds only: reference to the per-slot borrow-state counter so
/// that `Drop` can reset it to `0` (free), allowing subsequent callers
/// to detect aliasing. Elided in release builds — the structural
/// invariants (IRQs disabled + preemption disabled) are sufficient.
#[cfg(debug_assertions)]
borrow_state: &'a AtomicU32,
/// Ties guard lifetime to the `PreemptGuard` (via `get_mut_nosave` signature).
_guard: PhantomData<&'a mut PreemptGuard>,
}
impl<'a, T> core::ops::Deref for PerCpuMutRefNosave<'a, T> {
type Target = T;
fn deref(&self) -> &T { self.value }
}
impl<'a, T> core::ops::DerefMut for PerCpuMutRefNosave<'a, T> {
fn deref_mut(&mut self) -> &mut T { self.value }
}
/// `PerCpuMutRefNosave` must NOT be sent to another CPU/thread.
/// The contained `&'a mut T` is a per-CPU slot reference; moving it to a
/// different thread would allow that thread to alias it without holding the
/// required `IrqDisabledGuard` or `PreemptGuard`.
impl<T> !Send for PerCpuMutRefNosave<'_, T> {}
impl<T> PerCpu<T> {
/// Like `get_mut()`, but skips `local_irq_save()`/`local_irq_restore()`.
/// The caller must prove IRQs are already disabled by providing an
/// `IrqDisabledGuard`. This saves ~5-10 cycles on x86-64 (no `pushfq`/
/// `cli`/`popfq`) and ~3-8 cycles on ARM (no `mrs`/`msr` DAIF).
///
/// # When to use
///
/// Use `get_mut_nosave()` instead of `get_mut()` when:
/// - Inside a hardirq or softirq handler (IRQs disabled by hardware).
/// - Holding a `SpinLockGuard` from `spin_lock_irqsave()`.
/// - In the context switch path (scheduler holds IRQ-disabling lock).
/// - In any code path where an `IrqDisabledGuard` is already available.
///
/// In all other contexts, use `get_mut()` which manages IRQ state itself.
///
/// # Why `&IrqDisabledGuard`, not `&PreemptGuard`
///
/// `IrqDisabledGuard` is a strictly stronger guarantee than `PreemptGuard`:
/// it subsumes preemption disabling (tick IRQ is masked) while also preventing
/// interrupt handlers from racing on the same per-CPU slot. `PreemptGuard`
/// alone would not be sufficient — an IRQ could fire between the CPU-id read
/// and the slot access, running a handler that mutably accesses the same slot.
pub fn get_mut_nosave<'g>(
&self,
guard: &'g mut PreemptGuard,
_irq: &IrqDisabledGuard,
) -> PerCpuMutRefNosave<'g, T> {
let cpu = guard.cpu_id();
#[cfg(debug_assertions)]
{
let state = self.borrow_state(cpu);
if state.compare_exchange(0, u32::MAX, Ordering::Acquire, Ordering::Relaxed).is_err() {
panic!("PerCpu: slot {} already borrowed", cpu);
}
}
// No local_irq_save() — the IrqDisabledGuard proves IRQs are off.
// SAFETY: PreemptGuard pins to this CPU. IrqDisabledGuard proves
// IRQs disabled. Together, these guarantee exclusive access.
unsafe {
PerCpuMutRefNosave {
value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
#[cfg(debug_assertions)]
borrow_state: self.borrow_state(cpu),
_guard: PhantomData,
}
}
}
}
/// Mutable reference guard that does NOT save/restore IRQ flags on drop.
/// Created by `get_mut_nosave()`. Drop only clears the debug borrow state.
impl<'a, T> Drop for PerCpuMutRefNosave<'a, T> {
fn drop(&mut self) {
#[cfg(debug_assertions)]
{
self.borrow_state.store(0, Ordering::Release);
}
// No local_irq_restore() — caller's IrqDisabledGuard manages IRQ state.
}
}
Cost comparison (x86-64, release builds):
| Variant | IRQ management | Total cycles | Use case |
|---|---|---|---|
get_mut() |
pushfq/cli + popfq |
~3-8 | General per-CPU mutation |
get_mut_nosave() |
None (proof token) | ~1-3 | Hardirq, softirq, spinlock, scheduler |
On hot paths (NVMe interrupt handler, scheduler tick, NAPI poll), the IRQ
elision saves ~5-10 cycles per PerCpu access. With 1-2 PerCpu accesses
per NVMe completion, this yields ~5-20 cycles saved per I/O operation.
3.1.4 Cumulative Performance Budget
The complete set of UmkaOS overhead-reduction techniques — CpuLocal (Section 3.1.2),
debug-only PerCpu CAS (Section 3.1.3), IRQ elision (Section 3.1.3.1), RCU deferred
quiescent state (Section 3.1.1, RcuReadGuard::drop), PKRU shadow elision
(Section 10.2.5.1), capability amortization (Section 8.1.1.1), and doorbell
coalescing (Section 10.6) — are all core design decisions implemented
from day one. They are not deferred optimizations. Below is the cumulative
overhead analysis for the three hottest I/O paths with all techniques active:
NVMe 4KB random read (~10μs = ~25,000 cycles at 2.5 GHz):
| Source | Cycles | Notes |
|---|---|---|
| MPK domain switches (×4, shadow-elided) | ~47-92 | Shadow elides 1-2 of 4 WRPKRU on back-to-back transitions |
| CpuLocal slab magazine access (×2) | ~2-8 | alloc + free, arch-dependent |
| PerCpu (nosave) in IRQ context (×1) | ~1-3 | IRQ-disabled proof token, no pushfq/popfq |
| RCU quiescent flag (×1) | ~1 | CpuLocal bool write, deferred to tick |
| Capability validation (×1, amortized) | ~7-14 | ValidatedCap token, 3-4 sub-calls use cached |
| Doorbell write (×1, amortized over batch-32) | ~5 | Single MMIO write for entire batch |
| Total additional over Linux | ~63-123 | |
| % of 10μs operation | ~0.25-0.5% |
TCP RX packet (~5μs per-packet, NAPI batch-64):
| Source | Cycles | Notes |
|---|---|---|
| MPK domain switches (×4, shadow-elided, amortized/64) | ~0.7-1.4/pkt | Shadow + NAPI batching |
| CpuLocal NAPI budget + socket access | ~2-6/pkt | |
| RCU conntrack guard (×1) | ~0.02/pkt | CpuLocal flag write, amortized |
| Capability (amortized over NAPI batch) | ~0.1-0.2/pkt | One ValidatedCap per NAPI poll cycle |
| Total additional per packet | ~3-8 | |
| % of 5μs operation | ~0.02-0.06% | per packet, with batching |
| % without batching (batch=1) | ~1.5-2.2% | worst case, still improved |
Context switch (~2μs = ~5,000 cycles):
| Source | Cycles | Notes |
|---|---|---|
| CpuLocal runqueue access | ~1-4 | pick_next_task |
| Isolation shadow save/restore | ~2-4 | Memory writes only, no WRPKRU |
| RCU quiescent report (batched) | ~3-5 | Check flag + report, once per switch |
| Total additional | ~6-13 | integer workload, no FPU |
| % of ~2μs switch | ~0.1-0.3% |
Cumulative worst case (nginx-like: receive + read + send + switch):
TCP RX (NAPI-64): ~0.04%
NVMe read (batched): ~0.35%
TCP TX (batched): ~0.25%
Context switches: ~0.15%
──────────────────────────
Compound: ~0.8% (4.2% headroom under 5% budget)
The ~4.2% headroom absorbs the unknown unknowns that inevitably emerge during implementation — cache misses, branch mispredictions, compiler-generated code quality variations, and second-order effects from subsystem interactions.
Per-architecture variation (cumulative overhead, same nginx workload):
| Architecture | CpuLocal cost | PerCpu cost | Isolation cost | Shadow savings | Total |
|---|---|---|---|---|---|
| x86-64 | ~1 cycle | ~1-3 (nosave) | ~23 cycles/WRPKRU | ~23-46 elided | ~0.7-1.2% |
| AArch64 (POE) | ~2-4 cycles | ~3-6 (nosave) | ~40-80/MSR | ~40-80 elided | ~1.2-2.0% |
| AArch64 (page table) | ~2-4 cycles | ~3-6 (nosave) | ~150-300/switch | N/A (TLB) | ~3-6% |
| ARMv7 | ~3-5 cycles | ~4-8 (nosave) | ~10-20/MCR | ~10-20 elided | ~1.2-2.0% |
| RISC-V | ~5-10 cycles | ~6-12 (nosave) | ~200-500/PT | N/A (TLB) | ~4-10% |
| PPC64 | ~1-3 cycles | ~2-5 (nosave) | ~30-60/mtspr | ~30-60 elided | ~0.8-1.5% |
| PPC32 | ~3-6 cycles | ~5-10 (nosave) | ~10-30/mtsr | ~10-30 elided | ~1.2-2.5% |
On x86-64, AArch64 with POE, ARMv7, PPC64, and PPC32: comfortably within the 5% budget with substantial headroom. On AArch64 without POE and RISC-V: isolation cost dominates, and the per-CPU/shadow optimizations matter less — the bottleneck is page-table-based domain switching, not per-CPU access.
Optimization summary (all implemented from day one, not deferred):
| Technique | Section | Cycles saved per I/O | Cumulative impact |
|---|---|---|---|
| CpuLocal register-based access | Section 3.1.2 | ~15-25 (vs old PerCpu CAS) | Major: hottest paths |
| Debug-only PerCpu CAS | Section 3.1.3 | ~15-20 (release elision) | Major: all PerCpu paths |
| IRQ save/restore elision | Section 3.1.3.1 | ~5-10 (per get_mut) | Moderate: IRQ/spinlock paths |
| RCU deferred quiescent state | Section 3.1.1 | ~4-9 (per outermost drop) | Moderate: all RCU paths |
| Isolation shadow elision | 03a, Section 10.2 | ~23-80 (per elided write) | Major: back-to-back switches |
| Capability amortization | Section 8.1.1.1 | ~8-36 (per KABI dispatch) | Moderate: all KABI calls |
| Doorbell coalescing | Section 10.6 | ~145/cmd (batch-32) | Major: batched NVMe/virtio |
RCU interaction:
KabiDispatchGuard(which scopesValidatedCap<'dispatch>) holds an RCU read-side lock for the duration of every KABI dispatch — see Section 8.1.1.1. This means every KABI call adds one RCU nesting level. The consequence for the concurrency model: capability revocations (rcu_callcallbacks) cannot complete while a KABI dispatch is in progress on any CPU. Long KABI dispatches therefore increase RCU grace period latency, which is bounded by the maximum KABI call duration (~100 μs worst case for NVMe completion). Designers adding new KABI interfaces must keep dispatch handlers short; blocking operations (sleeping, waiting on locks) inside aKabiDispatchGuardscope are prohibited.
/// RCU in Rust: zero-lock read path, deferred reclamation.
/// Readers hold an RcuReadGuard (analogous to rcu_read_lock).
/// Writers swap the pointer atomically and defer freeing the old
/// value until all readers have exited their critical sections.
///
/// **Clone-and-swap write pattern**: The canonical way to make incremental
/// changes to RCU-protected state (e.g., adding a route to a routing table,
/// registering a service in a registry) is:
/// 1. Acquire the write-side lock (`Mutex<()>` co-located with the `RcuCell`).
/// 2. Read the current value via `cell.read()`.
/// 3. Clone the current value: `let mut new_val = (*current).clone();`
/// 4. Modify `new_val` as needed.
/// 5. Publish: `cell.update(new_val, &guard)`.
/// The old value is freed after the RCU grace period (all readers that saw
/// the old pointer have exited their critical sections). This is not a
/// performance concern for infrequent writes (service registration, config
/// updates, interface addition) — the cost is acceptable because these
/// operations happen at driver-load time or in response to admin actions.
/// The read path is always lock-free.
pub struct RcuCell<T: Send + Sync> {
ptr: AtomicPtr<T>,
}
impl<T: Send + Sync> RcuCell<T> {
/// Create a new RcuCell with an initial value. The value is heap-allocated
/// via `Box::try_new` (fallible) and the RcuCell takes ownership of the
/// raw pointer. Returns `Err(KernelError::OutOfMemory)` if allocation
/// fails. The pointer is always non-null after successful construction.
///
/// All RcuCell allocation is fallible. Callers must handle OutOfMemory.
pub fn new(value: T) -> Result<Self, KernelError> {
let ptr = Box::try_new(value).map_err(|_| KernelError::OutOfMemory)?;
Ok(Self {
ptr: AtomicPtr::new(Box::into_raw(ptr)),
})
}
pub fn read<'a>(&'a self, _guard: &'a RcuReadGuard) -> &'a T {
unsafe { &*self.ptr.load(Ordering::Acquire) }
}
/// Atomically replace the value. The old value is scheduled for deferred
/// freeing after all current RCU read-side critical sections complete
/// (grace period). The caller must NOT access the old value after this
/// call — RCU owns it and will free it asynchronously.
///
/// **Writer serialization**: Takes `&self` (not `&mut self`) plus a
/// `MutexGuard` proof token that demonstrates the caller holds the
/// write-side lock. This is the standard RCU pattern: readers are
/// lock-free via `read()`, writers serialize through an external mutex.
///
/// **Why `&self` instead of `&mut self`**: RCU cells are typically
/// stored in global/static data structures (routing tables, config
/// registries, module lists) accessed concurrently by many readers.
/// Requiring `&mut self` would force the caller to hold an exclusive
/// reference to the entire `RcuCell`, which is impractical for global
/// state — it would require wrapping the `RcuCell` in a `Mutex` or
/// `RwLock` that also blocks readers, defeating the purpose of RCU.
/// With `&self` + lock proof, the `RcuCell` can live in a shared
/// context (e.g., `static`, `Arc`, or behind `&`), readers access it
/// without any lock, and writers prove serialization by passing the
/// `MutexGuard` token. The `MutexGuard` lifetime ensures the lock is
/// held for the duration of the `update()` call but does not restrict
/// access to the `RcuCell` itself.
///
/// Concurrent writers without the lock would race on the swap: writer A
/// swaps old->new_A, writer B swaps new_A->new_B, then writer B defers
/// freeing new_A — which was just published and may have active readers.
/// The `MutexGuard` proof prevents this at compile time.
///
/// All RcuCell allocation is fallible. Returns `Err(KernelError::OutOfMemory)`
/// if allocation fails; the existing value is left unchanged in that case.
pub fn update(
&self,
new_value: T,
_writer_lock: &MutexGuard<'_, ()>,
) -> Result<(), KernelError> {
let new_box = Box::try_new(new_value).map_err(|_| KernelError::OutOfMemory)?;
let old = self.ptr.swap(Box::into_raw(new_box), Ordering::Release);
// Schedule old value for deferred freeing after grace period.
// rcu_defer_free takes ownership of the raw pointer and will
// reconstruct the Box and drop it after the grace period elapses.
// SAFETY: `old` was created by a previous `Box::into_raw()` call
// (either in `new()` or a prior `update()`). The writer lock
// guarantees no concurrent `update()` can swap the same pointer
// twice. After this call, the caller must not access `old`.
unsafe { rcu_defer_free(old) };
Ok(())
}
}
// Implementation note on MutexGuard proof tokens: The `_writer_lock` parameter
// ensures single-writer semantics at compile time. In the actual implementation,
// the `Mutex<()>` must be embedded in the same struct that contains the `RcuCell`,
// or in a per-instance wrapper, so that each RcuCell has its own dedicated mutex.
// Passing an unrelated mutex guard would compile but violate the invariant. This is
// enforced structurally: the kernel's RCU-protected data structures always pair their
// `RcuCell` and `Mutex` in the same struct (e.g., `struct RcuProtected<T> { cell:
// RcuCell<T>, writer_lock: Mutex<()> }`), and the `update()` call site acquires the
// co-located lock. This pattern is standard in Rust kernel design (similar to how
// Linux's `struct rcu_head` is always embedded in the protected struct).
// Note: `RcuCell<T>` does NOT block in its `Drop` implementation.
// Calling `rcu_synchronize()` (a blocking wait) from `Drop` would be
// unsafe in contexts where blocking is illegal: interrupt handlers, code
// executing under a spinlock, NMI handlers, or any other atomic context.
// Because `RcuCell` values can be dropped from any of these contexts
// (e.g., a global `RcuCell` freed during module unload while holding a lock),
// `Drop` uses `rcu_call()` (deferred callback) instead. The current pointer
// is enqueued for deferred freeing via the RCU callback mechanism; the actual
// `Box::drop` runs in the RCU grace period worker thread, which executes in
// a fully schedulable, non-atomic context. This matches the pattern used by
// `update()`, which also defers old-value freeing via `rcu_defer_free()`.
impl<T: Send + Sync> Drop for RcuCell<T> {
fn drop(&mut self) {
// SAFETY: We have &mut self (exclusive access). This guarantees no
// concurrent writers (update() takes &self + MutexGuard, but &mut self
// is incompatible with any shared reference). Readers may still hold
// &T references obtained via read(); rcu_call() defers the actual
// Box::drop until all pre-existing RCU read-side critical sections
// complete (the grace period), ensuring no live references to the
// pointed-to value remain before it is freed. The pointer was created
// by Box::into_raw() in new() or update() and has not yet been passed
// to rcu_defer_free() (only old values replaced by update() are
// deferred there). This Drop path covers only the *current* (final)
// value still held by the RcuCell at destruction time.
//
// Using rcu_call() (non-blocking enqueue) rather than rcu_synchronize()
// (blocking wait) is mandatory here: Drop can be invoked from atomic
// contexts (interrupt handlers, spinlock-held paths, etc.) where
// blocking would deadlock or corrupt kernel state.
let ptr = self.ptr.load(Ordering::Relaxed);
if !ptr.is_null() {
// SAFETY: ptr was created by Box::into_raw() and is non-null.
// rcu_call takes ownership of the raw pointer and will reconstruct
// the Box and drop it after the grace period elapses in the RCU
// callback worker thread.
unsafe extern "C" fn drop_box<T>(ptr: *mut ()) {
// SAFETY: ptr was created by Box::into_raw::<T>() and is only
// passed to this callback once, after the RCU grace period.
drop(unsafe { Box::from_raw(ptr as *mut T) });
}
unsafe { rcu_call(drop_box::<T>, ptr as *mut ()) };
}
}
}
RCU Grace Period Detection:
The grace period mechanism determines when all pre-existing RCU read-side critical
sections have completed, making it safe to execute deferred callbacks (such as freeing
old values swapped out by RcuCell::update()).
UmkaOS uses per-CPU quiescent state tracking.
Data structures:
/// `WaitQueue` is an alias for `WaitQueueHead` — the type used for blocking
/// waiters on grace period completion. Using the alias keeps RCU-specific
/// code readable without importing internal queue type names.
pub type WaitQueue = WaitQueueHead;
/// Global RCU state — one instance, protected by rcu_lock.
pub struct RcuGlobal {
/// Monotonically increasing grace period number.
/// Even = quiescent period stable; odd = grace period in progress.
///
/// **Grace Period Counter Semantics**
///
/// `gp_seq` is incremented by 1 at grace-period start (even → odd) and again
/// by 1 at grace-period end (odd → even), so each complete grace period advances
/// it by 2. It is never decremented.
///
/// *Wraparound*: at the theoretical maximum rate of one grace period per 10 μs,
/// a `u64` counter wraps after approximately 5.8 million years. No special
/// wraparound handling is required for `u64`.
///
/// *Comparison semantics*: any code that compares two `gp_seq` values to decide
/// whether grace period A completed before grace period B **must** use wrapping
/// subtraction rather than a direct `<` comparison. Direct comparison is
/// incorrect near the wraparound boundary (irrelevant for `u64` in practice,
/// but required by design so that a future space-saving change to `u32` — where
/// wraparound occurs in ~42 seconds at 100 K GP/s — cannot silently introduce
/// a correctness bug):
///
/// ```rust
/// /// Returns true if grace period `a` completed strictly before `b`.
/// #[inline]
/// fn gp_before(a: u64, b: u64) -> bool {
/// (b.wrapping_sub(a) as i64) > 0
/// }
/// ```
///
/// All callers that compare grace period sequence numbers (e.g.,
/// `rcu_synchronize` checking whether its target sequence has been reached,
/// `rcu_gp_thread` waking waiters) must use `gp_before` or an equivalent
/// wrapping form. The same pattern is used by Linux's `time_before` /
/// `time_after` macros for the same reason.
pub gp_seq: AtomicU64,
/// Per-CPU quiescent state flags. Initialized to `false` (all-pending) at grace
/// period start. Each CPU sets its slot to `true` when it passes a quiescent
/// state (preemption re-enabled after preempt-off section, context switch,
/// or idle entry). The grace period completes when all slots equal `true`.
/// Slots are reset to `false` at the start of each new grace period.
/// Dynamically allocated at boot with num_possible_cpus() entries — no compile-time
/// MAX_CPUS constant (matches the PerCpu<T> dynamic-sizing guarantee).
///
/// `AtomicBool` uses 1 byte per CPU (vs 8 for `AtomicU64`) and maps to native
/// single-instruction atomics on all supported architectures including ARMv7
/// (LDREXB) and PPC32 (byte-sized lwarx). `AtomicU64` on 32-bit architectures
/// requires locking sequences; `AtomicBool` avoids this overhead entirely.
pub gp_start_qs: Box<[AtomicBool]>,
/// Grace period waitqueue — tasks blocked in `rcu_synchronize()` sleep here
/// until the grace period thread wakes them after gp_seq advances past their
/// target sequence number.
pub gp_waitqueue: WaitQueue,
}
// NOTE: RCU callback rings are **per-CPU**, not global. Each CPU maintains its
// own `pending_callbacks` and `next_callbacks` rings in `RcuPerCpu` (below).
// This eliminates global serialization on the callback-enqueue path — `rcu_call()`
// only touches the calling CPU's ring under a short preemption-disabled section,
// with no cross-CPU lock contention.
/// Fixed-capacity ring of RCU callbacks. **Per-CPU** — each CPU has its own
/// `pending_callbacks` and `next_callbacks` rings (stored in `RcuPerCpu`), so
/// `rcu_call()` never contends with other CPUs on the enqueue path.
///
/// Capacity = 4096 entries per CPU (typical system drains well under 256 per GP).
/// Pre-allocated at boot during per-CPU initialization (no runtime allocation).
///
/// Design: UmkaOS uses a typed, pre-allocated ring rather than Linux's intrusive
/// linked-list (`rcu_head` embedded in objects). This eliminates the need for
/// `container_of` pointer arithmetic and gives a predictable allocation profile.
/// Callers pass a closure-style `(fn(*mut ()), *mut ())` pair; the receiving side
/// calls `func(data)` after the grace period. Objects being freed pre-register their
/// cleanup function at `call_rcu()` time rather than embedding a list node.
///
/// **Overflow policy**: If a CPU's ring is full when `rcu_call()` is invoked (4096
/// callbacks pending), the behavior depends on the calling context:
///
/// - **Task context** (preempt_count == 0, IRQs enabled): `rcu_call()` calls
/// `rcu_synchronize()` to block until the current grace period completes, then
/// invokes `func(data)` directly. A warning is logged to flag the condition.
///
/// - **Atomic context** (IRQs disabled or preempt_count > 0): Blocking is forbidden.
/// `rcu_call()` writes the callback into `RcuPerCpu::overflow_buf`, a
/// `RCU_OVERFLOW_BUF_CAPACITY`-slot pre-allocated emergency buffer. The buffer is
/// drained at the next timer tick by `rcu_tick_drain_overflow()` once ring space
/// is available. If both ring and overflow buffer are full, the callback is dropped
/// and an error is logged — this is a catastrophic condition indicating a persistent
/// RCU stall, logged at `log::error!` severity.
///
/// A warning is logged whenever the main ring is full, as it indicates either a grace
/// period stall or an unusually high callback production rate that may need tuning
/// (e.g., increasing the grace period thread priority or reducing batch sizes).
pub struct RcuCallbackRing {
/// Inline callback storage, allocated at boot (no runtime allocation).
// SAFETY: Backing memory is allocated from the boot allocator (static kernel
// lifetime). Never freed. Use raw pointer instead of Box to avoid UB on drop.
entries: *mut [RcuCallback],
head: usize,
tail: usize,
}
/// A single RCU callback entry.
pub struct RcuCallback {
/// Cleanup function. Called with `data` after the grace period.
func: unsafe fn(*mut ()),
/// Opaque pointer to the object being cleaned up (e.g., raw pointer to Box contents).
data: *mut (),
}
/// Per-CPU RCU state (stored in the per-CPU data region, zero-allocation access).
///
/// Each CPU has its own independent callback rings, quiescent state counter, and
/// nesting tracker. This per-CPU design eliminates global serialization on the
/// callback-enqueue path — `rcu_call()` only touches the local CPU's state.
pub struct RcuPerCpu {
/// Quiescent state counter — incremented at every quiescent point
/// (context switch, idle entry, explicit rcu_quiescent_state() call).
/// Monotonically increasing; wraps after 2^64 (~never).
pub qs_count: u64,
/// Set by RcuReadGuard::drop() when outermost guard is released.
/// Cleared by the scheduler tick after reporting to RcuGlobal.
/// (Same field as CpuLocal::rcu_passed_quiescent.)
pub passed_quiescent: bool,
/// Nesting depth of active RcuReadGuards on this CPU.
pub nesting: u32,
/// Emergency overflow buffer: used when the main ring is full AND the caller is in
/// atomic context (IRQs disabled or preempt_count > 0). Blocking (`rcu_synchronize`)
/// is forbidden in atomic context, so callbacks are staged here instead.
/// Drained at the next timer tick or scheduler tick via `rcu_tick_drain_overflow()`.
/// Pre-allocated at boot; never heap-allocates on the enqueue path.
///
/// `u8` is sufficient for the length field: `RCU_OVERFLOW_BUF_CAPACITY` (64) < 256.
pub overflow_buf: [MaybeUninit<RcuCallback>; RCU_OVERFLOW_BUF_CAPACITY],
pub overflow_len: u8,
/// Per-CPU pending callbacks (registered via `rcu_call()`).
/// Moved to `next_callbacks` at grace period start; executed at grace period end.
/// Each ring is a pre-allocated 4096-slot `RcuCallbackRing` — no heap allocation
/// on the enqueue path. Protected by a per-CPU spinlock (held only with
/// preemption disabled, so no cross-CPU contention).
pub pending_callbacks: SpinLock<RcuCallbackRing>,
/// Callbacks promoted from `pending_callbacks` at the start of the current
/// grace period. The grace period thread collects these from all CPUs after
/// the grace period completes and executes them in batch.
pub next_callbacks: SpinLock<RcuCallbackRing>,
}
Memory ordering requirements:
On RcuReadGuard acquisition (rcu_read_lock()):
No barrier needed. The read-side section is a pure software contract.
Hardware TSO (x86) and load-acquire semantics (ARM, RISC-V) ensure that
loads within the critical section see all stores that completed before
rcu_read_lock() was called.
On RcuReadGuard drop (rcu_read_unlock()):
Release store to rcu_passed_quiescent (CpuLocal).
RISC-V: fence rw,w (release semantics for the store).
AArch64: stlr (store-release).
x86: plain store (TSO provides release semantics for free).
This Release store ensures that all RCU-protected loads within the critical
section are visible to other CPUs before the quiescent state is advertised.
On grace period start (synchronize_rcu / call_rcu):
Acquire fence before reading per-CPU qs_count snapshots.
This ensures the snapshot reads see all prior RcuReadGuard drops.
On grace period completion (callbacks executed):
Full fence (smp_mb) before executing callbacks.
This ensures callbacks see all memory stores made by RCU-protected writers
before the grace period started.
- Each CPU maintains
RcuPerCpu::qs_count, incremented each time the CPU passes through a quiescent point — a state where the CPU is guaranteed not to hold any RCU read-side references. - Quiescent points occur at: (1) context switch (the outgoing task releases all
RcuReadGuards before being descheduled), (2) idle entry (a CPU entering the idle loop has no active critical sections), and (3) explicitrcu_quiescent_state()calls in long-running kernel loops that do not hold RCU references. - A grace period begins when
synchronize_rcu()orrcu_defer_free()is called. The RCU subsystem snapshots all per-CPU quiescent state counters at that moment. - The grace period ends when every CPU has passed through at least one quiescent point since the snapshot (i.e., every CPU's counter has advanced). At that point, no CPU can still hold a reference to data that was visible before the grace period started.
- Once the grace period ends, all deferred callbacks registered before or during that grace period are executed (freeing old pointers, running destructors, etc.).
- Grace period detection is batched: multiple
rcu_defer_free()calls are coalesced into the same grace period to amortize the per-CPU polling overhead. - KABI boundary crossing constitutes an RCU quiescent state: every KABI vtable call
entry and return is treated as a quiescent point for the calling CPU. This ensures
that Tier 1 drivers that return from KABI calls within bounded time (enforced by the
per-call timeout watchdog, Section 10.5.5.3) cannot block RCU grace period completion
indefinitely. Drivers that perform long-polling loops must call
rcu_quiescent_state()at each poll iteration, or use the KABI polling helperkabi_poll_wait()which includes an implicit quiescent state.
/// RCU read-side guard. Obtained via `rcu_read_lock()`, released via Drop.
///
/// This is a zero-cost marker type — it does not perform any atomic operations
/// or memory barriers on acquisition. The RCU read-side critical section is
/// purely a contract with the grace period detection mechanism: as long as any
/// CPU holds an `RcuReadGuard`, the current grace period cannot complete.
///
/// The guard is `!Send` because RCU read-side sections are per-CPU — the quiescent
/// state tracking (Section 3.1.1, "RCU Grace Period Detection") is CPU-local.
/// Sending an `RcuReadGuard` to another thread would allow the grace period
/// detection to miss an active reader.
///
/// # Example
/// ```rust
/// let guard = rcu_read_lock();
/// // Within this scope, any RCU-protected data can be safely read.
/// // The grace period will not complete until this guard is dropped.
/// let value = rcu_cell.read(&guard);
/// // guard dropped here; this CPU may now pass through a quiescent point
/// ```
pub struct RcuReadGuard {
/// CPU ID on which this guard was acquired. Used for debug assertions
/// and the KRL timeout mitigation (Section 8.2.8). Not used for
/// grace period tracking — that is handled by per-CPU quiescent state counters.
_cpu_id: u32,
/// Nesting depth. RCU read-side critical sections may be nested:
/// inner guards increment this counter, and only the outermost guard's
/// drop reports a quiescent state. This prevents a nested drop from
/// prematurely signaling quiescence while the outer critical section
/// still holds RCU-protected references.
_nesting: u32,
/// Marker to prevent Send/Sync auto-traits.
_not_send: PhantomData<*const ()>,
}
impl Drop for RcuReadGuard {
fn drop(&mut self) {
// Decrement the per-CPU nesting counter via CpuLocal (Section 3.1.2).
// Only the outermost guard (nesting reaches 0) needs further action.
// Nested RCU read sections work correctly — an inner guard's drop
// does NOT affect quiescent state while the outer section is active.
let nesting = cpu_local::rcu_nesting_dec();
if nesting > 0 {
return; // Still inside an outer RCU read-side critical section.
}
// Outermost guard dropped — set the per-CPU "passed quiescent point"
// flag. This is a CpuLocal boolean write (~1 cycle), NOT an immediate
// report to the grace period machinery. The actual quiescent state
// report is deferred to the next scheduler tick or context switch,
// which are the natural quiescent checkpoints (see below).
//
// On architectures with weak memory ordering, a Release store is
// used to prevent RCU-protected accesses from being reordered past
// the guard's drop point.
cpu_local::set_rcu_passed_quiescent(true);
// === Design rationale: deferred quiescent state reporting ===
//
// **Why NOT report immediately on every outermost drop**:
// The previous design called `rcu_note_quiescent_state()` here — a
// function call + per-CPU atomic store (~5-10 cycles). On NVMe paths
// with frequent short RCU sections (conntrack lookup, routing table
// lookup), this adds ~5-10 cycles per I/O. Across millions of IOPS,
// the overhead is measurable.
//
// **How deferred reporting works**:
// 1. `RcuReadGuard::drop()` sets `cpu_local.rcu_passed_quiescent = true`
// (~1 cycle CpuLocal write, no function call, no atomic).
// 2. The scheduler tick handler (`scheduler_tick()`, running at HZ=1000)
// checks the flag and, if set, calls `rcu_report_quiescent_state()`
// to notify the grace period machinery. This batches all quiescent
// state reports from the last 1ms into a single function call.
// 3. `context_switch()` also checks and reports — every voluntary or
// involuntary context switch is a quiescent point regardless.
// 4. `cpu_idle_enter()` reports unconditionally — idle is always quiescent.
//
// **Grace period stall prevention for CPU-bound kernel threads**:
// A CPU-bound kernel thread that never context-switches would, in a
// naive deferred model, block RCU grace periods indefinitely. UmkaOS
// prevents this through the scheduler tick: even CPU-bound threads
// receive timer ticks at HZ=1000 (1ms intervals). The tick handler
// checks `rcu_passed_quiescent` and reports. This guarantees that no
// CPU goes more than ~1ms without reporting a quiescent state, which
// is well within acceptable grace period latency (~10-100ms typical).
//
// For `nohz_full` CPUs (tickless, used for RT isolation per Section 7.2.5),
// the tick is suppressed. These CPUs report quiescent state via:
// (a) context switches (if they occur), or
// (b) the `rcu_nocbs` mechanism — RCU callbacks are offloaded to a
// designated non-isolated CPU, and the isolated CPU's quiescent
// state is inferred from its lack of RCU activity (matching
// Linux's `rcu_nocbs` behavior exactly).
//
// **Comparison with Linux**: In non-preemptible Linux kernels,
// `rcu_read_unlock()` generates zero code — quiescent states are
// inferred entirely from context switches, idle, and usermode return.
// UmkaOS's approach now matches Linux's model: the outermost drop sets
// a lightweight per-CPU flag, and the actual report is deferred to
// tick/switch checkpoints. The per-flag-write cost (~1 cycle) is
// effectively zero compared to the prior ~5-10 cycle function call,
// while maintaining the stall-freedom guarantee via the tick handler.
}
}
impl !Send for RcuReadGuard {}
impl !Sync for RcuReadGuard {}
/// Acquire an RCU read-side critical section guard.
///
/// The returned guard prevents the current RCU grace period from completing
/// until it is dropped. On drop, the outermost guard sets a per-CPU
/// `rcu_passed_quiescent` flag (CpuLocal write, ~1 cycle). The actual
/// quiescent state report to the grace period machinery is deferred to
/// the next scheduler tick or context switch — see `RcuReadGuard::drop()`
/// for the full design rationale. Entry is a zero-cost operation — no
/// atomic operations or memory barriers are performed on `rcu_read_lock()`.
///
/// # Safety invariants
/// - Must be paired with a `Drop` (RAII pattern — cannot be leaked).
/// - Must not be held across a blocking operation (sleep, mutex acquisition)
/// unless the holder is prepared for an extended grace period latency.
/// - The KRL timeout mitigation (Section 8.2.8) enforces a maximum critical
/// section duration for KRL access to prevent DoS.
pub fn rcu_read_lock() -> RcuReadGuard {
// Increment via CpuLocal (Section 3.1.2) — a single memory write to the
// per-CPU register-addressed block, no function call overhead.
let nesting = cpu_local::rcu_nesting_inc();
RcuReadGuard {
_cpu_id: cpu_id(),
_nesting: nesting,
_not_send: PhantomData,
}
}
/// A slice wrapper that can only be dereferenced within an RCU read-side
/// critical section. Used for RCU-protected arrays (e.g., KRL revoked_keys).
///
/// This is a zero-cost newtype around a raw pointer. The `Deref` implementation
/// is gated on an `RcuReadGuard` reference, ensuring that the pointed-to data
/// cannot be accessed after its backing memory is freed by the RCU callback.
///
/// # Type parameter
/// - `T`: The element type of the slice. Typically `[u8; 32]` for hash arrays.
///
/// # Safety contract
/// - The pointer must have been obtained from memory that will remain valid
/// until at least the next RCU grace period.
/// - The `RcuReadGuard` passed to `deref()` must have been acquired after the
/// last RCU update that could have freed this memory.
/// - Callers must not hold the `Deref` result across an RCU grace period
/// boundary (e.g., must not call `rcu_synchronize()` while holding a
/// reference derived from this slice).
///
/// # Example
/// ```rust
/// pub struct KeyRevocationList {
/// pub revoked_keys: RcuSlice<[u8; 32]>,
/// pub revoked_count: u32,
/// }
///
/// fn is_revoked(krl: &RcuCell<KeyRevocationList>, fingerprint: &[u8; 32]) -> bool {
/// let guard = rcu_read_lock();
/// let krl_ref = krl.read(&guard);
/// // Deref RcuSlice within the guard's lifetime
/// let keys: &[[u8; 32]] = krl_ref.revoked_keys.deref(&guard);
/// keys[..krl_ref.revoked_count as usize]
/// .binary_search(fingerprint)
/// .is_ok()
/// }
/// ```
#[repr(C)]
pub struct RcuSlice<T> {
ptr: *const T,
len: usize,
}
impl<T> RcuSlice<T> {
/// Create a new RcuSlice from a raw pointer and length.
///
/// # Safety
/// The caller must ensure that:
/// 1. The pointer is valid and properly aligned for type `T`.
/// 2. `ptr` points to `len` contiguous initialized elements of type `T`.
/// 3. The memory pointed to will remain valid until at least the next RCU
/// grace period after the last access via this slice.
pub unsafe fn from_raw(ptr: *const T, len: usize) -> Self {
Self { ptr, len }
}
}
impl<T> RcuSlice<T> {
/// Dereference the slice within an RCU read-side critical section.
///
/// The returned reference is valid only for the lifetime of the guard.
/// Accessing the reference after the guard is dropped is undefined behavior.
///
/// # Arguments
/// - `_guard`: A reference to an `RcuReadGuard`, proving that the caller
/// is within an RCU read-side critical section. The guard's lifetime
/// bounds the returned reference.
///
/// # Returns
/// A shared reference to the underlying slice of `T` elements.
/// For `RcuSlice<[u8; 32]>`, this returns `&[[u8; 32]]`.
pub fn deref<'a>(&self, _guard: &'a RcuReadGuard) -> &'a [T] {
// SAFETY: The caller has provided an RcuReadGuard, proving they are
// within an RCU read-side critical section. The memory pointed to by
// self.ptr was allocated by an RCU-protected update and will remain
// valid until at least the next grace period. Since the guard prevents
// grace period completion, the memory is valid for the guard's lifetime.
// The len field was set at construction time and is invariant.
unsafe { core::slice::from_raw_parts(self.ptr, self.len) }
}
}
// RcuSlice does NOT implement Send or Sync directly. It can only be accessed
// through an RcuReadGuard, which is itself !Send. The containing struct
// (e.g., KeyRevocationList) provides the necessary Send/Sync impls when
// accessed through RcuCell, which enforces the RCU lifetime contract.
/// Queue a callback to be invoked after the current RCU grace period.
/// Safe to call from any context (including interrupt context).
/// Does NOT block.
///
/// # Implementation
/// Adds `RcuCallback { func, data }` to the calling CPU's per-CPU
/// pending callback ring (`RcuPerCpu::pending_callbacks`). Each CPU
/// has its own independent 4096-slot ring, so there is no global
/// serialization bottleneck on the enqueue path.
///
/// **Overflow policy**: If the calling CPU's ring is full, the behavior depends
/// on the calling context:
///
/// - **Task context** (preempt_count == 0, IRQs enabled): `rcu_call()` calls
/// `rcu_synchronize()` to block until the grace period completes, then invokes
/// `func(data)` directly. The callback is guaranteed to execute before return.
/// A warning is logged to flag the overflow condition.
///
/// - **Atomic context** (IRQs disabled or preempt_count > 0): `rcu_call()` writes
/// the callback into `RcuPerCpu::overflow_buf`. The buffer is drained at the next
/// timer tick by `rcu_tick_drain_overflow()`. If the overflow buffer is also full,
/// the callback is dropped and `Err(RcuCallError::RingFull)` is returned — this
/// is a catastrophic condition (persistent RCU stall) and is logged at error level.
///
/// # Ordering guarantee
/// After `rcu_call(func, data)` returns `Ok(())`, `func(data)` will be called at
/// some future point after a complete grace period has elapsed (i.e., after every
/// CPU has passed through a quiescent state). In the task-context overflow fallback
/// case, `func(data)` has already been called by the time `rcu_call()` returns.
pub fn rcu_call(func: unsafe fn(*mut ()), data: *mut ()) -> Result<(), RcuCallError>;
rcu_call(func, data) algorithm:
rcu_call(func, data):
1. Disable preemption (ensures we stay on this CPU for the duration).
2. Get this CPU's RcuPerCpu state (pending_callbacks ring + overflow_buf).
3. If ring.is_full():
a. Log warning: "RCU callback ring full on CPU N".
b. If CpuLocal::preempt_count() > 0 || irqs_disabled():
// Atomic context: cannot block. Use pre-allocated overflow buffer.
if overflow_len < RCU_OVERFLOW_BUF_CAPACITY as u8:
overflow_buf[overflow_len as usize].write(RcuCallback { func, data })
overflow_len += 1
// Will be drained at next tick by rcu_tick_drain_overflow().
Re-enable preemption.
return Ok(()).
else:
// Both ring and overflow buffer full: catastrophic RCU stall.
log::error!("rcu_call: overflow buffer full in atomic context — callback dropped")
Re-enable preemption.
return Err(RcuCallError::RingFull).
else:
// Task context: safe to block.
Re-enable preemption.
log::warn!("rcu_call: ring full, synchronizing (task context)")
rcu_synchronize(); // Block until current grace period completes.
unsafe { func(data) }; // Execute directly — GP elapsed, readers done.
return Ok(()).
4. ring.push(RcuCallback { func, data }).
5. If ring.len() >= RCU_BATCH_DRAIN_THRESHOLD (default: 256):
Set rcu_global.gp_requested = true.
Wake rcu_gp_thread (via per-CPU kick flag, checked at next scheduler tick).
6. Re-enable preemption.
7. return Ok(()).
rcu_tick_drain_overflow() algorithm (called at each timer tick for CPUs with overflow_len > 0):
rcu_tick_drain_overflow():
1. Disable preemption.
2. Get this CPU's RcuPerCpu state.
3. While overflow_len > 0 && !pending_callbacks.is_full():
a. overflow_len -= 1.
b. cb = overflow_buf[overflow_len].assume_init_read().
c. pending_callbacks.push(cb).
4. If overflow_len == 0 && pending_callbacks.len() >= RCU_BATCH_DRAIN_THRESHOLD:
Set rcu_global.gp_requested = true.
Wake rcu_gp_thread.
5. Re-enable preemption.
/// Wait for an RCU grace period to complete (blocking).
///
/// This function blocks the calling thread until all pre-existing RCU
/// read-side critical sections have completed. Use this when you need
/// to free memory that may be referenced by RCU readers.
///
/// **MUST NOT be called from atomic context** (interrupt handler, spinlock-held,
/// preempt-disabled, or NMI). In atomic contexts, use `rcu_call()` instead.
///
/// # Design
/// UmkaOS uses a counter-based grace period (gp_seq). Even values = no grace
/// period in progress; odd values = grace period in progress.
pub fn rcu_synchronize();
rcu_synchronize() algorithm:
rcu_synchronize():
1. Snapshot seq = rcu_global.gp_seq.load(Relaxed).
2. If seq is even (no grace period in progress):
a. Atomically CAS gp_seq from seq to seq+1 (even → odd).
b. If CAS succeeds: this thread started the grace period.
Wake rcu_gp_thread.
c. If CAS fails: another thread started a grace period concurrently.
Re-read seq (it is now odd).
3. Add current task to rcu_global.gp_waitqueue with wait_gp_seq = seq+1
(we need the grace period that follows the current seq to complete,
i.e., gp_seq reaches seq+2).
4. schedule() — task sleeps until woken by rcu_gp_thread.
5. Return (grace period completed, all pending callbacks have been invoked).
Grace period thread (rcu_gp_thread):
rcu_gp_thread (kernel thread, one per NUMA node; node-0 thread acts as global coordinator):
Loop:
1. Sleep until rcu_global.gp_requested is set (or woken by rcu_call/rcu_synchronize).
2. [Node-0 only] Increment gp_seq to odd value (grace period starts).
[Other nodes] Receive IPI_RCU_GP_START from node-0; begin local CPU polling.
3. For each online CPU N local to this node:
a. Read rcu_global.gp_start_qs[N] (AtomicBool, set at grace period start to false).
b. Wait until CPU N reports a quiescent state:
- Quiescent states are reported by setting gp_start_qs[N] = true at:
- schedule() (any context switch)
- Entry to idle loop
- Return from system call to user space
- Exit from RCU read section (rcu_read_unlock sets rcu_passed_quiescent = true
when nesting reaches 0; the scheduler tick or context switch then writes
gp_start_qs[N] = true at the next scheduling checkpoint — deferred, not immediate)
c. Wait for CPU N quiescent state using exponential backoff
(see Section 3.1.4.1 for the full algorithm).
If a CPU is idle, it may not context-switch — the idle loop itself
reports quiescent state.
4. All local CPUs have passed quiescent state.
[Other nodes] Set rcu_global.node_qs_complete[this_node] = true.
Send IPI_RCU_NODE_DONE to node-0.
[Node-0] Poll node_qs_complete[] until all nodes report complete (see NUMA
coordination protocol in Section 3.1.4.2).
5. [Node-0 only] All nodes complete. Collect all pending callbacks from all
per-CPU RcuCallbackRings across all nodes.
6. Execute callbacks (outside any lock, in normal task context, with
preemption enabled). Each callback is func(data) — typically
a deallocation.
7. [Node-0 only] Increment gp_seq to even value (grace period complete).
8. [Node-0 only] Wake all tasks sleeping in rcu_synchronize() with wait_gp_seq <= gp_seq.
9. Clear gp_requested. Loop back to step 1.
3.1.4.1 RCU Grace Period Thread: Exponential Backoff Algorithm
The grace period thread waits between quiescent-state checks using exponential backoff with bounded jitter to avoid thundering-herd on high-CPU-count systems.
Constants:
/// Starting wait time after a failed quiescent state check (all CPUs must pass QS).
pub const RCU_GP_BACKOFF_START_MS: u64 = 1;
/// Maximum wait time per iteration before cap applies.
pub const RCU_GP_BACKOFF_MAX_MS: u64 = 100;
/// Backoff multiplier applied after each failed quiescent state poll.
pub const RCU_GP_BACKOFF_MULTIPLIER: u64 = 2;
/// Absolute maximum grace period wait before warning (stall detection threshold).
pub const RCU_GRACE_MAX_WAIT_MS: u64 = 10_000;
/// Capacity of the per-CPU overflow buffer used by `rcu_call()` in atomic context
/// when the main `RcuCallbackRing` is full. Pre-allocated in `RcuPerCpu`; no
/// heap allocation occurs on the enqueue path. Sized to absorb bursts from short
/// interrupt storms; `rcu_tick_drain_overflow()` drains the buffer at each timer tick.
pub const RCU_OVERFLOW_BUF_CAPACITY: usize = 64;
Algorithm (executed by the per-CPU RCU grace period thread):
fn wait_for_grace_period(rcu: &RcuState):
wait_ms = RCU_GP_BACKOFF_START_MS
total_ms = 0
loop:
if all_cpus_passed_quiescent_state(rcu):
complete_grace_period(rcu)
return
if total_ms >= RCU_GRACE_MAX_WAIT_MS:
emit_rcu_stall_warning(total_ms)
// Continue waiting — stall warning is advisory, not fatal.
// Jitter: add up to 10% of current wait to spread wakeups.
jitter_ms = rand_bounded(wait_ms / 10 + 1) // 0..wait_ms/10 inclusive
sleep_ms(wait_ms + jitter_ms)
total_ms += wait_ms + jitter_ms
wait_ms = min(wait_ms * RCU_GP_BACKOFF_MULTIPLIER, RCU_GP_BACKOFF_MAX_MS)
Rationale: The exponential backoff prevents busy-polling on systems where CPUs are slow to pass quiescent states (e.g., under load). The 100ms cap ensures responsiveness. The 10% jitter prevents all grace period threads from waking simultaneously after a system-wide event.
Grace Period Thread Wake Protocol:
The grace period thread (rcu_gp_kthread, one per NUMA node, scheduled at
SCHED_FIFO priority 10) sleeps on the rcu_global.gp_waitqueue between grace
periods. The wake protocol has three components:
-
rcu_call()path — non-IPI wakeup: Whenrcu_call()step 5 fires (ring length reachesRCU_BATCH_DRAIN_THRESHOLD = 256), it setsrcu_global.gp_requested = trueand callsscheduler::unblock(gp_thread). This is an O(1) scheduler operation that enqueues the GP thread on the current CPU's runqueue. It is NOT an IPI — no cross-CPU interrupt is generated. The GP thread will be scheduled on the next tick or preemption point of the current CPU (or immediately if the current CPU is idle). -
Quiescent state signaling — polling with deferred writes: The GP thread polls
rcu_global.gp_start_qs[N]for each online CPU N (step 3b above). Per-CPU quiescent state is written to this array by the scheduler tick handler andcontext_switch()when they observeCpuLocal::rcu_passed_quiescent == true. The GP thread does not use IPIs to collect quiescent states during normal operation — it relies on the existing scheduling infrastructure (tick at HZ=1000, every context switch) to write the flags. -
Accelerated GP via IPI — stall prevention: If after 10 ms the GP thread has not received a quiescent state report from CPU N (detected by comparing
gp_start_qs[N]against the 10 ms timestamp), the GP thread sends CPU N a lightweight reschedule IPI (arch::current::interrupts::send_reschedule_ipi(n)). This IPI has no handler beyond forcing a context switch on the target CPU, which constitutes a quiescent state and causes the tick handler to writegp_start_qs[N] = true. This bounds GP latency tomax(10 ms, scheduling latency)even for CPU-bound kernel threads that would otherwise delay the grace period. -
Idle CPUs — implicit quiescent state: The GP thread checks
CpuLocal::is_idle[N]before pollinggp_start_qs[N]. Idle CPUs are always in quiescent state —cpu_idle_enter()writesgp_start_qs[N] = trueunconditionally. The GP thread excludes idle CPUs from the wait loop, so they do not delay GP completion. -
GP completion — callback drain and wakeup: When all online non-idle CPUs have reported quiescent state, the GP thread collects callbacks from all per-CPU
RcuCallbackRingbuffers (bothpending_callbacksandnext_callbacks), executes them in batch, incrementsgp_seqto the next even value, and wakes all tasks sleeping inrcu_synchronize()whosewait_gp_seqhas been satisfied.
Why not one global GP thread? A single global GP thread would create a polling
bottleneck on systems with 256+ CPUs (reading gp_start_qs[N] for every CPU
sequentially). UmkaOS uses one rcu_gp_kthread per NUMA node, each responsible for
its local CPUs' quiescent state collection. Cross-node coordination (for
rcu_synchronize() callers on any node) is handled by the node-0 GP thread acting
as the global coordinator, with the other nodes' threads acting as collectors that
report aggregate quiescent state to node 0.
3.1.4.2 NUMA RCU Grace Period Coordination Protocol
The RcuGlobal struct carries one additional field to support cross-node coordination:
/// Per-NUMA-node grace period completion flags.
/// Cache-line padded to prevent false sharing between NUMA nodes.
/// `node_qs_complete[N]` = `true` means node N has reported all its local
/// CPUs as quiescent for the current grace period.
/// Dynamically allocated at boot with `num_numa_nodes()` entries.
#[repr(align(64))]
pub node_qs_complete: Box<[AtomicBool]>, // one entry per NUMA node
The full inter-node coordination protocol proceeds as follows:
NUMA RCU grace period coordination protocol:
1. GP start (node-0 GP kthread):
- Set all node_qs_complete[N] = false for N in 0..num_nodes.
- Set all per-CPU gp_start_qs[C] = false for C in 0..num_cpus.
- Increment gp_seq to odd (GP in progress).
- Send IPI_RCU_GP_START to the GP kthread on every other NUMA node.
- Begin polling its own local CPUs' gp_start_qs[] entries.
2. Per-node collection (each non-zero-node GP kthread, upon IPI_RCU_GP_START):
- Poll gp_start_qs[C] (AtomicBool) for each local CPU C using exponential backoff
(Section 3.1.4.1). Send reschedule IPI to CPU C after 10 ms stall (step 3 of
the wake protocol above).
- When ALL local CPUs have gp_start_qs[C] = true:
a. Set node_qs_complete[this_node] = true (Release store).
b. Send IPI_RCU_NODE_DONE to node-0 (lightweight, no data payload).
- Sleep until next IPI_RCU_GP_START.
3. Global completion (node-0 GP kthread):
- In parallel with step 2: also polls its own local CPUs (step 3 of the main
algorithm), writing node_qs_complete[0] = true when they all report quiescent.
- When IPI_RCU_NODE_DONE fires: the IPI handler sets a per-CPU atomic wake flag
on node-0's boot CPU. Node-0 kthread checks this flag on its next scheduler tick.
The IPI handler does NOT busy-wait or block.
- Poll node_qs_complete[] for all nodes using the same exponential backoff.
When all entries are true:
a. Collect callbacks from all per-CPU RcuCallbackRings (both rings).
b. Execute callbacks in batch (preemption enabled, no locks held).
c. Increment gp_seq to even (GP complete).
d. Wake gp_waitqueue (all rcu_synchronize() waiters on this GP).
e. Clear gp_requested.
Rationale for IPI vs. shared-memory polling:
IPI eliminates node-0 spinning on remote NUMA memory. Each remote node sends
one IPI_RCU_NODE_DONE notification rather than node-0 polling N × C_per_node
remote cache lines. On a 128-CPU / 8-node NUMA system this saves approximately
1000 unnecessary cross-node cache misses per grace period, at the cost of
8 IPI deliveries — a favorable trade above ~4 NUMA nodes.
rcu_read_lock() / rcu_read_unlock() algorithms:
rcu_read_lock():
Increment CpuLocal.rcu_nesting (preemption disabled during increment).
Preemption remains disabled while rcu_nesting > 0.
(This ensures the CPU cannot be descheduled in the middle of an RCU
read-side critical section — a descheduled task has already left the
read-side section from RCU's perspective.)
rcu_read_unlock():
Decrement CpuLocal.rcu_nesting.
If rcu_nesting == 0:
If CpuLocal.rcu_passed_quiescent == false:
Set CpuLocal.rcu_passed_quiescent = true.
// Do NOT write rcu_global.gp_start_qs[this_cpu] here.
// The actual write to gp_start_qs is deferred to the next scheduler
// tick or context switch, which are the natural quiescent checkpoints.
// This avoids a ~5-10 cycle atomic store on every outermost RCU drop
// (critical on high-IOPS paths such as NVMe/conntrack/routing).
Re-enable preemption if rcu_nesting == 0.
UmkaOS improvements over Linux RCU:
- Quiescent state detection uses a per-CPU AtomicBool flag (gp_start_qs) checked
at existing scheduling points (no additional polling overhead). AtomicBool uses
1 byte per CPU vs 8 for AtomicU64, and maps to native single-instruction atomics
on all supported architectures including 32-bit ARMv7 and PPC32.
- The grace period thread uses exponential backoff when waiting for quiescent
states, avoiding busy-waiting on busy systems.
- Callbacks use a pre-allocated typed ring (RcuCallbackRing) rather than
Linux's intrusive linked list with container_of pointer arithmetic.
- rcu_call overflow (ring full) is context-aware: task context falls back to
rcu_synchronize() + direct execution (never drops callbacks); atomic context
uses a pre-allocated 64-slot overflow_buf drained at the next timer tick via
rcu_tick_drain_overflow(). Both paths log warnings; only a full overflow buffer
in atomic context (catastrophic RCU stall) drops a callback and logs at error level.
/// Defer a callback to run after an RCU grace period (non-blocking).
///
/// This is the preferred way to free memory from atomic contexts. The callback
/// will be invoked in the RCU worker thread context after all pre-existing
/// readers have completed.
///
/// # Safety
/// - The callback must not access any data that was freed before the callback runs.
/// - The callback runs in a kernel thread context (not interrupt context), so
/// it may block, but should complete quickly to avoid delaying other callbacks.
///
/// # Example
/// ```rust
/// // Defer Box::drop after grace period
/// let ptr = Box::into_raw(old_value);
/// unsafe { rcu_defer_free(ptr) }; // callback will reconstruct Box and drop it
/// ```
pub unsafe fn rcu_defer_free<T>(ptr: *mut T);
/// Defer Box::drop after an RCU grace period (non-blocking).
///
/// This is a convenience wrapper around `rcu_defer_free()` for the common case
/// of freeing a Box<T> after an RCU grace period.
///
/// # Safety
/// Same as `rcu_defer_free()` — ptr must have been obtained from `Box::into_raw()`.
pub unsafe fn rcu_call_box_drop<T>(ptr: *mut T) {
// Reconstruct the Box and let it drop after the grace period.
rcu_defer_free(ptr);
}
RCU Slice Lifetime Safety Example (KRL):
The KeyRevocationList in Section 8.2.7 demonstrates correct RcuSlice usage:
// Boot-time KRL allocation (bump-allocated, 'static lifetime):
let boot_krl = KeyRevocationList {
revoked_keys: unsafe { RcuSlice::from_raw(boot_keys_ptr) }, // lives forever
revoked_count: boot_count,
// ... other fields
};
// Runtime KRL allocation (slab-allocated, RCU-managed):
let rt_keys = Box::try_new_in_slice(count, slab_allocator)?;
let rt_krl = Box::try_new(KeyRevocationList {
revoked_keys: unsafe { RcuSlice::from_raw(Box::as_slice_ptr(rt_keys)) },
revoked_count: count,
// ... other fields
})?;
// rt_krl is published via RcuCell::update(). Old KRL (if any) is freed
// by the RCU callback after the grace period, including its revoked_keys array.
/// Compile-time lock ordering: deadlock prevention via the type system.
/// A Lock<T, 3> can only be acquired while holding a Lock<_, N> where N < 3.
/// Attempting to acquire locks out of order is a compile-time error.
pub struct Lock<T, const LEVEL: u32> {
inner: SpinLock<T>, // or MutexLock<T> for sleeping locks
}
impl<T, const LEVEL: u32> Lock<T, LEVEL> {
pub fn lock<const HELD: u32>(&self, _proof: &LockGuard<HELD>) -> LockGuard<LEVEL>
where
Assert<{ HELD < LEVEL }>: IsTrue,
{
LockGuard::new(self.inner.lock())
}
/// Acquiring the first lock in a chain requires no proof.
/// Any level can be the starting lock — the constraint is that no other
/// lock is currently held. The returned `LockGuard<LEVEL>` then constrains
/// all subsequent lock acquisitions to levels strictly greater than LEVEL.
///
/// **Enforcement**: The type system alone cannot prevent a caller from
/// invoking `lock_first()` while already holding a `LockGuard` from a
/// different scope. Therefore, `lock_first()` performs a **runtime check**
/// in all build configurations: it reads the per-CPU `max_held_level`
/// variable and panics if any lock is currently held. This is a cheap
/// check (one per-CPU variable read + branch, ~2-3 cycles) and is always
/// enabled — not only in debug mode. The `lock()` method's compile-time
/// ordering guarantee (via `HELD < LEVEL`) remains the primary enforcement;
/// the runtime check in `lock_first()` closes the only loophole.
///
/// **Cross-session ABBA prevention**: Lock ordering is enforced by a global
/// total order defined at compile time via the `LEVEL` const type parameter.
/// Two sessions acquiring lock A (level 2) then lock B (level 5) always
/// acquire in the same order because the type system enforces `HELD < LEVEL`
/// at every step.
pub fn lock_first(&self) -> LockGuard<LEVEL> {
assert_no_locks_held(); // per-CPU max_held_level == NONE
LockGuard::new(self.inner.lock())
}
}
Lock categories: Subsystems define named lock categories to group related lock levels and prevent cross-subsystem lock violations:
/// Named lock category for subsystem-level lock grouping.
/// Each category maps to a range of lock levels. The runtime lockdep checker
/// validates that locks from different categories are never held simultaneously
/// unless explicitly permitted in the cross-category ordering table.
#[repr(u32)]
pub enum LockCategory {
/// Core kernel locks (scheduler run queues, memory allocator).
Core = 0,
/// Filesystem and block layer locks.
Fs = 1,
/// Network stack locks.
Net = 2,
/// Windows Emulation Architecture (NT object manager, WEA syscalls).
WEA = 3,
/// Driver subsystem locks (device registry, KABI VTable).
Driver = 4,
}
Lock level assignment table (authoritative; used by all Lock<T, LEVEL> instantiations):
Level 0 is reserved for locks that must be acquirable while the lock-ordering
checker is active and before any subsystem lock is held. Locks at level 0 are
never acquired while holding another level-0 lock; they are independent entry
points into the lock graph. FUTEX_BUCKET is the primary example: futex_wake()
acquires a bucket lock and then calls scheduler::enqueue(), which acquires
RQ_LOCK (level 2). The bucket lock must therefore be below TASK_LOCK (level 1)
to keep the acquisition order valid. Levels 4-6 cover capability and memory management
locks (CAP_TABLE_LOCK < VM_LOCK < ADDR_SPACE_LOCK), reflecting the invariant that
capability validation precedes address space mutation, and VMA-level decisions precede
page table modifications. Filesystem locks (FS_SB_LOCK < INODE_LOCK < DENTRY_LOCK)
follow the same outer-to-inner principle: superblock state is acquired before per-inode
operations, which in turn precede dentry cache manipulation.
| Level | Lock Name | Subsystem | Category |
|---|---|---|---|
| 0 | FUTEX_BUCKET |
FutexBucket spinlock | Core |
| 0 | IRQ_DESC_LOCK |
Interrupt descriptor | Core |
| 1 | TASK_LOCK |
Per-task struct | Core |
| 2 | RQ_LOCK |
Scheduler run queue | Core |
| 3 | PI_LOCK |
Priority inheritance chain | Core |
| 4 | CAP_TABLE_LOCK |
Capability table write lock — serializes capability slot insert/remove/revoke across a task's capability space | Core |
| 5 | VM_LOCK |
VMA tree write lock (mmap_lock equivalent) — protects the per-process virtual memory area tree during mmap/munmap/fault handling | Core |
| 6 | ADDR_SPACE_LOCK |
Address space page table lock — serializes page table modifications (map/unmap/protect) within a single address space; acquired under VM_LOCK |
Core |
| 7 | BUDDY_LOCK |
Per-NUMA buddy allocator | Core |
| 8 | SLAB_LOCK |
Per-NUMA slab partial list | Core |
| 10 | PAGE_LOCK |
Per-page lock (page cache) | Fs |
| 11 | FS_SB_LOCK |
Per-superblock lock — protects superblock-level state (mount flags, fs-wide counters, journal commit); acquired before per-inode locks during mount/unmount and fs-wide operations | Fs |
| 12 | INODE_LOCK |
Per-inode data lock (VFS) — protects inode metadata and serializes file operations (read/write/truncate) on a single inode | Fs |
| 13 | DENTRY_LOCK |
Dentry cache per-entry lock — protects dentry reference counts, parent/child linkage, and name hash chain membership | Fs |
| 14 | MOUNT_LOCK |
Mount table | Fs |
| 15 | SOCK_LOCK |
Per-socket | Net |
| 16 | CONNTRACK_BUCKET |
Conntrack bucket | Net |
| 17 | DEV_REG_LOCK |
Device registry | Driver |
| 18 | VTABLE_LOCK |
KABI vtable swap | Driver |
This table will be extended as subsystems are implemented. The ordering invariant is: a thread holding a lock at level N may only acquire locks at levels > N. Cross-category acquisitions (e.g., Core lock then Fs lock) follow the same rule — the numeric level is the sole ordering criterion.
Known limitation — total ordering only: The const-generic approach enforces a
total order on lock levels (level 0 < level 1 < level 2 < ...). This prevents
deadlocks caused by circular lock chains, but it cannot express a partial order
where two locks at the same conceptual level are safe to acquire together (because
they protect independent subsystems). In practice, this means some subsystems that
Linux allows to lock "in parallel" must be assigned adjacent but distinct levels in
UmkaOS, potentially over-constraining the lock graph. If this becomes a scalability
issue, the fallback is to introduce a lock_independent() API that takes two locks
at the same level with a static proof that their domains are disjoint (e.g., per-CPU
locks on different CPUs). For now, the total-order approach covers all known subsystem
interactions, and runtime lockdep (debug-mode only) validates that no partial-order
case is missed.
3.1.5 Locking Strategy
UmkaOS eliminates all "big kernel lock" patterns that plague Linux scalability:
| Linux Problem | UmkaOS Solution |
|---|---|
| RTNL global mutex | Per-table RCU + per-route fine-grained locks |
dcache_lock contention |
Per-directory RCU + per-inode locks |
zone->lock on NUMA |
Per-CPU page lists + per-NUMA-node pools |
tasklist_lock |
RCU-protected process table + per-PID locks |
files_lock (file table) |
Per-fdtable RCU + per-fd locks |
inode_hash_lock |
Per-bucket RCU-protected hash chains |
3.1.5.1 Locking Primitive Types
UmkaOS defines four concrete locking types used throughout all subsystems. These are the
actual implementations underlying Lock<T, LEVEL> (Section 3.1.4) and all per-subsystem
locks in the lock level table above.
RawSpinLock
A bare spinlock that disables preemption but does not save or restore IRQ state.
Modeled after Linux raw_spinlock_t. Suitable for: scheduler internals (runqueue locks),
interrupt-handler paths where IRQs are already disabled, and short critical sections
where the caller manages IRQ state explicitly. No RAII guard — the lock is manually
acquired and released, because a guard cannot enforce the caller's IRQ context.
/// Bare spinlock. Disables preemption on acquire; does NOT save/restore IRQ state.
///
/// # Safety
/// The caller is responsible for IRQ state. This type is correct only when:
/// - Called from an interrupt handler (IRQs already disabled), OR
/// - The caller has explicitly disabled IRQs (holds an `IrqDisabledGuard`), OR
/// - The critical section provably contains no IRQ-sensitive operations.
///
/// For general use, prefer `SpinLock<T>` which manages IRQ state automatically.
///
/// # Architecture-specific spinlock algorithm (all provide starvation-free acquisition)
///
/// - **x86-64**: Queued spinlock (MCS-like). Eliminates cache-line bounce on high-contention
/// paths; all CPUs spin on CPU-local memory. Uncontended path identical to test-and-set cost.
///
/// - **AArch64**: Queued spinlock (LDAXR/STLXR LL/SC pairs). AArch64 LL/SC is efficient for
/// the uncontended case and scales better than ticket under high contention.
///
/// - **ARMv7**: Ticket lock (LDREX/STREX on 32-bit word). ARMv7's reservation model does not
/// efficiently support MCS queue-node allocation; ticket provides fairness with lower overhead.
///
/// - **RISC-V**: MCS lock (LR/SC reservation pairs). RISC-V LR/SC supports reservations
/// natively; MCS avoids the fairness degradation of pure TAS on multi-hart systems.
///
/// - **PPC32**: Ticket lock (lwarx/stwcx on 32-bit word). PPC32 reservation model is
/// 32-bit word aligned; ticket fits naturally and provides FIFO fairness.
///
/// - **PPC64LE**: Queued spinlock (lwarx/stwcx with 64-bit). POWER cores benefit from
/// MCS-style local spinning on NUMA-distant workloads under high contention.
pub struct RawSpinLock {
/// Lock state word. Encoding is algorithm-specific per architecture (see above).
/// All architectures guarantee: 0 = unlocked; non-zero = locked (or queued).
state: AtomicU32,
}
impl RawSpinLock {
/// Acquire the lock (spin-wait). Disables preemption. Does NOT touch IRQs.
///
/// # Safety
/// See struct-level safety note. The caller must ensure IRQ state is correct.
pub unsafe fn lock(&self);
/// Release the lock. Re-enables preemption.
///
/// # Safety
/// Must be called by the same CPU that called `lock()`.
pub unsafe fn unlock(&self);
/// Try to acquire the lock once (non-blocking). Returns `true` if acquired.
///
/// # Safety
/// See `lock()`.
pub unsafe fn try_lock(&self) -> bool;
}
SpinLock<T>
RAII spinlock that saves and restores IRQ state on acquire/release. This is the
correct default for any data shared between normal kernel context and interrupt handlers.
Equivalent to Linux spinlock_t (acquired with spin_lock_irqsave). The protected
data T is only accessible through the returned guard, ensuring the lock is always
held when the data is accessed.
/// IRQ-saving spinlock. Disables preemption AND saves/restores IRQ state.
///
/// This is the correct default for data shared with interrupt handlers.
/// Acquiring this lock is safe from any context (normal kernel, softirq, hardirq).
///
/// The protected value `T` is accessible only through `SpinLockGuard<'_, T>`,
/// which re-enables IRQs and preemption on drop.
pub struct SpinLock<T> {
inner: RawSpinLock,
data: UnsafeCell<T>,
}
/// Guard returned by `SpinLock::lock()`. Releases lock and restores IRQs on drop.
pub struct SpinLockGuard<'a, T> {
lock: &'a SpinLock<T>,
/// Saved IRQ flags (RFLAGS on x86, DAIF on AArch64, SSTATUS.SIE on RISC-V, etc.).
saved_flags: ArchIrqFlags,
}
impl<T> SpinLock<T> {
/// Acquire the lock: save IRQ state, disable IRQs and preemption, spin-wait.
/// Returns a guard that provides exclusive access to `T`.
pub fn lock(&self) -> SpinLockGuard<'_, T>;
/// Acquire without saving IRQ state. Use this when the caller already holds
/// an `IrqDisabledGuard` (Section 3.1.3.1), avoiding a redundant save/restore
/// pair (~1-3 cycles saved on the fast path).
///
/// # Safety
/// The caller must hold an `IrqDisabledGuard` for the entire duration of the
/// returned guard's lifetime. Dropping the `IrqDisabledGuard` while the
/// `SpinLockGuard` is still alive would re-enable IRQs while the lock is held.
pub unsafe fn lock_nosave<'a>(
&'a self,
_irq: &'a IrqDisabledGuard,
) -> SpinLockGuard<'a, T>;
}
impl<T> Deref for SpinLockGuard<'_, T> {
type Target = T;
}
impl<T> DerefMut for SpinLockGuard<'_, T> {}
impl<T> Drop for SpinLockGuard<'_, T> {
fn drop(&mut self) {
// Release the RawSpinLock, then restore IRQ state from saved_flags.
// Order matters: release spin first, then re-enable IRQs, so that
// an interrupt firing after unlock sees consistent lock state.
}
}
lock_nosave and IrqDisabledGuard interaction: SpinLock::lock() internally
performs irq_save() + raw_spin_lock(). If the caller already holds an
IrqDisabledGuard (from PerCpu::get_mut_nosave(), see Section 3.1.3.1), use
lock_nosave() to skip the redundant save/restore. This mirrors the
get_mut_nosave() pattern and saves ~1-3 cycles on the fast path.
IntrusiveList<T> — Intrusive Doubly-Linked List
An intrusive doubly-linked list used throughout UmkaOS for lock-free and low-overhead
queues. Elements embed a link field (IntrusiveLink<T>) rather than allocating
separate list nodes. This avoids heap allocation and improves cache locality.
UmkaOS vs Linux list_head: Linux embeds raw prev/next pointers in each
element with no ownership or membership tracking. An element can silently be on
multiple lists simultaneously, or removed from the wrong list — a common source of
kernel bugs. UmkaOS's design uses a sealed HasIntrusiveLink<T> trait with compile-time
single-list membership enforcement. Pinning (Pin<&mut T>) prevents moving an element
while it is linked, which would corrupt the list.
/// An intrusive doubly-linked list. Elements must embed an `IntrusiveLink<T>`
/// field to participate in the list.
///
/// **UmkaOS vs Linux `list_head`**: Linux embeds raw `prev`/`next` pointers in each
/// element with no ownership tracking. This allows elements to be on multiple lists
/// simultaneously (a common source of bugs: removing from the wrong list). UmkaOS's
/// design enforces single-list membership at compile time via the owned `IntrusiveLink`.
///
/// **Pinning**: Elements must be `Pin`ned before they can be inserted. Moving an element
/// while it is linked would corrupt the list. The `Pin<&mut T>` API at insertion ensures
/// the element address is stable for the element's lifetime in the list.
///
/// **Performance**: Equivalent to `list_head` — O(1) insert/remove, O(n) iteration.
/// No heap allocation; all pointers live in the embedded `IntrusiveLink` fields.
pub struct IntrusiveList<T: HasIntrusiveLink<T>> {
/// Sentinel node. `head.next` points to the first element; `head.prev` points to
/// the last element. An empty list has `head.next == head.prev == &head`.
head: IntrusiveLink<T>,
/// Number of elements currently in the list. O(1) cached count.
len: usize,
}
/// The link field that must be embedded in each element type `T` that participates
/// in an `IntrusiveList<T>`. Each element may embed at most ONE `IntrusiveLink<T>`
/// (embedding two would require the element to be in two lists simultaneously, which
/// is allowed by the data structure but should be avoided — use separate wrapper types).
pub struct IntrusiveLink<T> {
/// Next element in the list (or &sentinel if this is the tail).
next: *mut IntrusiveLink<T>,
/// Previous element in the list (or &sentinel if this is the head).
prev: *mut IntrusiveLink<T>,
/// Zero-size marker; `T` is the owning type that contains this link.
_marker: PhantomData<T>,
}
/// Marker trait: `T` has an `IntrusiveLink<T>` field accessible via `link()`.
/// # Safety
/// The returned `IntrusiveLink` must be embedded in `self` at a stable address
/// (i.e., `self` must be pinned). Moving `self` while linked is undefined behavior.
pub unsafe trait HasIntrusiveLink<T> {
/// Return a pointer to the embedded link field.
fn link(this: *mut T) -> *mut IntrusiveLink<T>;
/// Recover the owning `T*` from a `*mut IntrusiveLink<T>` (via `offsetof`).
/// # Safety: `link` must point to the link field of a valid `T`.
unsafe fn from_link(link: *mut IntrusiveLink<T>) -> *mut T;
}
impl<T: HasIntrusiveLink<T>> IntrusiveList<T> {
/// Create an empty list. The sentinel is self-referential.
pub fn new() -> Self { ... }
/// Number of elements in the list.
pub fn len(&self) -> usize { self.len }
/// True if the list has no elements.
pub fn is_empty(&self) -> bool { self.len == 0 }
/// Insert `elem` at the back of the list. `elem` must be pinned.
/// # Safety: `elem` must not already be in any IntrusiveList.
pub unsafe fn push_back(&mut self, elem: *mut T) { ... }
/// Insert `elem` at the front of the list.
/// # Safety: same as `push_back`.
pub unsafe fn push_front(&mut self, elem: *mut T) { ... }
/// Remove and return the front element, or `None` if empty.
pub fn pop_front(&mut self) -> Option<*mut T> { ... }
/// Remove and return the back element, or `None` if empty.
pub fn pop_back(&mut self) -> Option<*mut T> { ... }
/// Remove `elem` from the list (it must currently be in this list).
/// # Safety: `elem` must be in this list.
pub unsafe fn remove(&mut self, elem: *mut T) { ... }
/// Iterate over elements in order (front to back).
pub fn iter(&self) -> IntrusiveListIter<'_, T> { ... }
/// Move all elements from `other` to the back of `self`. O(1).
pub fn splice_back(&mut self, other: &mut Self) { ... }
}
/// Iterator over an `IntrusiveList`. Elements are yielded as raw pointers.
pub struct IntrusiveListIter<'a, T: HasIntrusiveLink<T>> {
current: *mut IntrusiveLink<T>,
sentinel: *const IntrusiveLink<T>,
_lifetime: PhantomData<&'a T>,
}
/// Type alias: a bare link node used when the containing type is opaque or the
/// list is parameterized externally. Used by `WaitQueueEntry` and `MutexWaiter`.
pub type IntrusiveListNode = IntrusiveLink<()>;
/// Type alias: the sentinel head of an intrusive list where the element type is
/// managed externally (e.g., `WaitQueueHead` which holds a list of `WaitQueueEntry`).
pub type IntrusiveListHead = IntrusiveList<()>;
Mutex<T>
A sleeping lock. The caller blocks (is descheduled) if the lock is contended.
Must not be acquired from interrupt context or while holding a SpinLock or
RawSpinLock (sleeping under a spinlock causes a deadlock — the spinner cannot
be scheduled out, but the sleeping thread cannot progress). Equivalent to Linux
struct mutex.
/// Sleeping mutual exclusion lock. Blocks (deschedules) on contention.
///
/// # Contexts
/// - Safe to acquire from normal kernel task context.
/// - MUST NOT be acquired from interrupt context (hardirq, softirq, NMI).
/// - MUST NOT be acquired while holding a `SpinLock` or `RawSpinLock`.
/// Doing so would put the spinlock holder to sleep — the deadlock detector
/// (Section 3.1.4, lock level ordering) treats Mutex as higher-level than
/// SpinLock in the lock hierarchy.
pub struct Mutex<T> {
/// Lock state: 0 = unlocked, 1 = locked (no waiters), 2 = locked (waiters present).
state: AtomicU32,
/// List of tasks blocked waiting for this mutex. Protected by `waiters_lock`.
waiters: IntrusiveList<MutexWaiter>,
/// Spinlock protecting the `waiters` list. Held only briefly during enqueue/dequeue.
waiters_lock: RawSpinLock,
data: UnsafeCell<T>,
}
/// Task node embedded in `Mutex::waiters` when a task is blocked.
pub struct MutexWaiter {
task: *mut Task,
node: IntrusiveListNode,
}
/// Guard returned by `Mutex::lock()`. Releases lock on drop.
pub struct MutexGuard<'a, T> {
mutex: &'a Mutex<T>,
}
impl<T> Mutex<T> {
/// Acquire the lock. Blocks if contended. Returns when the lock is held exclusively.
pub fn lock(&self) -> MutexGuard<'_, T>;
/// Non-blocking acquire. Returns `Some(guard)` if acquired, `None` if contended.
pub fn try_lock(&self) -> Option<MutexGuard<'_, T>>;
}
impl<T> Deref for MutexGuard<'_, T> {
type Target = T;
}
impl<T> DerefMut for MutexGuard<'_, T> {}
impl<T> Drop for MutexGuard<'_, T> {
fn drop(&mut self) {
// If state == 1 (no waiters): CAS to 0, done.
// If state == 2 (waiters present): acquire waiters_lock, dequeue
// the first waiter, call scheduler::unblock(waiter.task), release
// waiters_lock, set state = 1 (or 0 if list now empty).
}
}
RwLock<T>
A sleeping reader-writer lock. Multiple concurrent readers OR one exclusive writer. New readers are blocked when a writer is waiting (writer preference — prevents writer starvation). Must not be acquired from interrupt context.
/// Sleeping reader-writer lock. Many readers OR one writer. Writer-preferring.
///
/// # Contexts
/// Same restrictions as `Mutex<T>`: task context only, not under a SpinLock.
///
/// # Writer preference
/// When a writer is waiting, new `read_lock()` calls block. This prevents writer
/// starvation at the cost of reduced read throughput under write pressure.
pub struct RwLock<T> {
/// Packed state: bits [30:0] = active reader count, bit 31 = writer-held flag.
/// Writer-waiting flag is tracked separately in the waiters list.
state: AtomicU32,
/// Queued readers and writers. Each entry carries a `WaiterKind` tag.
waiters: IntrusiveList<RwWaiter>,
/// Protects the `waiters` list.
waiters_lock: RawSpinLock,
data: UnsafeCell<T>,
}
pub enum WaiterKind { Reader, Writer }
pub struct RwWaiter {
kind: WaiterKind,
task: *mut Task,
node: IntrusiveListNode,
}
/// Guard for shared read access. Multiple `RwLockReadGuard`s can coexist.
pub struct RwLockReadGuard<'a, T> {
lock: &'a RwLock<T>,
}
/// Guard for exclusive write access. Uniquely held.
pub struct RwLockWriteGuard<'a, T> {
lock: &'a RwLock<T>,
}
impl<T> RwLock<T> {
/// Acquire shared read access. Blocks if a writer holds or is waiting for the lock.
pub fn read(&self) -> RwLockReadGuard<'_, T>;
/// Acquire exclusive write access. Blocks until all readers and the current
/// writer (if any) release the lock.
pub fn write(&self) -> RwLockWriteGuard<'_, T>;
/// Non-blocking read acquire.
pub fn try_read(&self) -> Option<RwLockReadGuard<'_, T>>;
/// Non-blocking write acquire.
pub fn try_write(&self) -> Option<RwLockWriteGuard<'_, T>>;
}
Lock Hierarchy Summary
The four types form a strict containment hierarchy. A thread may hold a higher-level lock while acquiring a lower-level one, but not the reverse:
| Type | Sleeps? | IRQ-safe? | Can hold while acquiring... |
|---|---|---|---|
RawSpinLock |
No | Caller-managed | (nothing lower) |
SpinLock<T> |
No | Yes (saves/restores) | RawSpinLock |
Mutex<T> |
Yes | No (task context only) | RawSpinLock, SpinLock<T> |
RwLock<T> |
Yes | No (task context only) | RawSpinLock, SpinLock<T> |
Critical constraint: holding a Mutex or RwLock guard and then acquiring a
SpinLock is permitted (Mutex is at a higher conceptual level and does not spin).
The reverse — acquiring a SpinLock and then calling Mutex::lock() — would put
the spinlock holder to sleep, violating the spinlock contract. The const-generic
lock level system (Section 3.1.4) enforces this at compile time by assigning
SpinLock-backed locks to lower numeric levels than Mutex-backed locks.
3.1.6 Lock-Free Data Structures
Where possible, lock-free data structures replace locked ones:
- MPSC ring buffers: For all cross-domain communication (io_uring-style)
- RCU-protected radix trees: For page cache lookups
- Per-CPU freelists: For slab allocation (no cross-CPU contention)
- Atomic reference counts: For capability tokens and shared objects
- Sequence locks (seqlock): For rarely-written, frequently-read data (e.g., system time, mount table snapshot)
- IDR (Integer ID Radix tree): For integer-keyed namespaces (PIDs, IPC IDs, file descriptors). See Section 3.1.6.1 below.
3.1.6.0 Compare-and-Swap Semantics Differ by Architecture
All UmkaOS lock-free algorithms that update shared state use compare-and-swap (CAS) operations,
exposed in Rust as AtomicT::compare_exchange and AtomicT::compare_exchange_weak. The
semantics of these operations differ materially across ISAs, and algorithms must be written
to be correct on the weakest variant.
x86 CMPXCHG — guaranteed single-attempt success:
On x86, CMPXCHG is an atomic read-modify-write instruction that succeeds if and only if
the current memory value matches the expected value. There are no spurious failures: if no
other CPU has modified the location since the load, the CMPXCHG will succeed on the first
attempt, always. This makes CAS on x86 appear to be a simple conditional store, and
algorithms that assume single-try success work correctly there.
ARM LDXR/STXR — load-exclusive/store-exclusive with spurious failure:
AArch64 and ARMv7 implement CAS using load-exclusive (LDXR) and store-exclusive (STXR)
instruction pairs. The store-exclusive can fail even when no competing CPU has modified the
target address. The CPU's exclusive monitor — a hardware reservation mechanism — can be
cleared by an interrupt, a context switch, a hypervisor VM exit, a cache line eviction, or
other microarchitectural events that have no relation to conflicting memory accesses. When
the exclusive monitor is cleared, STXR returns a failure status and the CAS must be
retried. This spurious failure is architecturally specified and guaranteed to occur
occasionally in practice.
RISC-V LR/SC — load-reserved/store-conditional with spurious failure:
RISC-V uses load-reserved (LR.W/LR.D) and store-conditional (SC.W/SC.D) pairs with
identical semantics. The RISC-V specification explicitly permits SC to spuriously fail even
in the absence of competing accesses. The reservation may be invalidated by any exception,
interrupt, or other event during the LR/SC sequence.
PowerPC ldarx/stdcx. — load-and-reserve/store-conditional:
PowerPC uses lwarx/stwcx. (32-bit) and ldarx/stdcx. (64-bit) with the same
spurious-failure semantics. An interrupt, context switch, or cache invalidation between the
load-reserve and store-conditional will cause the store-conditional to indicate failure,
requiring retry.
Design implication for UmkaOS lock-free algorithms: Every lock-free algorithm in UmkaOS that modifies shared state through a CAS loop MUST be written as an unconditional retry loop that tolerates spurious failure. On x86, the retry branch is never taken in practice; on ARM, RISC-V, and PowerPC, it will be taken occasionally and the algorithm must converge correctly regardless of how many spurious failures occur before a genuine success.
A CAS loop that assumes at most one retry after a genuine conflict is incorrect on ARM/RISC-V/PPC. A CAS loop that checks for spurious failure explicitly but does not re-load the current value before retrying is also incorrect — the retry must reload the current memory value and re-evaluate the expected value before each subsequent attempt.
Rust compare_exchange vs compare_exchange_weak:
- compare_exchange on platforms that use LR/SC or LDXR/STXR internally wraps
the operation in a loop that retries on spurious failure, presenting the caller with
x86-equivalent "fail only on genuine conflict" semantics. This loop is invisible to the
caller and adds retries inside the atomic operation itself.
- compare_exchange_weak exposes the underlying hardware semantics directly, permitting
spurious failure to propagate to the caller. The caller's retry loop must handle it.
In UmkaOS lock-free algorithms that already contain an explicit retry loop (the standard
pattern for CAS-based updates), compare_exchange_weak is preferred. Using
compare_exchange inside an outer retry loop results in double-looping on ARM/RISC-V:
the inner hidden loop retries spurious failures, and the outer loop retries genuine
conflicts. The compare_exchange_weak variant eliminates the inner loop, letting the
outer loop handle both cases, reducing instruction count and branch pressure on the
architectures where LR/STXR instructions are used.
3.1.6.1 Idr<T> — Integer ID Allocator
Idr<T> is a radix-tree-based integer ID allocator, equivalent to Linux's struct idr
(reimplemented on XArray since Linux 4.20). It maps small non-negative integer keys to
values with O(1) average-case lookup and integrated next-ID allocation.
/// Radix-tree-based integer ID allocator.
///
/// Internally uses a 64-ary radix tree (6 bits per level). For the common
/// case of <64K entries, the tree is 3 levels deep (6+6+6 = 18 bits).
/// Each node is cache-line-aligned (64 bytes metadata + 512 bytes pointers).
///
/// **Concurrency model** (matches Linux's IDR/XArray):
/// - **Reads**: RCU-protected. `idr_find()` requires only `rcu_read_lock()`
/// and performs a lock-free radix tree walk. Multiple readers on different
/// CPUs never contend.
/// - **Writes**: Serialized by an internal `SpinLock`. `idr_alloc()` and
/// `idr_remove()` acquire the lock, update the tree, then publish changes
/// via RCU (store-release on the node pointer). Freed nodes are reclaimed
/// after an RCU grace period.
/// - **No external locking required** for basic operations. Callers that
/// need atomic read-modify-write sequences (e.g., "find and update if
/// exists") must hold their own lock.
pub struct Idr<T> {
/// Root of the radix tree.
root: *mut IdrNode<T>,
/// Next ID hint (speeds up sequential allocation).
next_id: AtomicU32,
/// Write-side lock (protects tree mutations).
lock: SpinLock<()>,
}
Operations:
- idr_alloc(value) -> Result<u32, IdrError>: Allocate the next free ID, store value.
- idr_alloc_range(min, max, value) -> Result<u32, IdrError>: Allocate within a range.
- idr_find(id) -> Option<&T>: Lock-free lookup (requires rcu_read_lock).
- idr_remove(id) -> Option<T>: Remove entry, RCU-deferred node reclamation.
- idr_for_each(f: FnMut(u32, &T)): Iterate all entries (requires rcu_read_lock).
Usage: PID namespaces (Section 16.1.2), SysV IPC ID allocation (Section 16.1.2), file descriptor tables (Section 7.1.1), and any kernel subsystem mapping small integers to objects.
3.1.6.2 WaitQueueHead — Blocking Wait Queue
A WaitQueueHead is a spinlock-protected intrusive list of waiters. Kernel code
that needs to block until a condition is true (e.g., pipe readable, socket writable,
child exited) inserts itself into a WaitQueueHead and calls schedule(). When the
condition changes, the producer calls wake_up() to unblock one or all waiters.
/// A single waiter registered on a WaitQueueHead.
/// Embedded in the waiter's stack frame or task struct.
pub struct WaitQueueEntry {
/// Wakeup function called when the waiter is unblocked.
/// For normal task sleep: sets task state to TASK_RUNNING and
/// calls `try_to_wake_up()` (enqueues on runqueue).
pub wakeup: fn(*mut WaitQueueEntry) -> bool,
/// Link in the wait queue list (intrusive).
pub link: IntrusiveListNode,
/// Pointer to the waiting task.
pub task: Arc<Task>,
/// Private data for the wakeup function (e.g., poll key for epoll).
pub private: usize,
}
/// Head of a wait queue. Holds a spinlock and the list of waiters.
/// Embedding one in a struct means "tasks can block waiting for this object".
/// Used by: PipeBuffer (read_wait/write_wait), sockets, epoll fds, futexes,
/// page fault waits, inode event waits.
pub struct WaitQueueHead {
/// Protects the waiter list and coordinates with the waker.
pub lock: SpinLock<()>,
/// List of `WaitQueueEntry` nodes, ordered by insertion time (FIFO wakeup).
pub head: IntrusiveListHead,
}
impl WaitQueueHead {
/// Block until `condition()` returns true. Interruptible by signals.
/// Returns `Ok(())` when condition is true, `Err(EINTR)` if interrupted.
///
/// **Algorithm** (standard wait-loop; avoids lost-wakeup race):
/// ```
/// 1. Allocate WaitQueueEntry on caller's stack with wakeup=default_wakeup.
/// 2. Lock self.lock.
/// 3. Append entry to self.head (FIFO order).
/// 4. Set current task state to TASK_INTERRUPTIBLE.
/// 5. Unlock self.lock.
/// 6. Check condition(). If true:
/// Set task state to TASK_RUNNING.
/// Remove entry from queue (lock → splice out → unlock).
/// Return Ok(()).
/// 7. Call schedule(). Suspend until wake_up_one/wake_up_all is called.
/// 8. On resume: set task state to TASK_RUNNING.
/// Check for pending signal: if signal_pending(), go to step 9.
/// Check condition(). If true: remove entry, return Ok(()).
/// Otherwise: set state to TASK_INTERRUPTIBLE, go to step 7 (retry loop).
/// 9. Signal path: remove entry from queue, return Err(EINTR).
/// ```
/// Step 4 before step 6 is essential: a waker calling `wake_up_one()` between
/// steps 5 and 6 sets the task TASK_RUNNING before `schedule()` is called, so
/// `schedule()` will return immediately — the wakeup is never lost.
pub fn wait_event<F: Fn() -> bool>(&self, condition: F) -> Result<(), Errno>;
/// Block until `condition()` returns true. NOT interruptible by signals.
/// Returns only when the condition is true (no EINTR).
/// Same algorithm as `wait_event` but uses TASK_UNINTERRUPTIBLE at step 4
/// and skips the signal check at step 8.
pub fn wait_event_uninterruptible<F: Fn() -> bool>(&self, condition: F);
/// Wake up exactly one waiter (the first in FIFO order).
///
/// **Algorithm**:
/// ```
/// 1. Lock self.lock.
/// 2. If self.head is empty: unlock, return.
/// 3. Pop the first WaitQueueEntry from self.head.
/// 4. Unlock self.lock.
/// 5. Call entry.wakeup(&entry):
/// default_wakeup: atomically sets entry.task.state = TASK_RUNNING,
/// then calls sched_enqueue(entry.task) to place task on its CPU's runqueue.
/// ```
/// Unlocking before calling `sched_enqueue` avoids holding the wait queue
/// spinlock during scheduler operations (which may acquire runqueue locks).
pub fn wake_up_one(&self);
/// Wake up all waiters simultaneously (broadcast).
///
/// **Algorithm**:
/// ```
/// 1. Lock self.lock.
/// 2. Drain self.head into a local list (swap head pointer to empty list).
/// 3. Unlock self.lock.
/// 4. For each entry in local list: call entry.wakeup(&entry).
/// ```
/// Draining to a local list (step 2) means wakeups run outside the lock,
/// avoiding contention between woken tasks and new waiters inserting.
pub fn wake_up_all(&self);
}
Usage: PipeBuffer (§16.3.2), socket wait queues, epoll event delivery,
page fault wait-on-writeback, waitpid() child-exit notification.
3.1.7 Scalability Analysis: Hot-Path Metadata on 256+ Cores
Moving from a monolithic kernel to a hybrid one can introduce new contention bottlenecks in the "core" layer — specifically in metadata structures that every I/O operation must touch. This section analyzes UmkaOS's three highest-contention metadata subsystems on many-core machines (256-512 cores, 4-8 NUMA nodes).
1. Capability Table (Section 8.1)
Every syscall and driver invocation performs a capability check. On a 512-core machine running 10,000 concurrent processes, the capability table sees millions of lookups per second.
Design for contention: - Per-process capability tables: each process has its own capability table (a small array indexed by capability handle, typically <256 entries). Lookups are process-local with no cross-process contention. Two processes on different CPUs never touch the same capability table. - Capability creation/delegation (write path): uses per-process lock. Only contends when the same process creates capabilities from multiple threads simultaneously (rare — capability creation is a cold path). - Capability revocation: RCU-deferred. Revoking a capability marks it as invalid (atomic store), then defers memory reclamation to an RCU grace period. No lock on the read path.
Contention profile: none on read path (per-process, indexed by local handle). Comparable to Linux's file descriptor table — per-process, never a global bottleneck.
2. Unified Object Namespace / umkafs (Section 19.4)
The object namespace provides /sys/kernel/umka/ introspection. On a busy system,
monitoring tools (umka-top, prometheus-exporter) may read thousands of objects
per second.
Design for contention:
- Registry reads: the device registry (Section 10.5) is RCU-protected. Reads are
lock-free. Enumeration snapshots use a seqlock to detect concurrent modifications.
- Attribute reads (sysfs-style): most attributes are backed by atomic counters
or per-CPU counters that are aggregated on read. Example: a NIC's rx_packets
counter is per-CPU; reading it sums all per-CPU values. No lock on the write path
(per-CPU increment), brief aggregation on the read path.
- Object creation/destruction (hot-plug, process exit): per-subsystem lock, not
global. Creating a network device takes the network subsystem lock; creating a block
device takes the block subsystem lock. No global namespace lock.
- Pathological case: a monitoring tool reading every object attribute on every CPU
at 1Hz on a 512-core, 10,000-device system. The aggregation overhead is proportional
to num_cpus × num_objects — at 512 × 10,000, this is ~5M atomic reads per scan.
At ~5ns each, one scan takes ~25ms. This is acceptable for 1Hz monitoring; for
higher-frequency monitoring, tools should read only the objects they need.
Note: This 25ms scan runs in process context (a monitoring thread), not in
interrupt context. It does not affect the 50μs interrupt latency guarantee
from Section 7.2.2, which governs ISR entry latency — a completely orthogonal concern.
Contention profile: lock-free reads (RCU + atomics), per-subsystem writes. Bottleneck only if monitoring tools read aggressively (configurable rate limiting).
3. Page Cache and Memory Management (Section 4.1)
The page cache is the single highest-contention structure in any kernel. Every file read, every mmap fault, every page writeback touches it.
Design for contention:
- Page cache lookups: RCU-protected radix tree, per-inode (same as Linux's xarray).
Two threads faulting pages from different files never contend. Two threads faulting
different pages from the same file contend only at the xarray level (fine-grained,
per-node locking within the radix tree).
- LRU lists: per-NUMA-node LRU lists with per-CPU pagevec batching (same design as
Linux). Pages are added to a per-CPU buffer; the buffer is drained to the LRU list
in batches of 15 pages. This amortizes the per-NUMA LRU lock to 1 acquisition per
15 page faults.
- Buddy allocator: per-NUMA-node lock, with per-CPU page caches absorbing >95%
of allocations (Section 4.1.2). On a 512-core, 8-NUMA-node system, each buddy
allocator sees ~1/8 of the allocation traffic, and per-CPU caches absorb most of it.
- Known scaling limit: extreme mmap/munmap churn (thousands of VMAs per second
per process) contends on the per-process VMA lock. UmkaOS uses the same maple tree
(lockless reads, locked writes) as Linux 6.1+. For workloads that create/destroy
VMAs at extreme rates (JVMs, some databases), this is a known bottleneck in all
current kernels. UmkaOS does not claim to solve it — the contention is fundamental to
the VMA data structure, not to the hybrid architecture.
Contention profile: per-inode, per-NUMA, per-CPU on all hot paths. No new bottlenecks introduced by the hybrid architecture — UmkaOS's page cache follows the same design as Linux (which already runs on 512+ core machines).
Summary — the hybrid architecture's overhead (isolation domain switches) is orthogonal to metadata contention. The "core" layer's metadata structures use the same per-CPU, per- NUMA, RCU-based techniques that make Linux scale to 256+ cores. No new global locks are introduced. The capability system is per-process (no shared structure), the object namespace is RCU-read / per-subsystem-write, and the page cache follows Linux's proven xarray + pagevec design.
3.1.8 Interrupt Handling
- All device interrupts are routed to threaded interrupt handlers (same as Linux
IRQF_THREADED). - Top-half handlers are kept minimal: acknowledge interrupt, wake thread.
- This ensures all interrupt processing is preemptible and schedulable.
- MSI/MSI-X is preferred for all PCI devices (per-queue interrupts, no sharing).
- High-rate interrupt paths (100GbE at line-rate, NVMe at millions of IOPS): the threaded-handler model introduces scheduling latency (~1-5μs per interrupt) that could limit throughput if each packet triggered a separate interrupt. UmkaOS addresses this the same way Linux does — NAPI-style polling: the first interrupt wakes the thread, the thread then polls the device ring in a busy loop until the ring is drained, then re-enables interrupts. For 100GbE, this means the thread is woken once per batch of 64-256 packets, not once per packet. Combined with MSI-X per-queue affinity (one interrupt thread per CPU, one queue per CPU), the scheduling overhead is amortized to <100ns per packet, which is within the performance budget (Section 1.2).
3.1.9 Memory Model Differences Across Architectures
Critical Implementation Warning: UmkaOS relies extensively on lock-free concurrency, asynchronous ring buffers, and RCU to meet its strict performance budget. All lock-free algorithms must be correct across every target architecture — not just on the architecture where they were developed and tested.
3.1.9.1 x86 TSO Conceals Ordering Bugs
x86_64 implements Total Store Ordering (TSO). Under TSO, loads are not reordered with other loads, stores are not reordered with other stores, and stores are observed in program order by all CPUs. This is a significantly stronger ordering guarantee than software actually requires for most algorithms.
The consequence for development is that lock-free code with missing memory barriers, or code that uses Ordering::Relaxed where Ordering::Release is required, will almost always execute correctly on x86 and pass all tests there. The hardware silently supplies the ordering that the programmer omitted. Bugs of this class are structurally invisible on x86 — no amount of stress-testing on x86 hardware will expose them, because the CPU never exercises the reorderings that would trigger the race.
x86 TSO pays a hidden hardware cost. The strong ordering guarantee is not free: Intel and AMD CPUs implement store buffers and memory ordering machinery that impose a hardware tax on every store, whether or not the software needs ordered visibility. Code using Ordering::Release on x86 compiles to a plain store (the hardware already provides release semantics), but the CPU still pays the store-ordering overhead internally. Software running on x86 is paying for ordering it often does not need.
The practical conclusion: x86 is an unreliable test platform for lock-free code. A lock-free algorithm that passes all tests on x86 has not been validated — it has been tested on the platform most likely to hide its bugs. Correctness on x86 is necessary but not sufficient.
3.1.9.2 ARM, RISC-V, and PowerPC: Explicit Memory Ordering Surfaces True Requirements
AArch64, ARMv7, RISC-V, and PowerPC implement relaxed memory models. The CPU is permitted to reorder independent memory reads and writes for performance — a store to address A followed by a load from address B can complete in either order if A and B are in different cache lines and there is no explicit ordering constraint between them. Stores from one CPU become visible to other CPUs at different times, and different CPUs may observe stores in different orders unless barriers enforce a consistent sequence.
This is not a deficiency in these architectures — it is the architecturally correct exposure of what ordering actually costs. These architectures give software precise control: pay for ordering when the algorithm requires it, pay nothing when it does not. A missing memory barrier on AArch64 or RISC-V produces visible, reproducible failures: sequence locks read torn data, ring buffer consumers observe the tail pointer advance before payload data is visible, and RCU readers dereference pointers to uninitialized memory. The bug that x86 conceals, ARM and RISC-V surface.
Develop and test lock-free algorithms on ARM or RISC-V. Only a platform with relaxed memory ordering can expose ordering bugs. x86 is unreliable as a correctness gate for lock-free primitives.
Implementation Mandates: To ensure lock-free algorithms are correct across all target architectures, the following rules apply:
- Explicit Rust Atomics: All shared memory synchronization must use Rust's
std::sync::atomictypes with mathematically correct memory orderings (Acquire,Release,AcqRel,SeqCst). Never rely on implicit hardware ordering; write code that is correct on the weakest memory model UmkaOS targets. - Release-Acquire Semantics: The standard pattern for lock-free publishing in UmkaOS (e.g., advancing a ring buffer head, or updating an RCU pointer) MUST pair an
Ordering::Releasestore on the producer with anOrdering::Acquireload on the consumer. - No
Relaxedin Control Flow:Ordering::Relaxedmay only be used for pure statistical counters (e.g.,rx_packets). It must never be used to synchronize visibility of other data. - Mandatory Multi-Arch CI: Lock-free primitives (MPSC queues, RCU, seqlocks) must be subjected to heavy stress-testing natively on AArch64 and RISC-V hardware (or QEMU/emulators with memory-reordering fuzzing enabled). x86_64 test passes are considered insufficient to prove the correctness of lock-free code.
3.1.9.3 Ordering Instruction Cost: Ring Buffer and RCU Performance by Architecture
The implementation mandate to use Ordering::Release/Ordering::Acquire on ring buffer
head/tail pointer updates and RCU pointer publications has measurably different instruction-
level costs across architectures. Understanding this is essential for interpreting benchmark
results and for choosing the minimum correct ordering at each synchronization point.
Ordering::Release store — instruction cost by architecture:
| Architecture | Compiled instruction | Approximate cost |
|---|---|---|
| x86-64 | Plain MOV (no additional instruction) |
~1 cycle (TSO provides release for free) |
| AArch64 | STLR (store-release) or STR + DMB ISHST |
~5-20 cycles (barrier flushes store buffer) |
| ARMv7 | STR + DMB ISHST |
~10-30 cycles |
| RISC-V | FENCE W,W before store (or amoswap with .rl) |
~10-30 cycles |
| PowerPC | lwsync before store |
~10-20 cycles |
Ordering::Acquire load — instruction cost by architecture:
| Architecture | Compiled instruction | Approximate cost |
|---|---|---|
| x86-64 | Plain MOV (no additional instruction) |
~1 cycle (TSO prevents load reordering) |
| AArch64 | LDAR (load-acquire) |
~5-20 cycles |
| ARMv7 | LDR + DMB ISH |
~10-30 cycles |
| RISC-V | FENCE R,RW after load (or lr with .aq) |
~10-30 cycles |
| PowerPC | lwsync after load or isync on branch path |
~10-20 cycles |
Implications for UmkaOS ring buffer throughput:
Ring buffer head/tail pointer updates require a Release store on the producer side and an Acquire load on the consumer side — this is the minimum correct ordering and cannot be reduced without introducing a race. On x86, these compile to ordinary MOV instructions; the TSO hardware silently provides the ordering. On ARM and RISC-V, each ring buffer publish or consume event pays a real instruction-level barrier cost.
The consequence is that ring buffer throughput on AArch64 and RISC-V will be measurably lower than on x86 for identical Rust source code — not because of a bug or a missing optimization, but because x86 is paying its ordering cost in hardware (through store-buffer logic and pipeline constraints that are invisible to software), while ARM and RISC-V pay it explicitly in the instruction stream. Both are paying; only the accounting differs.
Ordering::Relaxed usage in UmkaOS ring buffers:
Within the ring buffer implementation, Ordering::Relaxed is used selectively where only
atomic visibility (not ordering relative to other accesses) is required:
- Statistical counters (
rx_packets,tx_bytes, drop counters):Relaxedon both store and load. These are sampled for reporting only; no other memory access is conditioned on their value. - Per-CPU freelist size counters:
Relaxed. The size is advisory; the actual allocation uses a separate acquire-load on the freelist head pointer. - Dead-reckoning checks (e.g., "is the ring approximately empty?"):
Relaxed. If the approximate check is stale by one entry, the caller falls back to a serializing path.
Ordering::Relaxed is never used for the head/tail pointers that control whether a slot
is safe to read or write. Misuse of Relaxed on these pointers produces silent data
corruption on ARM and RISC-V; it works by accident on x86. The rule from mandate (3) is
absolute: Relaxed may not appear in any code path that controls visibility of payload
data.
3.2 Error Handling and Fault Containment
3.2.1 Kernel Error Model
Linux problem: Kernel functions return negative integers (-ENOMEM, -EINVAL) for
errors, sometimes stuffed into pointers via ERR_PTR(). The type system enforces nothing —
callers can silently ignore errors, confuse pointers with error pointers, or propagate the
wrong errno. Unchecked kmalloc returns and missing error propagation are endemic.
UmkaOS design: All kernel-internal functions return Result<T, KernelError>. The ?
operator propagates errors up the call stack. The Rust type system makes it impossible to
silently ignore an error — no integer error codes, no sentinel values, no ERR_PTR.
/// Canonical kernel error type. All umka-core and driver-facing
/// kernel functions return `Result<T, KernelError>`.
#[repr(u32)]
pub enum KernelError {
OutOfMemory = 1, // Physical or virtual memory exhausted
InvalidCapability = 2, // Capability handle missing or revoked
PermissionDenied = 3, // Capability lacks required permission bits
InvalidArgument = 4, // Syscall argument out of range
DeviceError = 5, // Device error or device in error state
Timeout = 6, // Operation timed out
WouldBlock = 7, // Non-blocking operation would block
NotFound = 8, // Requested object does not exist
AlreadyExists = 9, // Object or resource already exists
Interrupted = 10, // Operation interrupted by signal
IoError = 11, // Generic I/O error (disk, network, DMA)
// Extensible: new variants added at end, values are stable.
}
POSIX errno mapping: The syscall entry point (Section 18.1) converts
KernelError to POSIX errno values at the syscall boundary — the only place integer
error codes exist:
| KernelError | POSIX errno | Value |
|---|---|---|
| OutOfMemory | ENOMEM | 12 |
| InvalidCapability | EBADF | 9 |
| PermissionDenied | EPERM / EACCES | 1 / 13 |
| InvalidArgument | EINVAL | 22 |
| DeviceError | EIO | 5 |
| Timeout | ETIMEDOUT | 110 |
| WouldBlock | EAGAIN | 11 |
| NotFound | ENOENT | 2 |
| AlreadyExists | EEXIST | 17 |
| Interrupted | EINTR | 4 |
| IoError | EIO | 5 |
Some variants map to different errnos depending on context (PermissionDenied becomes
EPERM for capability operations, EACCES for filesystem operations). The translation
is handled by the syscall dispatch layer, not by the originating subsystem.
3.2.2 Fault Containment Boundaries
UmkaOS has four fault containment domains. A fault in one domain does not propagate to domains above it in the hierarchy:
| Domain | Failure scope | Recovery |
|---|---|---|
| umka-core (Tier 0) | Kernel panic — entire system | Reboot (same as Linux) |
| Tier 1 driver (domain-isolated) | Single driver crash | Automatic restart, device FLR (Section 10.8) |
| Tier 2 driver (process) | Single driver process crash | Automatic restart (Section 10.8) |
| Userspace process | Single process terminated | Application-level recovery |
Linux problem: Linux has exactly one fault domain for the entire kernel. A null dereference in an obscure USB driver is indistinguishable from a bug in the scheduler — both trigger the same kernel panic. The only containment boundary is kernel vs. userspace.
UmkaOS design: The isolation model (Section 10.2) gives each Tier 1 driver its own
isolation domain. When a CPU exception fires (page fault, general protection fault,
divide-by-zero), umka-core's exception handler inspects the faulting context's isolation
domain ID (architecture-specific: PKRU on x86, page table base on ARM/RISC-V — see
arch::current::isolation::current_domain_id()):
- Domain 0 (umka-core): The fault is in the trusted kernel. This is a genuine kernel panic — proceed to the panic handler (Section 3.2.3).
- Domain 1-N (Tier 1 driver): The fault is in an isolated driver. The exception handler identifies the driver from the faulting domain ID, marks it as crashed, and invokes the crash recovery sequence (Section 10.8). The rest of the kernel continues running.
Tier 2 driver faults are even simpler: the driver runs in a separate address space, so a fault (SIGSEGV, SIGBUS, SIGFPE) terminates the driver process. The driver supervisor detects the exit and restarts it.
3.2.3 Panic Handling
A kernel panic means a bug in umka-core itself — the small trusted computing base. This is the only code whose failure is fatal.
Panic sequence:
1. DISABLE INTERRUPTS — local CLI on the faulting CPU.
NMI IPI broadcast to all other CPUs. Each CPU receiving the NMI executes
the NMI panic handler, which:
(a) Saves the current register context to a pre-allocated per-CPU crash buffer
(allocated at boot, never freed, immune to OOM — one 4KB page per CPU).
(b) Disables local interrupts (preventing further preemption or nested exceptions).
(c) Spins on an atomic flag waiting for the panic coordinator (the faulting CPU).
This ensures all CPUs are in a known-safe state before the coordinator reads
system data structures. The NMI handler is NMI-safe — it uses no locks, no
allocation, and no printk. It writes only to the pre-allocated crash buffer.
Architecture-specific NMI delivery: On x86-64, this uses the APIC NMI delivery
mode. On AArch64 (GICv3.3+), the GIC NMI mechanism (GICD_INMIR) is used; on
older GIC implementations, a highest-priority FIQ is used as a pseudo-NMI (same
technique as Linux's CONFIG_ARM64_PSEUDO_NMI). On RISC-V, sbi_send_ipi() with a
dedicated panic IPI vector is used. **Limitation**: standard RISC-V supervisor
interrupts are maskable — sstatus.SIE=0 blocks all supervisor interrupts
regardless of AIA priority. If the target CPU has interrupts disabled, the IPI
will not be delivered. Mitigation: the panic coordinator uses a 100 ms timeout
per CPU; CPUs that do not respond are marked "unavailable" in the crash dump.
On systems implementing the Smrnmi extension (Resumable Non-Maskable Interrupts),
UmkaOS uses RNMI delivery instead, which is truly non-maskable. On PPC64LE, the OPAL
opal_signal_system_reset() call triggers a system reset interrupt on target CPUs.
2. CAPTURE STATE — faulting CPU registers, stack backtrace (.eh_frame),
per-CPU crash buffers (from step 1), key data structures (process list,
cap table, driver registry, last 64KB klog)
3. SERIAL FLUSH — panic message + backtrace to serial (Tier 0, polled, always works)
4. CRASH DUMP — if configured, write ELF core dump to reserved memory region;
if NVMe panic-write path registered, polled-mode write to disk (Section 10.8)
5. HALT — default halt (umka.panic=halt), or reboot (umka.panic=reboot)
Driver panic vs. kernel panic: A panic!() inside Tier 1 driver code does NOT
panic the kernel. The kernel is compiled with panic = "abort", so there is no stack
unwinding. Instead, panic!() calls abort(), which executes an illegal instruction
(ud2 on x86-64, udf on AArch64/ARMv7, unimp on RISC-V, trap on PPC). This
triggers a CPU exception (invalid opcode / undefined instruction) within the driver's
isolation domain. The exception handler identifies the faulting domain as a non-core
driver domain and routes the fault to driver crash recovery (Section 10.8) — not to the
kernel panic path.
OOM policy: When physical memory is exhausted, UmkaOS applies pressure in stages:
- Reclaim page cache: Clean pages are evicted immediately (no I/O cost). Dirty pages are written back and then evicted.
- Compress to zpool: Inactive anonymous pages are compressed and moved to the in-kernel compression tier (Section 4.2), reducing physical memory usage 2-3x.
- Swap to disk: If a swap device is configured, compressed pages that haven't been accessed spill to disk.
- OOM killer: If all of the above fail to free enough memory, the OOM killer selects
a process to terminate. Heuristic: largest RSS, not marked
OOM_SCORE_ADJ=-1000, not system-critical, not recently started. The selected process receivesSIGKILL.
umka-core itself is never OOM-killed. Core kernel allocations draw from a reserved memory pool (configured at boot, default 64MB) that is excluded from the general-purpose allocator. If the reserved pool is exhausted — a symptom of a kernel memory leak — this is a kernel panic, not an OOM kill.
3.2.4 Error Reporting to Userspace
Syscall error returns: Standard Linux ABI — negative errno in the return register
(rax on x86-64, x0 on AArch64, a0 on RISC-V). Applications, glibc, and musl all
work unmodified.
Extended error information: For complex failures where a single errno is insufficient (e.g., "which capability was invalid?" or "which device returned an error?"), UmkaOS provides a per-thread extended error buffer:
/// Per-thread extended error context, populated on syscall failure.
#[repr(C)]
pub struct ExtendedError {
pub errno: i32, // POSIX errno (same as syscall return)
pub subsystem: u32, // Kernel subsystem that generated the error
pub detail_code: u32, // Subsystem-specific detail code
pub object_id: u64, // Related capability/device/inode ID (0 = N/A)
}
Queried via prctl(PR_GET_EXTENDED_ERROR, &buf) — entirely optional. Applications that
don't use it see standard errno behavior with zero overhead (the buffer is only written on
error). The subsystem and detail_code fields are stable, allowing diagnostic tools to
produce messages like "capability 0x3f revoked by generation advance" instead of "EBADF".
Kernel log messages: Errors are logged to the kernel ring buffer (dmesg) with
structured fields. The stable tracepoint ABI (Section 19.2) exposes these as
machine-parseable events for external monitoring tools.
3.2.5 Error Escalation Paths
Errors escalate through a five-level hierarchy. Each level is attempted before moving to the next:
retry → log → degrade → isolate → panic
1 2 3 4 5
-
Retry: Transient hardware errors (bus timeout, CRC mismatch, link retrain) are retried with exponential backoff. Maximum retries are configured per error class (default: 3 retries, 1ms / 10ms / 100ms backoff). If the retry succeeds, no further escalation occurs — the event is logged at
DEBUGlevel for trending. -
Log: Persistent errors that survive retries are logged to the kernel ring buffer and recorded as stable tracepoint events (Section 19.2). The Fault Management engine (Section 19.1) ingests these events for threshold-based diagnosis. No state change yet — the subsystem continues operating.
-
Degrade: Repeated errors from the same subsystem trigger graceful degradation. Examples: storage path failover to a redundant controller, NIC fallback from hardware offload to software path, memory controller marking a DIMM rank as degraded (Section 19.1 RetirePages action). The subsystem continues at reduced capacity. Degradation is reported via uevent to userspace.
-
Isolate: A misbehaving driver is crashed and restarted via the recovery sequence (Section 10.8). If the same driver crashes repeatedly — 3 times within 60 seconds — it is demoted to Tier 2 (full process isolation). If it continues crashing at Tier 2 (5 cumulative crashes within one hour), the driver is disabled entirely and its device is marked offline in the device registry (Section 10.5).
-
Panic: Reserved for corrupted umka-core state where continued operation risks data loss or silent corruption. Examples: invalid page table entries in kernel mappings, corrupted capability table metadata, scheduler invariant violations. Any state that cannot be recovered by isolating a single driver triggers a kernel panic (Section 3.2.3).
See also: Section 10.8 (Crash Recovery) for the full driver restart sequence. Section 19.1 (Fault Management) for proactive, telemetry-driven error handling before faults occur. Section 19.2 (Stable Tracepoints) for the machine-parseable event format used at escalation levels 2-4.