Skip to content

Chapter 6: Scheduling and Power Management

EEVDF, RT, deadline scheduling, per-CPU runqueues, EAS, power budgeting, CPU bandwidth, timekeeping


6.1 Scheduler

6.1.1 Multi-Policy Design

The scheduler supports three scheduling policies simultaneously:

Policy Algorithm Use case Priority range
Normal EEVDF General-purpose workloads (Section 6.1.2.1) Nice -20 to 19
Real-Time FIFO / RR Latency-sensitive applications RT 1-99
Deadline EDF (CBS) Guaranteed CPU time (audio, etc.) Runtime/Period

6.1.2 Architecture

                    Global Load Balancer
                    (runs every ~4ms)
                          |
            +-------------+-------------+
            |             |             |
        CPU 0          CPU 1         CPU N
    +----------+   +----------+   +----------+
    | RT Queue |   | RT Queue |   | RT Queue |   <- Highest priority
    +----------+   +----------+   +----------+
    | DL Queue |   | DL Queue |   | DL Queue |   <- Deadline tasks
    +----------+   +----------+   +----------+
    |EEVDF Tree|   |EEVDF Tree|   |EEVDF Tree|   <- Normal tasks (red-black tree, EEVDF)
    +----------+   +----------+   +----------+

6.1.2.1 EEVDF Algorithm Specification

The EEVDF (Earliest Eligible Virtual Deadline First) scheduler is the primary scheduling algorithm for normal (non-RT, non-deadline) tasks. This section specifies the complete algorithm as implemented, matching Linux 6.6+ EEVDF semantics.

Virtual runtime and weights. Each task accumulates virtual runtime proportional to its CPU consumption, inversely scaled by its weight:

vruntime += delta_exec_ns * (NICE_0_WEIGHT / task_weight)

where NICE_0_WEIGHT = 1024 (the weight of a nice-0 task) and delta_exec_ns is the wall-clock nanoseconds the task ran since the last accounting update. Higher-weight tasks accumulate vruntime more slowly (they are entitled to more CPU), and lower-weight tasks accumulate vruntime more quickly.

The sched_prio_to_weight table maps nice values to weights (identical to Linux CFS/EEVDF). The 40 entries (nice -20 to +19):

Nice Weight Nice Weight Nice Weight Nice Weight
-20 88761 -10 9548 0 1024 10 110
-19 71755 -9 7620 1 820 11 87
-18 56483 -8 6100 2 655 12 70
-17 46273 -7 4904 3 526 13 56
-16 36291 -6 3906 4 423 14 45
-15 29154 -5 3121 5 335 15 36
-14 23254 -4 2501 6 272 16 29
-13 18705 -3 1991 7 215 17 23
-12 14949 -2 1586 8 172 18 18
-11 11916 -1 1277 9 137 19 15

Each step of +1 nice reduces CPU share by approximately 10% (weight ratio ~1.25 between adjacent nice levels). The inverse weight table (sched_prio_to_wmult) is precomputed for fixed-point division: wmult[i] = 2^32 / weight[i].

Eligibility. A task is eligible to run when it has not received more than its fair share of CPU time. Eligibility is determined by comparing the task's virtual runtime against the run queue's minimum virtual runtime (min_vruntime):

eligible_vtime = vruntime - (lag * NICE_0_WEIGHT) / task_weight
eligible = (eligible_vtime <= min_vruntime)

A task with positive lag (received less CPU than its fair share) has eligible_vtime < vruntime, making it eligible sooner. A task with negative lag (received more than its fair share) has eligible_vtime > vruntime, delaying its eligibility until min_vruntime advances past it. This ensures fairness: tasks that have been underserved are prioritized, while tasks that have been overserved must wait.

Virtual deadline. Each task's virtual deadline determines its scheduling priority among eligible tasks:

vdeadline = eligible_vtime + (slice_ns / task_weight) * NICE_0_WEIGHT

The default slice is 750 µs (slice_ns = 750_000), configurable via sched_base_slice_ns (matching the Linux 7.0 default). Lower-weight tasks get longer virtual deadlines (lower scheduling urgency); higher-weight tasks get shorter virtual deadlines (higher urgency). The task with the minimum vdeadline among all eligible tasks is selected by pick_next_task.

Red-black tree organization. Each per-CPU run queue maintains two red-black trees, both keyed by vdeadline:

  • eligible_tree: Contains tasks where eligible_vtime <= min_vruntime. These tasks are ready to be scheduled. pick_next_task always selects the leftmost (minimum vdeadline) node from this tree.
  • ineligible_tree: Contains tasks where eligible_vtime > min_vruntime. These tasks have received more than their fair share and must wait.

When the current task runs and min_vruntime advances (it tracks the minimum vruntime among all runnable tasks), some tasks in ineligible_tree may become eligible. On each scheduler tick and on pick_next_task, the scheduler checks the leftmost node of ineligible_tree: if its eligible_vtime <= min_vruntime, it is migrated to eligible_tree. This check repeats until the leftmost ineligible task is still ineligible.

Eligibility filter in pick_next_task. Before selecting the task with the earliest virtual deadline, the EEVDF pick_next_task must apply an eligibility filter. A task is eligible if and only if its lag is non-negative:

lag(task) = task.weight × (rq.avg_vruntime - task.vruntime) / NICE_0_WEIGHT
eligible   = lag(task) ≥ 0

Only tasks in eligible_tree (those satisfying eligible_vtime ≤ min_vruntime) are considered for virtual-deadline selection. Tasks with negative lag (they have consumed more than their fair share) reside in ineligible_tree and are excluded from selection until min_vruntime advances past their eligible_vtime. If no eligible task exists (a transient condition: lag accounting may momentarily make all tasks ineligible due to numerical precision), pick_eevdf() returns curr — the currently-running task — rather than selecting from a tree node. Retaining the current task avoids a wasteful context switch and a scheduling decision on transiently stale state; the condition self-corrects on the next tick as min_vruntime advances.

Lag tracking. Lag measures how far a task's actual CPU service deviates from its ideal fair share:

lag = (avg_vruntime - vruntime) * task_weight / NICE_0_WEIGHT

where avg_vruntime is the weighted average virtual runtime of all runnable tasks on the run queue. Positive lag means the task is owed CPU time (has been underserved); negative lag means the task has received more than its share.

Lag is updated on every dequeue (when a task blocks, yields, or is preempted). To prevent starvation or unbounded credit accumulation, lag is clamped to ±slice_ns:

lag = clamp(lag, -slice_ns, slice_ns)

On enqueue (when a task wakes up), the saved lag from the previous dequeue is used to compute eligible_vtime, ensuring that a task that was owed CPU time before sleeping is prioritized when it wakes.

Deferred dequeue (over-served sleep path). When a task sleeps while carrying negative lag — meaning it has consumed more CPU than its fair share — a naive immediate removal from the run queue would let it re-enter with a head-start on avg_vruntime computation. UmkaOS instead uses a deferred dequeue mechanism, matching the Linux 6.12 sched_delayed design:

  • On sleep, if lag < 0, the task is NOT immediately removed from its red-black tree position. The field sched_delayed is set to true and on_rq remains in its current tree state (Eligible or Ineligible).
  • While deferred, pick_eevdf() skips the task: a deferred task is never selected for execution.
  • While deferred, the task still contributes its weight to total_weight and to the two avg_vruntime accumulators (avg_vruntime_v, avg_vruntime_w). This ensures that over-served sleeping tasks continue to push avg_vruntime upward, naturally decaying their negative lag without requiring any explicit timer.
  • The lag clamp applied on sleep is widened to ±2 × slice_ns:

lag = clamp(lag, -2 * slice_ns, +2 * slice_ns)

The asymmetric clamping rationale: a clamp of +2 × slice_ns prevents an under-served task from accumulating unbounded forward credit, while -2 × slice_ns ensures an over-served task cannot remain deferred for longer than two scheduling slices before it is forcibly evicted from the tree at pick time. - Deferred removal is triggered by two events: 1. Lag reaches zero at pick time: pick_eevdf() scans the tree and, for each deferred candidate, checks whether lag ≥ 0 has been reached (i.e., avg_vruntime has advanced past the task's vruntime). When true, the task is removed from the tree (transitioned to OnRqState::Off) without any wake-up. 2. Task wakes up before lag decays: On the wake-up path, if sched_delayed == true, the task is first removed from its deferred tree position, then re-enqueued normally with its virtual start time recomputed from the current lag. - On wake-up from a deferred state, the virtual start time is restored as:

v_i = V_bar - lag / weight (where V_bar = avg_vruntime)

This places the task at exactly its fair-share position in virtual time. If lag < 0 (still over-served at wake time), v_i > V_bar, so the task enters ineligible_tree and waits until min_vruntime catches up. If lag ≥ 0 (fully decayed by the time it wakes), v_i ≤ V_bar, so the task enters eligible_tree immediately.

The sched_delayed field is added to EevdfTask and OnRqState gains a Deferred variant to unambiguously represent the deferred state:

/// Whether and how a task appears on the run queue — extended for deferred dequeue.
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub enum OnRqState {
    /// Task is sleeping and not physically present in any run queue tree.
    Off,
    /// Task is runnable and eligible for selection (eligible_vtime ≤ min_vruntime).
    /// Present in `eligible_tree`.
    Eligible,
    /// Task is runnable but ineligible (eligible_vtime > min_vruntime).
    /// Present in `ineligible_tree`, waiting for `min_vruntime` to advance.
    Ineligible,
    /// Task is sleeping but still physically present in `eligible_tree` or
    /// `ineligible_tree` because it had negative lag at sleep time
    /// (`sched_delayed == true`). Skipped by `pick_eevdf()`; still contributes
    /// weight to `avg_vruntime_v` and `avg_vruntime_w`. Transitions to `Off`
    /// when lag decays to zero or the task wakes and is re-enqueued.
    Deferred,
}

Bandwidth throttling interaction. When a cgroup's CPU bandwidth is exhausted (CBS budget depleted per Section 6.3), all tasks in the cgroup are dequeued from their respective per-CPU run queues (removed from both eligible_tree and ineligible_tree). Each task's vruntime and lag are preserved in the task struct.

On budget replenishment (when the CBS period timer fires and the cgroup receives a new quota), all previously throttled tasks are re-enqueued with their saved vruntime and lag intact. This ensures that bandwidth throttling does not cause fairness distortion — a task that was owed CPU time before throttling remains owed after throttling ends.

Latency-nice. Tasks with latency_nice < 0 (latency-sensitive, e.g., interactive or audio) have their eligible_vtime shifted earlier, making them eligible sooner:

eligible_vtime -= (LATENCY_NICE_0_WEIGHT - latency_weight) / task_weight * slice_ns

where latency_weight is derived from a latency-nice priority table (analogous to sched_prio_to_weight but for the latency-nice range -20 to +19). A task with latency_nice = -20 becomes eligible significantly sooner than one with latency_nice = 0, reducing its scheduling latency. A task with latency_nice > 0 is shifted later (deprioritized for latency). This is the Linux 6.6+ latency-nice feature integrated into EEVDF.

Latency-nice does NOT affect CPU bandwidth — only scheduling latency. A latency-nice -20 task at nice 0 receives the same total CPU share as a latency-nice 0 task at nice 0; it simply gets scheduled sooner when it wakes up.

Data structures.

/// Lock level constant for per-CPU run queue locks. /// /// Placed above the task lock level (1) and below the priority-inheritance /// lock level (3) in the system-wide lock hierarchy defined in Section 3.1.5. /// Level 2 is shared by all run queue locks regardless of which CPU they protect. /// /// # Same-level ordering /// Because all RunQueue locks share level 2, the type system alone cannot prevent /// an ABBA deadlock between two run queues. The lock_two_runqueues() function /// closes this gap: it is the only function permitted to hold two run queue locks /// simultaneously, and it always acquires them in CPU-ID order. pub const RQ_LOCK_LEVEL: u8 = 2;

/// Per-CPU run queue — the top-level schedulable entity for one logical CPU. /// /// RunQueue is the owner of the SpinLock that protects EevdfRunQueue /// and the associated RT/deadline queues for one CPU. Callers that need to lock /// two run queues at once must use lock_two_runqueues() — direct chained /// calls to lock() on two different RunQueues is a compile-time error when /// the lock-level type system detects two concurrent level-2 acquisitions. pub struct RunQueue { /// Identity of the CPU that owns this run queue. /// Used by lock_two_runqueues() to establish a canonical acquisition order. pub cpu_id: CpuId, /// Typed spinlock carrying the EEVDF and RT/deadline state. /// The level-2 type parameter participates in the compile-time lock hierarchy /// (Section 3.1.5); it prevents acquiring this lock while already holding a /// level-2 lock except through lock_two_runqueues(). lock: SpinLock, }

/// RT priority valid range for SCHED_FIFO and SCHED_RR: 1–99. /// Priority 0 is EINVAL for real-time policies (reserved for non-RT policies /// per POSIX and Linux ABI). sched_get_priority_min(SCHED_FIFO) == 1. /// Slot 0 in the priority bitmap is allocated but always empty. pub const RT_PRIORITY_MIN: u8 = 1; pub const RT_PRIORITY_MAX: u8 = 99; pub const RT_PRIORITY_LEVELS: usize = 100; // Indexed 0–99; slot 0 is unused (priority 0 = EINVAL).

// Parameter validation for sched_setscheduler / sched_setattr: // // if policy == SCHED_FIFO || policy == SCHED_RR { // if param.sched_priority == 0 { return EINVAL; } // if param.sched_priority > RT_PRIORITY_MAX { return EINVAL; } // } // // Equivalently: valid range is RT_PRIORITY_MIN..=RT_PRIORITY_MAX (1–99). // sched_get_priority_min(SCHED_FIFO) and sched_get_priority_min(SCHED_RR) // both return 1. This matches the Linux ABI and POSIX SCHED_FIFO/SCHED_RR // semantics: priority 0 is defined only for non-RT policies (SCHED_OTHER, // SCHED_BATCH, SCHED_IDLE) where it is the only legal value.

/// Absolute deadline timestamp in nanoseconds (monotonic clock). /// /// Used as the key type for the DlRunQueue ordered map. Two tasks with the /// same absolute deadline share equal priority under EDF; BTreeMap will /// store them with distinct keys only if their deadline values differ. In /// practice, two tasks with identical absolute deadlines are exceedingly rare /// and the implementation may break ties by task ID when they occur. pub type AbsDeadlineNs = u64;

/// Fixed-point scale for deadline bandwidth accounting. /// /// DL_BW_SCALE = 1 << 20 (approximately 1,048,576). Deadline bandwidth /// fractions are stored as runtime_ns * DL_BW_SCALE / period_ns, giving /// 20 bits of sub-unit precision. A task consuming 100% of the CPU has /// bw = DL_BW_SCALE. The sum of all per-task bandwidths must not exceed /// capacity_ns * DL_BW_SCALE / period_ns. /// /// This matches Linux's BW_SHIFT = 20 / BW_UNIT = 1 << BW_SHIFT /// convention so that bandwidth values computed from SCHED_DEADLINE /// parameters are directly comparable. pub const DL_BW_SCALE: u64 = 1 << 20;

/// Per-CPU real-time (SCHED_FIFO / SCHED_RR) run queue. /// /// UmkaOS improves on Linux's struct rt_rq in two ways: /// /// 1. Per-queue CBS bandwidth accounting instead of a single global /// sched_rt_runtime_us knob. Each RtRunQueue tracks its own consumed /// runtime and replenishment period, so individual CPUs can be throttled /// independently without a global lock. This is particularly valuable on /// heterogeneous platforms where P-core and E-core CPUs have different /// RT capacity budgets. /// /// 2. Typed priority bitmap — the 100-bit occupancy map uses two u64 /// words (128 bits allocated, top 28 unused), with the highest set bit /// indicating the next priority level to schedule. A trailing-zeros scan /// on the complement of the bitmap locates the highest occupied queue in /// a single instruction on all supported architectures. /// /// Linux reference: struct rt_rq in kernel/sched/sched.h (Linux 6.x). /// Key differences: no rt_nr_boosted (PI boost is tracked per-task in /// UmkaOS, not per-queue), no pushable_tasks plist (UmkaOS uses a separate /// per-CPU migration candidate set managed by the load balancer), and no /// embedded group-scheduling pointers (tg, rq). pub struct RtRunQueue { /// Two-word occupancy bitmap for priority levels 0–99. /// /// Bit p is set when priority queue p is non-empty. /// Word 0 covers priorities 0–63; word 1 covers priorities 64–99 /// (bits 100–127 of word 1 are always zero). /// The highest-priority non-empty queue is found by scanning for the /// most-significant set bit across both words (word 1 first). pub bitmap: [u64; 2],

/// Per-priority intrusive task lists.
///
/// `queues[p]` holds all runnable tasks at RT priority `p` in
/// FIFO order. SCHED_RR tasks are rotated to the tail of their
/// queue on time-slice expiry. SCHED_FIFO tasks are never rotated.
/// Indexed 0 (lowest RT priority) through 99 (highest RT priority),
/// matching Linux's `SCHED_FIFO`/`SCHED_RR` priority numbering.
pub queues: [IntrusiveList<Task>; RT_PRIORITY_LEVELS],

/// Number of runnable RT tasks on this CPU.
///
/// Incremented on enqueue, decremented on dequeue. Does not include
/// tasks that are throttled (removed from all queues). The value
/// equals the number of set bits in `bitmap` summed over all queues.
pub nr_running: u32,

/// Accumulated runtime consumed during the current throttle period, in
/// nanoseconds.
///
/// Increased by the scheduler tick handler on every tick that an RT task
/// is running. When `rt_time_ns >= rt_runtime_ns` the queue is throttled:
/// all RT tasks are dequeued and `throttled` is set. This is a per-CPU
/// improvement over Linux's `sched_rt_runtime_us` global knob.
pub rt_time_ns: u64,

/// Maximum RT runtime allowed per period, in nanoseconds.
///
/// Defaults to `950_000_000` (950 ms per second — reserving 5% of the CPU
/// for non-RT work, matching Linux's default `sched_rt_runtime_us`).
/// Configurable at runtime via `sysctl umka.sched.rt_runtime_ns`.
/// Set to `u64::MAX` to disable throttling (equivalent to
/// `sched_rt_runtime_us = -1` in Linux).
pub rt_runtime_ns: u64,

/// `true` when this queue is currently throttled due to bandwidth
/// exhaustion (`rt_time_ns >= rt_runtime_ns`).
///
/// While throttled, no RT task from this queue may be selected by
/// `pick_next_task`. The period replenishment timer resets this flag
/// and re-enqueues all previously throttled tasks.
pub throttled: bool,

/// Monotonic timestamp (nanoseconds) of the start of the current
/// throttle accounting period.
///
/// The period length is 1 second (`1_000_000_000 ns`), matching Linux's
/// `sched_rt_period_us` default of 1 s. At the end of each period,
/// `rt_time_ns` is reset to zero, `throttled` is cleared, and
/// `period_start_ns` is advanced by the period length.
pub period_start_ns: u64,

}

/// Per-CPU deadline (SCHED_DEADLINE) run queue. /// /// UmkaOS improves on Linux's struct dl_rq in two ways: /// /// 1. BTreeMap instead of rb_root_cached — Rust's BTreeMap provides /// cleaner EDF ordering, safe iteration, and automatic rebalancing without /// hand-written augmentation. The map key is the absolute deadline (in ns), /// giving O(log n) enqueue/dequeue and O(log n) earliest-deadline lookup via /// BTreeMap::first_key_value(). The separately cached earliest_deadline_ns: u64 /// field provides O(1) preemption checks without tree traversal — updated on /// every enqueue/dequeue. /// /// 2. Explicit bandwidth trackingtotal_bw and capacity_ns are /// first-class fields rather than derived from per-task dl_bw entries. /// This makes admission control O(1): check total_bw + new_task_bw <= /// DL_BW_SCALE before accepting a new SCHED_DEADLINE task. /// /// Linux reference: struct dl_rq in kernel/sched/sched.h (Linux 6.x). /// Key differences: BTreeMap replaces rb_root_cached, explicit /// capacity_ns replaces per-rq dl_bw struct, and earliest_deadline_ns /// is a cached O(1) copy of the earliest deadline, providing O(1) preemption /// checks without a tree traversal (since BTreeMap::first_key_value() is O(log n)). pub struct DlRunQueue { /// EDF-ordered map from absolute deadline (ns) to task handle. /// /// Invariant: every task in tasks is runnable (not sleeping) and has /// been admitted through the bandwidth test. The map key equals the /// task's sched_dl_entity.deadline field at the time of enqueue; /// if a task's deadline is updated (e.g., on new job arrival), it is /// removed and re-inserted with the new key. pub tasks: BTreeMap>,

/// Sum of fixed-point bandwidth fractions for all admitted tasks.
///
/// Each task contributes `runtime_ns * DL_BW_SCALE / period_ns` to
/// `total_bw`. Admission control accepts a new task only if
/// `total_bw + new_bw <= capacity_ns * DL_BW_SCALE / period_ns`.
/// Maintained incrementally: increased on task admission, decreased
/// on task departure (sleep, termination, or policy change away from
/// SCHED_DEADLINE).
pub total_bw: u64,

/// CPU capacity in nanoseconds per period (default: 1_000_000_000 for a
/// fully available CPU over a 1-second window).
///
/// On heterogeneous CPUs, `capacity_ns` is scaled by the CPU's capacity
/// factor (from `CpuCapacity.capacity / 1024`) so that a 512-capacity
/// efficiency core exposes only 512 ms of deadline capacity per second.
/// This prevents over-admission on low-capacity cores.
pub capacity_ns: u64,

/// Number of runnable deadline tasks on this CPU.
///
/// Equals `tasks.len()`. Maintained as a separate `u32` to avoid the
/// overhead of `BTreeMap::len()` (which is O(1) in Rust's `BTreeMap` but
/// requires an extra load; keeping `nr_running` in the hot struct avoids
/// pointer chasing to the map's length field).
pub nr_running: u32,

/// Cached absolute deadline of the earliest-deadline task, or
/// `u64::MAX` when the queue is empty.
///
/// Mirrors `tasks.first_key_value().map(|(k, _)| *k).unwrap_or(u64::MAX)`.
/// Updated on every enqueue and dequeue. Used by `pick_next_task` and
/// the preemption check (`resched_curr`) to compare against the currently
/// running task's deadline without a map traversal.
pub earliest_deadline_ns: u64,

}

/// Data protected by RunQueue.lock. pub struct RunQueueData { /// EEVDF scheduling queues (eligible and ineligible trees). pub eevdf: EevdfRunQueue, /// RT FIFO/RR run queue for this CPU. pub rt: RtRunQueue, /// CBS deadline server list for this CPU. pub dl: DlRunQueue, /// The per-CPU idle task. Always runnable; never enqueued in any /// scheduling class. Returned by pick_next_task() when all three /// class queues are empty. Statically allocated at boot — one per CPU. pub idle_task: Arc, }

/// Lock two run queues in a deadlock-free order (lower CPU ID first). /// /// This is the only function that may hold two run queue locks simultaneously. /// The load balancer (work stealing) must call this function instead of acquiring /// RunQueue.lock directly while already holding another run queue's lock. /// /// # Deadlock prevention — compile-time enforced /// /// RunQueue.lock is a SpinLock<RunQueueData, LEVEL=2>. The SpinLock type /// parameter prevents a caller that already holds a level-2 lock from acquiring /// a second one without going through this function. Calling lock_two_runqueues /// is therefore the only legal path to holding two run queue locks simultaneously, /// and it always acquires them in CPU-ID order — eliminating ABBA deadlock at /// compile time rather than relying on code review to uphold a naming convention. /// /// The old Linux approach (documented rule "always lock min CPU first") has /// produced real deadlocks in distribution kernels. UmkaOS's type-level enforcement /// eliminates this class of bug entirely. pub fn lock_two_runqueues<'a>( rq_a: &'a RunQueue, rq_b: &'a RunQueue, ) -> (SpinGuard<'a, RunQueueData, RQ_LOCK_LEVEL>, SpinGuard<'a, RunQueueData, RQ_LOCK_LEVEL>) { // Always acquire the run queue whose CPU has the lower numeric ID first. // Every pair of CPUs has a unique total order under this relation, so no // two threads can form an acquisition cycle. if rq_a.cpu_id < rq_b.cpu_id { let g_a = rq_a.lock.lock(); let g_b = rq_b.lock.lock(); (g_a, g_b) } else { let g_b = rq_b.lock.lock(); let g_a = rq_a.lock.lock(); (g_a, g_b) } }

/// A reference to a runnable task held by the EEVDF scheduler. /// Ownership: the scheduler holds one Arc<Task> per runnable task. /// Using this alias makes ownership semantics explicit in data structure definitions. pub type TaskHandle = Arc;

/// A red-black tree augmented with a per-subtree minimum eligible virtual time field. /// /// The augmentation enables O(log n) eligible-task queries without a full tree scan: /// - Each node caches min_eligible_vtime: the minimum eligible_vtime of ALL nodes /// in its subtree (including itself). /// - On rotation (left or right): call recompute_min_eligible_vtime() on the /// affected nodes bottom-up — first the child, then the parent. /// - On insertion or key update: walk from the modified node to the root, calling /// recompute_min_eligible_vtime() at each ancestor. /// /// Augmentation invariant: /// /// node.min_eligible_vtime = min( /// node.eligible_vtime, /// node.left .map_or(u64::MAX, |l| l.min_eligible_vtime), /// node.right.map_or(u64::MAX, |r| r.min_eligible_vtime), /// ) /// /// /// pick_eevdf() uses this to prune ineligible subtrees: /// if subtree.min_eligible_vtime > min_vruntime, the entire subtree is skipped. pub struct AugmentedRBTree { root: Option>>, /// Total node count. O(1). pub len: usize, }

pub struct RBNode { pub key: K, pub value: V, color: RBColor, left: Option>>, right: Option>>, /// Augmented field: minimum eligible_vtime in this subtree. /// For AugmentedRBTree<u64, EevdfNode>, updated on every structural change. pub min_subtree_key: K, }

/// Non-augmented red-black tree (standard ordered map). pub struct RBTree { root: Option>>, pub len: usize, }

[derive(Clone, Copy, PartialEq)]

enum RBColor { Red, Black }

/// Per-CPU EEVDF run queue state. pub struct EevdfRunQueue { /// Red-black tree of eligible tasks, keyed by vdeadline. /// Each node caches min_deadline (see EevdfNode) so that pick_eevdf() /// can locate the eligible entity with the earliest deadline in O(log n) /// via the augmented-tree walk described in Section 6.1.2.1. /// pick_next_task uses pick_eevdf() rather than the raw leftmost node. eligible_tree: AugmentedRBTree, /// Red-black tree of ineligible tasks, keyed by vruntime. /// Tasks migrate to eligible_tree when min_vruntime advances past /// their eligible_vtime. Deferred tasks (OnRqState::Deferred) may /// reside in either tree; they are skipped by pick_eevdf(). ineligible_tree: RBTree, /// Monotonically increasing minimum virtual runtime across all /// runnable tasks. Equal to the vruntime of the leftmost node in /// eligible_tree (or ineligible_tree when eligible_tree is empty). /// Used as the eligibility threshold and as V_floor in the /// avg_vruntime two-accumulator formula. min_vruntime: u64, /// First avg_vruntime accumulator: V_hat = Σ(w_i × (v_i − min_vruntime)). /// Maintained as a running signed sum updated on every enqueue and dequeue /// (including deferred enqueue/dequeue). Never divided — the division-free /// eligibility test uses this value directly. See entity_eligible(). avg_vruntime_v: i64, /// Second avg_vruntime accumulator: W = Σ(w_i) (sum of all entity weights). /// Updated in lockstep with avg_vruntime_v. Deferred tasks contribute /// their weight until they are removed from the tree. avg_vruntime_w: i64, /// Sum of weights of all tasks on this run queue (non-deferred tasks only). /// Distinct from avg_vruntime_w which includes deferred tasks. /// Updated on enqueue/dequeue. Used for lag computation. total_weight: i64, /// Number of runnable (non-deferred) tasks on this run queue. /// AtomicU32 so that work-stealing CPUs may read this field lock-free /// using Relaxed ordering — an approximate count is sufficient for /// steal candidate selection. Updated with Relaxed on enqueue and dequeue. pub task_count: AtomicU32, /// Timer for CBS bandwidth replenishment checks. bandwidth_timer: HrTimer, }

/// A node in the augmented eligible RB-tree. /// /// Each node caches the minimum virtual deadline reachable from this subtree: /// /// text /// min_deadline = min(self.deadline, /// left.min_deadline if left exists, /// right.min_deadline if right exists) /// /// /// The min_deadline field is maintained by the RB-tree rebalancing hooks: /// every rotation or color change that restructures the tree must call /// recompute_min_deadline() on each affected node, bottom-up. This is /// the standard augmented-RB-tree update protocol. /// /// pick_eevdf() exploits min_deadline to prune entire subtrees during /// its eligible-minimum-deadline walk, achieving O(log n) selection even /// when eligibility filters out many candidates. pub struct EevdfNode { /// Handle to the task this node represents. task: TaskHandle, /// The virtual deadline of this specific task. deadline: u64, /// Cached minimum virtual deadline in the subtree rooted at this node. /// Must equal min(deadline, left.min_deadline, right.min_deadline). /// Invariant maintained by recompute_min_deadline() on every tree mutation. min_deadline: u64, }

/// EEVDF scheduling state embedded in each Task struct. pub struct EevdfTask { /// Virtual runtime: accumulated CPU consumption in virtual time units. /// Scales inversely with task weight (higher weight = slower accumulation). vruntime: u64, /// Virtual deadline: eligible_vtime + scaled slice. The task with the /// minimum vdeadline among eligible tasks is scheduled next. /// This is also the RB-tree key when the task is in eligible_tree. vdeadline: u64, /// Virtual eligible time: vruntime adjusted by lag. Compared against /// min_vruntime to determine eligibility. eligible_vtime: u64, /// Lag: deviation from ideal fair share in virtual-time units, scaled by weight. /// Positive = task has been underserved (owed CPU). Negative = task over-served. /// Clamped to ±2×slice_ns on sleep (deferred path) or ±slice_ns on immediate /// dequeue. Preserved across sleep/wake cycles. lag: i64, /// Time slice in nanoseconds. Default 750_000 (750 µs). Configurable per /// task via the sched_base_slice_ns sysctl. slice_ns: u64, /// Run queue membership state. Uses the four-variant enum to correctly /// capture the deferred-dequeue state (sleeping but still in the tree). on_rq: OnRqState, /// True when the task has gone to sleep with negative lag and is still /// physically resident in eligible_tree or ineligible_tree pending /// lag decay. While sched_delayed is set, pick_eevdf() skips this /// task, but the task still contributes its weight to the avg_vruntime /// accumulators. Cleared on wake-up (before re-enqueue) or when lag /// decays to zero at pick time and the task is physically removed. sched_delayed: bool, }

// --------------------------------------------------------------------------- // Augmented-tree pick algorithm (Gap 2.4) // ---------------------------------------------------------------------------

/// Select the eligible entity with the earliest virtual deadline in O(log n). /// /// The eligible_tree is an augmented min-RB-tree keyed by vdeadline. Each /// node caches min_deadline — the smallest vdeadline reachable in its /// subtree. This lets the walk prune entire subtrees that cannot contain a /// better candidate. /// /// # Algorithm /// /// text /// fn pick_eevdf(rq: &EevdfRunQueue) -> Option<TaskHandle> { /// let mut node = rq.eligible_tree.root()?; /// let mut best: Option<&EevdfNode> = None; /// /// loop { /// // 1. Skip deferred nodes — they are sleeping. /// if !task(node).sched_delayed && entity_eligible(rq, task(node)) { /// if best.is_none() || node.deadline < best.unwrap().deadline { /// best = Some(node); /// } /// } /// /// // 2. If the left subtree might hold a better candidate, descend. /// let left_min = node.left /// .map(|l| l.min_deadline) /// .unwrap_or(u64::MAX); /// if left_min <= node.deadline { /// node = node.left.unwrap(); /// continue; /// } /// /// // 3. Ascend and try the right subtree. /// match node.parent() { /// None => break, /// Some(p) => { node = p; } /// } /// } /// /// best.map(|n| n.task) /// } /// /// /// # Fallback when eligible_tree is empty /// /// If no eligible task exists (a transient condition: lag accounting may momentarily /// make all tasks ineligible due to numerical precision), pick_eevdf() returns /// curr — the currently-running task — rather than selecting from a tree node. /// Retaining the current task avoids a wasteful context switch and a scheduling /// decision on transiently stale state; the condition self-corrects on the next /// tick as min_vruntime advances. /// /// # Invariant maintenance /// /// Every tree mutation (insert, delete, rotation) must call /// recompute_min_deadline() on affected nodes bottom-up. Violating this /// causes pick_eevdf() to return a suboptimal or incorrect result. pub fn pick_eevdf(rq: &EevdfRunQueue) -> Option { / in sched/eevdf.rs / }

// --------------------------------------------------------------------------- // avg_vruntime two-accumulator maintenance (Gap 2.12) // ---------------------------------------------------------------------------

/// Update the avg_vruntime accumulators when a task is enqueued or dequeued. /// /// avg_vruntime is maintained without division using two running sums: /// /// text /// V_floor = rq.min_vruntime (vruntime of the leftmost RB-tree node) /// V_hat = Σ(w_i × (v_i − V_floor)) = rq.avg_vruntime_v /// W = Σ(w_i) = rq.avg_vruntime_w /// /// avg_vruntime = V_floor + V_hat / W (conceptual — never computed with division) /// /// /// On enqueue (task entering the tree, including deferred re-enqueue): /// text /// rq.avg_vruntime_v += (task.vruntime as i64 - rq.min_vruntime as i64) * task.weight as i64 /// rq.avg_vruntime_w += task.weight as i64 /// /// /// On dequeue (task leaving the tree, including deferred removal at lag=0): /// text /// rq.avg_vruntime_v -= (task.vruntime as i64 - rq.min_vruntime as i64) * task.weight as i64 /// rq.avg_vruntime_w -= task.weight as i64 /// /// /// On min_vruntime advance by Δ (called after every pick, when the /// leftmost node moves): /// text /// rq.avg_vruntime_v -= Δ as i64 * rq.avg_vruntime_w /// /// This correction keeps offsets relative to the moving floor, preventing i64 /// overflow over long run-queue lifetimes. /// /// Deferred tasks remain in the accumulators until their lag decays to zero /// (sched_delayed removal), ensuring over-served sleeping tasks continue to /// push avg_vruntime upward and accelerate their own lag decay. pub fn update_avg_vruntime(rq: &mut EevdfRunQueue, task: &EevdfTask, enqueue: bool) { / implemented in sched/eevdf.rs / }

/// Division-free O(1) eligibility check. /// /// A task is eligible when se.vruntime ≤ avg_vruntime, i.e.: /// /// text /// se.vruntime ≤ V_floor + V_hat / W /// ⟺ (se.vruntime − V_floor) × W ≤ V_hat /// ⟺ rq.avg_vruntime_w × (se.vruntime − rq.min_vruntime) ≤ rq.avg_vruntime_v /// /// /// No division is performed. The left side can be negative (eligible) or /// positive (ineligible). /// /// Precondition: rq.avg_vruntime_w > 0 (at least one task on the run queue). pub fn entity_eligible(rq: &EevdfRunQueue, se: &EevdfTask) -> bool { let vlag = se.vruntime as i64 - rq.min_vruntime as i64; rq.avg_vruntime_w * vlag <= rq.avg_vruntime_v }


### 6.1.3 Key Properties

- **Preemptible locks by default**: Mutexes and rwlocks are always sleeping locks with
  priority inheritance. Under `PreemptionModel::Realtime` ([Section 7.2.2](07-process.md#722-design-bounded-latency-paths)), spinlocks
  also become sleeping locks (RT-safe). Under `Voluntary` and `Full` preemption modes,
  `SpinLock` is a true spinlock that disables preemption for its critical section —
  but all spinlock-protected critical sections are bounded and O(1) in duration.
  Per-CPU data is protected by short IRQ-disabling guards (PerCpuMutGuard) that hold
  for bounded durations only — never across blocking operations. There are no
  unbounded preemption-disabled regions.
- **NUMA-aware load balancing**: The load balancer models migration cost (cache
  invalidation, memory latency) and only migrates tasks when the imbalance exceeds the
  migration cost.
- **Per-CPU run queues**: No global run queue lock. Each CPU manages its own queues
  independently. Each per-CPU run queue is protected by a per-CPU spinlock (`rq->lock`).
- **Work stealing**: Idle CPUs steal tasks from busy CPUs at low frequency (~4ms interval)
  to avoid thundering-herd effects. The work stealing algorithm is specified below.

  **Target CPU selection.** When a CPU goes idle and its local run queue is empty, it
  initiates a steal attempt. The target CPU is selected as follows:

  1. **Same-NUMA-node preference**: Scan CPUs within the same NUMA node first. Cross-node
     steals incur higher migration cost (remote memory latency, cache invalidation of NUMA-local
     pages) and are only attempted if no same-node candidate has stealable work.
  2. **Highest load first**: Among candidate CPUs, prefer the one with the highest run queue
     load (measured as `rq.eevdf.task_count`). This maximizes the probability that the
     target can spare a task without becoming underloaded itself.
  3. **Cache topology tiebreak**: When multiple candidates have equal load, prefer the CPU
     that shares the closest cache level (L2 > L3 > cross-package). Tasks migrated within a
     shared cache domain retain warm cache lines, reducing post-migration stall cycles.
  4. **Cross-node fallback**: If no same-node CPU has more than one runnable task, scan
     remote NUMA nodes in distance order (nearest first, using SLIT/SRAT distances from
     firmware). The migration cost threshold is higher for cross-node steals — the load
     imbalance must exceed `NUMA_MIGRATION_THRESHOLD` (default: 2 tasks) to justify the
     cross-node penalty.

  **Lock-free load observation**: Each `EevdfRunQueue` exposes:
  ```rust
  pub task_count: AtomicU32,  // incremented on enqueue, decremented on dequeue
  ```
  Ordering: `Relaxed` — an exact count is not required; approximate load is sufficient
  for steal decisions. The stealing CPU reads `task_count` atomically without acquiring
  the target runqueue's lock. This gives a snapshot that may be 1-2 operations stale,
  which is acceptable: the goal is finding a CPU with available work, not exact balance.

  **False positive handling**: If the stealing CPU reads non-zero `task_count` but finds
  no stealable task after acquiring the target lock (due to intervening dequeue), it
  counts as one steal attempt.

  **Steal attempt limit**: After `STEAL_ATTEMPT_LIMIT = 4` failed candidates (same-NUMA
  probes first, cross-NUMA second), the idle CPU enters halted state via
  `cpu::halt()` and awaits the next IPI or timer tick.
  ```rust
  pub const STEAL_ATTEMPT_LIMIT: usize = 4;
  ```

  **Task selection.** From the target CPU's EEVDF eligible tree, steal the task with the
  **largest vdeadline** (rightmost node in the tree). Rationale: the task with the largest
  vdeadline is the one furthest from being scheduled next on the source CPU — it has the
  most remaining virtual runtime before its next turn. Stealing it causes the least
  disruption to the source CPU's fairness invariants and avoids stealing a task that was
  about to run (which would waste the source CPU's scheduling decision). The stolen task's
  `eligible_vtime` and `lag` are preserved; its `vruntime` is adjusted relative to the
  destination run queue's `min_vruntime` to maintain fairness on the new CPU.

  RT and deadline tasks are **not** stolen by the normal work-stealing path. RT task
  migration uses a separate push/pull mechanism triggered by RT priority changes (see
  [Section 7.2](07-process.md#72-real-time-guarantees)).

  **Lock ordering.** The work stealer must hold two run queue locks simultaneously
  (source and destination). Deadlock prevention is enforced at compile time via
  `lock_two_runqueues()` (see below). However, the idle CPU also uses a **trylock with
  exponential backoff** strategy to avoid priority inversion:

  1. The idle CPU acquires its own run queue lock first (guaranteed success — it is local).
  2. It calls `trylock()` on the target CPU's run queue lock. If the lock is contended
     (the target CPU is in a scheduling critical section), the steal attempt is abandoned
     for this cycle rather than spinning.
  3. On `trylock` failure, the idle CPU backs off: it doubles the steal retry interval
     (from the base 4ms up to a cap of 32ms) and re-enters the idle loop. The backoff
     resets to 4ms on a successful steal or when a local wake-up occurs.

  This trylock approach ensures the work stealer never blocks a busy CPU's scheduler
  path. In practice, `trylock` succeeds on the first attempt >95% of the time because
  run queue critical sections are bounded and short (O(1), typically < 1 microsecond).

  **Maximum steal count.** Each steal attempt moves **exactly 1 task**. Stealing multiple
  tasks per attempt risks over-correcting the load imbalance and causing ping-pong
  migration between CPUs. The periodic 4ms steal interval provides natural convergence:
  a 4-task imbalance resolves in ~16ms (4 steal cycles), which is well within acceptable
  load-balancing latency for non-RT workloads.

- **Run queue lock ordering — compile-time enforced**: The load balancer (work stealing)
  acquires remote run queue locks. ABBA deadlock prevention is **type-enforced**, not a
  runtime convention. All run queue locks share `RQ_LOCK_LEVEL = 2` in the compile-time
  lock hierarchy ([Section 3.1.5](03-concurrency.md#315-locking-strategy)). The
  `SpinLock<_, LEVEL=2>` type prevents a caller holding one level-2 lock from acquiring a
  second one directly. The only legal path to holding two run queue locks simultaneously is
  `lock_two_runqueues(rq_a, rq_b)`, which always acquires the lock for the lower CPU ID
  first — making CPU-ID-ordered acquisition the sole valid code path rather than a
  convention that code review must uphold. The load balancer never holds more than two run
  queue locks simultaneously. Cross-subsystem ordering is:
  `TASK_LOCK (level 1) < RQ_LOCK (level 2) < PI_LOCK (level 3)`.
- **Real-time guarantees**: Dedicated RT cores can be reserved (isolcpus equivalent).
  Threaded interrupts ensure deterministic scheduling latency.
- **CPU frequency/power**: Integration with cpufreq governors for power management.

### 6.1.4 Scheduler Classes

The scheduler is modular. Each scheduling class implements a standard interface:

```rust
pub trait SchedClass: Send + Sync {
    fn enqueue_task(&mut self, task: &mut Task, flags: EnqueueFlags);
    fn dequeue_task(&mut self, task: &mut Task, flags: DequeueFlags);
    fn pick_next_task(&mut self, cpu: CpuId) -> Option<&mut Task>;
    fn check_preempt(&self, current: &Task, incoming: &Task) -> bool;
    fn task_tick(&mut self, task: &mut Task, cpu: CpuId, queued: bool);
    fn balance(&mut self, cpu: CpuId, flags: BalanceFlags) -> BalanceResult;
}

Classes are checked in priority order: Deadline > RT > EEVDF. The highest-priority class with a runnable task wins.

pick_next_task() dispatch algorithm. The per-CPU scheduler entry point traverses scheduling classes in strict priority order. Each class's pick_next_task is called at most once; the first class that returns a runnable task wins. This is O(1) in the number of classes (three fixed classes, not a dynamic list).

/// Select the highest-priority runnable task on this CPU's run queue.
///
/// Called from the scheduler core on every context switch, timer tick
/// preemption, and explicit `schedule()` invocation. The caller holds
/// `rq.lock` for the local CPU.
///
/// # Priority order
///
/// 1. **Deadline (CBS)** — tasks with active bandwidth reservations and
///    unexpired deadlines. Scheduled earliest-deadline-first within the
///    CBS server.
/// 2. **RT (FIFO / RR)** — real-time tasks. FIFO tasks run until they
///    yield or block; RR tasks rotate within their priority level on
///    each time slice expiry.
/// 3. **EEVDF (normal)** — the eligible task with the smallest virtual
///    deadline (`min vdeadline` in the eligible tree).
///
/// If all three classes are empty, the CPU enters the idle task — a
/// per-CPU kernel thread that executes the architecture's halt/wait
/// instruction (`hlt` on x86, `wfi` on ARM, `wfi` on RISC-V) until
/// the next interrupt.
fn pick_next_task(rq: &mut RunQueueData) -> &mut Task {
    // 1. Deadline class: highest priority. CBS tasks with active
    //    reservations whose deadline has not yet expired are checked
    //    first. `dl.pick_next_task()` returns the task with the
    //    earliest absolute deadline (EDF within the CBS server).
    if let Some(task) = rq.dl.pick_next_task() {
        return task;
    }

    // 2. RT class: FIFO and RR tasks. The highest-priority RT task
    //    is returned. Within a priority level, FIFO tasks are ordered
    //    by arrival time; RR tasks rotate on slice expiry.
    if let Some(task) = rq.rt.pick_next_task() {
        return task;
    }

    // 3. EEVDF class: normal (SCHED_NORMAL / SCHED_BATCH) tasks.
    //    Returns the eligible task with the smallest vdeadline from
    //    the eligible_tree (leftmost node in the red-black tree).
    //    Before selection, any tasks in ineligible_tree whose
    //    eligible_vtime <= min_vruntime are promoted to eligible_tree.
    if let Some(task) = rq.eevdf.pick_next_task() {
        return task;
    }

    // 4. All classes empty: return the per-CPU idle task.
    //    The idle task is always runnable and never enqueued in any
    //    scheduling class. It is a sentinel — the run queue is never
    //    truly "empty" because the idle task is always available.
    &mut rq.idle_task
}

Per-CPU run queue interaction. Each CPU calls pick_next_task() independently on its own RunQueueData while holding the local run queue lock. There is no cross-CPU coordination in the pick path — load balancing and work stealing (Section 6.1.3) are separate, asynchronous operations that move tasks between run queues. This ensures the scheduling hot path is lock-local and O(1) in the number of CPUs.

Idle task behavior. The idle task is a statically allocated per-CPU kernel thread that does not participate in any scheduling class. When selected, it: 1. Checks for pending softirqs and processes them before halting. 2. Invokes the cpuidle governor to select the deepest safe C-state (Section 6.4) based on expected idle duration and latency constraints. 3. Executes the architecture halt instruction. The CPU remains halted until the next interrupt (timer tick, IPI from work stealing, device interrupt). 4. On wake, the idle task immediately calls pick_next_task() again — it never "runs" application logic.

SchedClass Dispatch Mechanism:

UmkaOS uses static enum dispatch (not a vtable or dyn Trait). The SchedClass of each task is stored as a SchedClass enum field in Task. The scheduler's hot path uses a match statement on the enum, which the compiler can optimize to a direct jump table.

Rationale for static enum dispatch over vtable: - Zero indirection: enum match compiles to a jump table (O(1) branch predictor-friendly); vtable dispatch requires a pointer dereference before the call. On x86-64, this eliminates 1 cache miss per scheduling decision. - LTO-friendly: The compiler can inline small per-class operations (e.g., SCHED_IDLE.pick_next_task() always returns None if the idle task is the only runnable task). Vtable calls prevent inlining across compilation units. - No runtime registration: SchedClass is fixed at compile time. New scheduling classes require a kernel rebuild, not a runtime module. This avoids the race conditions and validation overhead of dynamically registered scheduling classes.

/// Scheduling class. Stored in Task; determines all scheduling decisions.
#[repr(u8)]
pub enum SchedClass {
    /// EEVDF (Eligible Earliest Virtual Deadline First).
    /// For all normal (CFS) tasks. Provides fair-share CPU time with
    /// latency-nice configurable slice sizes.
    Eevdf  = 0,
    /// POSIX SCHED_FIFO. Run until preempted by higher-priority RT task,
    /// blocked, or explicitly yields. Static priority 1-99.
    RtFifo = 1,
    /// POSIX SCHED_RR. Like SCHED_FIFO but with timeslices.
    RtRr   = 2,
    /// POSIX SCHED_DEADLINE. CBS (Constant Bandwidth Server). Specified
    /// by (runtime_us, deadline_us, period_us) at sched_setattr() time.
    Deadline = 3,
    /// SCHED_IDLE. Lower priority than any Eevdf task. Used for background
    /// maintenance tasks (garbage collection, defragmentation, telemetry).
    Idle   = 4,
}

// In the scheduler hot path:
fn pick_next_task(rq: &mut RunQueue) -> Option<&mut Task> {
    // Priority order: Deadline > RtFifo > RtRr > Eevdf > Idle
    if let Some(t) = rq.deadline_queue.pick_eligible() { return Some(t); }
    if let Some(t) = rq.rt_queue.pick_highest_priority() { return Some(t); }
    if let Some(t) = rq.eevdf_queue.pick_eligible() { return Some(t); }
    rq.idle_queue.pick_any()
}

Per-class operations (called via match in the scheduler): - enqueue(task): Add to the class-specific queue. - dequeue(task): Remove from the class-specific queue. - pick_next(): Select the next task to run. - put_prev(task): Task is being descheduled; update per-class bookkeeping (e.g., EEVDF virtual time advance). - check_preempt(task, new_task): Can new_task preempt task? Called when a new task becomes runnable. - task_tick(task): Called on each scheduler tick for the running task.

See also: Section 7.2 (Real-Time Guarantees) extends deadline scheduling with bounded-latency paths, threaded interrupts, and PREEMPT_RT-style priority inheritance for hard real-time workloads.

ML tuning: Key EEVDF parameters (eevdf_weight_scale, migration_benefit_threshold, eas_energy_bias, preemption_latency_budget) are registered in the Kernel Tunable Parameter Store and may be adjusted at runtime by Tier 2 AI/ML policy services via the closed-loop framework defined in Section 22.1. The scheduler emits SchedObs observations (task wakeup latency, EAS decisions, runqueue stats) that feed the umka-ml-sched Tier 2 service. All parameters revert to defaults within 60 seconds if the ML service stops sending updates.

6.1.5 Heterogeneous CPU Support (big.LITTLE / Intel Hybrid / RISC-V)

Modern SoCs are no longer symmetric. ARM big.LITTLE (2011+), Intel Alder Lake P-core/E-core (2021+), and RISC-V platforms with mixed hart types all present the scheduler with CPUs of different performance, power, and ISA capabilities. A scheduler that treats all CPUs as identical will either waste power (running background tasks on performance cores) or starve throughput (placing compute-heavy tasks on efficiency cores).

This section extends the scheduler with Energy-Aware Scheduling (EAS), per-CPU capacity tracking, and heterogeneous topology awareness.

6.1.5.1 CPU Capacity Model

Every CPU has a capacity value normalized to a 0–1024 scale, where the fastest core at its highest frequency = 1024. This is the fundamental abstraction that makes the scheduler heterogeneity-aware.

// umka-core/src/sched/capacity.rs

/// Per-CPU capacity descriptor.
/// Populated at boot from firmware tables (ACPI PPTT, devicetree, CPPC).
/// Updated at runtime when frequency changes.
pub struct CpuCapacity {
    /// Maximum capacity of this CPU at its highest OPP (Operating Performance Point).
    /// Normalized: fastest core in the system = 1024.
    /// An efficiency core might be 512 (half the throughput of a performance core).
    pub capacity: u32,

    /// Original (boot-time) maximum capacity. Does not change.
    pub capacity_max: u32,

    /// Current capacity, adjusted for current frequency.
    /// If a 1024-capacity core is running at 50% frequency, capacity_curr = 512.
    /// Updated by cpufreq governor on frequency change.
    pub capacity_curr: AtomicU32,

    /// Core type classification.
    pub core_type: CoreType,

    /// Frequency domain this CPU belongs to.
    /// All CPUs in a frequency domain share the same clock.
    pub freq_domain: FreqDomainId,

    /// ISA capabilities of this CPU.
    /// On heterogeneous ISA systems (RISC-V), different cores may support
    /// different extensions.
    pub isa_caps: IsaCapabilities,

    /// Microarchitecture ID (for Intel Thread Director).
    /// Different core types have different uarch IDs.
    pub uarch_id: u32,
}

/// Core type classification.
#[repr(u32)]
pub enum CoreType {
    /// ARM Cortex-X/A7x, Intel P-core.
    /// High single-thread performance, high power.
    Performance = 0,

    /// ARM Cortex-A5x, Intel E-core.
    /// Lower performance, significantly lower power.
    Efficiency  = 1,

    /// ARM Cortex-A7x mid-tier (e.g., Cortex-A78 in a system with X3 and A510).
    Mid         = 2,

    /// Traditional SMP — all cores identical.
    /// When all cores are Symmetric, EAS is disabled (unnecessary).
    Symmetric   = 3,
}

/// ISA capability flags.
/// On heterogeneous ISA systems, the scheduler must ensure a task only runs
/// on a CPU that supports the ISA features the task uses.
bitflags! {
    pub struct IsaCapabilities: u64 {
        // ARM
        const ARM_SVE       = 1 << 0;   // Scalable Vector Extension
        const ARM_SVE2      = 1 << 1;   // SVE2
        const ARM_SME       = 1 << 2;   // Scalable Matrix Extension
        const ARM_MTE       = 1 << 3;   // Memory Tagging Extension

        // x86
        const X86_AVX512    = 1 << 16;  // AVX-512 (P-cores only on some Intel)
        const X86_AMX       = 1 << 17;  // Advanced Matrix Extensions (P-cores only)
        const X86_AVX10     = 1 << 18;  // AVX10 (unified AVX across core types)

        // RISC-V
        const RV_V          = 1 << 32;  // Vector extension
        const RV_B          = 1 << 33;  // Bit manipulation
        const RV_H          = 1 << 34;  // Hypervisor extension
        const RV_CRYPTO     = 1 << 35;  // Cryptography extensions
    }
}

/// Vector length metadata — companion to IsaCapabilities for variable-length
/// vector ISAs (ARM SVE/SVE2, RISC-V V). The bitflags above indicate *presence*
/// of the extension; this struct encodes the *vector register width* that the
/// thread actually uses, which determines migration constraints and XSAVE area size.
#[repr(C)]
pub struct VectorLengthInfo {
    /// ARM SVE/SVE2 vector length in bits (128-2048, must be power of 2).
    /// 0 = thread does not use SVE. Discovered per-core via `rdvl` at boot.
    pub sve_vl_bits: u16,
    /// RISC-V VLEN in bits (32-65536). 0 = thread does not use RVV.
    /// Discovered per-hart via `vlenb` CSR at boot.
    /// Uses u32 because the RISC-V V spec allows VLEN up to 65536 bits,
    /// which equals u16::MAX + 1 and would overflow a u16.
    pub rvv_vlen_bits: u32,
}

Key design property: On a fully symmetric system (all cores CoreType::Symmetric), the capacity model is a no-op. All CPUs have capacity 1024, all have the same ISA capabilities. The scheduler fast path sees capacity_curr == 1024 on every CPU and skips all heterogeneous logic. Zero overhead on symmetric systems.

6.1.5.2 Energy Model

The energy model describes the power cost of running a workload at each performance level on each core type. It is the foundation of Energy-Aware Scheduling.

// umka-core/src/sched/energy.rs

/// Maximum number of Operating Performance Points per frequency domain.
/// 16 entries accommodates all known hardware OPP tables. If hardware exposes
/// more than 16 OPPs, the driver selects the 16 entries with the widest
/// frequency spread (lowest, highest, and 14 evenly distributed intermediate
/// points).
pub const MAX_OPP_ENTRIES: usize = 16;

/// Energy model for one frequency domain.
/// A frequency domain is a group of CPUs that share the same clock.
/// All CPUs in a domain have the same core type and OPP table.
pub struct EnergyModel {
    /// Which frequency domain this model covers.
    pub freq_domain: FreqDomainId,

    /// Core type of CPUs in this domain.
    pub core_type: CoreType,

    /// Number of CPUs in this domain.
    pub cpu_count: u32,

    /// Operating Performance Points, sorted by frequency (ascending).
    /// Each OPP maps a frequency to a capacity and power cost.
    /// Fixed-capacity inline array avoids heap allocation and keeps OPP
    /// data cache-local. 16 entries accommodates all known hardware OPP
    /// tables.
    pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>,  // MAX_OPP_ENTRIES = 16
}

/// One Operating Performance Point.
pub struct OppEntry {
    /// Frequency in kHz.
    pub freq_khz: u32,

    /// Capacity at this frequency (0–1024 scale).
    /// Capacity scales linearly with frequency within a core type.
    pub capacity: u32,

    /// Power consumption at this frequency (milliwatts).
    /// This is the DYNAMIC power for one CPU running at 100% utilization.
    /// Power scales roughly as V²×f (voltage² × frequency).
    pub power_mw: u32,
}

Example: ARM big.LITTLE system (Cortex-X3 + Cortex-A510)

Performance cores (Cortex-X3), freq_domain 0:
  OPP 0:  600 MHz, capacity  256, power   80 mW
  OPP 1: 1200 MHz, capacity  512, power  280 mW
  OPP 2: 1800 MHz, capacity  768, power  650 mW
  OPP 3: 2400 MHz, capacity 1024, power 1200 mW

Efficiency cores (Cortex-A510), freq_domain 1:
  OPP 0:  400 MHz, capacity  100, power  15 mW
  OPP 1:  800 MHz, capacity  200, power  50 mW
  OPP 2: 1200 MHz, capacity  300, power 110 mW
  OPP 3: 1600 MHz, capacity  400, power 200 mW

Observation: running a task with utilization 200 (out of 1024):
  On performance core at OPP 0 (capacity 256): power = 80 mW
  On efficiency core at OPP 3 (capacity 400):  power = 200 mW
  → Wait. On efficiency core at OPP 1 (capacity 200): power = 50 mW
  → Efficiency core at OPP 1 uses 50 mW. Performance core at OPP 0 uses 80 mW.
  → Efficiency core wins. EAS places the task on the efficiency core.

  But a task with utilization 500:
  → Doesn't fit on any efficiency core OPP (max capacity 400).
  → Must go to performance core. EAS picks lowest OPP that fits: OPP 1 (512), 280 mW.

6.1.5.3 Energy-Aware Scheduling Algorithm

EAS runs at task wakeup time (the most impactful scheduling decision). It answers: "Which CPU should this task run on to minimize total system energy while meeting performance requirements?"

// umka-core/src/sched/eas.rs

pub struct EnergyAwareScheduler {
    /// Energy models for all frequency domains.
    /// Boot-allocated contiguous array (one per frequency domain), not heap Vec.
    /// Indexed by frequency domain ID. Length = number of frequency domains
    /// discovered at boot. Stored as a `BootVec<EnergyModel>` (boot-time-allocated,
    /// fixed-size-after-init, no heap pointer indirection on the wakeup fast path).
    energy_models: BootVec<EnergyModel>,

    /// Per-CPU utilization (PELT, see Section 6.1.5.4).
    /// Boot-allocated contiguous array (one per CPU), not heap Vec.
    /// Indexed by CPU ID. Stored as `PerCpu<AtomicU32>` for cache-line-aligned
    /// per-CPU access without pointer indirection on the wakeup hot path.
    cpu_util: PerCpu<AtomicU32>,

    /// Threshold: a task is "misfit" if its utilization exceeds
    /// the capacity of the CPU it's running on.
    /// Misfit tasks are migrated to higher-capacity CPUs.
    misfit_threshold: u32,

    /// EAS is disabled on fully symmetric systems (no benefit).
    enabled: bool,
}

impl EnergyAwareScheduler {
    /// Find the most energy-efficient CPU for a waking task.
    /// Called from EEVDF enqueue path when EAS is enabled.
    ///
    /// Algorithm:
    ///   1. For each frequency domain:
    ///      a. Compute the new utilization if this task were placed here.
    ///      b. Find the lowest OPP that can handle the new utilization.
    ///      c. Compute energy cost = OPP power × (new_util / capacity).
    ///   2. Pick the frequency domain with the lowest energy cost.
    ///   3. Within that domain, pick the CPU with the most spare capacity
    ///      (to avoid unnecessary frequency increases).
    ///
    /// Complexity: O(domains × OPPs). Typically 2-3 domains × 4-6 OPPs = 8-18 iterations.
    /// Time: ~200-500ns. Acceptable for task wakeup path (~2000ns total).
    pub fn find_energy_efficient_cpu(&self, task_util: u32) -> CpuId {
        let mut best_energy = u64::MAX;
        let mut best_cpu = CpuId(0);

        for model in &self.energy_models {
            // Can this domain handle the task at all?
            let max_capacity = model.opps.last().map(|o| o.capacity).unwrap_or(0);
            if task_util > max_capacity {
                continue; // Task doesn't fit on this core type
            }

            // Compute energy cost for placing task in this domain.
            let energy = self.compute_energy(model, task_util);
            if energy < best_energy {
                best_energy = energy;
                best_cpu = self.find_idlest_cpu_in_domain(model.freq_domain);
            }
        }

        best_cpu
    }

    /// Estimate energy cost of adding `task_util` to a frequency domain.
    ///
    /// OPP selection uses the maximum per-CPU utilization in the domain (not
    /// the aggregate), because frequency is shared across all CPUs in a DVFS
    /// domain — the OPP must be high enough for the most loaded CPU.
    fn compute_energy(&self, model: &EnergyModel, task_util: u32) -> u64 {
        // Find max per-CPU utilization in this domain. Assumes the task
        // will be placed on the idlest CPU (same heuristic as
        // find_idlest_cpu_in_domain), so task_util is added to that
        // CPU's utilization when computing the domain's max.
        let max_cpu_util = self.max_cpu_utilization(model.freq_domain, task_util);

        // Find lowest OPP whose capacity can handle the busiest CPU.
        let opp = model.opps.iter()
            .find(|o| o.capacity >= max_cpu_util)
            .unwrap_or(model.opps.last().unwrap());

        // Energy = power × (sum of all CPU utilizations) / capacity.
        // Power is determined by the OPP (selected by max CPU), but energy
        // is proportional to total work done across all CPUs in the domain.
        let domain_util = self.domain_utilization(model.freq_domain) + task_util;
        (opp.power_mw as u64) * (domain_util as u64) / (opp.capacity as u64)
    }
}

When EAS is NOT used (symmetric systems, or when all cores are of the same type): the standard EEVDF load balancer runs instead. EAS adds zero overhead because enabled == false and the check is a single branch at the top of the wakeup path.

6.1.5.4 Per-Entity Load Tracking (PELT)

EAS needs accurate, up-to-date utilization data for each task and each CPU. PELT provides this with an exponentially-decaying average that balances responsiveness with stability.

// umka-core/src/sched/pelt.rs

// ---------------------------------------------------------------------------
// PELT constants and decay lookup table (Gap 2.13)
// ---------------------------------------------------------------------------

/// One PELT period in nanoseconds.
///
/// Chosen as 1024 × 1000 = 1,024,000 ns ≈ 1.024 ms. The power-of-two
/// factor (1024) means the division `delta_ns / PERIOD_NS` and the modulo
/// `delta_ns % PERIOD_NS` can be computed with a right-shift and a bitmask
/// on architectures where the compiler elides the integer division.
pub const PELT_PERIOD_NS: u64 = 1_024_000;

/// Converged maximum load average.
///
/// A task that has been 100% runnable for effectively infinite time converges
/// to `LOAD_AVG_MAX`. This is the closed-form sum of the geometric series:
///
/// ```text
/// LOAD_AVG_MAX = 1024 × Σ_{n=0}^{∞} y^n = 1024 / (1 − y) ≈ 47742
/// ```
///
/// where `y = 0.5^(1/32) ≈ 0.97857` is the per-period decay factor.
/// Used to normalise the internal `*_sum` accumulators to the `*_avg` fields
/// (0–1024 scale): `util_avg = util_sum × 1024 / LOAD_AVG_MAX`.
pub const LOAD_AVG_MAX: u64 = 47742;

/// Number of periods for the geometric series to converge.
///
/// After `LOAD_AVG_MAX_N` periods at 100% utilisation the internal sum
/// reaches `LOAD_AVG_MAX` to within 1 ULP. Any periods beyond this index
/// need not be tracked — `decay_load()` returns 0 for `n ≥ LOAD_AVG_MAX_N`.
pub const LOAD_AVG_MAX_N: u64 = 345;

/// Sub-period fractional decay coefficients for PELT.
///
/// `RUNNABLE_AVG_YN_INV[i]` is the fixed-point (Q32) representation of `y^i`
/// where `y = 0.5^(1/32) ≈ 0.97857` and `i ∈ [0, 31]`:
///
/// ```text
/// RUNNABLE_AVG_YN_INV[i] = round(y^i × 2^32)
/// ```
///
/// Entry 0 = `2^32 - 1` (full weight, zero elapsed sub-periods).
/// Entry 31 = `round(y^31 × 2^32)` (nearly one full period of decay).
///
/// Used by `decay_load()` for the fractional-period component of decay:
///
/// ```text
/// val = (val * RUNNABLE_AVG_YN_INV[n % 32]) >> 32
/// ```
///
/// This avoids floating-point arithmetic at runtime; the table is computed
/// once at compile time from the analytic formula.
pub const RUNNABLE_AVG_YN_INV: [u32; 32] = [
    0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a,
    0xeac0c6e6, 0xe5b906e6, 0xe0ccdeeb, 0xdbfbb796,
    0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
    0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46,
    0xb504f333, 0xb123f581, 0xad583ee9, 0xa9a15ab4,
    0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
    0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a,
    0x8b95c1e3, 0x88980e80, 0x85aac367, 0x82cd8698,
];

/// Decay a PELT accumulator value by `n` elapsed periods.
///
/// Applies the compound decay factor `y^n` using integer arithmetic:
///
/// 1. If `n > LOAD_AVG_MAX_N` (345), return 0 — the value is fully decayed.
/// 2. Halve `val` for each complete group of 32 periods: `val >>= n / 32`.
///    Each group of 32 periods reduces the value by exactly 50% (`y^32 = 0.5`).
/// 3. Apply the remaining sub-period fractional decay using the lookup table:
///    `val = (val * RUNNABLE_AVG_YN_INV[n % 32]) >> 32`.
/// 4. Return the decayed value.
///
/// # Precision
///
/// The fixed-point multiply in step 3 is a Q32 multiply: the result is the
/// upper 32 bits of the 64-bit product. On 64-bit platforms this is a single
/// `mulhi` or equivalent instruction. On 32-bit platforms it requires a
/// 32×32→64 widening multiply.
///
/// # Usage
///
/// Called for each PELT accumulator (`load_sum`, `runnable_sum`, `util_sum`)
/// when a state transition spans `n ≥ 1` complete periods.
pub fn decay_load(val: u64, n: u64) -> u64 {
    if n > LOAD_AVG_MAX_N {
        return 0;
    }
    // Halve for each complete group of 32 periods.
    let val = val >> (n / 32);
    // Fractional sub-period decay via Q32 multiply with lookup table.
    (val * RUNNABLE_AVG_YN_INV[(n % 32) as usize] as u64) >> 32
}

// ---------------------------------------------------------------------------
// PeltState — per-entity load tracking state
// ---------------------------------------------------------------------------

/// Per-Entity Load Tracking state.
///
/// Attached to every schedulable entity (task) and every CPU run queue.
/// Maintains exponentially-decaying averages of CPU utilisation, runnability,
/// and weighted load over a ~32 ms half-life window.
///
/// ## Internal representation
///
/// Three raw accumulators (`load_sum`, `runnable_sum`, `util_sum`) hold the
/// un-normalised geometric sums. Three derived averages (`load_avg`,
/// `runnable_avg`, `util_avg`) are the normalised 0–1024 values consumed by
/// EAS, load balancing, and cpufreq. The averages are recomputed from the
/// sums whenever a state transition occurs:
///
/// ```text
/// util_avg     = util_sum     * NICE_0_WEIGHT / LOAD_AVG_MAX   (clamped to 1024)
/// runnable_avg = runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX   (clamped to 1024)
/// load_avg     = load_sum     * task.weight   / LOAD_AVG_MAX
/// ```
///
/// `NICE_0_WEIGHT = 1024` and `LOAD_AVG_MAX = 47742`.
///
/// ## Half-life
///
/// The decay factor `y = 0.5^(1/32) ≈ 0.97857` per 1.024 ms period gives a
/// half-life of 32 periods ≈ 32.768 ms. A task that stops running drops to
/// 50% utilisation after ~32 ms and is effectively zero after ~345 periods
/// (~353 ms).
pub struct PeltState {
    /// Raw load accumulator: `Σ(task.weight × runnable_time_in_period × y^n)`.
    /// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `load_avg`.
    pub load_sum: u64,

    /// Raw runnable accumulator: `Σ(runnable_time_in_period × y^n)`.
    /// Counts time the entity was either running or waiting in the run queue.
    /// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `runnable_avg`.
    pub runnable_sum: u64,

    /// Raw utilisation accumulator: `Σ(running_time_in_period × y^n)`.
    /// Counts only time the entity was executing on a CPU (not queued).
    /// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `util_avg`.
    pub util_sum: u64,

    /// Sub-period carry-forward in nanoseconds.
    ///
    /// Nanoseconds elapsed since the start of the current (incomplete) period.
    /// Preserved across state transitions so that sub-period time accumulates
    /// correctly. Range: `[0, PELT_PERIOD_NS)`.
    pub period_contrib: u32,

    /// Normalised weighted load average (0 = idle, `task.weight` = fully loaded).
    /// `load_avg = load_sum * task.weight / LOAD_AVG_MAX`.
    /// Used by the load balancer and NUMA placement.
    pub load_avg: u64,

    /// Normalised runnable average (0–1024).
    /// Includes both running and queued (waiting) time.
    /// `runnable_avg = runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX`.
    pub runnable_avg: u64,

    /// Normalised utilisation average (0–1024).
    /// Pure execution time only, excluding queued time.
    /// `util_avg = util_sum * NICE_0_WEIGHT / LOAD_AVG_MAX`.
    /// This is the primary signal consumed by EAS and cpufreq.
    pub util_avg: u64,

    /// Monotonic timestamp of the last `update()` call (nanoseconds since boot).
    /// Used to compute `delta_ns` on the next call.
    pub last_update_time: u64,
}

impl PeltState {
    /// Update PELT state with a new time sample.
    ///
    /// Must be called at every scheduling event that changes entity state:
    /// task switch (running→queued, queued→running), sleep (queued→off),
    /// wake-up (off→queued), and scheduler tick. The caller must ensure
    /// `running`, `runnable`, and `delta_ns` accurately reflect the entity's
    /// state for the entire interval since the last call.
    ///
    /// **State-transition contract**: Between consecutive calls, the entity's
    /// state must be constant — exactly one of {running, runnable-but-not-running,
    /// sleeping}. Calling `update()` mid-interval and then again with a different
    /// state for the remainder is the correct pattern; calling `update()` with a
    /// blended state is incorrect and will misattribute the `period_contrib`
    /// carry-forward.
    ///
    /// `running`: was this entity executing on a CPU for the entire `delta_ns`?
    /// `runnable`: was this entity on the run queue (running or waiting)?
    ///   Invariant: `running → runnable` (every running task is runnable).
    /// `delta_ns`: nanoseconds elapsed since `last_update_time`.
    /// `task_weight`: the task's `sched_prio_to_weight` value (for `load_avg`).
    pub fn update(
        &mut self,
        running: bool,
        runnable: bool,
        delta_ns: u64,
        task_weight: u64,
    ) {
        debug_assert!(!running || runnable, "running implies runnable");

        // Accumulate carry-forward from the previous update with the new delta.
        let total_ns = delta_ns + self.period_contrib as u64;
        let n_periods = total_ns / PELT_PERIOD_NS;
        let remainder_ns = (total_ns % PELT_PERIOD_NS) as u32;

        if n_periods > 0 {
            // Decay all three raw sums by n_periods elapsed periods.
            self.load_sum     = decay_load(self.load_sum,     n_periods);
            self.runnable_sum = decay_load(self.runnable_sum, n_periods);
            self.util_sum     = decay_load(self.util_sum,     n_periods);

            // Add the contribution of the completed periods. Each complete period
            // where the entity was runnable/running contributes exactly 1024 units
            // (one full period of maximum accumulation), scaled by the decayed
            // geometric sum `accumulate_sum(n_periods)`:
            //
            //   accumulate_sum(n) = Σ_{k=0}^{n-1} y^k × 1024
            //                     = 1024 × (1 - y^n) / (1 - y)
            //
            // (computed without floating-point using the same lookup table).
            let contrib = accumulate_sum(n_periods);
            if runnable {
                self.runnable_sum += contrib;
                self.load_sum     += contrib * task_weight / NICE_0_WEIGHT;
            }
            if running {
                self.util_sum += contrib;
            }
        }

        // Carry the sub-period remainder into the next call. Do NOT add it to
        // the raw sums yet — it will be incorporated when it completes a full
        // period, preventing double-counting.
        self.period_contrib = remainder_ns;
        self.last_update_time += delta_ns;

        // Recompute the normalised averages from the updated sums.
        self.load_avg     = self.load_sum     * task_weight / LOAD_AVG_MAX;
        self.runnable_avg = self.runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX;
        self.util_avg     = self.util_sum     * NICE_0_WEIGHT / LOAD_AVG_MAX;

        // Clamp averages to their valid ranges.
        self.util_avg     = self.util_avg.min(NICE_0_WEIGHT);
        self.runnable_avg = self.runnable_avg.min(NICE_0_WEIGHT);
    }
}

/// Accumulate the decayed geometric series for `n` complete periods.
///
/// Returns the sum `Σ_{k=0}^{n-1} (1024 × y^k)`, which is the total
/// contribution of `n` complete periods of 100% activity to a PELT sum.
/// Implemented using the same two-step lookup as `decay_load()`:
///
/// ```text
/// // Full 32-period groups each contribute LOAD_AVG_MAX × (1 - y^32) = LOAD_AVG_MAX × 0.5
/// // but the incremental sum is easier to compute as LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n).
/// accumulate_sum(n) = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
/// ```
///
/// This identity holds because the converged sum minus the decayed tail is
/// exactly the contribution of `n` periods from a starting value of 0.
pub fn accumulate_sum(n: u64) -> u64 {
    LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
}

Relationship to EAS: When a task wakes up, the scheduler reads task.pelt.util_avg to know the task's CPU demand. EAS uses this to find the core type where the task fits most efficiently. Without PELT, EAS would have no utilization data to work with.

6.1.5.5 Frequency Domain Awareness and Cpufreq Integration

CPUs within a frequency domain share a clock — changing one CPU's frequency changes all of them. The scheduler must be aware of this grouping.

// umka-core/src/sched/cpufreq.rs

/// Frequency domain: a group of CPUs sharing a clock source.
pub struct FreqDomain {
    /// Domain identifier.
    pub id: FreqDomainId,

    /// CPUs in this domain.
    pub cpus: CpuMask,

    /// Core type of all CPUs in this domain (always uniform within a domain).
    pub core_type: CoreType,

    /// Available OPPs for this domain. Fixed-capacity inline array avoids
    /// heap allocation and keeps OPP data cache-local. 16 entries
    /// accommodates all known hardware OPP tables. If hardware exposes more
    /// than MAX_OPP_ENTRIES OPPs, the driver selects the 16 entries with
    /// the widest frequency spread (lowest, highest, and 14 evenly
    /// distributed intermediate points), which preserves DVFS fidelity for
    /// all practical workloads.
    pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>,  // MAX_OPP_ENTRIES = 16

    /// Current OPP index (into `opps`).
    pub current_opp: AtomicU32,

    /// Aggregate utilization of all CPUs in this domain (sum of PELT util_avg).
    /// Updated at scheduler tick.
    pub domain_util: AtomicU32,

    /// Cpufreq governor for this domain.
    pub governor: CpufreqGovernor,
}

/// Cpufreq governor — decides when to change frequency.
pub enum CpufreqGovernor {
    /// Schedutil: frequency tracks utilization (default for EAS).
    /// New frequency = (util / capacity) × max_freq.
    /// Tight integration with scheduler — runs from scheduler context.
    Schedutil,

    /// Performance: always run at max frequency.
    Performance,

    /// Powersave: always run at min frequency.
    Powersave,

    /// Ondemand: legacy userspace sampling (Linux compat).
    Ondemand,

    /// Conservative: like ondemand but ramps gradually.
    Conservative,
}

Schedutil integration: On every scheduler tick, the schedutil governor reads the domain's aggregate utilization and adjusts frequency:

new_freq = (domain_util / domain_capacity) × max_freq

If domain has 4 CPUs at capacity 1024 each:
  domain_capacity = 4096
  If domain_util = 2048 (50% utilized):
    new_freq = (2048 / 4096) × max_freq = 50% of max_freq

Frequency change latency: ~10-50 μs (hardware-dependent).
The governor rate-limits changes to avoid oscillation (~4ms minimum interval).

6.1.5.6 Intel Thread Director (ITD) Integration

Intel Thread Director (Hardware Feedback Interface):

The HFI table is memory-mapped: 1. UmkaOS Core allocates a 4KB-aligned physical buffer at boot. 2. Writes the physical address to IA32_HW_FEEDBACK_PTR MSR (0x17D0H). 3. Hardware fills the table with per-class performance/efficiency scores using normal memory stores (not MSR writes). 4. When HFI data is updated, hardware fires a Thermal Interrupt (bit 26 set in IA32_PACKAGE_THERM_STATUS). 5. The interrupt handler reads the updated table via normal memory loads.

Per-thread class ID: read via RDMSR from IA32_THREAD_FEEDBACK_CHAR (per-logical-processor MSR, address 0x17D2H). Each thread has a hardware-assigned classification (e.g., integer-heavy, floating-point-heavy, memory-bound) that informs EAS core assignment decisions.

Intel Thread Director is a hardware feature on Alder Lake+ that classifies running workloads and provides hints about which core type is optimal. The hardware monitors instruction mix in real-time and populates the HFI table in memory.

// umka-core/src/sched/itd.rs

/// Intel Thread Director hint (decoded from the memory-mapped HFI table).
/// Hardware populates the table via normal memory stores; UmkaOS reads it
/// on Thermal Interrupt (bit 26 of IA32_PACKAGE_THERM_STATUS).
/// Per-thread class ID is obtained via RDMSR IA32_THREAD_FEEDBACK_CHAR (0x17D2H).
pub struct ItdHint {
    /// Hardware's assessment: how much this task benefits from a P-core.
    /// 0 = no benefit (pure memory-bound), 255 = maximum benefit (compute-bound).
    pub perf_capability: u8,

    /// Hardware's assessment: energy efficiency on an E-core.
    /// 0 = poor efficiency on E-core, 255 = excellent efficiency on E-core.
    pub energy_efficiency: u8,

    /// Workload classification.
    pub workload_class: ItdWorkloadClass,
}

#[repr(u8)]
pub enum ItdWorkloadClass {
    /// Scalar integer code — runs well on E-cores.
    ScalarInt       = 0,
    /// Scalar floating-point — runs well on E-cores.
    ScalarFp        = 1,
    /// Vectorized (SSE/AVX) — may benefit from P-cores (wider execution units).
    Vector          = 2,
    /// AVX-512 / AMX — P-core only (E-cores may lack these).
    HeavyVector     = 3,
    /// Branch-heavy — benefits from P-core branch predictor.
    BranchHeavy     = 4,
    /// Memory-bound — core type doesn't matter, memory is the bottleneck.
    MemoryBound     = 5,
}

Integration with EAS: ITD hints are a refinement. EAS uses PELT utilization to pick the energy-optimal core. ITD overrides this when hardware detects a mismatch:

EAS decision: task has low utilization (200/1024) → place on E-core.
ITD override: task is HeavyVector (AVX-512) → E-core lacks AVX-512 → force P-core.

EAS decision: task has high utilization (800/1024) → place on P-core.
ITD override: task is MemoryBound → P-core is wasted → suggest E-core.

ITD hints are read from the memory-mapped HFI table on Thermal Interrupt (not per scheduler tick). On interrupt: one memory load per table row (~4ns), no RDMSR needed for the table itself. Per-thread class ID is fetched via RDMSR IA32_THREAD_FEEDBACK_CHAR (0x17D2H) on context switch: cost ~30ns per switch, only on Intel Alder Lake+. On non-Intel or pre-Alder Lake: ITD is disabled, zero overhead.

Other architectures: ARM and RISC-V do not have an ITD equivalent. On ARM big.LITTLE/DynamIQ, the scheduler relies on the capacity-dmips-mhz device tree property and runtime PELT utilization to make core placement decisions (Section 14.5.3). On RISC-V heterogeneous harts, the riscv,isa device tree property and per-hart ISA capability flags drive placement (Section 6.1.5.9). Neither architecture provides hardware-level workload classification hints — the scheduler's software heuristics (PELT + EAS) perform this role. This is architecturally acceptable: ITD is an optimization (~15-25% better placement for mixed workloads on Intel hybrid), not a correctness requirement.

6.1.5.7 Asymmetric Packing

On heterogeneous systems, idle CPU selection must be topology-aware:

Symmetric (traditional):
  Spread tasks across all CPUs evenly for maximum parallelism.

Asymmetric (big.LITTLE):
  Pack tasks onto efficiency cores first.
  Only spill onto performance cores when efficiency cores are full
  or when a task is too large (misfit).

  Why: an idle performance core at its lowest OPP still draws more power
  than a busy efficiency core. Packing onto efficiency cores first
  minimizes total system power.

Misfit migration: A task is "misfit" if its utilization (pelt.util_avg) exceeds the capacity of the CPU it's currently running on. Misfit tasks are migrated to a higher-capacity CPU at the next load balance opportunity.

/// Check if a task is misfit on its current CPU.
pub fn is_misfit(task: &Task, cpu: &CpuCapacity) -> bool {
    task.pelt.util_avg > cpu.capacity_curr.load(Ordering::Relaxed)
}

/// Misfit migration is checked at every load balance interval (~4ms).
/// If a task is misfit:
///   1. Find the closest (topology-wise) CPU with enough capacity.
///   2. Migrate the task.
///   3. Mark the source CPU's rq->misfit_task flag for faster detection.

6.1.5.8 Cgroup Integration

Cgroups can constrain which core types a group of tasks may use:

/sys/fs/cgroup/<group>/cpu.core_type
# # Allowed core types for this cgroup.
# # "all"         — any core type (default)
# # "performance" — only P-cores (latency-critical workloads)
# # "efficiency"  — only E-cores (background/batch workloads)
# # Multiple: "performance mid" — P-cores and mid-tier cores

/sys/fs/cgroup/<group>/cpu.capacity_min
# # Minimum per-CPU capacity for tasks in this cgroup.
# # Tasks will not be placed on CPUs with capacity below this value.
# # Default: 0 (no minimum).
# # Example: "512" — only run on CPUs with at least half maximum capacity.

/sys/fs/cgroup/<group>/cpu.capacity_max
# # Maximum per-CPU capacity for tasks in this cgroup.
# # Tasks will not be placed on CPUs with capacity above this value.
# # Default: 1024 (no maximum).
# # Example: "400" — only run on efficiency cores.

Use case examples:

# # Kubernetes: latency-critical pod on P-cores only
echo "performance" > /sys/fs/cgroup/k8s-pod-frontend/cpu.core_type

# # Background log processing: E-cores only (save P-cores for real work)
echo "efficiency" > /sys/fs/cgroup/k8s-pod-logshipper/cpu.core_type

# # ML training: needs AVX-512 (P-cores on Intel, ISA-gated)
echo "512" > /sys/fs/cgroup/k8s-pod-training/cpu.capacity_min

6.1.5.9 RISC-V Heterogeneous Hart Support

RISC-V takes heterogeneity further: different harts (hardware threads) may have different ISA extensions. One hart may have the Vector extension (RVV), another may not. One hart may support the Hypervisor extension (H), another may not.

// umka-core/src/sched/riscv.rs

/// RISC-V ISA extension discovery per hart.
/// Read from the devicetree `riscv,isa` property for each hart.
///
/// Example devicetree:
///   cpu@0 { riscv,isa = "rv64imafdc_zba_zbb_v"; };  // Vector-capable
///   cpu@1 { riscv,isa = "rv64imafd"; };               // No vector
pub fn discover_hart_capabilities(dt: &DeviceTree) -> BootVec<IsaCapabilities> {
    // Parse each hart's ISA string.
    // Set IsaCapabilities flags accordingly.
    // The scheduler uses these to ensure tasks that use Vector
    // instructions only run on Vector-capable harts.
}

ISA gating in the scheduler:

Task affinity includes ISA requirements:

struct TaskAffinityHint {
    /// ISA extensions this task requires (detected from ELF header
    /// or set by userspace via prctl).
    pub isa_required: IsaCapabilities,

    /// Core type preference (from cgroup or auto-detected).
    pub core_preference: CorePreference,

    /// PELT utilization (for EAS).
    pub util_avg: u32,
}

Scheduler check:
  if !cpu.isa_caps.contains(task.affinity.isa_required) {
      // This CPU lacks ISA extensions the task needs.
      // Skip this CPU. Do NOT schedule here.
      continue;
  }

This prevents illegal-instruction faults from scheduling a Vector task on a non-Vector hart, which would be a silent correctness bug on Linux today (Linux began adding per-hart ISA detection in 6.2+ and capability tracking in 6.4+, but does not yet integrate per-hart ISA awareness into the scheduler's task placement decisions — a Vector-tagged task can still be scheduled on a non-Vector hart).

6.1.5.10 Topology Discovery

The scheduler builds its heterogeneous topology model from firmware:

Sources (checked in order):

1. ACPI PPTT (Processor Properties Topology Table):
   — Provides core type, cache hierarchy, frequency domain.
   — Available on ARM SBSA servers and Intel Alder Lake+.

2. ACPI CPPC (Collaborative Processor Performance Control):
   — Provides per-CPU performance range (highest/lowest/nominal).
   — The ratio highest_perf / nominal_perf indicates core type:
     P-cores have higher highest_perf than E-cores.
   — Used on Intel hybrid platforms.

3. Devicetree:
   — ARM and RISC-V embedded systems.
   — `capacity-dmips-mhz` property gives relative core performance.
   — RISC-V `riscv,isa` property gives per-hart ISA extensions.

4. Intel CPUID leaf 0x1A (Hybrid Information):
   — Reports core type (Atom = E-core, Core = P-core) for the running CPU.
   — Each CPU reads its own CPUID at boot.

5. Fallback: runtime measurement
   — If no firmware data: run a calibration loop on each CPU at boot.
   — Measure instructions-per-second to derive relative capacity.
   — Last resort. ~100ms at boot.

6.1.5.11 Linux Compatibility

All Linux interfaces for heterogeneous CPU systems are supported:

/sys/devices/system/cpu/cpuN/cpu_capacity
# # Read-only. Capacity of CPU N (0–1024).
# # Written by the kernel at boot. Used by userspace tools.
# # e.g., "1024" for P-core, "512" for E-core.

/sys/devices/system/cpu/cpuN/topology/core_type
# # "performance", "efficiency", or "unknown".
# # New in Linux 6.x, used by systemd and schedulers.

/sys/devices/system/cpu/cpufreq/policyN/
# # Standard cpufreq interface (per frequency domain):
# # scaling_governor, scaling_cur_freq, scaling_max_freq, etc.

sched_setattr(pid, &attr):
# # SCHED_FLAG_UTIL_CLAMP: set min/max utilization clamp.
# # util_min/util_max affect EAS placement decisions.
# # Fully supported with same semantics as Linux.

prctl(PR_SCHED_CORE, ...):
# # Core scheduling (co-scheduling related tasks on the same core).
# # Supported.

Kernel command line:
# # isolcpus=2-3 (reserve CPUs, same as Linux)
# # nohz_full=2-3 (tickless for RT, same as Linux)
# # nosmt (disable SMT, same as Linux)

sched_ext compatibility: Linux 6.12+ allows BPF scheduling policies via sched_ext. UmkaOS provides the foundation for sched_ext through the eBPF subsystem (Section 18.1.4) and the sched_setattr interface (Section 18.1). Full sched_ext support requires the BPF struct_ops framework and sched_ext-specific kfuncs (scx_bpf_dispatch, scx_bpf_consume, etc.), which are part of the eBPF subsystem implementation. BPF schedulers that use sysfs topology files and sched_setattr for configuration will work without modification once the struct_ops infrastructure is in place.

6.1.5.12 Performance Impact

Symmetric systems (all cores identical):
  EAS: disabled (enabled == false). One branch check at task wakeup: ~1 cycle.
  Capacity model: all CPUs = 1024. No capacity checks affect scheduling decisions.
  PELT: runs regardless (already exists in the standard EEVDF scheduler). Zero additional cost.
  Total overhead vs Linux on symmetric: ZERO.

Heterogeneous systems (big.LITTLE, Intel hybrid):
  EAS: ~200-500ns per task wakeup (iterate 2-3 domains × 4-6 OPPs).
  Task wakeup total (without EAS): ~1500-2000ns.
  Task wakeup total (with EAS): ~1700-2500ns.
  Overhead: ~15-25% of wakeup path. Same as Linux EAS.

  Large topology scaling: On 128-core NUMA systems with 4 NUMA nodes and 3
  frequency domains per node, wakeup scanning covers up to 12 frequency
  domains × 128 CPUs = up to 1536 capacity lookups in the worst case. With
  cache misses on remote NUMA nodes, this can reach 4-40 μs — exceeding the
  200-500 ns estimate for small systems. UmkaOS mitigates this with:
    (1) per-domain capacity caches refreshed on topology change, not per-wakeup;
    (2) early termination when a suitable idle CPU is found;
    (3) the eas_max_domains sysctl (default 8) caps the search depth.
  On systems where EAS latency exceeds 1 μs P99, set eas_max_domains=4 or
  disable EAS entirely (kernel.sched_energy_aware=0). The 200-500 ns estimate
  applies to systems with ≤64 cores and ≤4 frequency domains.

  ITD class ID fetch (RDMSR IA32_THREAD_FEEDBACK_CHAR): ~30ns per context switch.
  HFI table read: on Thermal Interrupt only (infrequent, hardware-triggered).
  Combined overhead: negligible (context-switch-bound, not tick-bound).

  Misfit check: one comparison per load balance (~4ms). Negligible.

  Benefit: 20-40% power reduction for mixed workloads (measured on
  Linux EAS vs non-EAS on ARM big.LITTLE). Same benefit expected.

  This is not overhead — it's a power optimization. The CPU time spent
  on EAS decisions is recovered many times over in power savings.

Summary: UmkaOS's heterogeneous scheduling has identical performance
to Linux EAS on the same hardware. The algorithms are the same (PELT,
EAS energy computation, schedutil). The implementation is clean-sheet
Rust but the scheduling mathematics are equivalent.

See also: Section 6.4 (Power Budgeting) extends EAS with system-level power caps and per-domain throttling. Section 21.6 (Unified Compute Model) generalizes the CpuCapacity scalar into a multi-dimensional capacity vector spanning CPUs, GPUs, and accelerators.

6.1.6 Extended Register State Management

Modern x86 CPUs carry large amounts of extended register state beyond the basic GPRs and x87 FPU. Blindly saving and restoring all of this on every context switch is wasteful — most threads never touch AVX-512 or AMX.

The cost problem:

State component Size Save/restore cost
x87 + SSE (XMM) 576 bytes ~20 ns
AVX (YMM) 256 bytes ~10 ns
AVX-512 (ZMM) 2048 bytes ~80 ns
Intel AMX (tiles) 8192 bytes ~300 ns
ARM SVE (Z regs) 256–8192 bytes (VL-dependent) ~100–500 ns

On a server running thousands of threads with microsecond-scale scheduling, 300ns of AMX save/restore overhead per switch is significant.

Lazy XSAVE policy:

UmkaOS tracks per-thread which extended state components have actually been used via an xstate_used bitmap that mirrors the hardware XSTATE_BV field:

/// Per-thread extended state tracking.
struct ThreadXState {
    /// Bitmap of XSAVE components this thread has used since creation.
    /// Mirrors hardware XSTATE_BV layout (bit 0 = x87, bit 1 = SSE, bit 2 = AVX, etc.)
    xstate_used: u64,

    /// Dynamically-allocated XSAVE area. Starts as None; allocated on first use.
    /// Size depends on which components are enabled (CPUID leaf 0xD).
    xsave_area: Option<XSaveArea>,
}

Context switch optimization:

On context_switch(prev, next):
  1. Determine prev's dirty components: xstate_dirty = prev.xstate_used & XSTATE_MODIFIED_BITS
  2. XSAVE only the dirty components (XSAVES with prev.xstate_used as the mask)
     — If prev never used AVX-512, the ZMM state is NOT saved (zero cost)
  3. XRSTOR next's components (XRSTORS with next.xstate_used as the mask)
     — Components not in next's mask are initialized to their reset state by hardware

Modified Optimization:
  - If a thread hasn't executed any AVX-512/AMX instruction since the last context switch,
    the corresponding XSTATE_BV bits are clear — XSAVES skips those components automatically.
  - The kernel does NOT need to track this manually; it falls out of the hardware XSAVE
    optimized mode (XSAVES/XRSTORS with INIT optimization).

Init optimization (demand allocation):

Threads that never use extended SIMD pay zero XSAVE cost:

1. New thread starts with xstate_used = 0, xsave_area = None
2. CR0.TS bit is set (or XCR0 is restricted) — any use of SIMD triggers #NM
   (Device Not Available exception)
3. #NM handler:
   a. Allocate xsave_area (sized per CPUID leaf 0xD for the used components)
   b. Set appropriate bits in xstate_used
   c. Clear CR0.TS (or extend XCR0)
   d. Return — the faulting SIMD instruction re-executes successfully
4. Subsequent SIMD use proceeds without trapping

This means a thread that only does integer arithmetic and memory copies never allocates an XSAVE area and never incurs XSAVE/XRSTOR cost during context switch.

AMX special case:

Intel AMX tile registers (8KB) are especially expensive. Additional optimization: - AMX has a TILERELEASE instruction that explicitly marks tile state as unused. - UmkaOS's kernel scheduler can hint userspace (via prctl or arch_prctl) to call TILERELEASE when exiting a compute-intensive section, so the next context switch doesn't save 8KB of dead tile state. - If a thread hasn't used AMX tiles in the last N context switches (configurable, default N=8), the kernel deallocates the AMX portion of the XSAVE area to reclaim memory.

ARM SVE/SME (AArch64):

ARM's Scalable Vector Extension has a variable vector length (128–2048 bits). The same lazy-allocation strategy applies, with ARM-specific mechanisms:

SVE state components and sizes (VL-dependent):

| Component        | Size at VL=128 | Size at VL=512 | Size at VL=2048 |
|------------------|----------------|----------------|-----------------|
| Z registers (Z0-Z31) | 512 bytes | 2048 bytes    | 8192 bytes      |
| P predicates (P0-P15) | 32 bytes  | 128 bytes     | 512 bytes       |
| FFR (first-fault) | 2 bytes       | 8 bytes        | 32 bytes        |
| Total             | 546 bytes     | 2184 bytes     | 8736 bytes      |

Lazy SVE allocation policy:

1. New thread starts with SVE disabled (CPACR_EL1.ZEN = 0b00).
2. First SVE instruction triggers #UND (EL1 undefined instruction trap).
3. #UND handler:
   a. Read current VL from ZCR_EL1 (or inherit from parent thread).
   b. Allocate SVE save area sized for current VL.
   c. Set CPACR_EL1.ZEN = 0b01 (enable SVE for EL0).
   d. Return — the faulting SVE instruction re-executes.
4. Context switch saves/restores only if thread has SVE enabled:
   a. Check CPACR_EL1.ZEN — if SVE was never used, skip (zero cost).
   b. If used: SVE_ST (store Z/P/FFR) and SVE_LD (load) to save area.
   c. Cost: proportional to VL, not fixed. VL=128: ~50ns. VL=512: ~200ns.

VL management:
  - Per-thread VL via prctl(PR_SVE_SET_VL, new_vl).
  - VL change takes effect on next context restore (no immediate effect).
  - If new_vl > old_vl, the save area is reallocated (grown, not shrunk).
  - System default VL set via sysctl: kernel.sve_default_vl = 256.

SME (Scalable Matrix Extension, ARMv9.2) provides matrix tiles analogous to AMX:

SME state:
  - ZA tile register: (SVL/8) x (SVL/8) bytes, where SVL is the Streaming
    Vector Length in BITS. At SVL=512 bits: 64x64 = 4KB. At the maximum
    SVL=2048 bits: 256x256 = 64KB.
  - Streaming SVE mode (SSVE): uses SVE registers at streaming VL.

Lazy SME allocation:
  1. SMSTART (enter streaming mode) traps if SMCR_EL1.ENA = 0.
  2. Handler allocates ZA storage and enables SME.
  3. SMSTOP (exit streaming mode) marks ZA as inactive.
     If ZA is inactive for N switches (default 4), deallocate ZA storage.
     This matters for memory pressure: ZA at SVL=2048 is 64KB per thread.

Context switch for SME:
  - Check PSTATE.SM (streaming mode active) and PSTATE.ZA (ZA active).
  - If neither: zero cost.
  - If ZA active: save ZA tile (up to 64KB at max SVL=2048). This is expensive.
  - The scheduler deprioritizes SME-heavy threads from migration to minimize
    ZA save/restore on context switch (locality preference).

ARMv7 VFP/NEON:

ARMv7 extended register state is simpler than x86 or AArch64 — only VFP (Vector Floating Point) and NEON (SIMD) use extended registers:

ARMv7 extended state:

| Component        | Size       | Save/restore cost |
|------------------|------------|-------------------|
| VFP/NEON (D0-D31)| 256 bytes  | ~15-30 ns         |
| FPSCR            | 4 bytes    | ~5 ns             |

Total: 260 bytes per thread. Always the same size (no variable-length
extensions like SVE or AVX-512).

Lazy allocation policy for ARMv7:

1. New thread starts with VFP/NEON disabled (FPEXC.EN = 0).
2. First VFP/NEON instruction triggers #UND trap.
3. Handler:
   a. Allocate 260-byte VFP save area.
   b. Set FPEXC.EN = 1 (enable VFP/NEON).
   c. Return — faulting instruction re-executes.
4. Context switch: VSTM/VLDM to save/restore D0-D31 + FPSCR.
   Cost is fixed (~30ns) regardless of which registers were used.

ARMv7 has no equivalent of XSAVE's selective save — all 32 double-word
registers are saved/restored as a unit. The fixed 260-byte size means no
dynamic allocation complexity. Threads that never use floating-point or
NEON pay zero VFP save/restore cost (same lazy trap approach as x86 #NM).

RISC-V Vector Extension (RVV):

RISC-V Vector extension (RVV 1.0, ratified 2021) has variable vector length (VLEN: 128–65536 bits), making it the most flexible — and most complex to manage — of any supported architecture:

RVV state components:

| Component           | Size at VLEN=128 | Size at VLEN=256 | Size at VLEN=1024 |
|---------------------|------------------|------------------|-------------------|
| V registers (v0-v31)| 512 bytes        | 1024 bytes       | 4096 bytes        |
| vtype, vl, vstart   | 24 bytes         | 24 bytes         | 24 bytes          |
| vcsr (vxrm, vxsat)  | 8 bytes          | 8 bytes          | 8 bytes           |
| Total               | 544 bytes        | 1056 bytes       | 4128 bytes        |

Lazy RVV allocation policy:

1. New thread starts with Vector disabled (mstatus.VS = Off).
2. First vector instruction triggers illegal-instruction trap.
3. Handler:
   a. Read VLEN from hart capabilities (discovered at boot from DT or CSR).
   b. Allocate vector save area: 32 × (VLEN/8) + overhead bytes.
   c. Set mstatus.VS = Initial (vector state clean).
   d. Return — faulting instruction re-executes.
4. Context switch:
   a. Check mstatus.VS — if Off, skip entirely (zero cost).
   b. If Dirty: save all 32 V registers using 4× vs8r.v (v0-v7, v8-v15,
      v16-v23, v24-v31). The RVV spec maximum whole-register store is 8
      registers per instruction; there is no vs32r.v.
   c. If Clean: skip save (state hasn't changed since last restore).
   d. Restore: 4× vl8re8.v to load all 32 registers from next thread's area.
      Total: 8 whole-register instructions (4 stores + 4 loads).
      Set mstatus.VS = Clean after restore.

Per-hart VLEN variation:
  On heterogeneous RISC-V systems, different harts may have different VLEN.
  The scheduler tracks per-hart VLEN (discovered at boot). A thread that
  uses vector instructions with VLEN=256 can only run on harts with
  VLEN >= 256. This is integrated with the ISA-gating mechanism
  (Section 6.1.5.9): VectorLengthInfo (companion to IsaCapabilities) encodes
  per-thread VLEN. The scheduler uses rvv_vlen_bits to constrain hart placement.

Alignment with scheduler hints:

The scheduler already tracks ISA capability usage per-thread (Section 6.1.5, lines 3053–3059: IsaCapabilities bitflags including X86_AVX512, X86_AMX, ARM_SME). The XSAVE policy uses the same flags: - A thread with X86_AVX512 set in its IsaCapabilities is known to use AVX-512. The scheduler places it on a P-core (avoiding frequency throttling on E-cores). - The XSAVE subsystem uses the same flag to pre-allocate the ZMM XSAVE area, avoiding the #NM trap latency on the first AVX-512 instruction for threads that are known to use it (e.g., after an execve of a binary with AVX-512 in its ELF .note.gnu.property).

6.1.6.1 Saved State Composition by Architecture

This subsection enumerates exactly which registers UmkaOS saves and restores on each supported architecture during a context switch. For every architecture, state is divided into three categories: (1) general-purpose and control registers that are always saved; (2) extended floating-point/SIMD state that is saved lazily (only when the thread has actually used the relevant unit); and (3) debug registers that are saved only when hardware breakpoints are active. The lazy FP/SIMD policy is described in the preceding subsection; the per-architecture enable bits that control laziness are called out explicitly below.

x86-64:

Always saved (integer/control registers):

Register group Registers Notes
General-purpose RAX, RBX, RCX, RDX, RSI, RDI, RBP, R8–R15 14 explicit GPRs; RSP is implicit in the kernel stack switch
Stack pointer RSP Saved via kernel stack pointer swap in the switch stub
Instruction pointer RIP Saved via the call/ret discipline in the switch stub; not written explicitly
Flags RFLAGS Saved with pushfq/popfq
TLS base registers FS.base, GS.base Written via WRMSR/RDMSR (IA32_FS_BASE / IA32_GS_BASE); userspace TLS and kernel per-CPU pointer respectively. SWAPGS handles the kernel↔user GS.base exchange on entry/exit

Extended state (lazy — saved only when xstate_used bitmap indicates use):

The extended state area is allocated as a single XSAVE-formatted buffer, size determined at boot from CPUID leaf 0xD subleaf 0 (ECX = full XSAVE area size for all enabled components). UmkaOS uses XSAVES/XRSTORS (compacted format) to save only the components indicated by the task's xstate_used bitmap.

Component Registers XSAVE component bit Trigger
x87 FPU ST0–ST7, FIP, FDP, FOP, FCW, FSW, FTW Bit 0 Any x87 instruction
SSE XMM0–XMM15, MXCSR Bit 1 Any SSE/SSE2 instruction
AVX YMM0–YMM15 upper 128-bit halves Bit 2 Any AVX instruction
AVX-512 opmask K0–K7 Bit 5 Any AVX-512 masked operation
AVX-512 ZMM hi256 ZMM0–ZMM15 upper 256-bit halves Bit 6 Any AVX-512 ZMM instruction
AVX-512 ZMM hi16 ZMM16–ZMM31 full 512-bit Bit 7 Any AVX-512 ZMM16–31 instruction
Intel AMX tile config TILECFG Bit 17 LDTILECFG instruction
Intel AMX tile data Up to 8 tiles × 1024 bytes = 8192 bytes Bit 18 Any AMX tile compute instruction

Whether AVX, AVX-512, or AMX are present is determined at boot from CPUID leaf 0x7 and leaf 0xD; UmkaOS enables only the components reported by the hardware.

Debug registers (lazy — saved only when hardware breakpoints are active):

DR0–DR3 (break address), DR6 (status), DR7 (control). Saved on context switch out and restored on context switch in for any thread that has set hardware breakpoints (indicated by a per-task debug_active flag). DR4 and DR5 are aliases of DR6 and DR7; DR4/DR5 are not saved separately.

AArch64:

Always saved:

Register group Registers Notes
General-purpose X0–X30 31 integer registers; XZR (X31) is hardwired zero and never saved
User stack pointer SP_EL0 User-mode stack pointer; saved as a plain 64-bit value
Instruction pointer PC Saved via the ret target in the switch stub (stored in the thread's saved x30/LR slot pointing to the resume label)
Process state SPSR_EL1 Saved process state register; encodes NZCV flags, DAIF mask, execution state, SP selection
User TLS pointer TPIDR_EL0 User-readable thread pointer register; holds glibc TLS base

UmkaOS's per-CPU kernel pointer lives in TPIDR_EL1; it is not a per-thread value and is not saved/restored on context switch.

Extended state (lazy):

NEON/FP state is controlled by CPACR_EL1.FPEN. If FPEN is set to 0b00 (trapping), any FP/NEON instruction from EL0 or EL1 takes a trap that triggers allocation and enable. Once enabled, NEON/FP is always saved (there is no hardware equivalent to XSAVE's component-level selective save on non-SVE AArch64).

Component Registers Save size Enable bit Trigger
NEON/FP V0–V31 (128-bit each), FPSR, FPCR 528 bytes CPACR_EL1.FPEN Any FP or NEON instruction
SVE (FEAT_SVE) Z0–Z31 (variable, up to 2048 bits each), P0–P15 (predicates), FFR VL-dependent (see Section 6.1.6) CPACR_EL1.ZEN Any SVE instruction; first use traps to #UND
SME (FEAT_SME) ZA tile array (SVL-dependent, up to 64 KB at SVL=2048 bits), streaming SVE state SVL-dependent SMCR_EL1.ENA SMSTART instruction; traps if ENA=0

SVE and SME presence is determined from CPUID registers ID_AA64PFR0_EL1 and ID_AA64SMFR0_EL1 at boot. SVE vector length (VL) is read from ZCR_EL1; it may differ per-CPU cluster on heterogeneous SoCs, which is reflected in IsaCapabilities. SME streaming vector length (SVL) is read from SMCR_EL1.

Debug registers (lazy):

DBGBVR0–DBGBVR15 (breakpoint value registers) and DBGBCR0–DBGBCR15 (breakpoint control registers), plus DBGWVR0–DBGWVR15 and DBGWCR0–DBGWCR15 (watchpoints). The number of implemented breakpoints and watchpoints (up to 16 each) is read from ID_AA64DFR0_EL1 at boot. Saved only when the per-task debug_active flag is set.

ARMv7:

Always saved:

Register group Registers Notes
General-purpose R0–R14 15 integer registers; R15 (PC) is handled by the switch stub's bx lr return
Process state CPSR Current Program Status Register (condition flags, mode bits, interrupt masks)
User TLS pointer TPIDRURW User-read/write thread pointer register; holds glibc TLS base

Extended state (lazy):

VFP/NEON state is controlled by the FPEXC.EN bit. With EN=0, any VFP or NEON instruction from any privilege level traps to the undefined-instruction handler, which allocates the save area and sets EN=1.

Component Registers Save size Enable bit Trigger
VFP/NEON D0–D31 (64-bit each), FPSCR 264 bytes FPEXC.EN Any VFP or NEON instruction

ARMv7 has no hardware equivalent to XSAVE's component-level selective save: all 32 doubleword registers are saved and restored as a unit using VSTMIA/VLDMIA. The fixed 264-byte save area is allocated on first VFP/NEON use and is never resized. Threads that never use floating-point or NEON pay zero VFP save cost.

Presence of the VFP and NEON units is detected from FPSID and MVFR0 at boot. Some ARMv7 implementations (e.g., Cortex-M targets) omit VFP entirely; UmkaOS skips all VFP save/restore logic on those cores.

Debug registers (lazy):

BVR0–BVR15 (Breakpoint Value Registers) and BCR0–BCR15 (Breakpoint Control Registers), plus WVR/WCR pairs for watchpoints. Count is read from DBGDIDR at boot. Saved only when debug_active is set.

RISC-V (RV64GC):

Always saved:

Register group Registers Notes
General-purpose x1–x31 31 integer registers; x0 is hardwired zero and is never saved
Instruction pointer PC Saved via the ra (x1) convention in the switch stub; the stub stores the resume label in ra before saving and jumps via ret on restore
Thread pointer x4 (tp) Holds the UmkaOS per-task kernel TLS base when in kernel mode; userspace tp is preserved in the task's saved register frame

Extended state (lazy):

Floating-point and vector state laziness is implemented using the FS and VS fields in the sstatus CSR. Hardware sets FS/VS to Dirty whenever the corresponding register set is written; UmkaOS checks this flag at context switch time rather than maintaining a separate software bitmap.

Component Registers Save size sstatus field Trigger
F extension (single) f0–f31, fcsr 132 bytes FS Any F-extension instruction when FS = Off → traps; when FS = Initial or Dirty → no trap
D extension (double) f0–f31 (64-bit view), fcsr 260 bytes FS (shared with F) Any D-extension instruction
V extension (vector) v0–v31 (variable length), vcsr, vl, vtype, vstart VLEN-dependent (see Section 6.1.6) VS Any V-extension instruction when VS = Off → illegal-instruction trap

The F and D extensions share the same register file and FS field; D is a superset of F. If both are present, UmkaOS always saves 64-bit doubles. The presence of F, D, and V extensions is determined from the misa CSR and from the ISA string in the device tree or SBI firmware at boot.

Context switch policy for float/vector: - If FS = Off: no FP registers are saved (zero cost). - If FS = Initial: registers are in reset state; skip save, but restore from a canonical all-zeros area on next thread's restore (or leave as Initial). - If FS = Dirty: save all f0–f31 and fcsr. After save, set FS = Clean. - If VS = Off: no vector registers are saved. - If VS = Dirty: save all v0–v31 plus vcsr/vl/vtype/vstart. After save, set VS = Clean.

On heterogeneous RISC-V systems where different harts have different VLEN, a thread's VLEN is fixed at the VLEN of the hart that first executed a vector instruction. The scheduler then constrains that thread to harts with matching VLEN (see VectorLengthInfo in Section 6.1.5).

Debug registers (lazy):

The RISC-V debug trigger module provides tselect, tdata1, tdata2, and optionally tdata3 CSRs for configuring hardware breakpoints and watchpoints. The number of implemented triggers is determined at boot by iterating tselect until it wraps. Saved only when debug_active is set.

PPC32:

Always saved:

Register group Registers Notes
General-purpose R0–R31 32 integer registers
Special-purpose LR, CTR, XER, CR Link register, count register, integer exception register, condition register
Instruction pointer SRR0 Machine state save/restore register 0 holds the saved PC (restored via rfi)
Machine state SRR1 Machine state save/restore register 1 holds the saved MSR (restored via rfi)

User TLS is managed by convention: the ABI designates R2 as the small-data area pointer and R13 as the read-only TLS base; both are part of the general GPR save above. The kernel per-CPU pointer lives in SPRG3 and is not per-task.

Extended state (lazy):

FPU state is controlled by MSR.FP. With MSR.FP = 0, any floating-point instruction from any privilege level causes a floating-point unavailable exception. The handler allocates the save area, sets MSR.FP = 1, and returns.

Component Registers Save size Enable bit Trigger
FPR FPR0–FPR31 (64-bit each), FPSCR 264 bytes MSR.FP Any FP instruction when MSR.FP = 0

AltiVec/VMX is not universally present on PPC32 targets supported by UmkaOS (primarily embedded e500/e500mc class cores). On embedded PPC32 cores that do implement SPE (Signal Processing Engine) floating-point, the SPE save area (32 × 32-bit upper halves of the 64-bit SPE GPRs, plus SPEFSCR) replaces the classical FPR block. Presence of SPE is detected from the PVR (Processor Version Register) at boot.

Debug registers (lazy):

DBCR0, DBCR1, DAC1, DAC2 (data address compare), IAC1, IAC2 (instruction address compare). Count and capability are read from DBCR0 at boot. Saved only when debug_active is set.

PPC64LE:

Always saved:

Register group Registers Notes
General-purpose R0–R31 32 integer registers
Special-purpose LR, CTR, XER, CR, DSCR, AMR Link, count, exception, condition, data stream control, authority mask registers
Instruction pointer SRR0 / HSRR0 SRR0 for normal exceptions; HSRR0 for hypervisor exceptions (used in KVM context)
Machine state SRR1 / HSRR1 Saved MSR (restored via rfid / hrfid)

The AMR (Authority Mask Register) implements a hardware equivalent to memory protection keys on POWER9+ in Radix mode; it is always saved to preserve per-task memory domain state. DSCR controls the hardware prefetch engine and is saved to avoid polluting one task's prefetch hints into another.

User TLS follows the ELFv2 ABI: R13 holds the thread pointer. This is part of the general GPR save. The kernel per-CPU pointer lives in SPRG3; it is not per-task.

Extended state (lazy):

Three overlapping extended state components, each controlled by a separate MSR bit:

Component Registers Save size MSR bit Trigger
FPR FPR0–FPR31 (64-bit each), FPSCR 264 bytes MSR.FP Any FP instruction when MSR.FP = 0
VMX/AltiVec VR0–VR31 (128-bit each), VRSAVE, VSCR 528 bytes MSR.VEC Any VMX instruction when MSR.VEC = 0
VSX VS0–VS63 (the VSX register file overlays FPR0–31 and VR0–31) Covered by FPR + VMX saves MSR.VSX Any VSX instruction when MSR.VSX = 0

The VSX register file (VS0–VS63) is not an additional 64 independent registers: VS0–VS31 are the same physical registers as FPR0–FPR31 (double-precision view), and VS32–VS63 are the same physical registers as VR0–VR31. Saving FPR and VMX captures the complete VSX state; there is no additional VSX-specific save area.

All three components are saved independently and lazily: a task that uses FPR but not VMX pays only the 264-byte FPR save cost. MSR.VSX enables the xvmaddadp class instructions that cross the FPR/VMX boundary; it requires both MSR.FP and MSR.VEC to be set first.

Debug registers (lazy):

DAWR0 and DAWRX0 (data address watchpoint register, introduced POWER9) for hardware memory watchpoints. Hardware instruction breakpoints via CIABR (Completed Instruction Address Breakpoint Register). Saved only when debug_active is set.


Lazy FP/SIMD save — unified policy statement:

UmkaOS uses lazy FP/SIMD save on all supported architectures. Extended state is saved and restored only for threads that have actually used the corresponding unit. The mechanism by which laziness is enforced is architecture-specific:

Architecture Lazy FP mechanism Lazy vector mechanism
x86-64 CR0.TS=1 causes #NM on first FP/SSE use; XSAVES saves only dirty XSAVE components Same XSAVE component mask; AVX/AVX-512/AMX each have independent bits
AArch64 CPACR_EL1.FPEN=0b00 causes trap on first NEON/FP use CPACR_EL1.ZEN=0b00 causes #UND on first SVE use; SMCR_EL1.ENA=0 traps SMSTART
ARMv7 FPEXC.EN=0 causes undefined-instruction trap on first VFP/NEON use N/A (no vector extension beyond NEON)
RISC-V sstatus.FS=Off causes illegal-instruction trap; hardware sets FS=Dirty on write sstatus.VS=Off causes illegal-instruction trap; hardware sets VS=Dirty on write
PPC32 MSR.FP=0 causes floating-point unavailable exception N/A (SPE if present uses same exception mechanism)
PPC64LE MSR.FP=0 causes FP unavailable; MSR.VEC=0 causes VMX unavailable MSR.VSX=0 causes VSX unavailable; requires FP+VEC first

A task that never touches FP or SIMD registers pays zero extended-state save cost on every context switch across all architectures.

6.1.7 CPU Hotplug Integration

CPU hotplug (CPUs going offline/online at runtime) must be handled by the scheduler to migrate tasks and maintain invariants. UmkaOS supports full CPU hotplug on all architectures.

Offline sequence (CPU N going offline):

1. Mark CPU N as draining: set runqueue[N].state = DRAINING.
   New tasks are no longer scheduled onto CPU N (load balancer skips it).

2. Migrate tasks from runqueue[N]:
   For each runnable task in runqueue[N] (EEVDF tree + RT queues + DL queues):
   a. Select migration target: prefer same-NUMA-node CPU with lowest load
      (EAS-aware, Section 6.1.5).
   b. Dequeue from runqueue[N], set task.cpu = target_cpu,
      enqueue on runqueue[target_cpu].
   c. If a task is currently running on CPU N: wait for it to yield
      (it will find DRAINING state on next preemption point and yield).

3. Drain RCU quiescent state:
   Call rcu_barrier() to process all pending RCU callbacks that reference
   CPU N's per-CPU data. CPU N reports a final quiescent state.

4. Drain per-CPU slab magazines:
   Flush CPU N's per-CPU slab magazines to their per-NUMA partial lists
   (returns cached pages to the system).

5. Drain per-CPU writeback queue:
   Flush any pending writeback work on CPU N.

6. Mark CPU N offline:
   clear_bit(N, cpu_online_mask).
   CPU N executes arch::current::cpu::park() (HLT loop on x86-64,
   WFI in low-power state on AArch64, pause loop on RISC-V).

7. Notify subsystems:
   Fire cpu_hotplug_notifier(OFFLINE, N) to allow subsystems (networking,
   RCU, scheduler) to clean up CPU-N-specific state.

Online sequence (CPU N coming online):

1. Architecture-specific bring-up:
   x86-64: INIT-SIPI-SIPI sequence via LAPIC.
   AArch64: PSCI CPU_ON call.
   RISC-V: SBI HSM hart_start call.

2. Per-CPU data initialization:
   CPU N's PerCpu data structures are NOT re-allocated (they were sized
   at boot for all possible CPUs — see Section 3.1.3). Only state is reset:
   - runqueue[N]: initialize as empty, state = ACTIVE.
   - CpuLocal block for CPU N: zero-fill state fields.
   - RCU: set gp_start_qs[N] = 0; set rcu_passed_quiescent = false.

3. Mark CPU N online:
   set_bit(N, cpu_online_mask).

4. Fire cpu_hotplug_notifier(ONLINE, N).

5. Load balancer picks up CPU N in the next balance interval and starts
   migrating tasks to it.

Design note: UmkaOS's per-CPU arrays are sized at boot for num_possible_cpus() (all CPUs that could ever be brought online, including hotplugged ones). This means CPU offline/online is a pure state machine transition with no memory allocation, matching the "no hardcoded MAX_CPUS" principle and enabling sub-millisecond hotplug transitions.


6.2 Platform Power Management

Standards: ACPI 6.5 Section 1.2 (Power Management), Intel SDM Vol 3B Section 17.9 (RAPL MSRs), AMD PPR (Zen2+) Section 2.1.9 (RAPL), ARM Energy Model (Documentation/power/energy-model.rst), IPMI v2.0 Section 10.2 / DCMI v1.5. IP status: All interfaces are open standards or documented hardware interfaces. No proprietary implementations referenced.

6.2.1 Problem and Scope

Power management is a kernel responsibility — not because policy belongs in the kernel, but because the mechanisms that enforce policy require ring-0 privileges and sub-millisecond response latency:

Why ring-0 is required for power management mechanisms:

  • RAPL MSR writes require ring-0 access. Intel and AMD RAPL power-limit registers (e.g., MSR_PKG_POWER_LIMIT at 0x610) are privileged MSRs. A WRMSR instruction executed from ring-3 causes a #GP(0) fault. There is no userspace API that provides equivalent direct hardware control; powercap sysfs writes go through the kernel driver.
  • Thermal trip point response must be sub-millisecond. A thermal Critical trip point (typically 5–10 °C below the hardware PROCHOT shutdown temperature) requires an immediate forced poweroff. The kernel cannot wait for a userspace daemon to wake up, read a netlink event, and issue a shutdown ioctl — that path has unbounded latency. The kernel's thermal interrupt handler must act directly.
  • cgroup power accounting requires kernel-side energy counter integration. Attributing energy consumption to a cgroup requires reading RAPL energy counters at the same scheduler tick that records CPU time — these are indivisible from a correctness standpoint. A userspace poller cannot atomically correlate energy deltas to the task that was running.
  • VM power budgets must be enforced even if the VM misbehaves. A guest OS cannot be trusted to self-limit its power consumption. The hypervisor (umka-kvm) must enforce power caps from outside the VM, using kernel-level RAPL and cgroup mechanisms.

Scope: This section covers mechanisms only:

Policy — which power profile a user selects, when to throttle a VM for economic reasons, how to balance performance against energy cost — is a userspace/orchestrator concern. The kernel provides the enforcement hooks; daemons (e.g., tuned, power-profiles-daemon, umka-kvm's scheduler) invoke them.


6.2.2 RAPL — Running Average Power Limit

6.2.2.1 Domain Taxonomy

RAPL partitions the platform into named power domains. Each domain has independent power-limit registers and energy-status counters:

Domain Scope Availability
Pkg Entire CPU socket including uncore (LLC, memory controller, PCIe root complex, integrated graphics on server SKUs) Intel SNB+, AMD Zen2+
Core (PP0) CPU cores only (excluding uncore). Useful for isolating compute vs memory-bandwidth workloads. Intel SNB+
Uncore (PP1) Integrated GPU / GT on Intel client SKUs. Not present on server SKUs (Xeon). Intel client only
Dram Memory controller and attached DIMMs. Separate power rail on server platforms. Intel IVB-EP+, AMD Zen2+ server
Platform (PSYS) Entire platform as measured from the charger/PSU side. Introduced on Intel Skylake+ client. Captures power not visible to PKG (PCH, NVMe, display). Intel SKL+ client only

The Core domain is always ≤ Pkg. PlatformPkg because it includes peripheral power not counted by the socket energy counter.

6.2.2.2 MSR Interface (x86-64 / x86)

Intel RAPL is exposed via Model-Specific Registers readable/writable with RDMSR/WRMSR from ring-0. The register layout is documented in Intel SDM Vol 3B Section 17.9.

Key registers for the Pkg domain (other domains follow the same pattern at different base addresses):

MSR Address Name Direction Purpose
0x610 MSR_PKG_POWER_LIMIT R/W Set short-window and long-window power limits
0x611 MSR_PKG_ENERGY_STATUS R Read cumulative energy counter (wraps at ~65 J for typical units)
0x613 MSR_PKG_PERF_STATUS R Throttle duty cycle (fraction of time spent in power throttle)
0x614 MSR_PKG_POWER_INFO R Thermal Design Power (TDP), minimum, and maximum power

MSR_PKG_POWER_LIMIT bit layout: - Bits 14:0 — Long-window power limit (in hardware power units from MSR_RAPL_POWER_UNIT) - Bit 15 — Enable long-window limit - Bit 16 — Clamping enable (allow limit to go below TDP; requires CLAMPING_SUPPORT flag) - Bits 23:17 — Long-window time window (tau_x, encoded as y * 2^F × base unit, typically ≤ 28 s) - Bits 30:24 — Reserved - Bit 31 — Reserved - Bits 46:32 — Short-window power limit - Bit 47 — Enable short-window limit - Bit 48 — Short-window clamping enable - Bits 55:49 — Short-window time window (tau_y, ≤ 10 ms) - Bits 62:56 — Reserved - Bit 63 — Lock bit (locks the entire register until next RESET; kernel must not set this)

The short-window limit (tau_y ≤ 10 ms) is the primary mechanism for burst suppression. The long-window limit (tau_x ≈ 28 s) enforces sustained average power. Setting both gives a two-tier policy: allow short bursts up to short_limit_W for up to 10 ms, but enforce long_limit_W on average.

Energy units are encoded in MSR_RAPL_POWER_UNIT (address 0x606). The driver must read this at boot and convert all values accordingly.

6.2.2.3 AMD Equivalent

AMD Zen2 and later processors implement RAPL-compatible MSRs at the same addresses (0x610, 0x611, 0x614) with the same bit layout. This allows the same MSR driver to serve both Intel and AMD on Zen2+.

Older AMD processors (pre-Zen2) use the System Management Unit (SMU), a co-processor accessible via PCI config space (bus 0, device 0, function 0, PCI vendor/device ID varies by generation). The SMU interface is not publicly documented; UmkaOS uses the same reverse-engineered interface as the Linux amd_energy driver (kernel/drivers/hwmon/amd_energy.c).

The RaplInterface abstraction (Section 6.2.2.5) hides this difference from upper layers.

6.2.2.4 ARM and RISC-V Equivalents

ARM Energy Model (EM): ARM SoCs do not expose hardware energy counters equivalent to RAPL. Instead, the ARM Energy Model framework provides estimated power consumption based on empirically measured power coefficients per CPU frequency operating point (OPP). Each OPP has a power_mW coefficient stored in the device tree (operating-points-v2 table). The kernel integrates over active OPPs to estimate energy. This is less accurate than RAPL but enables the same cgroup accounting interface (Section 6.2.5).

RISC-V: There is no standardised RAPL equivalent in the RISC-V ISA or the SBI specification as of SBI v2.0. Platform-specific power management is exposed via vendor SBI extensions (e.g., T-HEAD/Alibaba extensions for their RISC-V SoCs). UmkaOS implements a NoopRaplInterface for RISC-V that returns PowerError::NotSupported for all limit-setting operations and provides zero energy readings. Cgroup accounting falls back to CPU-time-weighted estimation.

6.2.2.5 Kernel Abstraction

All RAPL consumers (cgroup accounting, VM power budgets, thermal passive cooling, DCMI enforcement) interact with power domains through the RaplInterface trait, never touching MSRs directly:

/// The type of a RAPL power domain.
/// See also the comprehensive PowerDomainType at [Section 6.4.2](#642-design-power-as-a-schedulable-resource) which extends
/// this to non-CPU domains (accelerators, NICs, storage).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[repr(u32)]
pub enum PowerDomainType {
    /// Entire CPU socket including uncore (LLC, memory controller, PCIe root).
    CpuPackage  = 0,
    /// CPU cores only (PP0). Excludes uncore.
    CpuCore     = 1,
    /// Memory controller and attached DIMMs. Server platforms only.
    Dram        = 2,
    /// GPU / accelerator.
    Accelerator = 3,
    /// NIC (if power-metered).
    Nic         = 4,
    /// NVMe SSD (if power-metered).
    Storage     = 5,
    /// Entire system (platform-level RAPL or BMC).
    Platform    = 6,
}

/// A single RAPL power domain with its hardware interface and energy accumulator.
///
/// This is the x86/RAPL-specific domain object used for the powercap sysfs hierarchy
/// and energy accounting (Section 6.2.4). The generic cross-architecture abstraction
/// is `GenericPowerDomain` defined in Section 6.4.2.
pub struct RaplDomain {
    /// The type of power domain this represents.
    pub domain_type: PowerDomainType,
    /// The hardware driver implementing this domain's register interface.
    pub hw_interface: Arc<dyn RaplInterface>,
    /// Cumulative energy consumed by this domain in microjoules.
    ///
    /// Updated periodically by the kernel power accounting thread (Section 6.2.5).
    /// Wraps at `u64::MAX`; readers must handle wrap-around by tracking deltas.
    pub energy_uj: AtomicU64,
    /// Socket index (0-based) this domain belongs to.
    pub socket_id: u32,
}

/// Hardware interface for reading and controlling a RAPL power domain.
///
/// Implementations exist for: Intel MSR (`IntelRaplMsr`), AMD MSR/SMU
/// (`AmdRaplInterface`), ARM Energy Model (`ArmEmInterface`), and no-op
/// (`NoopRaplInterface` for platforms without hardware support).
///
/// # Safety
///
/// Implementations that write MSRs must only do so from ring-0 kernel context.
/// MSR writes from interrupt context are permitted but must be idempotent and
/// must not acquire locks that could be held by non-interrupt code.
pub trait RaplInterface: Send + Sync {
    /// Set a power limit on the given domain.
    ///
    /// `limit_mw` is the power limit in milliwatts.
    /// `window_ms` is the averaging window in milliseconds. Hardware may
    /// round to the nearest supported window; callers must not assume exact values.
    ///
    /// Returns `PowerError::NotSupported` if the domain or windowed limiting
    /// is not available on this platform.
    fn set_power_limit(
        &self,
        domain: PowerDomainType,
        limit_mw: u32,
        window_ms: u32,
    ) -> Result<(), PowerError>;

    /// Remove a previously set power limit on the given domain, restoring
    /// the hardware default (TDP-derived limit).
    ///
    /// Returns `PowerError::NotSupported` if the domain is not available.
    fn clear_power_limit(&self, domain: PowerDomainType) -> Result<(), PowerError>;

    /// Read the cumulative energy consumed by the given domain in microjoules.
    ///
    /// The counter wraps at `max_energy_range_uj()`. Callers must track
    /// previous values and compute deltas to handle wrap-around correctly.
    ///
    /// Returns `PowerError::NotSupported` if the domain is not available.
    fn read_energy_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;

    /// Read the Thermal Design Power (TDP) of the given domain in milliwatts.
    ///
    /// This is the sustained power level the platform is designed to dissipate.
    /// It is used as the upper bound for VM admission control (Section 6.2.6).
    ///
    /// Returns `PowerError::NotSupported` if TDP information is not available.
    fn read_tdp_mw(&self, domain: PowerDomainType) -> Result<u32, PowerError>;

    /// Return the maximum value of the energy counter before it wraps, in microjoules.
    ///
    /// Callers use this to correctly handle wrap-around in `read_energy_uj`.
    fn max_energy_range_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;
}

/// Errors returned by `RaplInterface` operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum PowerError {
    /// The requested domain or operation is not supported on this platform.
    NotSupported,
    /// The requested power limit is below the hardware minimum or above the TDP.
    OutOfRange { min_mw: u32, max_mw: u32 },
    /// MSR or SMU access failed (hardware error or driver not initialised).
    HardwareFault,
    /// The domain's power limit register is locked until next RESET.
    Locked,
}

The platform boot sequence probes for available RAPL domains (by attempting RDMSR and checking for #GP) and registers each discovered domain with the global PowerDomainRegistry. Upper layers iterate the registry rather than hard-coding which domains exist.


6.2.2a Per-Architecture Power Management Interfaces

The RaplInterface abstraction in Section 6.2.2 covers the common interface for energy reading and power-limit setting. The hardware mechanisms that back that interface differ substantially across architectures. This section specifies what those mechanisms are so that the platform boot driver for each architecture knows which registers, protocols, and firmware services to initialise.

All per-architecture power interfaces are accessed through the PlatformPowerOps trait, which is registered at boot by each architecture's power driver (see Section 6.2.2.5 for the RaplInterface trait that PlatformPowerOps builds on). Upper layers — the cgroup power controller, the scheduler's Energy-Aware Scheduling path (Section 6.1.5), and the FMA health subsystem (Chapter 19) — use this trait exclusively. They never call architecture-specific MSRs, SCMI mailboxes, or SBI extensions directly.

6.2.2a.1 x86-64 (Intel and AMD)

Energy reporting:

Intel and AMD Zen2+ both expose energy via RAPL MSRs. The complete register map and bit layout is specified in Section 6.2.2.2. Summary of the energy-status registers used by the IntelRaplMsr and AmdRaplInterface implementations:

MSR address Domain Availability
MSR_PKG_ENERGY_STATUS (0x611) CPU socket (cores + uncore) Intel SNB+; AMD Zen2+
MSR_PP0_ENERGY_STATUS (0x639) CPU cores only Intel SNB+
MSR_PP1_ENERGY_STATUS (0x641) Integrated GPU (client only) Intel client SKUs
MSR_DRAM_ENERGY_STATUS (0x619) Memory controller + DIMMs Intel IVB-EP+; AMD Zen2+ server
MSR_PLATFORM_ENERGY_STATUS (0x64D) Entire platform (PSU side) Intel SKL+ client only

The energy unit is encoded in MSR_RAPL_POWER_UNIT (0x606); it must be read at boot before any energy delta computation.

Frequency and voltage control:

  • Intel P-states with HWP (Hardware-controlled Performance States): On Broadwell+ processors, HWP is enabled by writing bit 0 of IA32_PM_ENABLE (MSR 0x770). The scheduler then controls per-CPU performance hints via IA32_HWP_REQUEST (MSR 0x774), which encodes minimum performance, maximum performance, desired performance, and energy-performance preference (EPP) in a single 64-bit write. UmkaOS uses HWP when available in preference to legacy ACPI P-state (_PSS) switching.
  • AMD P-states (CPPC): On Zen2+, frequency scaling uses the Collaborative Processor Performance Control (CPPC) interface exposed through ACPI CPPC objects or directly via MSR_AMD_CPPC_REQ (0xC00102B3). The desired_perf field in CPPC maps to a CPU frequency in the same role as HWP's desired_perf.
  • Legacy ACPI _PSS: On older hardware without HWP or CPPC, UmkaOS falls back to ACPI P-state switching via _PSS/_PPC/_PCT methods, which the ACPI driver evaluates and translates into MSR writes (e.g., IA32_PERF_CTL at 0x199).

Power caps and TDP:

  • MSR_PKG_POWER_LIMIT (0x610): dual-window power limit (see Section 6.2.2.2 for the full bit layout). UmkaOS sets this register to enforce VM power budgets (Section 6.2.6) and rack-level DCMI caps (Section 6.2.7).
  • MSR_PKG_POWER_INFO (0x614): read-only; provides TDP, minimum, and maximum power. The TDP value is used as the default admission-control ceiling for VM watt-budget enforcement.
  • MSR_RAPL_POWER_LIMIT lock bit (bit 63 of 0x610): UmkaOS never sets this bit. Setting it prevents further limit changes until the next platform RESET.

Thermal:

  • IA32_THERM_STATUS (MSR 0x19C): per-core thermal status register. Bit 0 is the prochot log (set when the core has throttled due to heat). Bits 22:16 encode the "thermal margin" (degrees Celsius below TjMax). UmkaOS reads this MSR periodically in the thermal polling loop (Section 6.2.3.5) and maps it to a ThermalZone temperature reading.
  • IA32_PACKAGE_THERM_STATUS (MSR 0x1B1): package-level equivalent of the per-core thermal status; includes the PROCHOT log and package thermal margin.
  • DTS (Digital Thermal Sensor): the thermal margin field combined with TjMax (read from MSR_TEMPERATURE_TARGET, address 0x1A2, bits 23:16) gives the absolute die temperature: T_die = TjMax - thermal_margin.
  • ACPI _TSS/_TPC throttling methods remain as a fallback on platforms where DTS MSRs are not accessible from the OS.

Runtime device power management:

Device power state transitions (D0 to D3, and back) on x86-64 are driven by ACPI _PS0/_PS3 control methods evaluated by the ACPI interpreter in umka-kernel. The ACPI runtime PM path is common to all ACPI platforms (x86-64 and ACPI-based AArch64 servers); it is not x86-specific beyond the x86 ACPI initialisation sequence.

6.2.2a.2 AArch64 (ARM Servers: Graviton, Neoverse, Ampere)

AArch64 server platforms do not expose RAPL-equivalent MSRs. Energy reporting and frequency control are provided by a combination of hardware activity counters and a firmware-mediated control channel (SCMI).

Energy reporting:

The Activity Monitor Unit (AMU, FEAT_AMU, introduced in Armv8.4) provides a set of per-core hardware event counters accessible from EL1 via the AMEVCNTR0_EL0 and AMEVTYPER0_EL0 register families. Architecturally defined group-0 counters include:

Counter index Event Use
0 CPU cycles Total core cycles consumed
1 Instructions retired IPC computation
2 Memory stall cycles DRAM latency pressure indicator
3 L3 cache miss stall cycles LLC pressure (Neoverse V1/V2, Graviton3+)

AMU counters provide activity data, not joules. To derive energy, UmkaOS integrates AMU cycle counts against the per-OPP power coefficients from the ARM Energy Model (stored in the device tree operating-points-v2 table as opp-microwatt values). This gives an estimated energy per task, analogous to the delta-integration used on RAPL platforms, but with lower accuracy (±10–20% typical).

AMU is present on Neoverse V1, Neoverse V2, Neoverse N2, Cortex-A78, Cortex-X1, and later cores. On older cores (Neoverse N1, Graviton2), the ARM Energy Model estimation falls back to CPU-time-weighted power at the current OPP frequency, with no per-core AMU data.

Frequency and voltage control — SCMI:

On AArch64 server platforms (Graviton, Ampere Altra/Altra Max, Neoverse RD series), CPU frequency and power-domain gating are controlled via the System Control and Management Interface (SCMI, ARM specification DEN0056, currently version 3.2). SCMI runs between the OS and a dedicated System Control Processor (SCP) or equivalent firmware agent (e.g., the Nitro controller on AWS Graviton instances) over a shared-memory mailbox (doorbell register + shared SRAM buffer).

SCMI protocols used by UmkaOS:

SCMI protocol Protocol ID UmkaOS use
SCMI_PERF 0x13 P-state (OPP) transitions per CPU cluster or per-core. PERF_LEVEL_SET maps to the frequency request analogous to IA32_HWP_REQUEST on x86
SCMI_POWER 0x11 Power-domain gating (power on/off entire CPU clusters, peripherals). Used during CPU hotplug (Section 6.1.7) and system suspend (Section 6.2.10)
SCMI_SENSOR 0x15 Read platform sensor values (die temperature, supply voltage); used by the thermal framework (Section 6.2.3) as the sensor backend on SCMI platforms
SCMI_PERF_CAP 0x13 (cap sub-cmd) Power capping per performance domain, where supported by the SCP firmware

SCMI message exchange is asynchronous on multi-channel implementations (one shared-memory channel per CPU cluster); UmkaOS posts a request and either polls or waits for a doorbell interrupt (platform-dependent). Latency is typically 100–500 µs for a PERF_LEVEL_SET round-trip to the SCP. This is too slow for per-task frequency switching; SCMI frequency control is therefore applied at the granularity of runqueue load-balance intervals (typically 4–10 ms), not on every context switch.

Power caps:

On Graviton2/3 instances, AWS exposes the Nitro hypervisor's power budget to the guest OS via a platform-specific MMIO register or ACPI DSDT method; there is no standard SCMI power-capping channel available from within a Graviton VM. On bare metal (non-VM) AArch64 servers with SCMI power-capping protocol support, UmkaOS uses SCMI_PERF_CAP to enforce rack-level power budgets.

The PlatformPowerOps::set_power_limit implementation for SCMI platforms translates the limit_mw parameter into a performance level ceiling via the OPP table and issues a SCMI_PERF_LEVEL_SET with that ceiling as the maximum. This achieves power capping by constraining achievable frequency, not by a hardware power clamp as on x86-64 RAPL.

Thermal:

Temperature data on SCMI platforms is obtained via: 1. SCMI_SENSOR protocol: the SCP reads die-temperature sensors and exposes them to the OS as named sensors. UmkaOS's thermal zone driver registers these as SensorBackend::Scmi entries (Section 6.2.3.6). 2. Device tree thermal-zones nodes with thermal-sensors references: on embedded and mobile AArch64 SoCs (Qualcomm, MediaTek, Samsung), temperature sensors are MMIO-mapped and described in the device tree; UmkaOS's thermal zone driver reads them directly. 3. ACPI _TSS/_TPC: on ACPI-enumerated AArch64 servers (those following the ACPI for Arm specification, SBSA/SBBR), the ACPI thermal zone path is the same as on x86-64.

Thermal trip-point response on SCMI platforms follows the same framework as x86-64 (Section 6.2.3): the thermal interrupt (or polling timer) fires the trip-point callback, which issues a cooling action via CoolingDevice::set_state. On SCMI platforms, the FrequencyScalingCooler implementation translates the cooling state to a SCMI_PERF_LEVEL_SET call.

Runtime device power management:

Device power gating on AArch64 uses one of: - Device tree power-domains nodes backed by SCMI SCMI_POWER protocol: the generic power-domain framework calls SCMI_POWER_DOMAIN_STATE_SET to transition devices between POWER_ON and POWER_OFF. - PSCI SYSTEM_SUSPEND (function ID 0x8400_000E): used for system-wide suspend to RAM (Section 6.2.10). Per-CPU idle states also use PSCI CPU_SUSPEND. - On embedded/mobile: operating-points-v2 device tree nodes with regulator framework bindings allow the kernel to request voltage changes alongside frequency changes, forming a complete DVFS (Dynamic Voltage and Frequency Scaling) path.

6.2.2a.3 RISC-V

The RISC-V ISA specification and the SBI (Supervisor Binary Interface, specification v2.0) do not define a standardised energy reporting or frequency scaling interface equivalent to RAPL or SCMI. Power management on RISC-V platforms is therefore entirely platform-specific.

Energy reporting:

The RISC-V ISA defines hardware performance monitor (HPM) counters: hpmcounter3 through hpmcounter31 (CSRs 0xC03–0xC1F), each counting a platform-defined event selected by the corresponding mhpmevent machine-mode CSR. Whether any HPM counter counts an energy-proxy event (e.g., CPU cycles at a known voltage-frequency point) is platform-defined. On platforms where such a counter exists, UmkaOS's RISC-V energy driver reads it and converts to milliwatts using a boot-time calibration coefficient from the device tree or SBI vendor extension.

On platforms with no HPM energy counter, UmkaOS falls back to CPU-time-weighted power estimation: power (mW) = active_fraction × OPP_power_mW, where OPP_power_mW comes from the operating-points-v2 device tree node. Cgroup energy accounting uses this estimated power.

Frequency and voltage control:

  • SBI HSM (Hart State Management) extension (EID 0x48534D): the HART_SUSPEND function (FID 0x0) requests per-hart low-power clock-gating. This is the only standardised per-hart power state transition in SBI v2.0. It is used for idle (cpu_idle) and for offline CPUs (Section 6.1.7), not for frequency scaling.
  • Frequency scaling: there is no standardised RISC-V frequency scaling interface in the SBI specification as of version 2.0. On server-class RISC-V platforms following the RISC-V Server Platform specification (published 2023), ACPI CPPC is required; UmkaOS uses the ACPI CPPC driver path (same as on x86-64 legacy CPPC platforms). On embedded RISC-V SoCs, frequency scaling uses platform-specific MMIO registers described in the device tree, accessed through a platform-specific clock driver.
  • SBI vendor extensions: some RISC-V SoC vendors (e.g., T-HEAD/Alibaba for their C906/C910 series cores) define private SBI vendor extensions for DVFS. UmkaOS implements these as optional platform drivers registered at boot if the SBI probe reports the vendor extension ID.

Power caps:

No standardised RISC-V power-capping interface exists in the base SBI specification or RISC-V platform specifications as of 2025. The PlatformPowerOps::set_power_limit implementation for RISC-V returns PowerError::NotSupported unless a platform-specific driver (loaded via device tree compatible string matching) implements a power-capping MMIO interface.

Thermal:

Temperature sensors on RISC-V platforms are described in the device tree using the standard thermal-zones binding with thermal-sensors references pointing to platform-specific thermal sensor nodes (e.g., compatible = "sifive,fu740-temp"). UmkaOS's thermal zone driver reads them via the platform's sensor driver. Trip-point response uses the same thermal framework as other architectures (Section 6.2.3).

Runtime device power management:

SBI HSM HART_SUSPEND provides per-hart suspend (with and without local context retention, depending on the suspend_type field). System-wide suspend follows the platform-specific mechanism (ACPI S3 on RISC-V ACPI platforms; device-tree power domains on embedded platforms).

6.2.2a.4 PPC32 and PPC64LE

IBM POWER and PowerPC platforms have two distinct power management environments: bare-metal (directly running on the hardware, including OpenPOWER) and LPAR (Logical Partition, running under the PowerVM or KVM hypervisor). The mechanisms differ between these environments.

Energy reporting:

  • LPAR on IBM POWER (PowerVM hypervisor): Energy data is exposed via the PHYP (PowerVM Hypervisor) H-call H_GET_EM_PARMS, which returns the partition's current power consumption as measured by the system's power meters. This is the LPAR equivalent of RAPL: the hypervisor aggregates physical PSU data and attributes a share to each partition.
  • Bare metal (OpenPOWER, POWER9/POWER10 with OPAL): The OPAL (OpenPOWER Abstraction Layer) firmware exposes power data via opal_sensor_read (OPAL call 0x30) and opal_sensor_read_u64 (0x52). The ibm,opal-sensors device tree node lists available sensors (die temperature, core power, memory power) by sensor handle. UmkaOS's OPAL sensor driver iterates this list at boot and registers each as a GenericPowerDomain in the PowerDomainRegistry.
  • Bare metal without OPAL (classic PPC32 embedded): No hardware power counters accessible from the OS. CPU-time-weighted OPP estimation is the only option.
  • PMU counters: Both PPC32 and PPC64LE have hardware performance monitor facilities (configurable via MMCR0/MMCR1/MMCR2 and PMCx registers). These can count CPU cycles, L2/L3 misses, and memory bandwidth — useful energy proxies — but require platform-specific calibration. UmkaOS optionally uses PMC0 (total cycles) as a proxy if an OPAL/PHYP energy interface is not available.

Frequency and voltage control:

  • LPAR (PowerVM): The H_SET_PPP (Processor Folding Priority) H-call allows a partition to request a change in its CPU frequency priority relative to other partitions on the same physical POWER system. This is not a direct frequency knob; the hypervisor honors the request subject to available capacity. UmkaOS issues H_SET_PPP from the scheduler's EAS path when the workload shifts between low and high throughput modes.
  • Bare metal OpenPOWER (OPAL): On POWER8 and POWER9 systems with OPAL, CPU frequency is controlled via the opal_set_freq call or by writing to the EPS (Energy Management) registers via OPAL. UmkaOS uses the OPAL cpufreq driver for OpenPOWER platforms.
  • ACPI on OpenPOWER: POWER9 and POWER10 systems running the Little-Endian Linux ABI (ppc64le) and ACPI-enumerated (ACPI is supported on OpenPOWER via SBSA-like profile) can use ACPI CPPC for frequency control, the same as ARM ACPI servers.
  • Embedded PPC32 (e500/e500mc): Frequency scaling is platform-specific; most embedded PPC32 SoCs use a simple PLL register write, described in the device tree.

Thermal:

  • OPAL platforms: Temperature sensor data is read via opal_sensor_read using the sensor handles discovered from the ibm,opal-sensors node. UmkaOS registers these as SensorBackend::Opal entries in the thermal framework.
  • LPAR (PowerVM): Thermal management is entirely hypervisor-controlled; the guest OS has no visibility into die temperature and cannot control throttling. UmkaOS does not register thermal zones in LPAR mode.
  • Server platforms with IPMI: Both PPC32 and PPC64LE rack servers typically have a Baseboard Management Controller (BMC) accessible via IPMI. Temperature sensors reported by the BMC are accessed through the IPMI thermal zone backend (Section 6.2.3.6). This is the same DCMI/IPMI path as on x86-64 rack servers (Section 6.2.7).

Runtime device power management:

  • LPAR: Device power gating is hypervisor-managed. UmkaOS does not control device power state directly in LPAR mode; the hypervisor handles it transparently.
  • OPAL bare metal: OPAL exposes device power domains via opal_pci_set_power_state for PCIe devices and via the ibm,opal device tree node's power-management subnode for on-chip devices.
  • Device tree power domains: Embedded PPC32/PPC64 platforms follow standard device tree power-domains bindings, identical to ARM embedded.

6.2.3 Thermal Framework

6.2.3.1 Thermal Zones

A thermal zone is a region of the system that has one or more temperature sensors and a set of trip points. Physical examples:

  • CPU die (one per socket; typically uses the TCONTROL MSR or PECI for temperature)
  • GPU die (integrated or discrete)
  • Battery (reported via ACPI _BTP or Smart Battery System)
  • Skin/chassis (NTC thermistor on laptop lid; used to prevent burns)
  • NVMe drive (SMART temperature, reported via hwmon Section 10.10.2.1)
/// A thermal zone: a named region with a temperature sensor and trip points.
pub struct ThermalZone {
    /// Human-readable name (e.g., `"cpu0-die"`, `"battery"`, `"skin"`).
    /// Must be unique within the system. Used as the sysfs directory name.
    pub name: &'static str,

    /// The temperature sensor for this zone.
    pub sensor: Arc<dyn TempSensor>,

    /// Ordered list of trip points, sorted by `temp_mc` ascending.
    ///
    /// The thermal monitor evaluates all trip points on each poll cycle and
    /// fires actions for any whose threshold has been crossed.
    ///
    /// **Boot-time only**: populated by the ACPI/DT thermal zone parser at boot
    /// and never resized after the thermal subsystem initializes. `Vec` is used
    /// for owned, contiguous storage — not for dynamic growth.
    pub trip_points: Vec<TripPoint>,

    /// Cooling devices bound to this zone with their maximum cooling state
    /// and the trip point(s) that activate them.
    ///
    /// **Boot-time only**: populated at boot alongside `trip_points` and never
    /// modified at runtime. Typical zone has 1–4 bindings.
    pub cooling_devices: Vec<CoolingBinding>,

    /// Current polling interval in milliseconds.
    ///
    /// Starts at 1000 ms (normal), drops to 100 ms when the zone temperature
    /// is within 5 °C of any trip point, and drops to 10 ms when within 1 °C
    /// of a `Hot` or `Critical` trip point.
    pub polling_interval_ms: AtomicU32,
}

6.2.3.2 Trip Points

A trip point is a temperature threshold with an associated action type:

/// A temperature threshold that triggers a thermal action when crossed.
pub struct TripPoint {
    /// Temperature at which this trip point fires, in millidegrees Celsius.
    ///
    /// For example, 95000 = 95 °C.
    pub temp_mc: i32,

    /// The action to take when this trip point is crossed.
    pub trip_type: TripType,

    /// Hysteresis in millidegrees Celsius.
    ///
    /// The trip point is considered cleared only when the temperature drops
    /// below `temp_mc - hysteresis_mc`. This prevents oscillation around the
    /// threshold. Typical value: 2000 (2 °C).
    pub hysteresis_mc: i32,
}

/// The action taken when a thermal trip point threshold is crossed.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TripType {
    /// Reduce power consumption by notifying the cpufreq governor to lower
    /// the maximum CPU frequency. Does not forcefully reduce frequency;
    /// relies on the governor to converge. This is the primary mechanism
    /// for sustained thermal management.
    Passive,

    /// Activate a cooling device (e.g., spin up a fan to a higher speed).
    /// The bound `CoolingDevice` is set to its next higher state.
    Active,

    /// The temperature has reached a dangerous level. Post a `ThermalEvent`
    /// to userspace monitoring daemons via the thermal netlink socket.
    /// Userspace may respond by reducing workload. No kernel-side action.
    Hot,

    /// Emergency condition. The kernel immediately forces a system poweroff
    /// (equivalent to `kernel_power_off()`). This happens synchronously in
    /// the thermal interrupt handler or poll loop — userspace is not consulted.
    /// Data integrity is not guaranteed; this is a last resort before hardware
    /// thermal shutdown.
    Critical,
}

The Critical trip point is typically set 5–10 °C below the hardware's own PROCHOT# shutdown temperature to give the kernel a chance to shut down cleanly (flushing journal, unmounting filesystems) before the hardware forcibly powers off.

6.2.3.3 Cooling Devices

A cooling device is something the kernel can actuate to reduce heat generation or increase heat dissipation:

/// A device that can reduce thermal load on a thermal zone.
///
/// Cooling states are represented as integers from 0 (no cooling) to
/// `max_state()` (maximum cooling). The mapping from state number to physical
/// action is device-specific.
///
/// # Examples
///
/// - `CpufreqCooler`: state 0 = max frequency, state N = minimum frequency.
/// - `FanCooler`: state 0 = fan off, state N = 100% PWM duty cycle.
pub trait CoolingDevice: Send + Sync {
    /// Return the maximum cooling state this device supports.
    ///
    /// The device can be set to any state in `[0, max_state()]`.
    fn max_state(&self) -> u32;

    /// Return the current cooling state.
    fn current_state(&self) -> u32;

    /// Set the cooling state to `state`.
    ///
    /// Must be idempotent if `state == current_state()`.
    /// Returns `ThermalError::OutOfRange` if `state > max_state()`.
    fn set_state(&self, state: u32) -> Result<(), ThermalError>;

    /// Human-readable name for this cooling device (e.g., `"cpufreq-cpu0"`,
    /// `"fan-chassis0"`). Used as the sysfs `type` file content.
    fn name(&self) -> &'static str;
}

/// Binding between a thermal zone and a cooling device.
pub struct CoolingBinding {
    /// The cooling device to actuate.
    pub device: Arc<dyn CoolingDevice>,

    /// The trip point index (into `ThermalZone::trip_points`) that activates
    /// this binding. The cooling device is stepped up one state each time the
    /// thermal zone crosses this trip point.
    pub trip_point_index: usize,

    /// The cooling state to apply when the trip point is in the active
    /// (crossed) state. When the zone cools below `temp_mc - hysteresis_mc`,
    /// the device is stepped back down toward 0.
    pub target_state: u32,
}

Standard cooling device types provided by UmkaOS:

Type Description State mapping
CpufreqCooler Limits max CPU frequency via cpufreq (Section 6.1.5.5) 0 = cpu_max_freq, N = cpu_min_freq, linear interpolation
GpufreqCooler Limits max GPU frequency via drm/gpu driver Same as above
FanCooler Sets fan PWM duty cycle via hwmon (Section 10.10.2.1) 0 = fan off, max_state() = 100% PWM
UsbCurrentCooler Reduces USB charging current to lower battery heat 0 = max current, N = 0 mA
RaplCooler Reduces RAPL PKG limit directly 0 = TDP, N = minimum supported limit

6.2.3.4 Cooling Map Discovery

The binding between thermal zones and cooling devices is discovered at boot from:

  • ACPI: _TZD (thermal zone devices), _PSL (passive cooling list), _AL0_AL9 (active cooling lists). The ACPI thermal driver evaluates these control methods and populates the cooling_devices list in each ThermalZone.
  • Device tree: cooling-maps node under the thermal zone node (binding documented in Linux kernel Documentation/devicetree/bindings/thermal/thermal-zones.yaml). UmkaOS parses this during DTB processing (Section 3.2).
  • Static board description: For platforms without ACPI or DTB thermal tables, a board-specific Rust module in umka-kernel/src/arch/ can register zones and bindings at compile time.

6.2.3.5 Polling and Interrupt-Driven Monitoring

The thermal monitor uses two mechanisms:

Polling (always available): A kernel timer fires periodically to call TempSensor::read_temp_mc() and evaluate all trip points. The polling interval is adaptive:

Temperature distance from nearest trip point Polling interval
> 5 °C below any trip point 1000 ms
1–5 °C below a Passive or Active trip 100 ms
< 1 °C below a Hot or Critical trip 10 ms

Interrupt-driven (when available): Some platforms provide hardware thermal interrupts that fire when a temperature threshold is crossed:

  • Intel PROCHOT interrupt: the CPU asserts PROCHOT# when the die temperature reaches the factory-programmed limit. The kernel registers an interrupt handler on APIC vector 0xFA (Linux convention for thermal LVT). This fires before RAPL-based throttling takes effect.
  • AMD SB-TSI alert: an SMBus alert from the SB-TSI temperature sensor on AMD platforms. Handled by the amd_sb_tsi I2C driver.
  • ACPI _HOT / _CRT notify: the firmware sends an ACPI notify event when a thermal zone crosses its Hot or Critical temperature. The ACPI event handler evaluates the zone immediately rather than waiting for the next poll cycle.

Interrupt-driven monitoring reduces the latency from temperature threshold crossing to kernel response from ≤ 1000 ms (polling) to ≤ 100 µs (interrupt).

6.2.3.6 Temperature Sensor Abstraction

/// A hardware temperature sensor.
///
/// Implementations include: x86 PECI (Platform Environment Control Interface),
/// ACPI `_TMP` control method, I2C/SMBus sensors (LM75, TMP102, etc.),
/// and ARM SoC on-die sensors.
pub trait TempSensor: Send + Sync {
    /// Read the current temperature in millidegrees Celsius.
    ///
    /// Returns `ThermalError::SensorFault` if the hardware sensor reports
    /// an error condition (e.g., I2C NACK, PECI timeout).
    fn read_temp_mc(&self) -> Result<i32, ThermalError>;

    /// Human-readable name for this sensor (e.g., `"peci-cpu0"`, `"acpi-tz0"`).
    fn name(&self) -> &'static str;
}

/// Errors returned by thermal framework operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ThermalError {
    /// The sensor or cooling device is not available or not initialised.
    NotAvailable,
    /// The sensor returned an error condition (hardware fault or communication error).
    SensorFault,
    /// The requested cooling state is outside `[0, max_state()]`.
    OutOfRange,
    /// The cooling device is currently locked by another subsystem.
    DeviceBusy,
}

6.2.3.7 Linux sysfs Compatibility

UmkaOS exposes the thermal framework under the same sysfs paths as the Linux kernel thermal framework, enabling unmodified Linux monitoring tools:

/sys/class/thermal/
  thermal_zone0/
    type          # zone name (e.g., "x86_pkg_temp")
    temp          # current temperature in millidegrees (e.g., "52000")
    mode          # "enabled" or "disabled"
    trip_point_0_temp    # first trip point temperature
    trip_point_0_type    # "passive", "active", "hot", or "critical"
    trip_point_0_hyst    # hysteresis in millidegrees
    policy        # cooling policy: "step_wise" or "user_space"
  cooling_device0/
    type          # cooling device name (e.g., "Processor")
    max_state     # maximum cooling state
    cur_state     # current cooling state

The type file content for CPU cooling devices uses the string "Processor" for compatibility with lm_sensors, thermald, and similar tools that match on this string.


6.2.4 Powercap Interface (sysfs)

The powercap sysfs hierarchy provides a unified interface for reading energy counters and setting power limits. UmkaOS's layout is byte-for-byte compatible with Linux's intel_rapl_msr driver output, ensuring that existing power monitoring and management tools work without modification.

6.2.4.1 Directory Structure

/sys/devices/virtual/powercap/
  intel-rapl/                          # Control type: Intel RAPL
    intel-rapl:0/                      # Socket 0 PKG domain
      name                             # "package-0"
      energy_uj                        # Cumulative energy (µJ, read-only, wraps)
      max_energy_range_uj              # Counter wrap value in µJ
      constraint_0_name                # "long_term"
      constraint_0_power_limit_uw      # Long-window limit in µW (read-write)
      constraint_0_time_window_us      # Long-window duration in µs (read-write)
      constraint_0_max_power_uw        # Maximum settable limit (TDP) in µW
      constraint_1_name                # "short_term"
      constraint_1_power_limit_uw      # Short-window limit in µW (read-write)
      constraint_1_time_window_us      # Short-window duration in µs (read-write)
      constraint_1_max_power_uw        # Maximum settable short-term limit
      enabled                          # "1" to enable limits, "0" to disable
      intel-rapl:0:0/                  # Socket 0 Core (PP0) sub-domain
        name                           # "core"
        energy_uj
        max_energy_range_uj
        constraint_0_name              # "long_term"
        constraint_0_power_limit_uw
        constraint_0_time_window_us
        constraint_0_max_power_uw
        enabled
      intel-rapl:0:1/                  # Socket 0 Uncore (PP1) sub-domain (client only)
        name                           # "uncore"
        ...
    intel-rapl:1/                      # Socket 1 PKG domain (dual-socket servers)
      ...

The DRAM domain appears as a separate top-level entry on server platforms:

    intel-rapl:0:2/                    # DRAM sub-domain of socket 0
      name                             # "dram"

On AMD Zen2+ systems, the same layout is used with the control type still named intel-rapl for compatibility (Linux uses the same driver name). AMD-specific extensions (if any) appear in an amd-rapl control type directory.

6.2.4.2 Tool Compatibility

The following tools work against UmkaOS's powercap hierarchy without modification:

Tool Use
powerstat Per-socket power consumption over time
turbostat CPU frequency, power, and temperature combined
s-tui Terminal UI showing frequency and power
powertop Process-level power attribution (uses /proc, not powercap, but reads energy_uj)
Prometheus node_exporter --collector.powersupplyclass and powercap collector
rapl-read Low-level RAPL register dump

6.2.4.3 Write Semantics

Writing constraint_N_power_limit_uw calls RaplInterface::set_power_limit() on the corresponding RaplDomain. Writes from unprivileged userspace are rejected with EPERM. Root (or a process with CAP_SYS_ADMIN) may write any domain.

Writing a limit that exceeds the domain's constraint_N_max_power_uw returns EINVAL. Writing 0 is equivalent to calling RaplInterface::clear_power_limit() (removes the software limit, restoring hardware default).


6.2.5 Cgroup Power Accounting

6.2.5.1 Design

Energy consumption is attributed to cgroups using a sampling-based model that parallels CPU time accounting. A dedicated kernel thread (the power accounting thread) wakes every 10 ms (configurable via /proc/sys/kernel/power_sample_interval_ms, range 1–1000 ms) and:

  1. Reads energy_uj from all active RAPL domains (all sockets, all sub-domains).
  2. Computes the delta from the previous sample, handling counter wrap-around.
  3. Queries the scheduler to get, for each cgroup, the CPU time consumed in the last 10 ms interval.
  4. Distributes the energy delta across cgroups proportional to their CPU time share.
  5. Accumulates the attributed energy into each cgroup's power.energy_uj counter.

Clarification on per-cgroup overhead: The 10 ms interval is the RAPL sampling and reporting period — a single kernel thread reads hardware energy counters and distributes the delta. This is NOT per-cgroup polling. The power accounting thread performs one RAPL read per domain per interval (typically 4-8 RAPL domains total), then a single O(n) pass over active cgroups to distribute the delta. The per-cgroup CPU time data is already maintained by the scheduler's existing accounting (updated on context switch and dequeue events, not by polling). Therefore, even with 4096 active cgroups, the accounting thread's cost is: ~4-8 RAPL reads (~200 ns each) + one linear scan of cgroup time deltas (~4096 × ~20 ns = ~80 μs) = under 100 μs per 10 ms interval, or <0.001% CPU. The naive concern that 4096 cgroups × 10 ms polling would cost ~4% CPU assumes each cgroup requires independent hardware polling; in reality, the hardware counters are per-socket (not per-cgroup) and the per-cgroup attribution is a lightweight arithmetic distribution.

This is the same weighted attribution model used by Linux's cpuacct cgroup controller and, more recently, by Intel's Energy Aware Scheduling patches.

6.2.5.2 Attribution Model

Let E_delta be the total PKG energy delta in the current interval (µJ), and let T_i be the CPU time consumed by cgroup i in the interval (µs). The energy attributed to cgroup i is:

E_i = E_delta × (T_i / Σ T_j)

where the sum is over all cgroups with T_j > 0. Idle time (no cgroup running) is attributed to a synthetic idle cgroup and not charged to any user cgroup.

Limitation: This model has two known imprecisions:

  1. It does not account for memory bandwidth differences between cgroups sharing a socket. A cgroup running a memory-bandwidth-intensive workload consumes more power per CPU cycle than one running a compute-bound workload, but they receive the same energy charge per CPU time unit. This is acceptable for accounting and billing; it is not suitable for precise per-process energy metering.

  2. On a multi-socket server, PKG energy from socket 0 may be attributed to a cgroup whose threads ran on socket 1 if the sampling window captures a migration. The error is bounded by one sampling interval (10 ms default).

6.2.5.3 Cgroup Interface

The power cgroup controller provides the following files:

File Mode Description
power.energy_uj R Cumulative energy attributed to this cgroup in µJ. Wraps at u64::MAX.
power.stat R Per-domain energy breakdown: pkg_energy_uj, core_energy_uj, dram_energy_uj.
power.limit_uw RW Power limit for this cgroup in µW. 0 = no limit. Setting a non-zero value enables power cap enforcement (Section 6.2.5.4).
power.limit_window_ms RW Averaging window for power.limit_uw enforcement, in ms. Default: 100.

These files are created under the cgroup hierarchy directory, e.g.:

/sys/fs/cgroup/my-vm/power.energy_uj
/sys/fs/cgroup/my-vm/power.limit_uw

6.2.5.4 Per-Cgroup Power Limit Enforcement

When power.limit_uw is non-zero, the power accounting thread checks, at each sample interval, whether the cgroup's rolling-average power consumption (calculated from power.energy_uj deltas over power.limit_window_ms) exceeds the limit.

If the limit is exceeded:

  1. The cgroup's effective RAPL PKG short-window limit is reduced proportionally to bring the cgroup's power consumption within budget. This is implemented by adjusting the cpu.max bandwidth (Section 6.3) for the cgroup's tasks — reducing their CPU time allocation reduces their power consumption.
  2. A PowerLimitEvent is posted to the cgroup's event fd (readable via cgroup.events), allowing userspace monitoring daemons to observe throttling.

Limitation: RAPL enforcement at sub-PKG granularity (per-cgroup, per-core) is not directly supported by hardware. Per-cgroup limits are enforced indirectly via CPU time throttling (Section 6.3). True per-cgroup hardware power isolation would require per-core RAPL (available on some Intel Xeon generations as MSR_PP0_POWER_LIMIT) combined with strict core pinning — a configuration that umka-kvm uses for VM power budgets (Section 6.2.6), but which is not the general case.


6.2.6 VM Power Budget Enforcement

6.2.6.1 Motivation

Traditional VM resource accounting models (CPU cores, RAM) do not capture actual power consumption. A VM running a STREAM memory-bandwidth benchmark or a dense linear algebra kernel (e.g., BLAS DGEMM with AVX-512) can consume 2–3× the power of a VM running a web server at equivalent CPU utilisation. In a datacenter where the binding constraint is rack PDU amperage, not CPU cores, CPU-count quotas systematically mis-model the actual cost of workloads.

Watt-based quotas reflect actual rack power budget more honestly:

  • A 500W rack PDU can host either ten 50W VMs or five 100W VMs regardless of how many vCPUs each is assigned.
  • A burst-capable VM (bursty ML inference job) can be allocated 80W sustained with a 150W burst cap for 10 ms — mirroring the RAPL two-tier limit model.
  • Overcommit is detectable and rejectable at admission time by comparing sum(vm_power_limit_mw) against measured or rated socket TDP.

6.2.6.2 Mechanism

When umka-kvm creates a VM with a vm_power_limit_mw budget:

  1. Dedicated cgroup: A cgroup is created at /sys/fs/cgroup/umka-vms/<vm-id>/ for the VM's vCPU threads. power.limit_uw is set to vm_power_limit_mw * 1000.
  2. Core pinning: The VM's vCPU threads are pinned to a CPU set on a single socket (or across sockets if vm_numa_topology specifies multi-socket). This ensures energy counter attribution is accurate (Section 6.2.5.2 limitation 2).
  3. Socket RAPL coordination: The PKG short-window limit for each socket is set to sum(vm_power_limit_mw for all VMs pinned to that socket) + headroom_mw, where headroom_mw is a configurable per-socket constant (default: 10% of TDP) reserved for host kernel overhead.
  4. Monitoring: The VmPowerBudget::update() method is called by umka-kvm's power accounting thread every 100 ms. If the VM exceeds its budget, vCPU scheduling quota is reduced.
/// Power budget tracking for a single VM.
pub struct VmPowerBudget {
    /// Allocated sustained power budget for this VM in milliwatts.
    pub limit_mw: u32,

    /// Allocated burst power limit in milliwatts, enforced for windows ≤ 10 ms.
    ///
    /// Maps to the RAPL short-window limit. Set to `limit_mw` if no burst
    /// allowance is configured (conservative mode).
    pub burst_limit_mw: u32,

    /// Measured average power consumption over the last 1-second sliding window,
    /// in milliwatts. Updated by `update()` every 100 ms.
    pub measured_mw: AtomicU32,

    /// Number of times vCPU quota was reduced due to power budget violation.
    ///
    /// Monotonically increasing. Used for throttle-rate monitoring and alerting.
    pub throttle_count: AtomicU64,

    /// Handle to the cgroup backing this VM's power accounting and enforcement.
    cgroup: CgroupHandle,
}

impl VmPowerBudget {
    /// Called every 100 ms by umka-kvm's power accounting thread.
    ///
    /// Reads the energy delta from the VM's cgroup, updates `measured_mw`,
    /// and reduces vCPU scheduling quota if the budget is exceeded.
    ///
    /// `sched` is the CPU bandwidth controller for this VM's vCPU threads (Section 6.3).
    pub fn update(&self, sched: &CpuBandwidth) {
        let delta_uj = self.cgroup.read_energy_delta_uj();
        // 100 ms window: delta_uj / 100 ms = µJ/ms = mW.
        let measured = (delta_uj / 100) as u32;
        self.measured_mw.store(measured, Ordering::Relaxed);

        if measured > self.limit_mw {
            // Reduce vCPU CBS quota proportional to overage fraction.
            // Uses fixed-point arithmetic (percentage × 10) to avoid FPU in kernel.
            // Example: measured = 120 mW, limit = 100 mW →
            //   overage = 20, reduction_permille = 20 * 1000 / 120 = 166 (16.6%).
            let overage = measured - self.limit_mw;
            let reduction_permille = overage * 1000 / measured;
            sched.reduce_quota_permille(reduction_permille);
            self.throttle_count.fetch_add(1, Ordering::Relaxed);
        }
    }
}

6.2.6.3 Admission Control

Before umka-kvm creates a new VM with a vm_power_limit_mw budget, it checks the admission constraint:

sum(vm.limit_mw for all VMs on target socket) + new_vm.limit_mw
    ≤ socket_tdp_mw - host_headroom_mw

where socket_tdp_mw is read from RaplInterface::read_tdp_mw(PowerDomainType::Pkg) at boot, and host_headroom_mw defaults to 10% of TDP (configurable via /sys/module/umka_kvm/parameters/power_headroom_pct).

If the constraint is violated, umka-kvm returns ENOSPC to the VM creation ioctl. The caller (e.g., an orchestrator) must either reduce the requested budget, migrate an existing VM to another socket/host, or reject the workload.

This is an admission control gate, not a guarantee. Actual power consumption may exceed TDP temporarily due to:

  • Turbo Boost / AMD Precision Boost (transient power above TDP for ≤ 10 ms)
  • RAPL enforcement latency (the hardware enforces limits over the configured window; instantaneous power can spike)

These transient overages are expected and handled by the hardware's own thermal and power-delivery circuitry. UmkaOS's admission control operates at the sustained (long-window) level.

6.2.6.4 Observability

Per-VM power accounting is exposed via:

/sys/fs/cgroup/umka-vms/<vm-id>/power.energy_uj    # Cumulative energy (µJ)
/sys/fs/cgroup/umka-vms/<vm-id>/power.stat          # Per-domain breakdown
/sys/fs/cgroup/umka-vms/<vm-id>/power.limit_uw      # Current limit (µW)

umka-kvm also exposes power metrics via the KVM statistics interface (/sys/bus/event_source/devices/kvm/), enabling Prometheus node_exporter KVM collector to report per-VM power consumption.


6.2.7 DCMI / IPMI Rack Power Management

6.2.7.1 Overview

In server deployments managed by a Baseboard Management Controller (BMC), the BMC may impose a platform-level power cap via the Data Center Manageability Interface (DCMI), an extension of IPMI v2.0 (specification: DCMI v1.5, published by Intel/DMTF).

DCMI provides the following power management commands over the IPMI channel:

DCMI Command NetFn/Cmd Description
Get Power Reading 2C/02h Current platform power in watts (instantaneous, min, max, average over a rolling window)
Get Power Limit 2C/03h Read the currently configured platform power cap
Set Power Limit 2C/04h Set a platform power cap and exception action (hard power-off or OEM-defined)
Activate/Deactivate Power Limit 2C/05h Enable or disable the configured power cap
Get DCMI Capabilities 2C/01h Enumerate which DCMI features the BMC supports

These commands are sent by the datacenter management infrastructure (e.g., OpenBMC, Redfish, Dell iDRAC, HP iLO) to impose a rack-level power budget on individual servers.

6.2.7.2 UmkaOS Integration

The UmkaOS IPMI driver (Tier 1; KCS/SMIC/BT keyboard-controller-style interfaces via I2C or LPC, as described in Section 10.10.2) handles incoming DCMI commands from the BMC. When the BMC asserts a power cap via Set Power Limit + Activate Power Limit, the kernel responds as follows:

BMC sets cap C_bmc (watts)
  │
  ▼
umka-ipmi driver receives DCMI Set/Activate Power Limit
  │
  ├─► Reduce aggregate RAPL PKG limits across all sockets (TDP-proportional)
  │     for each socket i:
  │       new_pkg_limit[i] = C_bmc × (socket_tdp[i] / Σ socket_tdp)
  │     TDP values are read from MSR_PKG_POWER_INFO (x86) or ACPI PPTT at boot.
  │     On heterogeneous systems (mixed socket SKUs), this gives higher-TDP
  │     sockets a proportionally larger share of the cap. On homogeneous systems,
  │     this reduces to C_bmc / num_sockets.
  │     → RaplInterface::set_power_limit(Pkg, new_pkg_limit[i], long_window_ms)
  │
  ├─► Notify umka-kvm to reduce VM watt budgets proportionally
  │     reduction_factor = C_bmc / current_total_vm_budget
  │     → for each VM: VmPowerBudget::limit_mw *= reduction_factor
  │     → Re-run admission control check (may trigger VM migration signal)
  │
  └─► Post PowerCapEvent to userspace monitoring channel
        → sysfs: /sys/bus/platform/drivers/dcmi/power_cap_uw (updated)
        → netlink thermal event (for `thermald` compatibility)
        → KVM statistics update (for Prometheus collector)

6.2.7.3 Escalation Hierarchy

Power management operates at three levels, each enforced by a different actor:

Level Mechanism Enforced by Override possible?
Software power limit RAPL MSR_PKG_POWER_LIMIT UmkaOS kernel Yes (root can raise within TDP)
BMC power cap DCMI Set Power Limit BMC firmware Only by BMC admin
Physical current limit PSU OCP / PDU circuit breaker Hardware No

The kernel controls only the first level. The BMC cap (second level) is communicated to the kernel via DCMI but ultimately enforced by the BMC's power management controller, which can throttle the server via SYS_THROT# or force a hard power-off regardless of OS state. The kernel's DCMI integration is cooperative, not authoritative.

6.2.7.4 DcmiPowerCap Interface

/// Interface for the DCMI power cap enforcement callback.
///
/// Implemented by the IPMI driver. Called when the BMC asserts or modifies
/// a DCMI power limit.
pub trait DcmiPowerCap: Send + Sync {
    /// Called when the BMC sets a new platform power cap.
    ///
    /// `cap_mw` is the new cap in milliwatts. `0` indicates the cap has been
    /// deactivated (no limit). Implementors must update RAPL limits and notify
    /// umka-kvm within this call or schedule it for immediate async processing.
    fn on_cap_set(&self, cap_mw: u32);

    /// Return the currently active BMC-imposed cap in milliwatts.
    ///
    /// Returns `None` if no cap is currently active.
    fn current_cap_mw(&self) -> Option<u32>;

    /// Return the last measured platform power reading from the BMC in milliwatts.
    ///
    /// This is the BMC's own measurement, which may differ from RAPL's
    /// (BMC measures at the PSU, RAPL measures at the socket).
    fn last_reading_mw(&self) -> u32;
}

6.2.8 Battery and SMBus Monitoring

SMBus (System Management Bus) is a subset of I2C used for battery/charger chips. Example: Smart Battery System (SBS) batteries expose registers at I2C address 0x0B: - 0x08: Temperature (in 0.1K units). - 0x09: Voltage (in mV). - 0x0A: Current (in mA, signed). - 0x0D: Relative State of Charge (0-100%). - 0x0F: Remaining Capacity (in mAh).

The battery driver (Tier 1, probed via ACPI PNP0C0A device) reads these registers periodically (every 5 seconds when on battery, every 60 seconds when on AC) and exposes them via sysfs/umkafs (see Section 6.2.9.5 for the userspace interface).


6.2.9 Consumer Power Profiles

This subsection defines the user-facing policy layer; Section 6.2.16.2.8 define the underlying mechanisms.

6.2.9.1 Power Profile Enumeration

// umka-core/src/power/profile.rs

/// User-facing power profile (consumer policy; translates to Section 6.2 mechanisms).
#[repr(u32)]
pub enum PowerProfile {
    /// Maximum performance. AC adapter expected.
    Performance  = 0,
    /// Balanced performance and power. Default on AC.
    Balanced     = 1,
    /// Aggressive power saving. Default on battery.
    BatterySaver = 2,
    /// User-defined constraints loaded from /System/Power/CustomProfile.
    Custom       = 3,
}

6.2.9.2 Profile → Mechanism Translation

Each profile maps to concrete Section 6.2 parameters:

Profile RAPL PKG limit CPU turbo GPU freq cap WiFi PSM Display brightness
Performance None (HW TDP) Enabled 100% Disabled 100%
Balanced 80% TDP Enabled 80% PSM 75%
BatterySaver 50% TDP Disabled 40% Aggressive 40%

set_profile() calls RaplInterface::set_power_limit() (Section 6.2.2), the cpufreq governor, and WirelessDriver::set_power_save() (Section 12.1.1). No RAPL MSR writes happen in consumer-layer code — all hardware access goes through Section 6.2.

6.2.9.3 AC/Battery Auto-Switch

The PowerManager listens for ACPI AC adapter events (from the battery driver, Section 6.2.8) and automatically applies the user's preferred profile for each power source. On critical battery (≤5%), BatterySaver is forced. A BatteryCritical event is posted to the Section 6.6 event ring so userspace can display a notification.

6.2.9.4 Per-Process Power Attribution

Per-process energy attribution (for desktop power managers like GNOME Settings, KDE Powerdevil) is provided by the Section 6.2.5 cgroup power accounting. Per-process granularity: each process lives in a cgroup; power.energy_uj on that cgroup gives its energy consumption. Exposed via /proc/<pid>/power_consumed_uj in umka-compat procfs.

6.2.9.5 Userspace Interface

Kernel exposes power management state via the following paths:

Path Description
/sys/kernel/umka/power/profile Read/write power profile selection (performance, balanced, battery-saver). Writable by processes with CAP_SYS_ADMIN.
/proc/<pid>/power_consumed_uj Per-process energy consumption in microjoules (RAPL cgroup attribution, Section 6.2.5.3).
/sys/class/power_supply/BAT0/capacity Battery charge percentage (0–100).
/sys/class/power_supply/BAT0/energy_now Remaining energy in µWh.
/sys/class/power_supply/BAT0/current_now Discharge/charge current in µA.
/sys/class/power_supply/BAT0/cycle_count Charge cycle count.
/sys/class/power_supply/BAT0/status Charging, Discharging, Full, Unknown.

Time-remaining estimation, low-battery notifications, and battery health display are handled by userspace daemons (UPower or equivalent) reading these paths.


6.2.10 Suspend and Resume Protocol

Context: Section 6.2.10.1 (S4Hibernate variant in SleepState) covers S4 hibernate. This section specifies S3 (Suspend-to-RAM) and S0ix (Modern Standby), which are the primary suspend mechanisms on consumer laptops.

6.2.10.1 Sleep State Enumeration

// umka-core/src/power/suspend.rs

/// ACPI sleep state.
#[repr(u32)]
pub enum SleepState {
    /// S3: Suspend-to-RAM. CPU powered off, DRAM refreshing, ~2-5W, wake in <2s.
    S3SuspendToRam = 3,
    /// S0ix: Modern Standby (S0 Low Power Idle). CPU in deep C-states, OS "running",
    /// network alive, <1W, instant wake. Intel 6th gen+, AMD Ryzen 3000+, ARM.
    S0ixModernStandby = 0x0F, // Not a standard ACPI state, vendor-specific
    /// S4: Hibernate. See [Section 17.2](17-virtualization.md#172-suspend-and-resume) for the
    /// full specification (snapshot creation, dm-crypt signed snapshot, resume protocol).
    S4Hibernate = 4,
}

/// Power state machine states.
#[repr(u32)]
pub enum SuspendPhase {
    /// System running normally.
    Running = 0,
    /// Pre-suspend: freeze userspace, sync filesystems.
    PreSuspend = 1,
    /// Device suspend: call driver suspend callbacks.
    DeviceSuspend = 2,
    /// CPU suspend: save CPU state, enter ACPI sleep state.
    CpuSuspend = 3,
    /// (system asleep, this state is never observed by running code)
    Asleep = 4,
    /// CPU resume: restore CPU state.
    CpuResume = 5,
    /// Device resume: call driver resume callbacks.
    DeviceResume = 6,
    /// Post-resume: thaw userspace.
    PostResume = 7,
}

6.2.10.2 Power State Machine

impl SuspendManager {
    /// Initiate suspend to a given sleep state.
    pub fn suspend(&self, state: SleepState) -> Result<(), SuspendError> {
        // Phase 1: PreSuspend
        self.set_phase(SuspendPhase::PreSuspend);
        self.freeze_userspace()?; // Stop all userspace tasks
        self.sync_filesystems()?; // Flush all dirty pages, journal commits

        // Phase 2: DeviceSuspend
        self.set_phase(SuspendPhase::DeviceSuspend);
        self.suspend_devices(state)?; // Call driver suspend callbacks in reverse probe order

        // Phase 3: CpuSuspend
        self.set_phase(SuspendPhase::CpuSuspend);
        self.save_cpu_state()?; // Save registers, page tables, GDT, IDT
        self.enter_acpi_sleep_state(state)?; // Write to ACPI PM1a_CNT, CPU halts

        // ... (system asleep, wake event occurs) ...

        // Phase 4: CpuResume (code resumes here after wake)
        self.set_phase(SuspendPhase::CpuResume);
        self.restore_cpu_state()?; // Restore registers, reload CR3, GDTR, IDTR

        // Phase 5: DeviceResume
        self.set_phase(SuspendPhase::DeviceResume);
        self.resume_devices(state)?; // Call driver resume callbacks in probe order

        // Phase 6: PostResume
        self.set_phase(SuspendPhase::PostResume);
        self.thaw_userspace()?; // Unfreeze userspace tasks

        self.set_phase(SuspendPhase::Running);
        Ok(())
    }
}

6.2.10.3 Driver Suspend/Resume Callbacks

Every driver (Tier 1 and Tier 2) must implement suspend/resume:

// umka-driver-sdk/src/suspend.rs

/// Driver suspend/resume trait.
pub trait SuspendResume {
    /// Suspend the device to the given sleep state.
    ///
    /// # Contract
    /// - Flush all pending I/O to the device.
    /// - Disable interrupts (deregister interrupt handler or mask at device level).
    /// - Power down the device (write to PCI PM registers, or device-specific power control).
    /// - Save any device state that cannot be reconstructed (e.g., firmware upload not repeatable).
    ///
    /// # Timeout
    /// If suspend does not complete within 500ms, the kernel may force-kill the driver
    /// (Tier 2) or mark it as failed (Tier 1).
    fn suspend(&self, target: SleepState) -> Result<(), SuspendError>;

    /// Resume the device from the given sleep state.
    ///
    /// # Contract
    /// - Restore device state (e.g., re-upload firmware, reconfigure registers).
    /// - Re-enable interrupts.
    /// - Re-establish any connections (WiFi: reconnect to AP, NVMe: reinitialize controller).
    ///
    /// # Failure handling
    /// If resume fails, return Err. The kernel will attempt recovery (Section 6.2.10.6).
    fn resume(&self, from: SleepState) -> Result<(), SuspendError>;
}

6.2.10.4 Device Suspend Ordering

Devices must suspend in reverse dependency order and resume in dependency order:

Suspend order (leaves first, roots last):
1. GPU (no dependencies)
2. Display controller (depends on GPU for framebuffer)
3. NVMe (no dependencies)
4. Filesystem (depends on NVMe)
5. Network interface (WiFi, Ethernet)
6. Network stack (depends on NIC)

Resume order (roots first, leaves last):
1. Network stack
2. Network interface
3. Filesystem
4. NVMe
5. Display controller
6. GPU

The device registry (Section 10.5) tracks dependencies. Before suspend, the registry computes a topological sort of the device tree and calls suspend callbacks in that order.

6.2.10.5 Tier 2 Driver Suspend

Tier 2 drivers are separate processes. Suspending them requires IPC:

impl SuspendManager {
    /// Suspend a Tier 2 driver (send message via ring buffer).
    fn suspend_tier2_driver(&self, driver_pid: Pid, state: SleepState) -> Result<(), SuspendError> {
        // Send DRIVER_SUSPEND message to the driver's control ring (Section 10.7.2).
        let msg = DriverControlMessage::Suspend { state };
        driver_ring_buffer.push(msg)?;

        // Wait for response (DRIVER_SUSPEND_ACK) with 500ms timeout.
        match self.wait_for_response(driver_pid, Duration::from_millis(500)) {
            Ok(DriverControlMessage::SuspendAck) => Ok(()),
            Err(TimeoutError) => {
                // Driver did not respond: force terminate.
                process::kill(driver_pid, Signal::SIGKILL)?;
                // Mark device as unavailable until resume attempts to restart the driver.
                self.device_registry.mark_unavailable(driver_pid)?;
                Ok(()) // Continue suspend, device is orphaned but system suspends
            }
            Err(e) => Err(SuspendError::DriverFailed(e)),
        }
    }
}

6.2.10.6 Resume Failure Recovery

If a driver fails to resume:

Tier 1 driver failure: 1. Attempt Function Level Reset (FLR) via PCI config space (PCI_EXP_DEVCTL_BCR_FLR). 2. If FLR succeeds, reload the driver module (call probe() again). 3. If FLR fails or reload fails, mark device unavailable, continue boot. Log error to console and /var/log/kernel.log.

Tier 2 driver failure: 1. Kill the driver process. 2. Restart the driver process (spawn new process, re-initialize ring buffers). 3. If restart succeeds, device resumes normal operation (~10-50ms total recovery). 4. If restart fails 3 times, mark device unavailable, continue boot.

Critical device failure (NVMe root filesystem, display controller on the only display): - If the root NVMe fails to resume, suspend fails, system must reboot (no recovery possible without filesystem). - If the display fails to resume, system continues to run but displays a VT panic message (via Tier 0 VGA fallback).

6.2.10.7 S0ix Modern Standby

S0ix is not a true suspend state — the OS remains "running", but the CPU enters deep C-states (C10 on Intel, Ccd6 on AMD) where cores are powered off but SoC stays alive.

Differences from S3: - CPU does not power off: Scheduler still runs, interrupts still fire, but all tasks are idle (blocked on I/O or sleeping). - WiFi stays live: Driver keeps the radio in D3hot (low power but still connected to AP), wakes on packet. - Display off: Panel enters DPMS Off (Section 20.4), backlight off, but display controller stays powered. - Device D3hot, not D3cold: Devices enter D3hot (low power, can wake quickly) instead of D3cold (unpowered).

Enter S0ix:

impl SuspendManager {
    pub fn enter_s0ix(&self) -> Result<(), SuspendError> {
        // 1. Notify all drivers to enter D3hot (not D3cold).
        for driver in &self.drivers {
            driver.set_power_state(PciPowerState::D3hot)?;
        }

        // 2. Set CPU P-state to minimum frequency.
        self.cpu_freq_governor.set_min_freq()?;

        // 3. Program CPU package C-state limit to C10 (deepest).
        self.cpu_cstate_governor.set_max_cstate(CState::C10)?;

        // 4. Idle all CPUs (all tasks blocked or sleeping).
        //    Scheduler tick timer set to 1 Hz (extremely long idle periods).
        self.scheduler.set_idle_mode(true)?;

        // System now in S0ix. CPUs enter C10, wake on interrupt (timer, GPIO, PCIe PME).
        Ok(())
    }
}

Exit S0ix: Any interrupt (lid open, network packet, RTC alarm, USB device activity) wakes the CPU from C10 back to C0, resume is instant (~1-5ms).


6.2.11 Integration Points

The following table maps each Section 6.2 mechanism to its consumers in other sections:

Mechanism Defined in Consumed by
RaplInterface::set_power_limit() Section 6.2.2.5 Section 6.2.9 consumer power profiles, Section 6.2.6 VM power budgets, Section 6.2.7 DCMI enforcement, thermal passive cooling (Section 6.2.3.3 RaplCooler)
PowerDomainRegistry Section 6.2.2.5 powercap sysfs (Section 6.2.4), cgroup power accounting (Section 6.2.5), VM admission control (Section 6.2.6.3)
ThermalZone trip points Section 6.2.3.1–3.2 Scheduler passive cooling (Section 6.1): Passive trips reduce cpufreq max; hwmon fan control (Section 10.10.2.1): Active trips actuate FanCooler
TripType::Critical handler Section 6.2.3.2 kernel_power_off() — no other dependencies
cgroup power.energy_uj Section 6.2.5.3 Billing/monitoring userspace agents, umka-kvm per-VM accounting (Section 6.2.6.2), Section 6.2.9.4 per-process attribution
cgroup power.limit_uw Section 6.2.5.3–5.4 umka-kvm VmPowerBudget (sets this file on VM creation), Section 6.2.9 power profiles (sets this on cgroup creation)
VmPowerBudget struct Section 6.2.6.2 umka-kvm/src/power.rs; interacts with Section 6.3 CpuBandwidth::reduce_quota()
DcmiPowerCap::on_cap_set() Section 6.2.7.4 BMC-driven cap propagation to RAPL (Section 6.2.2) and umka-kvm (Section 6.2.6)
SMBus battery registers Section 6.2.8 Battery driver, sysfs/umkafs (Section 6.2.9.5), Section 6.2.9.3 AC/battery auto-switch
PowerProfile enum Section 6.2.9.1 Section 6.2.9.2 profile translation, userspace power managers
SuspendManager::suspend() Section 6.2.10.2 System suspend/resume, Section 6.2.10.5 Tier 2 driver IPC
SuspendResume trait Section 6.2.10.3 Tier 1/Tier 2 driver implementations

6.3 CPU Bandwidth Guarantees

Inspired by: QNX Adaptive Partitioning (concept only). IP status: Built from academic scheduling theory (CBS/EDF, 1998) and Linux cgroup v2 interface. QNX-specific implementation NOT referenced. Term "adaptive partitioning" NOT used.

6.3.1 Problem

Section 6.1 defines three scheduler classes: EEVDF (normal), RT (FIFO/RR), and Deadline (EDF/CBS). Cgroups v2 provides resource limits (cpu.max caps the ceiling, cpu.weight sets relative priority).

What is missing: guaranteed minimum CPU bandwidth under overload. Current mechanisms:

  • cpu.weight is proportional sharing — if one cgroup has weight 100 and another has weight 900, the first gets 10% of whatever is available. But if the system is fully loaded, "10% of available" might not meet the minimum requirement.
  • cpu.max is a ceiling, not a floor. It limits maximum, does not guarantee minimum.
  • Deadline scheduler provides guarantees, but only for individual tasks, not for groups.

Use case: A server runs a database (needs guaranteed 40% CPU), a web frontend (needs guaranteed 20%), and batch jobs (uses the rest). Under overload, the batch jobs must not be able to starve the database below 40%, even if the batch jobs are numerous.

6.3.2 Design: CBS-Based Group Bandwidth Reservation

The solution combines the existing Deadline scheduler's Constant Bandwidth Server (CBS) algorithm with cgroup v2's group hierarchy.

CBS (Abeni & Buttazzo, 1998) provides bandwidth isolation: each server (task or group) is assigned a bandwidth Q/P (Q microseconds of CPU time every P microseconds). The server is guaranteed this bandwidth regardless of other servers' behavior. Unused bandwidth is redistributed (work-conserving).

This is a well-established academic algorithm with no IP encumbrance.

6.3.3 Cgroup v2 Interface

New control file in the cpu controller, additive to the existing interface:

/sys/fs/cgroup/<group>/cpu.guarantee

Format: $QUOTA $PERIOD (microseconds), identical to cpu.max format.

Example:

# # Database cgroup: guaranteed 40% CPU bandwidth
echo "400000 1000000" > /sys/fs/cgroup/database/cpu.guarantee
# # = 400ms of CPU time every 1000ms = 40% guaranteed

# # Web frontend: guaranteed 20%
echo "200000 1000000" > /sys/fs/cgroup/web/cpu.guarantee

# # Batch jobs: no guarantee (uses whatever is left)
# # cpu.guarantee defaults to "max" (no guarantee)

# # Total guaranteed: 60%. Remaining 40% is shared by weight among all.

Semantics:

  • A group with cpu.guarantee set is backed by a CBS server at the specified bandwidth.
  • The CBS server ensures the group receives at least its guaranteed bandwidth even under full system load.
  • When the group is idle, its unused bandwidth is redistributed to other groups (work-conserving). This is inherent to CBS — no special logic needed.
  • cpu.guarantee cannot exceed cpu.max (if both are set, guarantee <= max).
  • Sum of all cpu.guarantee across the system must not exceed total CPU bandwidth. Attempting to overcommit returns -ENOSPC.
  • Nested cgroups: a child's guarantee comes out of its parent's guarantee budget.

Multi-core accounting: cpu.guarantee specifies system-wide bandwidth, not per-CPU. The implementation uses a global budget pool with per-CPU runtime slices (same pattern as Linux EEVDF bandwidth throttling for cpu.max). A CBS server with a 40% guarantee on a 4-CPU system gets a global budget of 400ms per 1000ms period. This budget is drawn down as tasks in the group run on any CPU. When exhausted, all tasks in the group are throttled until the next period. Per-NUMA-node variants are tracked in BandwidthAccounting.per_node_guaranteed for NUMA-aware scheduling hints, but the guarantee is enforced globally.

RT and DL task handling: RT and DL tasks within a CBS-guaranteed cgroup bypass the CBS server's budget accounting. Their guarantees come from the RT/DL schedulers directly. The CBS guarantee applies only to EEVDF-class (normal) tasks within the group. This matches Linux's behavior where cpu.max throttling does not apply to RT tasks.

6.3.4 Kernel-Internal Design

// umka-core/src/sched/cbs_group.rs (kernel-internal)

/// A CBS bandwidth server attached to a cgroup.
pub struct CbsGroupServer {
    /// Bandwidth: quota microseconds per period microseconds.
    pub quota_us: u64,
    pub period_us: u64,

    /// Current budget remaining in this period. Signed because CBS
    /// allows transient overspend: if a task's time slice crosses the
    /// budget boundary (e.g., budget was 50μs remaining but the tick
    /// granularity is 1ms), the budget goes negative. When negative:
    ///   1. The server is immediately throttled (`throttled` set to true).
    ///   2. All tasks in this server's runqueue are dequeued from the
    ///      CPU's run queue and parked until the next period.
    ///   3. The deficit carries forward: at period replenishment,
    ///      `budget_remaining_us = quota_us + budget_remaining_us`
    ///      (adding a negative value reduces the next period's budget).
    ///   4. The deadline is pushed back by one period regardless of
    ///      deficit magnitude — CBS guarantees temporal isolation by
    ///      limiting each server to its declared bandwidth over any
    ///      sliding window.
    /// Admission control (Section 6.3.3) ensures the sum of all servers'
    /// quota/period ratios does not exceed 1.0 per CPU, preventing
    /// starvation even when individual servers overshoot within a period.
    ///
    /// **Deficit cap**: To prevent a buggy or malicious task from accumulating
    /// an unbounded deficit that would starve it for many periods, the budget
    /// is clamped to a minimum of `-quota_us` (one full period's worth of deficit).
    /// At replenishment, if `budget_remaining_us < -quota_us`, it is set to
    /// `-quota_us` before adding `quota_us`, ensuring the server always receives
    /// at least `max(0, quota_us - |deficit|)` budget. This bounds the recovery
    /// time to at most one period regardless of how negative the deficit became.
    /// (Implementation note: since `quota_us` is `u64`, the comparison is
    /// `budget_remaining_us < -(quota_us as i64)` — the cast is safe because
    /// `quota_us` never exceeds `i64::MAX` in practice; a 1-second period uses
    /// 1_000_000, well within range.)
    pub budget_remaining_us: AtomicI64,

    /// Absolute deadline of current server period.
    pub current_deadline_ns: u64,

    /// Whether this server is currently throttled (budget exhausted).
    pub throttled: AtomicBool,

    /// Run queue of EEVDF-class (normal) tasks belonging to this cgroup.
    /// RT/DL tasks bypass this server and are scheduled by their
    /// respective schedulers directly (see Section 6.3.3).
    pub runqueue: EevdfRunQueue,  // Reuses existing EEVDF tree

    /// Total CPU time consumed by this server (for accounting).
    pub total_runtime_ns: AtomicU64,

    /// High-resolution timer for period-boundary replenishment.
    ///
    /// Armed to fire at `current_deadline_ns`. When a CBS task is blocked
    /// (idle before exhausting its budget), this timer ensures replenishment
    /// still occurs at the period boundary so the next period begins with a
    /// full budget. See CBS Replenishment below.
    pub replenish_timer: HrTimer,
}

CBS Replenishment

When a CBS task exhausts its budget (budget_remaining_us reaches zero or goes negative):

  1. Set budget_remaining_us = quota_us (replenish to full budget; any carried deficit from the previous period has already been accounted for via the deficit-cap mechanism described above).
  2. Set current_deadline_ns = current_deadline_ns + period_us * 1000 (postpone the absolute deadline by one period, converting microseconds to nanoseconds).
  3. If the task is still runnable, re-enqueue it in the CBS server's internal EEVDF tree with the new deadline.

The replenish_timer fires at current_deadline_ns to handle the case where a blocked task does not exhaust its budget within a period. On timer fire:

if budget_remaining_us < quota_us as i64:
    budget_remaining_us = quota_us as i64   // replenish
    current_deadline_ns += period_us * 1000 // advance deadline
    if task is runnable: re-enqueue in CBS EEVDF tree
rearm replenish_timer to fire at new current_deadline_ns

This ensures both exhaustion-triggered and timer-triggered replenishment follow the same CBS invariant: the server's deadline advances by exactly one period per replenishment, bounding the guaranteed bandwidth to quota_us / period_us over any sliding window.

Integration with existing scheduler (Section 6.1):

Per-CPU run queue structure (updated):

    +------------------+
    | RT Queue         |   <- Highest priority (unchanged)
    +------------------+
    | DL Queue         |   <- Deadline tasks (unchanged)
    +------------------+
    | CBS Group Servers|   <- NEW: CBS servers for guaranteed groups
    |  +-- db_server   |      Each server has its own EEVDF tree inside
    |  +-- web_server  |
    +------------------+
    | EEVDF Tree       |   <- Normal tasks without guarantee (unchanged)
    +------------------+

Scheduling decision: 1. Check RT queue (highest priority) — unchanged. 2. Check DL queue (deadline tasks) — unchanged. 3. Check CBS group servers (ordered by earliest deadline): - If a server has budget and runnable tasks: pick its next task. - CBS guarantees each server receives its bandwidth. 4. Check EEVDF tree (normal tasks without guarantee) — unchanged.

Unguaranteed tasks (step 4) run when all CBS servers are idle or throttled. Plus, CBS servers that are under-utilizing their budget donate the slack back to step 4 (work-conserving).

6.3.5 Overcommit Prevention

// umka-core/src/sched/cbs_group.rs

/// System-wide guarantee accounting.
pub struct BandwidthAccounting {
    /// Total guaranteed bandwidth across all CBS servers.
    /// Stored as fraction * 1_000_000 (e.g., 400000 = 40%).
    pub total_guaranteed: AtomicU64,

    /// Maximum allowable guarantee (default: 95%).
    /// Reserves 5% for kernel threads, interrupts, housekeeping.
    pub max_guarantee: u64,

    /// Per-NUMA-node guaranteed bandwidth (for NUMA-aware scheduling).
    /// Dynamically sized at boot based on discovered NUMA node count,
    /// following the same pattern as `NumaTopology` (Section 4.1.8).
    pub per_node_guaranteed: &'static [AtomicU64],
}

Setting cpu.guarantee fails with -ENOSPC if total_guaranteed + new_guarantee would exceed max_guarantee. This prevents overcommit and guarantees all promises are simultaneously satisfiable.

6.3.6 Interaction with Existing Controls

Control Meaning Interaction with cpu.guarantee
cpu.weight Relative share of excess CPU Distributes CPU beyond guaranteed minimums
cpu.max Maximum CPU ceiling Guarantee cannot exceed max; max still enforced
cpu.guarantee Minimum CPU floor NEW: CBS-backed guaranteed bandwidth
cpu.pressure PSI pressure info Reports pressure relative to guarantee

Example: a cgroup with cpu.guarantee=40%, cpu.max=60%, cpu.weight=100: - Always gets at least 40% CPU (even under full system load) - Never gets more than 60% CPU (even if system is idle — ceiling applies) - Between 40% and 60%, shares proportionally with other cgroups by weight

6.3.7 Use Case: Driver Tier Isolation

CPU guarantees integrate naturally with the driver tier model:

# # Ensure Tier 1 drivers always have CPU bandwidth for I/O processing
echo "200000 1000000" > /sys/fs/cgroup/umka-tier1/cpu.guarantee  # 20%

# # Ensure Tier 2 drivers have some guaranteed bandwidth
echo "50000 1000000" > /sys/fs/cgroup/umka-tier2/cpu.guarantee   # 5%

A misbehaving Tier 2 driver process spinning in a loop cannot starve Tier 1 NVMe or NIC drivers of CPU time.


6.4 Power Budgeting

6.4.1 Problem

Datacenters in 2026 are power-wall limited. A rack has a fixed power budget (typically 20-40 kW). Power, not compute, is the scarce resource.

Linux has power management (cpufreq, DVFS, C-states, RAPL readout) but no power budgeting. There is no way to say "this container gets at most 150W total across CPU, GPU, memory, and NIC." There is no way for the scheduler to make holistic power-performance tradeoffs.

Relationship to Section 6.1.5 (Heterogeneous CPU / EAS): Section 6.1.5 covers Energy-Aware Scheduling at the per-task level — selecting the most energy-efficient core type (P-core vs E-core) for each task using OPP tables and PELT utilization. This section covers a complementary concern: per-cgroup power budgeting — enforcing total watt caps across all power domains (CPU + GPU + DRAM + NIC). The two mechanisms interact: EAS picks the optimal core, power budgeting enforces the envelope.

6.4.2 Design: Power as a Schedulable Resource

Power joins CPU time, memory, and accelerator time as a kernel-managed resource with cgroup integration.

// umka-core/src/power/budget.rs

/// Maximum number of power domains tracked by the power budgeting subsystem.
/// A typical datacenter server has:
///   - 1-2 CPU packages (CpuPackage)
///   - 8-128 CPU cores (CpuCore, if per-core RAPL is available)
///   - 1-2 DRAM controllers (Dram)
///   - 0-8 GPUs/accelerators (Accelerator)
///   - 1-4 NICs (Nic, if power-metered)
///   - 1-8 NVMe SSDs (Storage, if power-metered)
///   - 1 platform-level domain (Platform)
/// Setting 256 covers high-end servers with per-core monitoring enabled.
/// The ArrayVec avoids heap allocation on the tick hot path.
pub const MAX_POWER_DOMAINS: usize = 256;

/// Maximum number of cgroups tracked by the power budgeting subsystem.
/// Cgroups are typically hierarchical; a large server may have:
///   - 1 root cgroup
///   - 10-100 system.slice cgroups (systemd services)
///   - 10-1000 user.slice cgroups (user sessions, containers)
/// Setting 4096 covers large container hosts without excessive memory.
/// The FixedHashMap uses open addressing with power-of-two sizing.
pub const MAX_POWER_CGROUPS: usize = 4096;

/// Platform-agnostic power domain.
///
/// This is the generic cross-architecture power domain object used by the
/// power-budget enforcer (Section 6.4.4) and cgroup power accounting
/// (Section 6.2.5). It identifies a device by `DeviceNodeId` and tracks
/// current and maximum power draw regardless of the underlying measurement
/// mechanism (RAPL, SCMI, ACPI, or estimation).
///
/// Contrast with `RaplDomain` (Section 6.2.2.5), which is x86/RAPL-specific
/// and carries a `RaplInterface` hardware handle. `GenericPowerDomain` is the
/// unified abstraction that upper layers use after the architecture-specific
/// boot driver has populated the `PowerDomainRegistry`.
pub struct GenericPowerDomain {
    /// Domain identifier (matches device registry node).
    pub device_id: DeviceNodeId,

    /// Domain type.
    pub domain_type: PowerDomainType,

    /// Current power draw (milliwatts, updated every tick).
    pub current_mw: AtomicU32,

    /// Maximum power this domain can draw (TDP or configured limit).
    /// Initialized from ACPI PPCC (Participant Power Control Capabilities)
    /// tables where available; these define hardware power limits that
    /// the OS must respect. Falls back to TDP from CPUID/ACPI otherwise.
    pub max_mw: u32,

    /// Current performance level (0 = lowest power, 100 = maximum).
    pub perf_level: AtomicU32,

    /// Power measurement source.
    pub measurement: PowerMeasurement,
}

// Canonical definition — see Section 6.2.2.5 above.
#[repr(u32)]
pub enum PowerDomainType {
    /// CPU package (includes all cores and uncore).
    CpuPackage  = 0,
    /// CPU core subset (per-core RAPL on AMD Zen 2+; proportional estimate on Intel).
    CpuCore     = 1,
    /// DRAM (memory controller).
    Dram        = 2,
    /// GPU / accelerator.
    Accelerator = 3,
    /// NIC (if power-metered).
    Nic         = 4,
    /// NVMe SSD (if power-metered).
    Storage     = 5,
    /// Entire system (platform-level RAPL or BMC).
    Platform    = 6,
}

#[repr(u32)]
pub enum PowerMeasurement {
    /// Intel RAPL (Running Average Power Limit) via MSR.
    IntelRapl       = 0,
    /// AMD RAPL equivalent.
    AmdRapl         = 1,
    /// ARM SCMI (System Control and Management Interface).
    ArmScmi         = 2,
    /// ACPI Power Meter device.
    AcpiPowerMeter  = 3,
    /// BMC/IPMI (out-of-band, lower frequency).
    BmcIpmi         = 4,
    /// Estimated from utilization (no hardware meter).
    Estimated       = 5,
}

Per-architecture power measurement details:

Intel/AMD RAPL (x86):
  - Read via MSR: IA32_PKG_ENERGY_STATUS (package), IA32_PP0_ENERGY_STATUS (cores),
    IA32_DRAM_ENERGY_STATUS (DRAM), IA32_PP1_ENERGY_STATUS (GPU/uncore).
  - Resolution: ~15.3 μJ per LSB (Intel), ~15.6 μJ (AMD).
  - Read cost: ~100ns per MSR read. 6 domains × 100ns = 600ns per tick.
  - Per-core RAPL (AMD Zen 2+ via `MSR_CORE_ENERGY_STAT`, MSR `0xC001_029A`):
    per-CPU energy attribution. Enables precise per-cgroup power accounting without
    proportional estimation. **Intel does not provide per-core energy counters** —
    Intel RAPL PP0 is an all-cores aggregate for the entire package. On Intel
    platforms, per-CPU energy is estimated proportionally from PP0 using utilization
    weights (less precise than AMD's direct per-core counters).
  - Overflow: 32-bit energy counters. Overflow interval depends on the CPU model's
    energy unit and current power draw — it MUST be computed at runtime. At boot,
    the kernel reads MSR_RAPL_POWER_UNIT to extract the energy unit (bits 12:8),
    giving energy_unit_joules = 2^(-ESU) (e.g., ESU=14 → ~61 μJ on Haswell
    and later (including Skylake), ESU=16 → ~15.3 μJ on Sandy Bridge / Ivy
    Bridge (the architecture default)). The overflow interval is then:
      overflow_seconds = 2^32 * energy_unit_joules / current_power_watts
    For a 200W package with ESU=16: 2^32 * 15.3e-6 / 200 ≈ 329 seconds.
    For a 500W package with ESU=14: 2^32 * 61e-6 / 500 ≈ 524 seconds.
    The kernel sets the RAPL polling interval to min(overflow_seconds / 2, tick)
    to guarantee no counter wraparound is missed. With a 4ms tick and typical
    overflow intervals of 329-524 seconds, the formula always evaluates to `tick`
    (4ms) on current hardware — the overflow margin is enormous. The calculation
    is still performed at runtime (not assumed) to handle hypothetical hardware
    where very high power draw or very coarse energy units could produce an
    overflow interval shorter than twice the tick period.

ARM SCMI (AArch64/ARMv7):
  - SCMI (System Control and Management Interface, ARM DEN 0056) is a standardized
    protocol for communication between the OS and a System Control Processor (SCP).
  - Power domains are discovered via SCMI_POWER_DOMAIN_ATTRIBUTES (protocol 0x11, message 0x03).
  - Power measurement: SCMI_SENSOR_READING_GET (protocol 0x15, message 0x06) reads sensor values
    from the SCP. Sensor types include POWER (watts), ENERGY (joules), CURRENT (amps).
  - Read cost: ~1-5 μs per SCMI message (shared memory + doorbell interrupt to SCP).
    Higher than RAPL (~100ns) but still within the 4ms tick budget.
  - Available on: ARM SBSA servers (AWS Graviton, Ampere), Cortex-M SCP-based
    platforms, and any SoC implementing SCMI power management.
  - Fallback: If SCMI is not available (e.g., simple embedded boards without SCP),
    fall back to Estimated mode.

  Power domain mapping:
    SCMI domain ID → UmkaOS GenericPowerDomain:
      - SCMI_POWER_DOMAIN type "CPU" → PowerDomainType::CpuPackage or CpuCore
      - SCMI_POWER_DOMAIN type "GPU" → PowerDomainType::Accelerator
      - SCMI_POWER_DOMAIN type "MEM" → PowerDomainType::Dram
      - Platform-level SCMI sensor → PowerDomainType::Platform

RISC-V SBI PMU:
  - RISC-V has no standard power measurement interface. The SBI PMU extension
    (ratified) provides performance counters but not power counters.
  - On platforms with BMC/IPMI (e.g., datacenter RISC-V): use BmcIpmi source.
  - On platforms without any power measurement: use Estimated mode.
  - Future: the RISC-V power management task group is defining power management
    extensions. UmkaOS will adopt these when ratified.

Estimated (fallback for all architectures):
  - When no hardware power meter is available, UmkaOS estimates power from:
    a. CPU utilization × TDP per core (linear model, ~10% accuracy).
    b. Frequency scaling: power ∝ V² × f. Frequency from cpufreq.
    c. C-state residency: idle cores at deep C-states draw ~0.5-2W.
  - Estimation runs in the scheduler tick handler (zero additional overhead).
  - Accuracy: ±20-30% vs actual hardware measurement. Sufficient for coarse
    power budgeting (e.g., "keep this rack under 30kW") but not for fine-grained
    per-cgroup accounting.
  - The estimated source is logged at boot:
    umka: power: No hardware power meter detected, using estimated power model
    umka: power: Estimated power accuracy: ±25%. Consider RAPL/SCMI hardware.

6.4.3 Cgroup Integration

/sys/fs/cgroup/<group>/power.max
# # Maximum total power budget for this cgroup (milliwatts).
# # Enforced across ALL power domains (CPU + GPU + memory + NIC).
# # Format: "150000" (150W)
# # "max" = no limit (default)

/sys/fs/cgroup/<group>/power.current
# # Current power draw by this cgroup (milliwatts, read-only).
# # Sum of all power domains attributed to this cgroup's processes.

/sys/fs/cgroup/<group>/power.stat
# # Power statistics:
# #   energy_uj <total energy consumed in microjoules>
# #   throttle_count <times power budget was exceeded and throttled>
# #   throttle_us <total microseconds spent throttled>
# #   avg_power_mw <average power over last 10 seconds>

/sys/fs/cgroup/<group>/power.weight
# # Relative share of excess power budget (like cpu.weight).
# # Default: 100. Higher = more power when contended.

/sys/fs/cgroup/<group>/power.domains
# # Per-domain power limits (optional, for fine-grained control).
# # Format: "cpu 80000 gpu 60000 dram 10000"
# # If not set, the global power.max is split by the kernel
# # based on workload demand.

6.4.4 Power-Aware Scheduler

// umka-core/src/sched/power.rs

const MAX_POWER_CGROUPS: usize = 4096; // Must be power of two

/// A fixed-capacity open-addressing hash map with compile-time maximum size.
/// Uses power-of-two capacity with linear probing. Never allocates — backed
/// by a static or slab-allocated array. Insertion returns Err if at capacity.
/// N must be a power of two; load factor is capped at 75% (0.75 * N entries).
///
/// Hash function: FxHash (Firefox's fast integer hash — multiply by a golden
/// ratio constant, shift right). FxHash is ideal for small integer keys
/// (CgroupId, DeviceId) and has no allocation or state. For string keys,
/// SipHash-1-3 is used (DoS-resistant).
///
/// Collision resolution: linear probing with step size 1. At 75% load factor
/// and power-of-two sizing, expected probe length is ~2 (Birthday paradox
/// bound). At capacity (75% of N), insert returns `Err(MapFull)`.
///
/// The map is pre-allocated at CBS initialization with capacity
/// `max_concurrent_cbs_tasks × 2` (load factor 0.5), where
/// `max_concurrent_cbs_tasks` is derived from the system's admission-control
/// limit (Section 6.3.5). Pre-allocation guarantees that no insertion fails
/// during the tick hot path as long as the admitted task count does not exceed
/// the limit enforced at `cpu.guarantee` write time.
///
/// On the rare case of map overflow (should not occur with correct
/// pre-allocation; indicates a kernel bug or admission-control bypass):
/// charge `budget_remaining_us = 0` for the current tick as a conservative
/// fallback — the task is considered to have exhausted its budget for this
/// tick. This is safe: it errs on the side of throttling the task rather than
/// allowing unaccounted CPU consumption, preserving CBS bandwidth isolation
/// guarantees. A kernel warning is emitted unconditionally on overflow
/// (not gated on debug_assertions) since this path should never be reached.
struct FixedHashMap<K: Hash + Eq, V, const N: usize> {
    entries: [Option<(K, V)>; N],
    count: usize,
}

/// Power budget enforcer.
/// Runs at scheduler tick frequency (~4ms). NOT per-scheduling-decision.
pub struct PowerBudgetEnforcer {
    /// Power domains on this machine.
    /// Populated at boot from ACPI/DT hardware discovery. The number of power
    /// domains is small and bounded (typically <=16: package + cores + DRAM +
    /// accelerators). Uses a fixed-capacity array sized to MAX_POWER_DOMAINS.
    domains: ArrayVec<GenericPowerDomain, MAX_POWER_DOMAINS>,

    /// Per-cgroup power accounting.
    /// Uses a fixed-size hash table (open addressing, power-of-two size) rather
    /// than BTreeMap: this runs at tick frequency (~4ms) so O(1) average-case
    /// lookup is required. Maximum cgroup count is bounded by the cgroup hierarchy
    /// (typically <1024 active cgroups). Resized only on cgroup creation/deletion,
    /// never on the tick hot path.
    cgroup_power: FixedHashMap<CgroupId, CgroupPowerState, MAX_POWER_CGROUPS>,

    /// Global power budget (rack-level, from BMC or admin-configured).
    global_budget_mw: Option<u32>,
}

pub struct CgroupPowerState {
    /// Budget for this cgroup (from power.max).
    budget_mw: u32,

    /// Current attributed power draw.
    current_mw: u32,

    /// Running energy counter (microjoules).
    energy_uj: u64,

    /// Is this cgroup currently throttled?
    throttled: bool,

    /// Throttle mechanism:
    /// 1. Reduce CPU frequency (cpufreq) for this cgroup's cores.
    /// 2. Reduce accelerator clock (AccelBase set_performance_level).
    /// 3. As last resort: CPU throttling (delay scheduling).
    /// At most one action per PowerDomainType variant (7 variants), so bounded to 8.
    /// Using 8 instead of MAX_POWER_DOMAINS (256) saves ~3 KB per cgroup entry.
    throttle_actions: ArrayVec<ThrottleAction, 8>,
}

pub enum ThrottleAction {
    /// Reduce CPU frequency to this level (MHz).
    CpuFrequency(u32),
    /// Reduce accelerator performance level.
    AccelPerformance { device_id: DeviceNodeId, level: u32 },
    /// Throttle CPU time (insert idle cycles).
    CpuThrottle { duty_cycle_percent: u32 },
}

Enforcement flow:

Every scheduler tick (~4ms):
  1. Read power counters from all domains (1 MSR read per domain, ~100ns each).
  2. Attribute power to cgroups based on CPU time + accelerator time share.
     Note: power attribution is an APPROXIMATION. RAPL gives package-level
     power, not per-process. Attribution model:
       Per-core RAPL (AMD Zen 2+): precise per-CPU attribution. Intel: proportional.
       Package-level RAPL: proportional to (cgroup CPU time / total CPU time)
         weighted by frequency at time of execution.
       Accelerator: proportional to AccelBase get_utilization() per cgroup.
     This is the same limitation as Linux (perf energy-cores event).
     Exact per-process power metering requires hardware not yet available.
  3. For each cgroup:
     a. Is current_mw > budget_mw?
     b. Yes → rebuild throttle_actions array (see selection algorithm below).
     c. No → clear throttle_actions and release any active throttles.
  4. Total overhead: ~1 μs per tick. Fraction of the 4ms tick: 0.025%.

Throttle action selection algorithm (step 3b):
  Input: excess_mw = current_mw - budget_mw for the cgroup.
  Output: throttle_actions filled in priority order.

  Step 1 — CPU frequency reduction:
    For each CPU frequency domain that contains CPUs running this cgroup's tasks:
      current_pstate = cpufreq_get_current_pstate(domain)
      If current_pstate > PSTATE_MIN:
        Add ThrottleAction::CpuFrequency(pstate_to_mhz(current_pstate - 1))
        Estimated power reduction: (current_mw * pstate_freq_ratio_drop) mw
    If estimated reduction ≥ excess_mw: stop here (frequency alone is enough).

  Step 2 — Accelerator clock reduction:
    Only if remaining_excess_mw > 0 after Step 1.
    For each accelerator context used by this cgroup:
      current_level = accel_vtable.get_performance_level(device_id)
      If current_level > 0:
        Add ThrottleAction::AccelPerformance { device_id, level: current_level - 1 }
    If estimated reduction ≥ remaining_excess_mw: stop here.

  Step 3 — CPU duty-cycle throttle (last resort):
    Only if remaining_excess_mw > 0 after Steps 1 and 2.
    duty_cycle = max(50, 100 - (remaining_excess_mw * 100 / current_mw))
    Add ThrottleAction::CpuThrottle { duty_cycle_percent: duty_cycle }

  Invariants:
    - throttle_actions is rebuilt from scratch every tick (no incremental state).
    - At most one CpuFrequency entry per frequency domain.
    - At most one AccelPerformance entry per device_id.
    - At most one CpuThrottle entry (covers all CPUs for this cgroup).
    - Array capacity 8 is sufficient: at most one entry per PowerDomainType (7 types)
      plus the CpuThrottle fallback = 8 maximum.

6.4.5 System-Level Power Accounting

/sys/kernel/umka/power/
    energy_total_uj        # Total system energy since boot (microjoules)
    budget_mw              # System-wide power budget (admin-set)

Carbon policy is NOT the kernel's job. The kernel measures watts and enforces watt budgets. Carbon intensity depends on grid mix, geography, time of day, renewable contracts — all external to the machine. Orchestrators (Kubernetes, Nomad, custom fleet managers) can read energy_total_uj and compute carbon externally. This is the correct separation of concerns: kernel provides accurate power telemetry, userspace applies policy.

Thermal Throttling Coordination:

When the hardware thermal throttle engages, the power budget enforcer backs off to prevent double-throttling. Detection is architecture-specific: - x86: MSR IA32_THERM_STATUS (PROCHOT assertion) or ACPI thermal zone events. - AArch64/ARMv7: SCMI thermal notifications from SCP, or ACPI thermal zones on SBSA-compliant servers. On DT-based platforms: thermal zone DT nodes with trip points. - RISC-V: Platform-specific (BMC/IPMI thermal events, or DT thermal zones).

If hardware is already throttling a domain, the kernel does not apply additional software throttling to that domain — doing so would reduce performance below what the thermal situation requires. The kernel logs the thermal event and adjusts its power model to account for reduced headroom.

Hardware/software throttle coordination: To prevent double-throttling during the detection window, the power enforcer reads the hardware throttle status BEFORE applying its own throttle. On x86, IA32_THERM_STATUS bit 0 (PROCHOT active) and IA32_PACKAGE_THERM_STATUS indicate active hardware throttling. On ARM, SCMI notifications deliver thermal events asynchronously. The coordination protocol: 1. Before each enforcement tick, read hardware throttle status. 2. If hardware throttling is active, skip software throttling for this tick (hardware is already reducing power draw). 3. If hardware throttling was active on the previous tick but is now inactive, re-evaluate software throttle based on current power measurement (the RAPL or SCMI reading now reflects the hardware-throttled power level). 4. Race window: between hardware engaging thermal throttle and the next enforcement tick (~4 ms worst case), both throttles may be active simultaneously. This is safe — double-throttling reduces performance temporarily but does not cause correctness issues. The next tick detects the hardware throttle and removes the software throttle.

Battery Systems:

Power budgeting for battery-powered systems (laptops, edge devices) is out of scope for v1. Battery charge level, discharge rate, and remaining runtime are platform-management concerns handled by ACPI/UPower in userspace. The power budgeting system provides the watt-level telemetry that battery management software can consume, but does not implement battery-specific policies.

6.4.6 Performance Impact

Per-architecture overhead per scheduler tick (~4ms):

Architecture Read mechanism Cost per domain 6-domain system Overhead
x86 (RAPL) MSR read ~100ns 600ns 0.015%
AArch64 (SCMI) SCP mailbox ~1-5 μs 6-30 μs 0.15-0.75%
ARMv7 (SCMI) SCP mailbox ~1-5 μs 6-30 μs 0.15-0.75%
RISC-V (Estimated) Calculation ~50ns 300ns 0.008%
PPC32 (Estimated) Calculation ~50ns 300ns 0.008%
PPC64LE (OCC) OPAL sensor read ~1-5 μs 6-30 μs 0.15-0.75%
Any (BMC/IPMI) OOB polling ~10-50 μs 60-300 μs 0.006-0.03% (rate-limited to 1/s)

SCMI overhead is higher than RAPL but still well within budget. For BMC/IPMI sources, the kernel rate-limits reads to 1 per second (not per tick) to avoid I2C/IPMI bus saturation, using the last-read value for inter-read ticks. The overhead percentage for BMC/IPMI reflects amortization over the 1-second read interval (60-300 μs / 1s), not per-tick cost.

When power throttling is active: performance reduction is intentional and configured. It replaces uncontrolled thermal throttling (which is worse — it's sudden and undifferentiated).

When power throttling is NOT active: zero overhead beyond the power reads.


6.5 Timekeeping and Clock Management

Accurate, low-latency timekeeping is foundational: the scheduler needs monotonic timestamps for CBS deadlines (Section 6.3), real-time tasks need bounded timer latency (Section 7.2), and userspace applications call clock_gettime() millions of times per second. This section describes how UmkaOS reads hardware clocks, maintains system time, exposes fast timestamps to userspace, and manages timer events.

6.5.1 Clock Source Hierarchy

Each architecture provides one or more hardware cycle counters. UmkaOS selects the best available source at boot and can switch at runtime if a source proves unstable (Section 6.5.5).

Architecture Primary Source Secondary Resolution Access
x86-64 TSC (Time Stamp Counter) HPET, ACPI PM Timer sub-ns rdtsc (user/kernel)
AArch64 Generic Timer (CNTPCT_EL0) typically 1-10 ns mrs (EL0 if enabled)
ARMv7 Generic Timer (CNTPCT via cp15) typically 1-10 ns mrc (PL0 if enabled)
RISC-V mtime (MMIO) rdtime CSR implementation-defined rdtime (U-mode)
PPC32 Timebase (TBL/TBU) Decrementer (DEC) typically 1-10 ns mftb / mfspr
PPC64LE Timebase (TB) Decrementer (DEC) sub-ns (POWER9: 512 MHz) mftb (user/kernel)

x86-64 notes: Modern processors (Intel Nehalem+, AMD Zen+) provide an invariant TSC that runs at a constant rate regardless of frequency scaling or C-state transitions. CPUID leaf 0x8000_0007 EDX bit 8 advertises this. When invariant TSC is available, it is the preferred source: zero-cost reads (rdtsc is unprivileged), sub-nanosecond resolution, and monotonicity guaranteed across cores. When invariant TSC is absent, UmkaOS falls back to HPET (~100 ns read latency, MMIO) or the ACPI PM Timer (~800 ns read latency, port I/O).

AArch64 / ARMv7 notes: The ARM Generic Timer is architecturally defined and always present. The kernel configures CNTKCTL_EL1 to allow EL0 (userspace) reads of CNTPCT_EL0, enabling a vDSO fast path identical in spirit to x86 rdtsc.

RISC-V notes: The rdtime pseudo-instruction reads the platform-provided real-time counter. Frequency is discoverable from the device tree (timebase-frequency property). Resolution varies by implementation.

All clock sources implement a common abstraction:

// umka-core/src/time/clocksource.rs

/// Hardware clock source abstraction.
/// Implementations are per-architecture; the best source is selected at boot.
pub trait ClockSource: Send + Sync {
    /// Read the current cycle count from hardware.
    fn read_cycles(&self) -> u64;

    /// Nominal frequency of this clock source in Hz.
    fn frequency_hz(&self) -> u64;

    /// Quality rating: higher values are preferred when multiple sources exist.
    /// TSC invariant = 350, HPET = 250, ACPI PM Timer = 100.
    fn rating(&self) -> u32;

    /// Whether this source continues counting through CPU sleep states.
    fn is_continuous(&self) -> bool;

    /// Upper bound on single-read uncertainty in nanoseconds.
    /// Accounts for read latency and synchronization jitter.
    fn uncertainty_ns(&self) -> u32;
}

At boot, umka-core enumerates available sources, sorts by rating(), and activates the highest-rated continuous source. The secondary source (if any) is retained for watchdog cross-validation (Section 6.5.5).

6.5.2 Timekeeping Subsystem

UmkaOS maintains four clocks, matching POSIX semantics:

Clock Semantics Adjustable?
CLOCK_MONOTONIC Time since boot, NTP-adjusted rate No (monotonic)
CLOCK_MONOTONIC_RAW Time since boot, raw hardware rate No
CLOCK_REALTIME Wall clock (UTC), NTP-adjusted Yes (clock_settime, NTP)
CLOCK_BOOTTIME Like CLOCK_MONOTONIC but includes suspend time No

Timestamp representation: All internal timestamps use a (seconds: u64, nanoseconds: u64) tuple. Both fields are 64-bit to avoid overflow in intermediate arithmetic (nanoseconds may temporarily exceed 10^9 during computation and are normalized before storage).

Global timekeeper state is protected by a seqlock — the same pattern used in Linux timekeeping.c. Readers (including the vDSO) retry if they observe a torn update. Writers (the timer interrupt handler) are serialized by holding the seqlock write side.

// umka-core/src/time/timekeeper.rs

/// Global timekeeping state, updated on every tick or clocksource event.
pub struct Timekeeper {
    pub seq: SeqLock,                       // seqlock protecting all fields
    pub clock: &'static dyn ClockSource,    // active clock source
    pub cycle_last: u64,                    // last cycle count at update
    pub mask: u64,                          // counter wrap bitmask
    pub mult: u32,                          // ns = (cycles * mult) >> shift
    pub shift: u32,
    pub wall_sec: u64,                      // CLOCK_REALTIME
    pub wall_nsec: u64,
    pub mono_sec: u64,                      // CLOCK_MONOTONIC
    pub mono_nsec: u64,
    pub boot_offset_sec: u64,              // CLOCK_BOOTTIME delta
    pub boot_offset_nsec: u64,
    pub freq_adj: i64,                      // NTP/PTP frequency correction (scaled ppm)
    pub phase_adj: i64,                     // NTP/PTP phase correction (ns)
}

NTP/PTP discipline: An adjtimex()-compatible interface accepts frequency and phase corrections from userspace NTP or PTP daemons. Frequency adjustment modifies mult slightly so that cycles-to-nanoseconds conversion drifts at the requested rate. Phase adjustment is applied as a slew (at most 500 ppm rate adjustment) to avoid wall clock jumps. CLOCK_MONOTONIC_RAW is immune to both adjustments — it reflects raw hardware cycles converted at the nominal rate.

6.5.3 vDSO Fast Path

Linux problem: clock_gettime() is the most frequently invoked syscall in many workloads (databases, trading systems, telemetry). A kernel entry costs ~100-200 ns due to mode switch, KPTI page table reload, and speculative execution mitigations. At millions of calls per second, this adds up.

UmkaOS design: Like Linux, UmkaOS maps a vDSO (virtual Dynamic Shared Object) into every process's address space. The vDSO contains userspace implementations of clock_gettime(), gettimeofday(), and time() that read the hardware clocksource directly and apply precomputed conversion parameters — no syscall needed.

The kernel maintains a read-only shared page (the vDSO data page) that it updates on every timer tick and on NTP adjustments. Userspace vDSO code reads this page under seqlock protection.

// umka-core/src/time/vdso.rs

/// Shared page mapped read-only into every process.
/// Updated by the kernel under seqlock protection.
#[repr(C)]
pub struct VdsoData {
    // Maps to the `SeqLock` counter in `Timekeeper`; the kernel writes this
    // field as part of the seqlock protocol (odd = update in progress,
    // even = consistent). Userspace reads `seq` before and after reading
    // fields to detect torn reads and retry.
    pub seq: u32,
    pub clock_mode: u32,           // which clocksource (TSC, HPET, Generic Timer, ...)
    pub cycle_last: u64,           // cycle count at last kernel update
    pub mask: u64,                 // clocksource bitmask
    pub mult: u32,                 // ns = ((cycles - cycle_last) & mask) * mult >> shift
    pub shift: u32,
    pub wall_time_sec: u64,        // CLOCK_REALTIME base
    pub wall_time_nsec: u64,
    pub monotonic_time_sec: u64,   // CLOCK_MONOTONIC base
    pub monotonic_time_nsec: u64,
    pub boottime_sec: u64,         // CLOCK_BOOTTIME base
    pub boottime_nsec: u64,
}

vDSO read path (userspace, per-architecture):

  1. Read seq. If odd, spin (kernel is mid-update).
  2. Read cycle_last, mult, shift, mask, and the relevant base time.
  3. Read the hardware counter (rdtsc / mrs CNTPCT_EL0 / rdtime).
  4. Compute delta = (now - cycle_last) & mask.
  5. Compute ns = base_nsec + (delta * mult) >> shift. Normalize into seconds.
  6. Re-read seq. If it changed, go to step 1.

Cost: ~5-20 ns depending on architecture (dominated by the clocksource read instruction itself). This is 10-40x faster than a syscall path.

Fallback: If clock_mode indicates no userspace-readable source is available (e.g., HPET on x86, which requires MMIO the kernel has not mapped into user address space), the vDSO falls back to a real syscall instruction.

6.5.4 Timer Infrastructure

UmkaOS provides two timer mechanisms, matching the Linux split between coarse and high-resolution timers.

Timer wheel (coarse-grained, jiffies resolution):

Used for network retransmission timeouts, poll/epoll timeouts, and other events where millisecond precision is sufficient. Implemented as a hierarchical timer wheel with O(1) insertion and O(1) per-tick processing (cascading is amortized). The wheel uses 8 levels with 256 slots each, covering timeouts from 1 tick to ~50 days at HZ=250.

High-resolution timers (hrtimers, nanosecond precision):

Used for timer_create() (POSIX per-process timers), nanosleep() / clock_nanosleep(), timerfd_create(), and scheduler deadline enforcement. Implemented as a per-CPU red-black tree keyed by absolute expiry time. The nearest expiry programs the hardware timer (local APIC on x86, Generic Timer on ARM, mtimecmp on RISC-V) to fire at the exact time.

// umka-core/src/time/hrtimer.rs

/// A high-resolution timer.
pub struct HrTimer {
    /// Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
    pub expires_ns: u64,

    /// Callback invoked on expiry. Runs in hard-IRQ context.
    pub callback: fn(&mut HrTimer),

    /// Opaque context value passed to the callback. Typically a pointer to
    /// the enclosing structure (cast via `as usize`), allowing the callback
    /// to recover its context via `unsafe { &*(context as *const T) }`.
    /// This is the Rust equivalent of Linux's `container_of` pattern for
    /// timer callbacks.
    ///
    /// # Safety
    ///
    /// Using `context` as a pointer requires the following invariants:
    ///
    /// 1. **Pinning**: The `HrTimer` must be embedded in a `Pin`-ned
    ///    allocation. The enclosing structure must not move while the timer
    ///    is armed, since `context` stores a raw pointer to it.
    /// 2. **Drop ordering**: The enclosing structure's `Drop` implementation
    ///    must cancel the timer (`hrtimer_cancel()`) before deallocation,
    ///    ensuring the callback never fires with a dangling `context`.
    /// 3. **Type agreement**: The callback is responsible for casting
    ///    `context` back to the correct type via
    ///    `unsafe { &*(self.context as *const T) }`. The caller that sets
    ///    `context` and the callback must agree on the type `T`.
    /// 4. These invariants match Linux's `container_of` + `hrtimer` pattern,
    ///    adapted for Rust's ownership model. The timer subsystem enforces
    ///    invariant (2) by requiring `Pin<&mut HrTimer>` for
    ///    `hrtimer_start()`.
    pub context: usize,

    /// Timer state.
    pub state: HrTimerState,

    /// Owning CPU (timers are per-CPU to avoid cross-CPU synchronization).
    pub cpu: u32,
}

Per-CPU timer queues: Each CPU maintains its own timer wheel and hrtimer tree. Timer insertion targets the local CPU by default. Expiry processing happens in the local timer interrupt — no cross-CPU IPI is needed. This eliminates contention and provides deterministic latency on isolated CPUs (Section 7.2.5).

Timer coalescing: When a timer is inserted with a slack tolerance (e.g., a 100 ms timeout with 10 ms acceptable slack), the kernel may delay it to coalesce with nearby timers. This reduces wakeups on idle CPUs, improving power efficiency. Coalescing is disabled for hrtimers with zero slack (RT workloads). The timer_slack_ns per-process tunable controls default slack, identical to the Linux interface.

6.5.5 Clocksource Watchdog

A clocksource that reports incorrect time is worse than a slow one — it causes silent data corruption in timestamps, incorrect scheduler decisions, and broken network protocols.

Cross-validation: Every 500 ms (configurable), the kernel reads both the primary and secondary clocksource and compares the elapsed interval. If the primary's elapsed time deviates from the secondary's by more than a threshold (default: 100 ppm sustained over 5 consecutive checks), the primary is marked unstable.

TSC instability detection: On x86, the TSC can be unreliable in several scenarios:

  • Non-invariant TSC (pre-Nehalem Intel, pre-Zen AMD): frequency changes with P-state transitions.
  • TSC halts during deep C-states on some older processors.
  • TSC desynchronization across sockets on early multi-socket systems.

The watchdog detects all three cases. When the TSC is marked unstable:

  1. The kernel logs a warning: clocksource: TSC marked unstable (drift >100ppm vs HPET).
  2. The active clocksource switches to HPET (or ACPI PM Timer if HPET is absent).
  3. The vDSO clock_mode is updated so userspace falls back to the syscall path (HPET is not readable from userspace without kernel MMIO mapping).
  4. The switch is atomic from the perspective of seqlock readers — one consistent snapshot uses TSC parameters, the next uses HPET parameters.

Capability-gated calibration: TSC frequency calibration (reading MSRs like MSR_PLATFORM_INFO or calibrating against PIT/HPET) requires privileged operations. Only umka-core holds the capability to read/write MSRs. Tier 1 drivers cannot influence clocksource selection — a compromised driver cannot subvert system timekeeping.

6.5.6 Interaction with RT and Power Management

RT timer latency: Real-time tasks (Section 7.2) depend on bounded timer expiry. On CPUs designated for RT workloads (isolcpus, nohz_full), hrtimer expiry is serviced directly in hard-IRQ context with a preemption-disabled path of bounded length. The worst-case path from hardware interrupt to hrtimer callback execution is: interrupt entry (~200 cycles) + hrtimer tree lookup (O(1) for the nearest timer) + callback invocation. On x86 with a local APIC timer and an isolated CPU (no frequency scaling, shallow C-states, nohz_full), the software path completes in under 1 μs. However, hardware-level non-determinism (DRAM refresh cycles ~350ns worst-case, cache miss penalties, memory controller contention) means the end-to-end observed latency on real hardware is typically 1-5 μs under favorable conditions and up to 10 μs under worst-case memory pressure. These figures match measured PREEMPT_RT Linux performance on isolated cores. Section 7.2.9 details the hardware resource partitioning (CAT, MBA, RDT) that UmkaOS uses to minimize hardware-level jitter.

When PreemptionModel::Realtime is active (Section 7.2.2), softirq-context timers are promoted to hard-IRQ context for RT-priority hrtimers, ensuring they cannot be delayed by threaded interrupt processing.

C-state interaction with clocksources: CPU power states affect timer behavior:

C-state Invariant TSC Non-Invariant TSC Generic Timer (ARM) mtime (RISC-V) Timebase (PPC)
C1 (halt) Continues May stop Continues Continues Continues
C3+ (deep sleep) Continues Stops Continues Continues Continues

When a non-invariant TSC is detected and the system supports deep C-states, the kernel forces HPET as the clocksource and disables the vDSO fast path for timestamp reads. This is a correctness requirement, not a performance choice.

Tickless (nohz) mode: When a CPU has no pending timers and is running a single task (or is idle), the periodic tick is stopped entirely. The kernel reprograms the hardware timer to fire at the next actual event (nearest hrtimer expiry, or infinity if none). This eliminates unnecessary wakeups on isolated RT CPUs and idle CPUs.

Resuming the tick happens when: (a) a new timer is inserted on the CPU, (b) a second task becomes runnable (the scheduler needs periodic load balancing), or (c) an interrupt wakes the CPU from idle. The nohz implementation reuses Linux's nohz_full semantics: user code on an isolated CPU can run for arbitrarily long periods without a single kernel interrupt.

Power-aware timer placement: When timer coalescing (Section 6.5.4) groups timers, the kernel prefers placing them on CPUs that are already awake. Waking a CPU from C3+ costs ~100 us and defeats the purpose of coalescing. The timer subsystem queries the per-CPU idle state before choosing a coalescing target.


6.6 System Event Bus

The event bus is a core kernel facility that enables kernel subsystems and drivers to notify userspace of hardware and system state changes via a capability-gated, lock-free ring buffer mechanism. Netlink compatibility (for udev/systemd) is implemented in umka-compat (Section 18.3).

6.6.1 Event Subscription Model

// umka-core/src/event/mod.rs

/// System event types.
#[repr(u32)]
pub enum EventType {
    /// Battery level changed.
    BatteryLevelChanged = 1,
    /// AC adapter state changed (plugged/unplugged).
    AcStateChanged = 2,
    /// WiFi connection state changed.
    WifiStateChanged = 3,
    /// Bluetooth device paired/unpaired.
    BluetoothDeviceChanged = 4,
    /// USB device inserted/removed.
    UsbDeviceChanged = 5,
    /// Display hotplug (connected/disconnected).
    DisplayHotplug = 6,
    /// Thermal event (warning, critical).
    ThermalEvent = 7,
    /// Power profile changed.
    PowerProfileChanged = 8,
}

/// Event payload (exactly 256 bytes, cache-line friendly).
///
/// Layout verification with `#[repr(C)]`:
///   Offset 0:  event_type (EventType = u32)  = 4 bytes
///   Offset 4:  _pad0 ([u8; 4])               = 4 bytes (explicit alignment padding)
///   Offset 8:  timestamp_ns (u64)            = 8 bytes
///   Offset 16: data (EventData, 240 bytes)   = 240 bytes
///   Total: 4 + 4 + 8 + 240 = 256 bytes.
///
/// **Compile-time assertion**: `const_assert!(size_of::<Event>() == 256);`
#[repr(C)]
pub struct Event {
    /// Event type.
    pub event_type: EventType,
    /// Explicit padding for u64 alignment of timestamp_ns.
    pub _pad0: [u8; 4],
    /// Timestamp (monotonic ns).
    pub timestamp_ns: u64,
    /// Event-specific data.
    pub data: EventData,
}

/// Event-specific data (union of all possible payloads).
#[repr(C)]
pub union EventData {
    pub battery: BatteryEvent,
    pub ac: AcEvent,
    pub wifi: WifiEvent,
    pub bluetooth: BluetoothEvent,
    pub usb: UsbEvent,
    pub display: DisplayEvent,
    pub thermal: ThermalEvent,
    pub power_profile: PowerProfileEvent,
    _pad: [u8; 240], // 256 - 4 (event_type) - 4 (_pad0) - 8 (timestamp_ns) = 240
}

/// Battery event data.
#[repr(C)]
pub struct BatteryEvent {
    /// Battery percentage (0-100).
    pub percent: u8,
    /// Charging state (0=discharging, 1=charging, 2=full).
    pub charging: u8,
    /// Time remaining in minutes (0xFFFF = unknown).
    pub time_remaining_min: u16,
}

/// Display hotplug event data.
#[repr(C)]
pub struct DisplayEvent {
    /// Connector ID.
    pub connector_id: u32,
    /// Event subtype (0=disconnected, 1=connected).
    pub connected: bool,
}

6.6.2 Subscription via Capability

Processes subscribe to events via the capability system (Section 8.1):

impl EventManager {
    /// Subscribe to a class of events. Returns an EventSubscription capability.
    ///
    /// # Security
    /// Requires `CAP_SYS_ADMIN` for system-wide events (thermal, power profile).
    /// Requires `CAP_NET_ADMIN` for network events (WiFi, Bluetooth).
    /// Battery, AC, USB, display events are unrestricted (visible to all processes).
    pub fn subscribe(&self, event_type: EventType, process: &Process) -> Result<CapabilityToken, EventError> {
        // Check capability grants.
        match event_type {
            EventType::ThermalEvent | EventType::PowerProfileChanged => {
                if !process.has_capability(Capability::SysAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            EventType::WifiStateChanged | EventType::BluetoothDeviceChanged => {
                if !process.has_capability(Capability::NetAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            _ => {} // Unrestricted
        }

        // Allocate event ring buffer (per-process, 4 KB = ~16 events).
        let ring = EventRing::allocate(process)?;

        // Mint capability token.
        let cap_token = self.capability_manager.mint(
            CapabilityType::EventSubscription,
            CapabilityRights::READ,
            ring.ring_id(),
        )?;

        // Register subscription.
        self.subscriptions.insert(event_type, SubscriptionInfo {
            process_id: process.pid(),
            ring,
        });

        Ok(cap_token)
    }

    /// Post an event to all subscribers.
    pub fn post_event(&self, event: Event) {
        let subs = self.subscriptions.get(&event.event_type);
        for sub in subs {
            if let Some(ring) = sub.ring.upgrade() {
                // Write event to subscriber's ring buffer (lock-free push).
                if ring.push(event).is_err() {
                    // Ring full: drop event (subscriber is too slow).
                    sub.dropped_events.fetch_add(1, Ordering::Relaxed);
                }
            }
        }
    }
}

For Netlink compatibility (udev, systemd integration), see Section 18.3.

6.6.3 Integration Points

Subsystem Events posted
Battery driver (Section 6.2.8) BatteryLevelChanged, AcStateChanged
WiFi driver (Section 12.1.1, 10-drivers.md) WifiStateChanged
Bluetooth (Section 12.2, 10-drivers.md) BluetoothDeviceChanged
USB bus (Section 10.9) UsbDeviceChanged
Display driver (Section 20.4) DisplayHotplug
Thermal framework (Section 6.2.3) ThermalEvent
Power profiles (Section 6.2.9) PowerProfileChanged

6.7 Intent-Based Resource Management

6.7.1 The Abstraction Gap

UmkaOS has all the mechanisms for smart resource management: - In-kernel inference engine (Section 21.4) for learned decisions - Per-device utilization tracking (Section 21.1) - Topology awareness (device registry, Section 10.5) - Power metering (Section 6.4 ) - Memory tier tracking (PageLocationTracker, Section 21.2) - Network fabric topology (Section 5.2)

What's missing is the abstraction that ties these together. Currently, resources are managed imperatively: "give me 4 cores and 16GB RAM." The alternative: declare goals, let the kernel optimize.

6.7.2 Design: Resource Intents

// umka-core/src/intent/mod.rs

/// A resource intent declares WHAT the workload needs,
/// not HOW to allocate resources.
#[repr(C)]
pub struct ResourceIntent {
    /// Target P99 SCHEDULING latency (nanoseconds).
    /// This is the time from task becoming runnable to task getting CPU.
    /// The kernel cannot measure application-level latency (it doesn't know
    /// what an "operation" is). This metric is scheduling + I/O completion
    /// latency — both kernel-observable.
    /// Kernel adjusts CPU priority, memory placement, I/O scheduling.
    /// 0 = no latency target (best-effort).
    pub target_latency_ns: u64,

    /// Target throughput (operations per second).
    /// Kernel adjusts CPU allocation, I/O queue depth, batch sizes.
    /// 0 = no throughput target (best-effort).
    pub target_ops_per_sec: u64,

    /// Availability requirement (basis points: 9999 = 99.99%).
    /// Kernel adjusts redundancy, crash recovery priority.
    /// 0 = no availability target.
    /// Used by: cgroup knob `intent.availability` (Section 6.7.3),
    /// crash recovery priority in Section 19.1 (higher availability_bp
    /// = faster restart, more aggressive health monitoring).
    pub availability_bp: u32,

    /// Power efficiency preference (0 = max performance, 100 = max efficiency).
    /// Kernel adjusts DVFS, core parking, accelerator clock.
    /// 50 = balanced (default).
    pub efficiency_preference: u32,

    /// Data locality hint: where does this workload's data live?
    /// Kernel uses this for NUMA placement and distributed scheduling.
    /// Used by: cgroup knob `intent.data_affinity` (Section 6.7.3),
    /// NUMA placement optimizer (Section 6.7.5 step 2b), and distributed
    /// scheduling in Section 5.1.
    pub data_affinity: DataAffinityHint,

    /// Struct layout version. Enables future extension without breaking binary
    /// compatibility: the kernel checks this field and interprets fields beyond
    /// the base layout only if version >= the version that introduced them.
    /// v1 = initial layout (this definition). Future versions extend into _reserved.
    pub version: u32,

    /// Reserved for future fields. New versions of ResourceIntent consume bytes
    /// from this region. Zero-initialized by callers; the kernel ignores
    /// non-zero bytes in positions it does not recognize for the given version.
    /// Sized to make the struct exactly 64 bytes (u64-aligned) with no implicit
    /// tail padding: 8+8+4+4+4+4+32 = 64.
    pub _reserved: [u8; 32],
}

#[repr(u32)]
pub enum DataAffinityHint {
    /// No preference. Kernel decides based on observation.
    Auto            = 0,
    /// Data is primarily local (disk-bound workload).
    Local           = 1,
    /// Data is distributed across nodes (distributed workload).
    Distributed     = 2,
    /// Data is on accelerators (GPU-bound workload).
    Accelerator     = 3,
}

6.7.3 Cgroup Integration

/sys/fs/cgroup/<group>/intent.latency_ns
# # Target P99 latency in nanoseconds.
# # "0" = no target (default, pure imperative mode).
# # "5000000" = target 5ms P99 latency.

/sys/fs/cgroup/<group>/intent.throughput
# # Target operations per second.
# # "0" = no target.

/sys/fs/cgroup/<group>/intent.efficiency
# # 0 = max performance, 100 = max efficiency, 50 = balanced.
# # Default: 50.

/sys/fs/cgroup/<group>/intent.availability
# # Availability target in basis points (0 = no target, 9999 = 99.99%).
# # Kernel adjusts crash recovery priority and health monitoring
# # frequency for drivers serving this cgroup. Higher values trigger
# # faster driver restart (Section 19.1) and redundant I/O path selection.
# # Default: 0 (no availability target).
# # Maps to: ResourceIntent.availability_bp

/sys/fs/cgroup/<group>/intent.data_affinity
# # Data locality hint for NUMA placement and distributed scheduling.
# # Values: "auto" (default), "local", "distributed", "accelerator"
# # "auto" = kernel observes memory access patterns and decides.
# # "local" = data is primarily on local storage (optimize for disk I/O).
# # "distributed" = data spans cluster nodes (optimize for network).
# # "accelerator" = data lives on accelerator memory (minimize transfers).
# # Maps to: ResourceIntent.data_affinity (DataAffinityHint enum)

/sys/fs/cgroup/<group>/intent.status
# # Read-only. Current intent satisfaction:
# #   latency_met: true|false
# #   latency_p99_actual_ns: <value>
# #   throughput_met: true|false
# #   throughput_actual: <value>
# #   power_actual_mw: <value>
# #   optimizer_action: <last action taken, e.g., "raised cpu.weight to 200">
# #   adjustments_last_hour: <count>
# #   contradiction: <none|description>

Multi-tenant access control: In K8s multi-tenant clusters, intent.status exposes internal workload metrics (actual latency, throughput, power draw) for each cgroup. A process in one container reading /sys/fs/cgroup/other-tenant/intent.status would expose the other tenant's workload profile, which is an information disclosure risk. Access control: intent.status is readable only by processes with CAP_SYS_ADMIN in the cgroup's user namespace, or by the cgroup owner (matching the cgroup's uid). This matches the access model for /proc/PID/status — visible to owner and root only. Non-owner reads return EACCES. The same access policy applies to intent.explain and intent.adjustment_history, which also contain tenant-specific operational data.

6.7.4 Objective Function and Conflict Resolution

Clarification: SCHED_INTENT is NOT a scheduling class.

SCHED_INTENT is an annotation layered on top of existing scheduling classes (EEVDF, RT, Deadline). A task's SchedClass is unchanged by intent assignment — an EEVDF task with intent::LATENCY_SENSITIVE remains EEVDF class.

The annotation modifies the task's effective eligible_vtime calculation: a latency-sensitive EEVDF task receives a forward-eligible offset (effectively a negative lag boost) that prioritizes it WITHIN EEVDF without promoting it to RT class.

The "priority level 4" in the multi-class selection table refers to the selection order among scheduler classes: a LATENCY_SENSITIVE EEVDF task is considered before standard EEVDF tasks but after all RT tasks. This is implemented by adjusting eligible_vtime on wakeup — there is no separate SCHED_INTENT class in the runqueue.

Intent Scheduling: Objective Function and Conflict Resolution

Objective function: minimize total latency for latency-sensitive tasks subject to meeting all SCHED_DEADLINE deadlines.

  minimize: Σ latency(task_i) for all SCHED_INTENT/SCHED_NORMAL tasks
  subject to: deadline_j met for all SCHED_DEADLINE tasks j

Scheduling class priority (highest to lowest):

Priority Class Condition
1 (highest) SCHED_DEADLINE CBS task with remaining budget and active deadline
2 SCHED_RT FIFO Real-time, FIFO policy, priority 1-99
3 SCHED_RT RR Real-time, round-robin, priority 1-99
4 SCHED_INTENT (latency) Task with intent::LATENCY_SENSITIVE annotation and recent wakeup
5 SCHED_NORMAL (EEVDF) Standard tasks (Section 6.1)
6 SCHED_BATCH CPU-bound batch jobs (EEVDF with longer time slices)
7 (lowest) SCHED_IDLE Run only when nothing else is runnable

Intent conflict resolution rules:

  1. Two SCHED_INTENT tasks competing for the same CPU: Schedule by EEVDF virtual time (same as SCHED_NORMAL). Annotations affect eligibility but not within-class ordering.

  2. LATENCY_SENSITIVE vs THROUGHPUT on same CPU: LATENCY_SENSITIVE runs first in the current scheduling quantum. THROUGHPUT tasks use the remaining time in the quantum.

  3. POWER_EFFICIENT hint: Migrate to an efficiency core (EAS decision, Section 6.1.5) if: (a) latency budget allows the migration cost (~10-50μs), AND (b) the efficiency core is not already at capacity. POWER_EFFICIENT is never honored if it would cause a LATENCY_SENSITIVE task to miss its target latency.

  4. EXCLUSIVE_CPU hint (two tasks competing): Round-robin at equal EEVDF priority. The hint is advisory — UmkaOS does not dedicate a CPU to a single SCHED_INTENT task unless it has an explicit CPU affinity set via sched_setaffinity().

  5. SCHED_INTENT degradation: If the requested intent is unachievable (e.g., all CPUs committed to RT tasks), the task degrades to SCHED_NORMAL for that scheduling quantum. Degradation is logged to the observability layer (Section 16) and exposed via /proc/PID/sched_intent_stats.

6.7.5 The Optimization Loop

Every ~1 second (configurable):

1. IntentOptimizer collects metrics:
   - Per-cgroup: actual latency (P50, P99), throughput, power
   - Per-device: utilization, temperature, power
   - Cluster-wide: node loads, memory pressure, network utilization

2. For each cgroup with intents:
   a. Is the intent being met?
      - latency_p99_actual <= target_latency_ns?
      - throughput_actual >= target_ops_per_sec?
   b. If not met → need more resources:
      - Increase CPU allocation (raise cpu.weight or cpu.guarantee)
      - Improve NUMA placement (migrate pages closer to running CPUs)
      - Increase accelerator allocation (raise accel.compute.guarantee)
      - Increase I/O priority
   c. If met with headroom → can release resources:
      - Reduce CPU allocation (lower cpu.weight)
      - Lower frequency (save power)
      - Free accelerator time for other workloads

3. Apply adjustments via existing cgroup knobs.
   Intent layer is an OPTIMIZER that writes to existing imperative knobs.
   It does NOT replace the imperative interface — it sits above it.

4. **Optimization algorithm** (gradient-descent-inspired, bounded):
   - For each unmet intent, compute the **deficit**: e.g.,
     `latency_deficit = latency_p99_actual - target_latency_ns`.
   - Map deficit to resource adjustment via a **PD controller**:
     `delta_weight = K_p × (deficit / target) + K_d × d(deficit / target) / dt`,
     where `K_p` is a per-resource-type proportional gain constant (default:
     K_p=0.5 for CPU weight, K_p=0.3 for frequency) and `K_d` is the
     derivative gain (default: K_d=1.0 for CPU weight, K_d=0.6 for frequency).
     See [Section 6.7.5.1](#6751-stability-analysis) for the complete control law
     and gain derivation.
   - **Clamp** adjustments to avoid oscillation: each iteration adjusts
     by at most ±20% of the current allocation (`MAX_INTENT_ADJUSTMENT = 0.20`).
     Multiple iterations converge geometrically.
   - **Convergence criterion**: intent is "met" when the metric is within
     10% of the target for 3 consecutive measurement windows. Once met,
     the optimizer enters a **hold** state for that cgroup (no further
     adjustments until the metric drifts outside the 10% band).
   - **`compute.weight` decomposition**: When the intent specifies
     `accel.compute.weight`, the optimizer distributes across CPU and
     accelerator proportionally to their current utilization ratio:
     `cpu_share = cpu_util / (cpu_util + accel_util)`.
   - **Safety bound**: The optimizer never reduces allocation below the
     cgroup's `*.min` guarantee or above the `*.max` ceiling.

   Stability controls (prevent oscillation/hunting):
     - Hysteresis: don't adjust unless delta exceeds 10% of current value.
     - Minimum hold time: no changes within 5 seconds of last adjustment.
     - Damping: exponential backoff if last 3 adjustments didn't converge.
     - Max adjustment rate: at most ±20% change per optimization cycle.

4. Conflicting intents: when multiple cgroups declare intents that cannot
   all be satisfied simultaneously (insufficient resources):
     - Intents are BEST-EFFORT, not guarantees.
     - Priority follows existing cpu.weight / accel.compute.weight hierarchy.
     - Higher-weight cgroups get intent satisfaction first.
     - Unsatisfied intents are reported in intent.status (latency_met: false).
     - The optimizer does NOT starve low-priority cgroups — it respects
       existing cgroup min guarantees (cpu.min, memory.min).

5. Policy priority ordering (prevents conflicts between subsystems):
     Priority (highest to lowest):
       a. Hardware limits (thermal throttle, voltage limits) — immutable
       b. Admin-configured cgroup limits (cpu.max, power.max) — hard ceiling
       c. Power budget enforcement (Section 6.4) — watt cap
       d. Intent optimization (this section) — soft optimization
       e. EAS energy optimization (Section 6.1.5) — per-task core selection
     Each layer can only adjust WITHIN the ceiling set by the layer above.
     Power budget is a hard constraint; intents work within it. No oscillation.
     When power budgeting and EAS conflict (e.g., power budgeting throttles a
     CPU domain that EAS prefers), power budgeting takes precedence — the EAS
     migration is deferred until the power budget is satisfied. This may
     temporarily route tasks to less energy-efficient cores, but prevents
     thermal throttling and power supply overload, which are correctness
     constraints rather than optimization goals.

6. Log adjustments to /sys/kernel/umka/intent/adjustment_log
   for observability and debugging.

The in-kernel inference engine (Section 21.4) powers the optimization. The "Intent I/O Scheduler" and "Intent Page Prefetch" models (Section 21.4.5) are use cases of intent-based management.

IntentOptimizer data structures.

The IntentOptimizer is a singleton kernel subsystem that owns the optimization loop state. It runs as a dedicated kernel thread (kthread/intent_optimizer) and is never instantiated more than once.

/// Top-level state for the intent optimization subsystem.
///
/// One instance exists system-wide, owned by the `intent_optimizer` kernel
/// thread. All fields are accessed only from that thread except where noted.
pub struct IntentOptimizerState {
    /// PD controller state per resource dimension.
    /// Each dimension (CPU weight, frequency, accelerator allocation, I/O
    /// priority) has independent gain constants and error history.
    pub controller: [PdControllerState; ResourceDim::COUNT],

    /// Per-cgroup intent control state. Keyed by cgroup ID.
    /// Allocated on first intent assignment, freed when all intents are cleared.
    pub cgroup_states: BTreeMap<CgroupId, IntentControlState>,

    /// Monotonic nanosecond timestamp of the last optimizer tick.
    /// Used to detect missed ticks and adjust derivative computation.
    pub last_tick_ns: u64,

    /// Whether the optimizer is in the "hold" state (all intents converged).
    /// When true, the optimizer still runs on its timer but skips computation
    /// unless a metric drifts outside the convergence band.
    pub all_converged: bool,

    /// Configuration: optimizer tick period in nanoseconds. Default: 1_000_000_000 (1s).
    /// Adjustable at runtime via `/sys/kernel/umka/intent/tick_period_ns`.
    pub tick_period_ns: u64,
}

/// PD controller state for one resource dimension (CPU weight, frequency, etc.).
///
/// # Floating-Point Usage
///
/// This struct contains `f64` fields. Floating-point arithmetic in kernel context
/// requires FPU state to be saved/restored across preemption points and context
/// switches. To avoid this overhead on the common path, this struct is only ever
/// accessed from the `IntentOptimizer` kthread — a dedicated kernel thread that
/// runs with FPU context enabled in task (non-interrupt) context.
///
/// **Forbidden contexts**: interrupt handlers, RCU callbacks, softirq handlers,
/// spinlock critical sections, or any preemption-disabled section. Violations cause
/// FPU state corruption on preemptible kernels.
///
/// Compile-time enforcement: `PdControllerState` is `!Send` (via `PhantomData<*mut ()>`)
/// — only the single `IntentOptimizer` kthread may hold a reference to it.
pub struct PdControllerState {
    /// Proportional gain constant. Default depends on dimension
    /// (0.5 for CPU weight, 0.3 for frequency). May be halved by the
    /// exponential backoff mechanism after 3 consecutive non-convergent ticks.
    pub k_p: f64,

    /// Derivative gain constant. Set to `k_p * tau_d` where `tau_d` is the
    /// dominant feedback delay for this dimension (see Section 6.7.5.1).
    pub k_d: f64,

    /// Setpoint: the target value that the controller drives toward.
    /// For latency dimensions, this is `target_latency_ns`. For throughput,
    /// `target_ops_per_sec`. Updated when the cgroup's intent changes.
    pub setpoint: f64,

    /// Normalized error from the previous optimizer tick.
    /// Used to compute the discrete derivative `d_error[k]`.
    /// Initialized to 0.0 on first tick after intent assignment.
    pub prev_error: f64,

    /// Accumulated integral term (reserved for future PID extension).
    /// Currently unused — the optimizer runs a PD controller only.
    /// Retained in the struct to avoid a layout change if PID is needed.
    pub integral: f64,

    /// Current effective gain multiplier, reduced by exponential backoff.
    /// Starts at 1.0, halved after 3 consecutive non-convergent adjustments,
    /// reset to 1.0 on convergence.
    pub gain_multiplier: f64,

    /// `!Send` marker: prevents this struct from being moved across threads.
    /// `PdControllerState` must only be accessed from the `IntentOptimizer`
    /// kthread, which holds FPU context. Transferring it to another thread
    /// would bypass this invariant and risk FPU state corruption.
    _not_send: core::marker::PhantomData<*mut ()>,
}

Observation flow: observe_scheduler_metric().

Scheduler observations flow into the optimizer through per-CPU observation rings. Each CPU writes metrics locally without contention; the optimizer thread reads them in bulk during its tick.

/// Write a scheduler observation to the per-CPU observation ring.
///
/// Called from the scheduler hot path (task tick, context switch, wakeup)
/// with preemption disabled. The write is O(1) — a single slot in a
/// pre-allocated per-CPU ring buffer. If the ring is full, the oldest
/// unread observation is silently overwritten (lossy under extreme load,
/// but the optimizer is statistical and tolerates dropped samples).
///
/// # Arguments
///
/// * `cpu` — The CPU producing the observation. Must be the current CPU
///   (enforced by the `PreemptGuard` the caller holds).
/// * `metric_type` — The type of metric being reported.
/// * `value` — The metric value (nanoseconds for latency, count for
///   throughput, 0-1024 for utilization).
///
/// # Performance
///
/// ~5-10 ns per call (single cache-line write to per-CPU ring). Zero
/// allocation. No locks. The per-CPU ring is sized to hold 256 entries
/// (one cache line per entry), sufficient for ~250ms of observations at
/// 1000 observations/second before wraparound.
pub fn observe_scheduler_metric(cpu: CpuId, metric_type: SchedMetricType, value: u64) {
    let ring = per_cpu!(sched_observation_ring, cpu);
    ring.push(SchedObservation {
        timestamp_ns: arch::current::cpu::read_timestamp(),
        metric_type,
        value,
    });
}

/// Scheduler metric types reported to the intent optimizer.
#[derive(Copy, Clone, Debug)]
pub enum SchedMetricType {
    /// Task wakeup-to-run latency in nanoseconds.
    WakeupLatency,
    /// Run queue depth (number of runnable tasks) at observation time.
    RunQueueDepth,
    /// CPU utilization (PELT-smoothed, 0-1024 scale).
    CpuUtilization,
    /// Context switch count since last observation.
    ContextSwitchCount,
    /// EAS migration decision (1 = migrated to efficiency core, 0 = stayed).
    EasMigration,
}

/// A single scheduler observation written to the per-CPU ring.
#[repr(C, align(64))]
pub struct SchedObservation {
    /// Monotonic timestamp in nanoseconds.
    pub timestamp_ns: u64,
    /// The metric being reported.
    pub metric_type: SchedMetricType,
    /// The metric value.
    pub value: u64,
}

These per-CPU observation rings are part of the KernelObservation bus defined in Section 22.1. The observe_kernel! macro (zero-cost when no consumer is attached) is the generic entry point; observe_scheduler_metric() is the scheduler-specific typed wrapper.

Parameter update propagation: apply_policy_update().

When the optimizer (or an external Tier 2 ML policy service) computes a new parameter value, it flows through a validated, bounded update path:

/// Apply a policy-driven parameter update to a kernel tunable.
///
/// This is the sole entry point for runtime parameter changes from both
/// the in-kernel intent optimizer and external Tier 2 policy services.
/// All updates are validated and bounded before application.
///
/// # Validation
///
/// 1. The parameter must be registered in the Kernel Tunable Parameter
///    Store ([Section 22.1](22-ml-policy.md#221-aiml-policy-framework-closed-loop-kernel-intelligence)).
/// 2. The new value must be within the parameter's declared `[min, max]`
///    bounds. Out-of-bounds values are clamped (not rejected) and the
///    clamping is logged.
/// 3. The rate of change must not exceed `MAX_INTENT_ADJUSTMENT` (20%)
///    per optimizer tick. If the requested change exceeds this, it is
///    clamped to the maximum rate.
/// 4. The caller must hold `CAP_ML_TUNE` (for Tier 2 services) or be
///    the intent optimizer kernel thread (implicitly authorized).
///
/// # Propagation
///
/// After validation, the update is written to the parameter's backing
/// store (an `AtomicU64` or `AtomicF64` in the tunable registry). The
/// affected subsystem reads the new value on its next access — there is
/// no explicit notification. This is safe because all tunable parameters
/// are designed to be read speculatively (the scheduler reads
/// `eevdf_weight_scale` on every `pick_next_task`, not cached).
///
/// # Auto-decay
///
/// Every parameter update carries an expiry timestamp. If no new update
/// arrives within the expiry window (default: 60 seconds), the parameter
/// reverts to its compiled-in default. This prevents a crashed ML service
/// from leaving the kernel in a mis-tuned state indefinitely.
pub fn apply_policy_update(
    param: &KernelTunableParam,
    value: f64,
    expiry_ns: u64,
) -> Result<(), PolicyUpdateError> {
    // 1. Bounds check and clamp.
    let clamped = value.clamp(param.min_value, param.max_value);
    if (clamped - value).abs() > f64::EPSILON {
        log_clamped_update(param, value, clamped);
    }

    // 2. Rate-of-change check and clamp.
    let current = param.current_value();
    let max_delta = current.abs() * MAX_INTENT_ADJUSTMENT;
    let delta = (clamped - current).clamp(-max_delta, max_delta);
    let final_value = current + delta;

    // 3. Write to backing store with expiry.
    param.set_with_expiry(final_value, expiry_ns);

    Ok(())
}

Wake mechanism: the optimizer kernel thread.

The intent optimizer runs as a dedicated kernel thread (kthread/intent_optimizer) created during boot. It does not busy-poll — it sleeps and is woken by one of two mechanisms:

  1. Periodic timer. A high-resolution timer fires every tick_period_ns (default: 1 second). This is the primary wake source during normal operation. The timer is set with HRTIMER_MODE_REL and re-armed at the end of each optimizer tick to avoid drift accumulation.

  2. Threshold-crossing event. When a per-CPU observation ring detects that a metric has crossed a critical threshold (e.g., wakeup latency exceeds 2x the target for any cgroup with an active latency intent), it sets an atomic flag (intent_optimizer_wake_pending) and sends an IPI to the CPU running the optimizer thread. This wakes the optimizer ahead of the next timer tick, enabling faster response to sudden load changes. The threshold check is a single comparison in the observe_scheduler_metric() hot path and is skipped entirely when no cgroup has active intents (checked via a global atomic active_intent_count).

Threshold Crossing Detection and Wake Protocol:

The intent optimizer is a background kernel thread (umka_intent_optimizer, SCHED_IDLE priority) that re-evaluates scheduling policy hints based on observed workload behavior.

Threshold monitoring: Each schedulable entity tracks a set of behavioral metrics updated by the scheduler hot path: - runtime_last_window_us: CPU time consumed in the last 100ms window. - iowait_fraction: fraction of runnable time spent in I/O wait. - cache_miss_rate: L3 miss rate sampled via per-CPU PMU (updated every 10ms tick). - wakeup_rate_hz: wakeups per second (exponential moving average).

Threshold crossing detection (lock-free, hot path): On each scheduler tick, a fast check compares each metric against its registered threshold. The comparison uses a hysteresis band (5% above the "enter" threshold, 5% below the "exit" threshold) to prevent flapping. The check is implemented as:

if (abs(metric - threshold) > hysteresis && !already_pending):
    per_cpu(intent_sample_pending) = true
    // No IPI yet — piggyback on the next scheduler yield point.

IPI targeting (deferred, not on hot path): The intent optimizer thread sleeps on intent_wait_queue. It is woken by: 1. Scheduler yield points: schedule() checks per_cpu(intent_sample_pending) for the current CPU and calls intent_wakeup_if_pending() — a non-IPI local wakeup (enqueues the optimizer thread on the current CPU's runqueue if it's not already runnable). O(1), no cross-CPU traffic. 2. Cross-CPU threshold: If a threshold was crossed on CPU B but CPU B's scheduler is idle (no local yield point imminent), the timer tick on CPU B sends a SCHEDULER_KICK IPI to the optimizer thread's home CPU (the optimizer is pinned to CPU 0 by default, or the least-loaded CPU if configured). IPIs are rate-limited to 1 per 100ms per crossing to prevent IPI storms.

Batch processing: The optimizer wakes at most 100 times per second (configurable). Each wakeup processes up to 64 pending threshold crossings from all CPUs before returning to sleep.

/// The intent optimizer kernel thread entry point.
///
/// This function runs in an infinite loop, sleeping between optimizer
/// ticks. It is created once during boot and runs for the lifetime of
/// the kernel.
fn intent_optimizer_thread() -> ! {
    let state = IntentOptimizerState::new();
    let timer = HrTimer::new(state.tick_period_ns, HrTimerMode::Rel);

    loop {
        // Sleep until timer fires or threshold-crossing wakes us early.
        timer.wait_or_wake(&INTENT_OPTIMIZER_WAKE_PENDING);

        // Drain all per-CPU observation rings into aggregated metrics.
        for cpu in 0..num_online_cpus() {
            let ring = per_cpu!(sched_observation_ring, cpu);
            while let Some(obs) = ring.pop() {
                state.aggregate_observation(cpu, &obs);
            }
        }

        // Run the optimization loop (steps 1-6 from the pseudocode above).
        for (cgroup_id, cgroup_state) in &mut state.cgroup_states {
            let metrics = state.collect_cgroup_metrics(*cgroup_id);
            let adjustments = state.compute_adjustments(cgroup_state, &metrics);
            state.apply_cgroup_adjustments(*cgroup_id, &adjustments);
        }

        // Re-arm the periodic timer.
        timer.rearm(state.tick_period_ns);
    }
}

The optimizer thread runs at SCHED_NORMAL priority with a low nice value (+5). It is not a real-time thread — it must not interfere with RT or deadline workloads. If the optimizer tick takes longer than expected (e.g., due to a large number of intent cgroups), the next tick is simply delayed — there is no attempt to "catch up" missed ticks, as the PD controller is designed for variable tick rates.

6.7.5.1 Stability Analysis

Control law.

The proportional control law from step 4 above is augmented with a derivative term to provide active damping. The complete PD control law for each resource dimension r is:

error[k]      = deficit[k] / target[r]          // normalized error at tick k
d_error[k]    = (error[k] - error[k-1]) / T     // discrete derivative (T = tick period)

raw_adjustment = K_p[r] × error[k]  +  K_d[r] × d_error[k]

adjustment[r] = clamp(raw_adjustment, -MAX_INTENT_ADJUSTMENT, +MAX_INTENT_ADJUSTMENT)

where:

Parameter Value Meaning
T 1 s Optimizer tick period (see Section 6.7.6)
K_p (CPU weight) 0.5 Proportional gain, CPU weight dimension
K_p (frequency) 0.3 Proportional gain, frequency dimension
K_d (CPU weight) 1.0 Derivative gain = K_p × τ_d where τ_d = 2 s (dominant delay, see below)
K_d (frequency) 0.6 Derivative gain for frequency dimension
MAX_INTENT_ADJUSTMENT 0.20 Maximum fractional change per tick (20% of current allocation)

MAX_INTENT_ADJUSTMENT = 0.20 means the optimizer will never increase or decrease a cgroup's allocation by more than 20% of its current value in a single tick, regardless of the magnitude of the computed adjustment. For example, a cgroup at cpu.weight = 100 can move at most to [80, 120] in one optimizer cycle.

Dominant delay estimate.

The feedback signals observed by the optimizer are:

Signal Measurement latency Settling time
CPU utilization (PELT) ~40 ms (PELT decay constant) ≤ 200 ms
P99 latency (sliding window) Configurable; default 1 s window ≤ 1 s
Memory pressure (PSI) 10 s exponential window ≤ 10 s
Temperature (RAPL / ACPI thermal) 1 s read interval (Section 6.4.1) ≤ 2 s
I/O utilization (iostat-style) 250 ms sampling ≤ 500 ms

The dominant (largest) delay in the loop is the PSI memory pressure signal with a settling time of up to 10 seconds. However, the intent optimizer only acts on PSI as a secondary, advisory input — it does not directly adjust memory allocation in response to PSI (that is the responsibility of the memory reclaim path, Section 4). For the primary control dimensions (CPU weight and frequency), the dominant delay is the P99 latency window of at most 1 second, and the RAPL/thermal signal at ≤ 2 seconds. Setting τ_d = 2 s for the derivative gain is therefore conservative (larger than needed for CPU control, but correct for the temperature feedback path).

Nyquist stability condition.

For a proportional-only controller in a sampled-data loop with tick period T and signal delay τ_d, the Nyquist stability criterion requires:

T ≥ 2 × τ_d_primary

For the primary CPU-weight control path: T = 1 s, τ_d_primary ≤ 1 s (P99 window). The Nyquist condition 1 s ≥ 2 × 1 s is not satisfied by the P99 window alone, which is why the derivative term is required.

The PD controller transforms the open-loop transfer function. With K_d = K_p × τ_d:

G_PD(z) = K_p × (1 + τ_d / T) - K_p × (τ_d / T) × z⁻¹

For K_p = 0.5, τ_d = 2, T = 1:

G_PD(z) = 0.5 × 3 - 0.5 × 2 × z⁻¹  =  1.5 - 1.0 × z⁻¹

Stability analysis for discrete-time PD controller:

Closed-loop characteristic equation: 1 + G_PD(z) × z⁻¹ = 0
  where G_PD(z) = k_p + k_d × (1 - z⁻¹) = 1.5 - 1.0z⁻¹  [example gains: k_p=0.5, τ_d=2, T=1]

Expanding: 1 + (1.5 - z⁻¹) × z⁻¹ = 0
→ 1 + 1.5z⁻¹ - z⁻² = 0
→ multiply by z²: z² + 1.5z - 1.0 = 0  [characteristic polynomial, degree 2]

Jury stability criterion for z² + bz + c (b = 1.5, c = -1.0):
  Condition 1: |c| < 1  →  |-1.0| = 1.0  [marginally stable at these example gains]

Note: The gains above (k_p = 0.5, τ_d = 2) are illustrative only. Production gain selection is performed numerically at deployment time via the umka-intent-tuner tool, which sweeps the gain space and verifies Jury stability for the measured plant delay on each hardware configuration. The architecture document shows the stability analysis methodology, not fixed production gains.

Gain margin and stability boundary.

The gain margin is the factor by which K_p can increase before the closed loop becomes unstable. For the CPU-weight PD controller (K_d = 2 K_p), numerical analysis gives a gain margin of approximately 3× (i.e., K_p up to ~1.5 before instability). The chosen K_p = 0.5 is well within this margin.

If feedback pathology causes error[k] to grow without bound — for example, a bug in the P99 latency measurement that returns extreme values — the MAX_INTENT_ADJUSTMENT clamp prevents runaway:

// In every optimizer tick, regardless of computed adjustment magnitude:
let adjustment = raw_adjustment.clamp(-MAX_INTENT_ADJUSTMENT, MAX_INTENT_ADJUSTMENT);

Concretely: even if the proportional and derivative terms compute a raw adjustment of +5.0 (500% increase), the clamped adjustment is +0.20 (20% increase). The system cannot diverge faster than geometric growth at rate 1.20 per second, and the existing 5-second minimum hold time (stability controls in step 4) further limits the maximum achievable divergence rate to 1.20^(1/5) ≈ 1.038 per second — less than 4% per second even in a fully pathological feedback scenario.

State carried across ticks.

Each controlled cgroup retains:

struct IntentControlState {
    /// Normalized error from the previous optimizer tick (for derivative computation).
    prev_error: [f64; ResourceDim::COUNT],
    /// Number of consecutive ticks within the 10% convergence band.
    converged_ticks: u32,
    /// Monotonic nanosecond timestamp of the last applied adjustment.
    last_adjustment_ns: u64,
    /// Number of consecutive adjustments that failed to improve the metric (for backoff).
    backoff_count: u32,
}

prev_error is initialized to 0.0 on the first tick after an intent is set, so the derivative term contributes zero on the first control step. This avoids a derivative kick from the initial large error value.

Interaction with existing stability controls.

The stability controls in step 4 complement the PD controller:

  • Hysteresis (10% band): prevents the derivative term from amplifying measurement noise. When |error[k]| < 0.10, both K_p × error and K_d × d_error are zeroed (no adjustment applied). This is equivalent to a deadband in the control law.
  • 5-second minimum hold time: provides a floor on effective tick rate, giving the plant time to respond before the next adjustment. This is equivalent to adding latency to the feedback path, making the system behave as if T_eff = max(T, 5 s) for purposes of the Nyquist criterion.
  • Exponential backoff (3-miss rule): after 3 consecutive adjustments without convergence, K_p is halved for subsequent ticks until convergence is achieved. This adaptive gain reduction handles unmodelled plant dynamics (e.g., a workload whose latency is bottlenecked on I/O, not CPU — increasing CPU weight does nothing).
  • ±20% max rate (MAX_INTENT_ADJUSTMENT): as analyzed above, provides the runaway-prevention bound.

Together, these mechanisms give the intent optimizer the stability properties of an overdamped second-order system: it converges monotonically (no overshoot in the normal case) with a time constant of approximately 3–5 optimizer ticks (3–5 seconds). This is appropriate for a slow-path resource allocation controller — fast enough to respond to workload shifts within 10–15 seconds, slow enough to avoid the thrashing that a tighter controller would produce.

Intent Admission Control:

Intents are advisory, not guaranteed. When a cgroup sets intent.latency_ns = 5000000 (5ms), the kernel attempts to meet it but does not reject the intent if resources are insufficient. Instead: - If the intent cannot be met: intent.status reports latency_met: false with the actual observed P99 latency. - Clamping: intent values are clamped to physically achievable bounds. An intent.latency_ns = 1 (1 nanosecond) is silently clamped to the system's minimum achievable scheduling latency (~10μs on a typical x86 system). - Contradictions (e.g., intent.latency_ns = 1000 with intent.efficiency = 100) are logged as warnings in intent.status with contradiction: latency_vs_efficiency. The optimizer prioritizes the latency target.

Intent Feedback:

The intent.status file (defined above) provides real-time feedback on intent satisfaction. See the intent.status definition in the cgroup interface listing above for the full field set.

Multi-Tenant Isolation:

The cgroup hierarchy IS the authority for resource isolation. Intents operate within existing cgroup limits: - A child cgroup's intent cannot cause resource consumption exceeding the parent's cpu.max, memory.max, or power.max limits. - Parent limits are a hard ceiling. Intents are soft optimization within that ceiling. - Cross-tenant interference is prevented by existing cgroup isolation — the intent optimizer adjusts knobs for one cgroup without affecting other cgroups' guarantees (cpu.min, memory.min are respected).

6.7.6 Performance Impact

The optimization loop runs once per second as a background kernel thread. Each iteration reads per-cgroup metrics, runs the inference engine, and writes adjusted parameters. The total cost scales linearly with the number of cgroups that have active intents:

Active intent cgroups Typical iteration cost Notes
≤16 ~100-500 μs Sub-millisecond; inference engine inline
10-50 ~1-5 ms Still negligible as fraction of 1-second period
>100 ~10-50 ms Should use intent_optimizer_batch_size to split

For the default case (≤50 cgroups), the amortised overhead is well under 0.5% CPU.

The actual scheduling/allocation decisions use the same fast paths as before. Only the cgroup parameters change. Hot-path performance: identical to Linux.

6.7.7 Explainability Interface

Intent optimization (Section 6.7.5) reports whether intents are met and what adjustments were made, but administrators also need to understand why a specific performance target is not being met, what the system tried and rejected, and what action they could take to help. The explainability interface provides this deep diagnostic view.

sysfs interface (per-cgroup, read-only):

/sys/fs/cgroup/<group>/intent.explain
    bottleneck: cpu|memory|io|accelerator|power|network
    bottleneck_detail: "CPU saturated: 4/4 cores at 100%, cpu.max reached"
    adjustments_attempted: 5
    adjustments_rejected: 2
    rejected_reasons: ["cpu.max ceiling reached", "power budget Section 6.4 constraint"]
    recommendation: "Increase cpu.max from 400000 to 600000"
    conflicting_intents: ["cgroup:/prod/db has higher cpu.weight, consuming 3/4 cores"]

Each field is populated by the optimization loop (Section 6.7.5) at the end of each cycle. The bottleneck field identifies the single most constrained resource. The recommendation field suggests the smallest configuration change that would allow the intent to be met. The conflicting_intents field lists other cgroups whose intents are competing for the same resource.

Structured tracepoint: umka_tp_stable_intent_explain is emitted every optimization cycle for cgroups with unmet intents. Fields: cgroup path, intent type, target value, actual value, bottleneck type, attempted adjustments (count), rejection reasons (array). This enables perf / BPF-based monitoring of intent optimization across the system.

Adjustment history log: Per-cgroup ring buffer of the last 64 adjustments, exposed via:

/sys/fs/cgroup/<group>/intent.adjustment_history
# # Each entry:
# #   timestamp: 1708012345.123456
# #   parameter: cpu.max
# #   old_value: 400000
# #   new_value: 500000
# #   reason: "latency_p99 target 5ms, actual 8ms, cpu was bottleneck"
# #   effect: "latency_p99 dropped from 8ms to 4.2ms at next cycle"

The ring buffer is fixed-size (64 entries, ~8 KiB per cgroup) and wraps around. It provides a complete causal trail: what changed, why, and what happened as a result.

Integration with islectl: islectl intent explain <cgroup> provides a human-readable summary combining intent.status + intent.explain + intent.adjustment_history into a single diagnostic view. Example output:

$ islectl intent explain /prod/web
Intent: latency_p99 ≤ 5ms
Status: NOT MET (actual: 8.1ms)
Bottleneck: CPU (4/4 cores at 100%, cpu.max = 400000)
Recommendation: Increase cpu.max to 600000
Conflicting: /prod/db (cpu.weight=200, consuming 3/4 cores)
Last 3 adjustments:
  [12:01:05] cpu.weight 100→150 — effect: p99 9.3ms→8.5ms
  [12:01:06] io.weight  100→200 — effect: p99 8.5ms→8.3ms (not bottleneck)
  [12:01:07] cpu.weight 150→200 — rejected: power budget Section 6.4 constraint