Chapter 7: Scheduling and Power Management¶

EEVDF, RT, deadline scheduling, per-CPU runqueues, EAS, power budgeting, CPU bandwidth, timekeeping

EEVDF (Earliest Eligible Virtual Deadline First) is the default scheduler class. Real-time (FIFO/RR) and deadline classes are fully supported. Per-CPU runqueues eliminate global contention. Energy-Aware Scheduling (EAS) drives power management on heterogeneous platforms. All scheduling policy is replaceable via live kernel evolution.

7.1 Scheduler¶

7.1.1 Multi-Policy Design¶

The scheduler supports three scheduling policies simultaneously:

Policy	Algorithm	Use case	Priority range
Normal	EEVDF	General-purpose workloads (Section 7.1)	Nice -20 to 19
Real-Time	FIFO / RR	Latency-sensitive applications	RT 1-99
Deadline	EDF (CBS)	Guaranteed CPU time (audio, etc.)	Runtime/Period

Live evolution: Scheduler policy is replaceable via the SchedPolicy trait as an EvolvableComponent (Section 13.18). Runqueue data structures (EEVDF red-black tree, DL intrusive RB tree, RT bitmap) are non-replaceable verified data — only the pick/enqueue/dequeue policy logic is swappable. This means a live kernel evolution cycle can replace the scheduling algorithm (e.g., swap EEVDF for a future policy) without draining or rebuilding the runqueue. The replacement module imports the existing runqueue state via EvolvableComponent::import_state() and resumes scheduling immediately. See also the SchedPolicy trait definition in Section 19.9 and the SchedClassOps documentation trait in Section 7.1 below.

7.1.2 Architecture¶

                    Global Load Balancer
                    (runs every ~4ms)
                          |
            +-------------+-------------+
            |             |             |
        CPU 0          CPU 1         CPU N
    +----------+   +----------+   +----------+
    | RT Queue |   | RT Queue |   | RT Queue |   <- Highest priority
    +----------+   +----------+   +----------+
    | DL Queue |   | DL Queue |   | DL Queue |   <- Deadline tasks
    +----------+   +----------+   +----------+
    |EEVDF Tree|   |EEVDF Tree|   |EEVDF Tree|   <- Normal tasks (red-black tree, EEVDF)
    +----------+   +----------+   +----------+

7.1.2.1 EEVDF Algorithm Specification¶

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). See CLAUDE.md Spec Pseudocode Quality Gates.

The EEVDF (Earliest Eligible Virtual Deadline First) scheduler is the primary scheduling algorithm for normal (non-RT, non-deadline) tasks. This section specifies the complete algorithm, matching Linux 6.12+ EEVDF mathematical semantics exactly for the five core functions (avg_vruntime, entity_eligible, pick_eevdf, update_entity_lag, place_entity), while using UmkaOS-native integration (CpuLocal, Evolvable, ML policy).

Reference: Stoica & Abdel-Wahab, "Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation", TR-95-22, Old Dominion University, 1995. Linux implementation: kernel/sched/fair.c (v6.12+, Peter Zijlstra).

Evolvable component classification. The EEVDF scheduler is split into non-replaceable (Nucleus) and replaceable (Evolvable) components per Section 13.18:

Component	Classification	Rationale
`VruntimeTree` data structure	Nucleus (data)	Shared tree + accumulator layout; embedded by EevdfRunQueue, CbsCpuServer, GroupEntity
`EevdfRunQueue` data structure	Nucleus (data)	Wraps VruntimeTree + root-only fields (curr, next, bandwidth_timer); import_state operates on these fields
`EevdfTask` scheduling fields	Nucleus (data)	Task state layout must survive policy replacement
RB-tree node layout	Nucleus (data)	Data structure layout integrity; tree operations are Evolvable code
`VruntimeTree` accumulators (`sum_w_vruntime`, `sum_weight`, `zero_vruntime`)	Nucleus (data + invariant)	Accumulator integrity invariants (division-free eligibility, overflow bounds) are checked by `EevdfInvariantChecker`. The accumulators themselves are Nucleus data (part of `VruntimeTree`); the code that updates them is Evolvable.
`avg_vruntime()` computation	Evolvable	Formula code, not data. Correctness ensured by Nucleus invariant checker (`EevdfInvariantChecker`) before swap is committed. Making formulas Nucleus prevents fixing a math bug without reboot — worse for 50-year correctness.
`entity_eligible()` check	Evolvable	Formula code operating on Nucleus data. Invariant checker validates division-free semantics consistency with avg_vruntime.
`update_entity_lag()`	Evolvable	Vlag clamping formula. Invariant checker validates clamping bounds and monotonicity.
`update_curr()` vruntime advance	Evolvable	Accounting formula. Invariant checker validates weight-proportional vruntime advance and monotonic vruntime for running entity.
`eevdf_task_tick()`	Evolvable	Per-tick logic: delegates to `update_curr()`, check_preempt_tick, cgroup reweight. All policy code.
`set_protect_slice()`	Evolvable	Protection fraction is policy (ML-tunable via `ParamId::SchedProtectFraction`).
PELT (`update_load_avg()`)	Evolvable	Exponential decay formula. Invariant checker validates decay constant and signal bounds.
CBS charge path (`cbs_charge()`)	Evolvable	Budget accounting formula. Invariant checker validates deficit bounds.
RB-tree insert/delete/augment ops	Evolvable	Code operating on Nucleus RB-tree node layout. O(log N) invariant checked.
`pick_eevdf()` tree walk	Evolvable	Tie-breaking, PICK_BUDDY, protect_slice heuristics swappable
`place_entity()` wake policy	Evolvable	Lag inflation, initial placement, wake bonus swappable
Slice computation	Evolvable	ML-tunable via `ParamId::SchedEevdfWeightScale`
Preemption threshold	Evolvable	ML-tunable via `ParamId::SchedPreemptionLatencyBudget`
Load balancer heuristics	Evolvable	Migration benefit threshold ML-tunable

Phase assignment: EEVDF Nucleus data structures and all Evolvable scheduler code (formulas + policy) are Phase 2 (required for basic scheduling). ML integration of tunable parameters is Phase 4 (Section 23.1).

Virtual runtime and weights. Each task accumulates virtual runtime proportional to its CPU consumption, inversely scaled by its weight:

vruntime += delta_exec_ns * NICE_0_WEIGHT / task_weight

where NICE_0_WEIGHT = 1024 (the weight of a nice-0 task) and delta_exec_ns is the wall-clock nanoseconds the task ran since the last accounting update. Higher-weight tasks accumulate vruntime more slowly (they are entitled to more CPU), and lower-weight tasks accumulate vruntime more quickly.

The sched_prio_to_weight table maps nice values to weights (identical to Linux CFS/EEVDF). The 40 entries (nice -20 to +19):

Nice	Weight	Nice	Weight	Nice	Weight	Nice	Weight
-20	88761	-10	9548	0	1024	10	110
-19	71755	-9	7620	1	820	11	87
-18	56483	-8	6100	2	655	12	70
-17	46273	-7	4904	3	526	13	56
-16	36291	-6	3906	4	423	14	45
-15	29154	-5	3121	5	335	15	36
-14	23254	-4	2501	6	272	16	29
-13	18705	-3	1991	7	215	17	23
-12	14949	-2	1586	8	172	18	18
-11	11916	-1	1277	9	137	19	15

Each step of +1 nice reduces CPU share by approximately 10% (weight ratio ~1.25 between adjacent nice levels). The inverse weight table (sched_prio_to_wmult) is precomputed for fixed-point division: wmult[i] = 2^32 / weight[i].

calc_delta_fair: Converts wall-clock nanoseconds to virtual-time nanoseconds for a given entity. Used by update_curr(), update_entity_lag(), and place_entity():

/// Convert wall-clock delta to virtual-time delta for an entity.
/// For nice-0 (weight 1024): returns delta unchanged.
/// For nice +19 (weight 15): returns delta * 1024 / 15 ≈ 68× delta.
/// For nice -20 (weight 88761): returns delta * 1024 / 88761 ≈ 0.012× delta.
///
/// Linux equivalent: `calc_delta_fair()` + `__calc_delta()`.
#[inline]
fn calc_delta_fair(delta_ns: u64, weight: u32) -> u64 {
    if weight == NICE_0_WEIGHT {
        delta_ns
    } else {
        // Fixed-point: delta * NICE_0_WEIGHT * 2^32 / weight / 2^32
        // Using precomputed wmult = 2^32 / weight:
        //   result = (delta * NICE_0_WEIGHT * wmult) >> 32
        //
        // For weights within the sched_prio_to_weight table range (15..=88761),
        // use the precomputed table for speed. For group entities with weights
        // outside the table range (e.g., cpu.weight=10000 maps to group_weight
        // = 102400), compute wmult dynamically. This matches Linux's
        // `__update_inv_weight()` which computes inv_weight on-the-fly for
        // group entities.
        let wmult = if weight <= MAX_TABLE_WEIGHT {
            SCHED_PRIO_TO_WMULT[weight_to_idx(weight)]
        } else {
            // Dynamic computation for out-of-range weights (group entities).
            (u32::MAX as u64 / weight as u64) as u32
        };
        ((delta_ns as u128 * NICE_0_WEIGHT as u128 * wmult as u128) >> 32) as u64
    }
}

cgroup cpu.weight and task weight: Task weight is always derived from the task's nice value via sched_prio_to_weight[nice + 20]. The cgroup's cpu.weight affects the GroupEntity's weight in the hierarchical scheduler tree, not individual task weights. The cpu.weight value [1, 10000] is converted to an EEVDF group weight via group_weight = (cpu.weight * NICE_0_WEIGHT) / 100 and applied to the GroupEntity that represents the cgroup in its parent's run queue. This hierarchical model means that tasks in a cgroup with cpu.weight = 200 get twice the CPU share of tasks in a sibling cgroup with cpu.weight = 100, but each task's individual vruntime accumulation rate is still governed by its nice-derived weight within the group. See Section 17.2 for the full hierarchical weight model.

Weight change propagation: Weight changes propagate lazily. When a cgroup's cpu.weight is modified, the new value is stored atomically in CpuController.weight (Relaxed store — no ordering fence needed). The scheduler recomputes the GroupEntity weight on the next enqueue, dequeue, or scheduler tick that touches a task in the affected cgroup. No reschedule IPI is sent for weight changes on ticked cores — the weight update is picked up lazily at the next tick. Tickless cores with running tasks from the affected cgroup receive a reschedule IPI to ensure timely weight application (see Section 17.2). Tasks that are sleeping when the weight changes pick up the new weight on their next wakeup (enqueue path). Tasks that are currently running on ticked cores pick it up on the next task_tick() (1-4 ms granularity). This matches Linux CFS behavior, where reweight_entity() is called lazily for ticked cores.

Eligibility. A task is eligible to run when it has not received more than its fair share of CPU time. The eligibility test compares the task's virtual runtime against the run queue's weighted average virtual runtime (avg_vruntime):

eligible = (se.vruntime <= avg_vruntime(cfs_rq))

A task with vruntime < avg_vruntime has received less CPU than its fair share (positive virtual lag) and is eligible. A task with vruntime > avg_vruntime has received more than its fair share (negative virtual lag) and must wait until avg_vruntime advances past it. This ensures fairness: tasks that have been underserved are prioritized, while tasks that have been overserved must wait.

Division-free eligibility test. To avoid division on the hot path, the eligibility check is algebraically transformed (see entity_eligible() below):

se.vruntime <= zero_vruntime + sum_w_vruntime / sum_weight

is equivalent to (multiply both sides by sum_weight):

(se.vruntime - zero_vruntime) * sum_weight <= sum_w_vruntime

This uses only subtraction and multiplication — no division.

Tree pruning bound. The augmented RB-tree stores a per-subtree min_vruntime field (the minimum se.vruntime of any entity in the subtree). pick_eevdf() prunes subtrees using the same division-free eligibility check: vruntime_eligible(rq, subtree.min_vruntime). If the subtree's minimum vruntime is not eligible (i.e., greater than avg_vruntime), then no entity in that subtree can be eligible either. This achieves O(log n) selection.

Virtual deadline. Each task's virtual deadline determines its scheduling priority among eligible tasks:

vdeadline = vruntime + calc_delta_fair(slice_ns, weight)

This is set by place_entity() on wakeup and by update_deadline() on slice expiry. Linux equivalent: se->deadline = se->vruntime + calc_delta_fair(se->slice, se).

The default slice is 750 us (slice_ns = 750_000), configurable via sched_base_slice_ns. Deliberate divergence from Linux: Linux mainline uses 700 us (sysctl_sched_base_slice = 700000 in kernel/sched/fair.c). UmkaOS uses 750 us based on internal analysis of container workload scheduling latency — the extra 50 us reduces context-switch rate by ~7% on database/web-server mixes without measurably increasing tail latency. This is an intentional UmkaOS design choice, not a stale value from an earlier Linux version.

ML-tunable slice: The ML policy framework can adjust the effective slice via ParamId::SchedEevdfWeightScale (default 100, range [50, 200]):

let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
    .map_or(100, |p| p.current.load(Relaxed));
let effective_slice = base_slice_ns * weight_scale as u64 / 100;

This parameter is read by place_entity() (Evolvable) when computing the virtual slice for deadline assignment. A weight_scale of 150 increases slices by 50%, reducing context-switch overhead at the cost of tail latency.

Lower-weight tasks get longer virtual deadlines (lower scheduling urgency); higher-weight tasks get shorter virtual deadlines (higher urgency). The task with the minimum vdeadline among all eligible tasks is selected by pick_next_task.

Red-black tree organization. Each per-CPU run queue maintains a single intrusive red-black tree keyed by vdeadline (virtual deadline), matching the Linux 6.6+ tasks_timeline design:

Tree key: deadline (virtual deadline). Ties broken by TaskId (lower ID sorts first). Linux equivalent: entity_before() compares a->deadline < b->deadline.
Augmented field: min_vruntime — per-subtree minimum of se.vruntime. Linux equivalent: sched_entity.min_vruntime augmented field.
Additional augmented fields: max_slice — per-subtree maximum of se.slice. Used by cfs_rq_max_slice() for lag clamping.

This tree organization is identical to Linux 6.6+. Linux also keys by deadline and augments with min_vruntime. (The pre-6.6 CFS keyed by vruntime with min_deadline augmentation — that design was replaced by EEVDF.)

tasks_timeline: An intrusive augmented RB-tree containing ALL runnable tasks (both eligible and ineligible). Each node embeds an EevdfRbLink with the augmented min_vruntime field. pick_eevdf() prunes ineligible subtrees using the division-free vruntime_eligible() check on the subtree's min_vruntime.

Eligibility is computed dynamically during the pick_eevdf() walk, not stored as a tree membership property. The deadline-keyed single-tree design means pick_eevdf() performs a left-descent walk that naturally visits earlier-deadline tasks first.

Intrusive link: Each EevdfTask embeds an EevdfRbLink directly. Enqueue and dequeue are O(log n) intrusive insert/remove operations with zero heap allocation. This is critical for the scheduler hot path (per-syscall, per-tick).

Eligibility filter in pick_eevdf(). The EEVDF pick_eevdf() must find the eligible task with the earliest virtual deadline. A task is eligible when:

entity_eligible(rq, se)  <==>  se.vruntime <= avg_vruntime(rq)

pick_eevdf() walks the deadline-keyed tree using vruntime_eligible() on subtree min_vruntime fields to prune entire subtrees that cannot contain an eligible task. The currently-running entity (curr) is also considered as a candidate: if eligible and has an earlier deadline than the tree-walk result, curr wins.

If no eligible task exists (a transient condition due to numerical precision), pick_eevdf() returns curr if eligible, otherwise None (the caller falls through to the idle task).

Lag tracking (vlag). Virtual lag measures how far a task's vruntime deviates from the run queue's weighted average. Linux stores this as se->vlag — an unweighted, signed virtual-time quantity:

vlag = avg_vruntime(cfs_rq) - se.vruntime

Positive vlag means se.vruntime < avg_vruntime (task is owed CPU — eligible). Negative vlag means se.vruntime > avg_vruntime (task over-served — ineligible).

Vlag is updated on every dequeue (when a task blocks, yields, or is preempted) by update_entity_lag(). It is clamped to prevent starvation or unbounded credit:

limit = calc_delta_fair(cfs_rq_max_slice(rq) + TICK_NSEC, se)
se.vlag = clamp(vlag, -limit, limit)

The clamp limit is per-entity (depends on the entity's weight) and per-rq (depends on the maximum slice of any entity on the rq, plus one tick for timing granularity). For a nice-0 task with default slice: limit = 750_000 + TICK_NSEC. For a nice +19 task (weight 15): limit = calc_delta_fair(750_000 + 4_000_000, 15) = (750_000 + 4_000_000) * 1024 / 15 = 324_266_666 — much larger, allowing low-weight tasks to accumulate proportionally more virtual credit.

Linux equivalent: update_entity_lag() in fair.c.

On enqueue (when a task wakes up), the saved vlag from the previous dequeue is used by place_entity() to position the task in virtual time, ensuring that a task owed CPU time before sleeping is prioritized when it wakes.

Deferred dequeue (over-served sleep path). When a task sleeps while carrying negative vlag — meaning it has consumed more CPU than its fair share — a naive immediate removal from the run queue would let it re-enter with a head-start on avg_vruntime computation. UmkaOS uses a deferred dequeue mechanism, matching the Linux 6.12 sched_delayed design:

On sleep, if vlag < 0 (equivalently, !entity_eligible(rq, se) — the two conditions are algebraically identical since vlag = avg_vruntime - se.vruntime), the task is NOT immediately removed from the tasks_timeline tree. The field sched_delayed is set to true and on_rq remains Queued. This deferral is gated on the DELAY_DEQUEUE scheduling feature flag (default: enabled). Special dequeue paths bypass the delay: DEQUEUE_SPECIAL (signal-induced dequeue, task exit) and DEQUEUE_THROTTLE (CBS bandwidth exhaustion) always remove immediately regardless of vlag.
While deferred, a task remains in the tree and may be selected if eligible. When picked, requeue_delayed_entity() clears the delay and re-integrates it. The tree walk does NOT filter sched_delayed — this matches Linux kernel/sched/fair.c pick_eevdf() which has no sched_delayed guard.
While deferred, the task still contributes its weight to the two avg_vruntime accumulators (sum_w_vruntime, sum_weight). This ensures that over-served sleeping tasks continue to push avg_vruntime upward, naturally decaying their negative vlag without requiring any explicit timer.
The vlag clamp is the standard update_entity_lag() clamp: limit = calc_delta_fair(cfs_rq_max_slice(rq) + TICK_NSEC, se). No separate widened clamp exists — the single clamping formula handles all cases. Linux does not have a separate widened clamp for deferred entities.
Deferred removal is triggered by two events:
Vlag reaches zero at pick time: pick_eevdf() scans the tree and, for each deferred candidate, checks whether entity_eligible(rq, se) (i.e., avg_vruntime has advanced past the task's vruntime). When true, the task is removed from the tree (transitioned to OnRqState::Off) without any wake-up.
Task wakes up before vlag decays: On the wake-up path, if sched_delayed == true, the task is first removed from its deferred tree position, then re-enqueued via place_entity() which uses the saved vlag to position it correctly.
On wake-up from a deferred state, place_entity() sets:

se.vruntime = avg_vruntime(rq) - inflated_vlag

where inflated_vlag = se.vlag * (W + w_i) / W compensates for the effect of adding this entity on avg_vruntime (see place_entity() below). If vlag < 0 (still over-served at wake time), se.vruntime > avg_vruntime, so the task remains ineligible and waits until avg_vruntime catches up. If vlag >= 0 (fully decayed by the time it wakes), se.vruntime <= avg_vruntime, so the task is immediately eligible for selection.

requeue_delayed_entity() — called from enqueue_task_fair() when re-enqueueing a task with ENQUEUE_DELAYED flag, or when walking up the hierarchy and finding a delayed ancestor. NOT called from pick_eevdf() — the transition happens on the enqueue side, not the pick side. Matches Linux kernel/sched/fair.c requeue_delayed_entity() (line ~6909):

fn requeue_delayed_entity(rq: &mut RunQueue, se: &mut EevdfTask) {
    debug_assert!(se.sched_delayed && se.on_rq == OnRqState::Deferred);

    // If DELAY_ZERO feature is enabled, zero positive vlag on requeue.
    // This prevents over-served tasks from accumulating unlimited vlag
    // credit during deferred sleep.
    if sched_feat(DELAY_ZERO) {
        update_entity_lag(rq, se);
        if se.vlag > 0 {
            // Dequeue from current position, place with vlag = 0, re-enqueue.
            __dequeue_entity(rq, se);
            se.vlag = 0;
            place_entity(rq, se, PlaceFlag::Requeue);
            __enqueue_entity(rq, se);
        }
    }

    update_load_avg(rq, se);
    se.sched_delayed = false;
    se.on_rq = OnRqState::Queued;
}

The sched_delayed field is added to EevdfTask and OnRqState gains a Deferred variant to unambiguously represent the deferred state (distinct from CbsThrottled which represents CBS bandwidth exhaustion — see below):

/// Whether and how a task appears on the run queue — extended for deferred dequeue
/// and CBS throttling.
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub enum OnRqState {
    /// Task is sleeping and not physically present in any run queue tree.
    Off,
    /// Task is runnable and present in the `tasks_timeline` tree. Eligibility
    /// (whether `se.vruntime <= avg_vruntime`) is computed dynamically by
    /// `pick_eevdf()` via the division-free `entity_eligible()` check — it is
    /// NOT a stored state. This single variant replaces
    /// the former `Eligible`/`Ineligible` split, matching the Linux 6.6+ design
    /// where all runnable tasks reside in one `tasks_timeline` tree.
    Queued,
    /// **EEVDF deferred dequeue**: task is sleeping but still physically present in
    /// `tasks_timeline` because it had negative vlag at sleep time
    /// (`sched_delayed == true`). Remains in `tasks_timeline` and may be
    /// selected by `pick_eevdf()` if eligible; still contributes weight to
    /// `sum_w_vruntime` and `sum_weight`. Transitions to `Off`
    /// when vlag decays to zero or the task wakes and is re-enqueued.
    ///
    /// **Not the same as CBS throttling.** A `Deferred` task is an EEVDF-internal
    /// optimization: the task voluntarily slept while over-served (negative vlag),
    /// so it remains on the tree to let its vlag decay passively. It is NOT waiting
    /// for any timer or budget replenishment — it is waiting for a future wakeup
    /// event (I/O, signal, timer) at which point `place_entity()` repositions it.
    Deferred,
    /// **CBS bandwidth throttled**: task's cgroup CPU bandwidth budget is exhausted.
    /// The task is NOT in `tasks_timeline` — it has been fully dequeued with its
    /// `vruntime` and `vlag` preserved in the task struct. The task waits for the
    /// CBS period replenishment timer to fire, at which point it is re-enqueued
    /// with `OnRqState::Queued`.
    ///
    /// Unlike `Deferred`, a `CbsThrottled` task:
    /// - Is NOT physically present in any run queue tree.
    /// - Does NOT contribute weight to `avg_vruntime` accumulators.
    /// - Is INTERRUPTIBLE by signals: SIGKILL immediately re-enqueues the task
    ///   (bypassing the throttle with bandwidth debt). Other signals set
    ///   `TIF_SIGPENDING` and the task processes them on replenishment.
    /// - Transitions to `Queued` on budget replenishment, not on
    ///   vlag decay.
    ///
    /// See `CbsCpuServer.throttled` and `CbsCpuServer.max_throttled` in
    /// [Section 7.6](#cpu-bandwidth-guarantees) for the per-CPU server state that
    /// controls when tasks enter/exit this state.
    CbsThrottled,
}

PELT interaction with deferred dequeue. When a task is deferred (sched_delayed = true, still in the tree but sleeping), its PELT accounting pauses: last_update_time is NOT advanced during the deferred period. The task is sleeping — it is not consuming CPU — so neither util_sum nor runnable_sum should accumulate.

When the task is eventually picked (because its vlag decayed to zero) or explicitly dequeued (wake-up from deferred state), PELT decay catches up with a single call to update_load_avg() using the true elapsed time since the last update. The geometric decay naturally handles the gap: if the deferred period was D nanoseconds, the call applies decay_load(sum, D / PELT_PERIOD_NS) to all three accumulators, correctly reducing the stale contribution.

If the deferred period exceeds 32 ms (one PELT half-life), the stale contribution decays by more than 50%, which is correct behavior: the task was not actually consuming CPU during that time. A task deferred for 100 ms (~97 periods) has its PELT signal decayed to ~5% of its pre-sleep value, accurately reflecting its recent CPU demand.

This design avoids two failure modes: 1. No load spike on resume: without the catch-up decay, a task resuming from deferred state would appear to have full utilization (stale util_avg), causing EAS to over-provision and the load balancer to make unnecessary migrations. 2. No double-counting: period_contrib is preserved during the deferred period (not reset to zero), so the Phase 1 head completion in update_load_avg() correctly finishes the partial period that was in progress when the task entered deferred state.

Bandwidth throttling interaction. When a cgroup's CPU bandwidth is exhausted (CBS budget depleted per Section 7.6), all tasks in the cgroup are dequeued from their respective per-CPU run queues (removed from tasks_timeline) and their on_rq is set to OnRqState::CbsThrottled. Each task's vruntime and lag are preserved in the task struct.

On budget replenishment (when the CBS period timer fires and the cgroup receives a new quota), all CbsThrottled tasks are re-enqueued with their saved vruntime and vlag intact, transitioning to Queued. This ensures that bandwidth throttling does not cause fairness distortion — a task that was owed CPU time before throttling remains owed after throttling ends.

Signal delivery during CBS throttling:

When a cgroup's CPU bandwidth is exhausted and tasks are dequeued:

SIGKILL / SIGSTOP (uncatchable): Delivered immediately. The task is re-enqueued with its vruntime intact, bypassing bandwidth throttling. This ensures kill -9 always works regardless of bandwidth state. The bandwidth consumed is recorded as debt, repaid on next quota replenishment.
Other signals (SIGTERM, SIGHUP, user signals): Queued in the task's pending signal set (TIF_SIGPENDING is set). The task is NOT re-enqueued. Signal delivery completes when the task is re-enqueued on quota replenishment and returns to userspace through the signal check path.
KILLABLE tasks (UNINTERRUPTIBLE | WAKEKILL) that are CBS-throttled: SIGKILL wakes immediately with bandwidth debt. Non-fatal signals remain pending.

This design preserves bandwidth isolation (a cgroup cannot escape its quota by receiving signals) while ensuring liveness (fatal signals always terminate within one scheduling tick + IPI latency).

Latency-nice. Tasks with latency_nice < 0 (latency-sensitive, e.g., interactive or audio) have their virtual deadline shortened, making them scheduled sooner among eligible tasks. The effective slice used for deadline calculation:

effective_slice = base_slice_ns * latency_weight / LATENCY_NICE_0_WEIGHT

where latency_weight = LATENCY_NICE_TO_WEIGHT[(latency_nice + 20)]. A task with latency_nice = -20 (weight 88818) gets effective_slice ≈ base_slice_ns * 88818 / 1024 ≈ 86.7 * base_slice_ns — wait, this is inverted. The correct formula: latency-nice affects the virtual deadline offset, not the slice duration. A more latency-sensitive task gets a shorter virtual deadline offset:

vslice = calc_delta_fair(effective_slice, weight)
effective_slice = base_slice_ns * LATENCY_NICE_0_WEIGHT / latency_weight

A task with latency_nice = -20 (latency_weight = 88818) gets effective_slice = 750_000 * 1024 / 88818 ≈ 8_650 ns, giving it a very short virtual deadline offset — it is picked first among equal-vruntime peers. A task with latency_nice = +19 (latency_weight = 15) gets effective_slice = 750_000 * 1024 / 15 = 51_200_000 ns, pushing its deadline far out — it is picked last.

NOT a Linux feature. latency_nice is an UmkaOS-original design inspired by the proposed (but never merged) Linux latency_nice patchset (Vincent Guittot / Parth Shah, LKML 2022-2024). The patchset was discussed on LKML multiple times but never accepted into torvalds/linux mainline. As of Linux 6.17+, there is no latency_nice field in struct task_struct, no sched_latency_nice field in struct sched_attr, and no SCHED_FLAG_LATENCY_NICE bit.

The LATENCY_NICE_TO_WEIGHT table, the effective_slice formula, and the sched_latency_nice extension to sched_attr are all UmkaOS-specific. UmkaOS defines a new flag SCHED_FLAG_LATENCY_NICE = 0x80 for sched_setattr(2) and extends struct sched_attr with a sched_latency_nice: i32 field (see Section 19.1 for the extended sched_attr layout). Applications that use latency_nice are UmkaOS-only and will not work on upstream Linux kernels.

/// Latency-nice to latency_weight mapping. Higher weight = shorter virtual
/// deadline offset = more latency-sensitive. Geometric ratio ~1.25x per step.
/// latency_nice 0 maps to LATENCY_NICE_0_WEIGHT (1024).
///
/// **UmkaOS-original.** This table is part of the UmkaOS latency_nice
/// extension (NOT a Linux feature — see the latency-nice note above).
/// UmkaOS maintains a separate latency weight table (distinct from
/// `sched_prio_to_weight`) to enable per-cgroup ML policy overrides via
/// `SubsystemId::Scheduler` parameters. See
/// [Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence).
///
/// **Intentionally different from `sched_prio_to_weight`**: The nice-to-weight
/// table (`sched_prio_to_weight`) controls CPU bandwidth allocation (time slices).
/// This table controls scheduling latency (virtual deadline offset). They use
/// the same 1024 base and the same ~1.25x geometric ratio per step, but the two
/// dimensions are independent: a thread can have nice=0 (normal bandwidth) but
/// latency_nice=-20 (maximum latency sensitivity).
///
/// **Formula**: `entry[i] = round(1024 * 1.25^(20 - i))` for i = 0..39.
/// The base value 1024 at index 20 (latency_nice 0) matches `NICE_0_WEIGHT`.
/// Each step toward latency_nice -20 multiplies by 1.25 (more latency-sensitive);
/// each step toward +19 divides by 1.25 (less latency-sensitive).
///
/// Index 0 = latency_nice -20 (most latency-sensitive),
/// index 20 = latency_nice 0 (baseline),
/// index 39 = latency_nice +19 (least latency-sensitive).
///
/// Lookup: latency_weight = LATENCY_NICE_TO_WEIGHT[(latency_nice + 20) as usize]
///
/// Note: LATENCY_NICE_TO_WEIGHT[-20] = 88818 differs from
/// sched_prio_to_weight[-20] = 88761. Both use ~1.25x geometric ratios but
/// different rounding. The latency table uses exact `round(1024 * 1.25^20)` =
/// 88818; the weight table uses the Linux-compatible rounded approximation
/// 88761. Intentionally different — see line 583.
pub const LATENCY_NICE_TO_WEIGHT: [u32; 40] = [
    // latency_nice  -20    -19    -18    -17    -16
                   88818, 71054, 56843, 45475, 36380,
    // latency_nice  -15    -14    -13    -12    -11
                   29104, 23283, 18626, 14901, 11921,
    // latency_nice  -10     -9     -8     -7     -6
                    9537,  7629,  6104,  4883,  3906,
    // latency_nice   -5     -4     -3     -2     -1
                    3125,  2500,  2000,  1600,  1280,
    // latency_nice   +0     +1     +2     +3     +4
                    1024,   819,   655,   524,   419,
    // latency_nice   +5     +6     +7     +8     +9
                     336,   268,   215,   172,   137,
    // latency_nice  +10    +11    +12    +13    +14
                     110,    88,    70,    56,    45,
    // latency_nice  +15    +16    +17    +18    +19
                      36,    29,    23,    18,    15,
];

/// Base latency weight corresponding to latency_nice 0.
pub const LATENCY_NICE_0_WEIGHT: u32 = 1024;

Latency-nice does NOT affect CPU bandwidth — only scheduling latency. A latency-nice -20 task at nice 0 receives the same total CPU share as a latency-nice 0 task at nice 0; it simply gets scheduled sooner when it wakes up.

Data structures.

/// Lock level constant for per-CPU run queue locks.
///
/// Placed above the task lock level (20) and below the priority-inheritance
/// lock level (60) in the system-wide lock hierarchy defined in
/// [Section 3.4](03-concurrency.md#cumulative-performance-budget). The lock hierarchy uses x10 spacing
/// (0, 10, 20, ..., 260). Level 50 is shared by all run queue locks regardless
/// of which CPU they protect.
/// (Level 30 is `SIGHAND_LOCK`, level 40 is `SIGLOCK`/`FDTABLE_LOCK`.)
///
/// Runqueue lock is acquired via `rq.lock()` which returns
/// `SpinGuard<RunQueueData, RQ_LOCK_LEVEL>`. The lock level prevents holding a
/// higher-level lock (e.g., CAP_TABLE_LOCK at level 70) while holding the runqueue lock
/// — compile-time enforced. PI_LOCK (level 45) is BELOW RQ_LOCK (level 50).
///
/// # Same-level ordering
/// Because all `RunQueue` locks share level 50, the type system alone cannot prevent
/// an ABBA deadlock between two run queues. The `lock_two_runqueues()` function
/// closes this gap: it is the **only** function permitted to hold two run queue locks
/// simultaneously, and it always acquires them in CPU-ID order.
pub const RQ_LOCK_LEVEL: u32 = 50;

/// Logical CPU identifier. Monotonic index assigned during topology enumeration
/// (BSP = 0, APs numbered in ACPI MADT / DT order). Used as a key for per-CPU
/// data structures and as a lock-ordering tiebreaker.
pub type CpuId = u32;

/// Per-CPU run queue — the top-level schedulable entity for one logical CPU.
///
/// `RunQueue` is the owner of the `SpinLock` that protects `EevdfRunQueue`
/// and the associated RT/deadline queues for one CPU. Callers that need to lock
/// two run queues at once **must** use `lock_two_runqueues()` — direct chained
/// calls to `lock()` on two different `RunQueue`s is a compile-time error when
/// the lock-level type system detects two concurrent level-50 acquisitions.
pub struct RunQueue {
    /// Identity of the CPU that owns this run queue.
    /// Used by `lock_two_runqueues()` to establish a canonical acquisition order.
    pub cpu_id: CpuId,
    /// Typed spinlock carrying the EEVDF and RT/deadline state.
    /// The level-50 type parameter participates in the compile-time lock hierarchy
    /// ([Section 3.4](03-concurrency.md#cumulative-performance-budget)); it prevents acquiring this lock while
    /// already holding a level-50 lock except through `lock_two_runqueues()`.
    lock: SpinLock<RunQueueData, RQ_LOCK_LEVEL>,
}

/// RT priority valid range for SCHED_FIFO and SCHED_RR: 1–99.
/// Priority 0 is EINVAL for real-time policies (reserved for non-RT policies
/// per POSIX and Linux ABI). `sched_get_priority_min(SCHED_FIFO) == 1`.
/// Slot 0 in the priority bitmap is allocated but always empty.
pub const RT_PRIORITY_MIN: u8 = 1;
pub const RT_PRIORITY_MAX: u8 = 99;
pub const RT_PRIORITY_LEVELS: usize = 100; // Indexed 0–99; slot 0 is unused (priority 0 = EINVAL).

// Parameter validation for sched_setscheduler / sched_setattr:
//
//   if policy == SCHED_FIFO || policy == SCHED_RR {
//       if param.sched_priority == 0 { return EINVAL; }
//       if param.sched_priority > RT_PRIORITY_MAX { return EINVAL; }
//   }
//
// Equivalently: valid range is RT_PRIORITY_MIN..=RT_PRIORITY_MAX (1–99).
// sched_get_priority_min(SCHED_FIFO) and sched_get_priority_min(SCHED_RR)
// both return 1. This matches the Linux ABI and POSIX SCHED_FIFO/SCHED_RR
// semantics: priority 0 is defined only for non-RT policies (SCHED_OTHER,
// SCHED_BATCH, SCHED_IDLE) where it is the only legal value.

/// Absolute deadline timestamp in nanoseconds (monotonic clock).
///
/// Used as the ordering key for the `DlRunQueue` intrusive red-black tree.
/// Two tasks with the same absolute deadline share equal priority under EDF;
/// tie-breaking uses `TaskId` (lower ID wins). In practice, two tasks with
/// identical absolute deadlines are exceedingly rare.
pub type AbsDeadlineNs = u64;

/// Fixed-point scale for deadline bandwidth accounting.
///
/// `DL_BW_SCALE = 1 << 20` (approximately 1,048,576). Deadline bandwidth
/// fractions are stored as `runtime_ns * DL_BW_SCALE / period_ns`, giving
/// 20 bits of sub-unit precision. A task consuming 100% of the CPU has
/// `bw = DL_BW_SCALE`. The sum of all per-task bandwidths must not exceed
/// `capacity_ns * DL_BW_SCALE / period_ns`.
///
/// This matches Linux's `BW_SHIFT = 20` / `BW_UNIT = 1 << BW_SHIFT`
/// convention so that bandwidth values computed from `SCHED_DEADLINE`
/// parameters are directly comparable. CBS admission
/// ([Section 7.6](#cpu-bandwidth-guarantees)) uses the same `BW_SCALE = 1 << 20`
/// constant, ensuring DL and CBS bandwidths can be summed directly in
/// the system-wide overcommit check.
pub const DL_BW_SCALE: u64 = 1 << 20;

/// Per-CPU real-time (SCHED_FIFO / SCHED_RR) run queue.
///
/// UmkaOS improves on Linux's `struct rt_rq` in two ways:
///
/// 1. **Per-queue CBS bandwidth accounting** instead of a single global
///    `sched_rt_runtime_us` knob. Each `RtRunQueue` tracks its own consumed
///    runtime and replenishment period, so individual CPUs can be throttled
///    independently without a global lock. This is particularly valuable on
///    heterogeneous platforms where P-core and E-core CPUs have different
///    RT capacity budgets.
///
/// 2. **Typed priority bitmap** — the 100-bit occupancy map uses two `u64`
///    words (128 bits allocated, top 28 unused), with the highest set bit
///    indicating the next priority level to schedule. A leading-zeros scan
///    (`CLZ` / `BSR`) on the bitmap locates the highest occupied priority in
///    a single instruction on all supported architectures. Scan word 1 first
///    (priorities 64-99); if zero, scan word 0 (priorities 0-63).
///
/// Linux reference: `struct rt_rq` in `kernel/sched/sched.h` (Linux 6.x).
/// Key differences: no `rt_nr_boosted` (PI boost is tracked per-task in
/// UmkaOS, not per-queue), no `pushable_tasks` plist (UmkaOS uses a separate
/// per-CPU migration candidate set managed by the load balancer), and no
/// embedded group-scheduling pointers (`tg`, `rq`).
pub struct RtRunQueue {
    /// Two-word occupancy bitmap for priority levels 0–99.
    ///
    /// Bit `p` is set when priority queue `p` is non-empty.
    /// Bit 0 is never set (priority 0 is invalid for RT policies;
    /// `RT_PRIORITY_MIN == 1`). Word 0 covers priorities 0–63; word 1
    /// covers priorities 64–99 (bits 100–127 of word 1 are always zero).
    /// The highest-priority non-empty queue is found by scanning for the
    /// most-significant set bit across both words (word 1 first).
    pub bitmap: [u64; 2],

    /// Per-priority intrusive task lists.
    ///
    /// `queues[p]` holds all runnable tasks at RT priority `p` in
    /// FIFO order. SCHED_RR tasks are rotated to the tail of their
    /// queue on time-slice expiry. SCHED_FIFO tasks are never rotated.
    /// Indexed by user-visible `sched_priority` (0-99), where 99 is the
    /// highest RT priority. This differs from Linux's internal kernel
    /// priority numbering (where 0 is the highest internal priority);
    /// UmkaOS uses the user-facing numbering directly to avoid the
    /// `99 - sched_priority` translation.
    /// Note: `queues[0]` is always empty (tombstone): `RT_PRIORITY_MIN == 1`.
    /// `bitmap` bit 0 is never set. The slot exists for O(1) direct indexing.
    pub queues: [IntrusiveList<Task>; RT_PRIORITY_LEVELS],

    /// Number of runnable RT tasks on this CPU.
    ///
    /// Incremented on enqueue, decremented on dequeue. Does not include
    /// tasks that are throttled (removed from all queues). The value
    /// equals the sum of task counts across all priority level queues:
    /// `nr_running = sum(queues[p].len() for all p)`.
    pub nr_running: u32,

    /// Accumulated runtime consumed during the current throttle period, in
    /// nanoseconds.
    ///
    /// Increased by the scheduler tick handler on every tick that an RT task
    /// is running. When `rt_time_ns >= rt_runtime_ns` the queue is throttled:
    /// all RT tasks are dequeued and `throttled` is set. This is a per-CPU
    /// improvement over Linux's `sched_rt_runtime_us` global knob.
    pub rt_time_ns: u64,

    /// Maximum RT runtime allowed per period, in nanoseconds.
    ///
    /// Defaults to `950_000_000` (950 ms per second — reserving 5% of the CPU
    /// for non-RT work, matching Linux's default `sched_rt_runtime_us`).
    /// Configurable at runtime via `sysctl umka.sched.rt_runtime_ns`.
    /// Set to `u64::MAX` to disable throttling (equivalent to
    /// `sched_rt_runtime_us = -1` in Linux).
    pub rt_runtime_ns: u64,

    /// `true` when this queue is currently throttled due to bandwidth
    /// exhaustion (`rt_time_ns >= rt_runtime_ns`).
    ///
    /// While throttled, no RT task from this queue may be selected by
    /// `pick_next_task`. The period replenishment timer resets this flag
    /// and re-enqueues all previously throttled tasks.
    pub throttled: bool,

    /// Monotonic timestamp (nanoseconds) of the start of the current
    /// throttle accounting period.
    ///
    /// The period length is 1 second (`1_000_000_000 ns`), matching Linux's
    /// `sched_rt_period_us` default of 1 s. At the end of each period,
    /// `rt_time_ns` is reset to zero, `throttled` is cleared, and
    /// `period_start_ns` is advanced by the period length.
    pub period_start_ns: u64,
}

/// Per-CPU deadline (SCHED_DEADLINE) run queue.
///
/// UmkaOS improves on Linux's `struct dl_rq` in two ways:
///
/// 1. **Intrusive red-black tree with cached leftmost** — each `Task` embeds
///    a `DlRbLink` node (see `EevdfTask.dl_rb_link`), so enqueue and dequeue
///    manipulate only the embedded pointers — **zero heap allocation**.  The
///    tree is ordered by `(AbsDeadlineNs, TaskId)` giving O(log n) insert /
///    remove and O(1) earliest-deadline pickup via the cached `leftmost`
///    pointer (updated on every structural change).  This matches Linux's
///    `rb_root_cached` + embedded `rb_node` pattern used in `struct dl_rq`.
///
///    Unlike `BTreeMap`, which allocates B-tree nodes on the heap during
///    `insert()`, the intrusive tree never allocates — making it safe to call
///    under the per-CPU runqueue spinlock on the scheduler hot path.
///
/// 2. **Explicit bandwidth tracking** — `total_bw` and `capacity_ns` are
///    first-class fields rather than derived from per-task `dl_bw` entries.
///    This makes admission control O(1): check `total_bw + new_task_bw <=
///    DL_BW_SCALE` before accepting a new SCHED_DEADLINE task.
///
/// Linux reference: `struct dl_rq` in `kernel/sched/sched.h` (Linux 6.x).
/// Key differences: explicit `capacity_ns` replaces per-rq `dl_bw` struct,
/// and `earliest_deadline_ns` is a cached O(1) copy of the earliest deadline
/// (equivalent to `rb_first_cached(&dl_rq->root)` in Linux).
pub struct DlRunQueue {
    /// Intrusive red-black tree of runnable deadline tasks, ordered by
    /// `(AbsDeadlineNs, TaskId)`.
    ///
    /// Each task embeds a `DlRbLink` in `EevdfTask.dl_rb_link`. The tree
    /// owns no heap-allocated nodes — all storage lives inside the `Task`
    /// structs themselves. Insert and remove are O(log n) with no allocator
    /// calls, safe to execute under the per-CPU runqueue spinlock.
    ///
    /// Invariant: every task linked into this tree is runnable (not sleeping)
    /// and has been admitted through the bandwidth test. The ordering key
    /// combines the task's absolute deadline with its `TaskId` to prevent
    /// key collisions when two tasks share the same absolute deadline.
    /// If a task's deadline is updated (e.g., on new job arrival), it is
    /// unlinked and re-inserted with the new key.
    pub root: IntrusiveRbRoot<DlRbLink>,

    /// Cached pointer to the leftmost (earliest-deadline) node, or `None`
    /// when the tree is empty. Maintained on every insert / remove — provides
    /// O(1) earliest-deadline pickup without a tree traversal.
    pub leftmost: Option<NonNull<DlRbLink>>,

    /// Sum of fixed-point bandwidth fractions for all admitted tasks.
    ///
    /// Each task contributes `runtime_ns * DL_BW_SCALE / period_ns` to
    /// `total_bw`. Admission control accepts a new task only if
    /// `total_bw + new_bw <= capacity_ns * DL_BW_SCALE / period_ns`.
    /// Maintained incrementally: increased on task admission, decreased
    /// on task departure (sleep, termination, or policy change away from
    /// SCHED_DEADLINE).
    pub total_bw: u64,

    /// CPU capacity in nanoseconds per period (default: 1_000_000_000 for a
    /// fully available CPU over a 1-second window).
    ///
    /// On heterogeneous CPUs, `capacity_ns` is scaled by the CPU's capacity
    /// factor (from `CpuCapacity.capacity / 1024`) so that a 512-capacity
    /// efficiency core exposes only 512 ms of deadline capacity per second.
    /// This prevents over-admission on low-capacity cores.
    pub capacity_ns: u64,

    /// Number of runnable deadline tasks on this CPU.
    ///
    /// Maintained as a separate `u32` on every enqueue/dequeue. Keeping
    /// `nr_running` in the hot struct avoids pointer chasing to count tree
    /// nodes.
    pub nr_running: u32,

    /// Cached absolute deadline of the earliest-deadline task, or
    /// `u64::MAX` when the queue is empty.
    ///
    /// Mirrors the deadline component of `leftmost`'s key. Updated on every
    /// enqueue and dequeue. Used by `pick_next_task` and the preemption
    /// check (`resched_curr(rq, ReschedUrgency::Eager)`) to compare against the currently running
    /// task's deadline without a tree traversal.
    pub earliest_deadline_ns: u64,
}

/// Intrusive red-black tree link embedded in each deadline task.
///
/// Contains the left/right/parent pointers and color bit needed for the
/// red-black tree, plus the ordering key `(AbsDeadlineNs, TaskId)`.
/// The `container_of!` macro recovers the owning `Task` from a `DlRbLink`
/// pointer. No heap allocation is performed — the link lives inside the
/// `Task` struct (via `EevdfTask.dl_rb_link`).
///
/// A `DlRbLink` is in exactly one of two states:
/// - **Linked**: the task is in a `DlRunQueue` tree. `parent`/`left`/`right`
///   may be non-null.
/// - **Unlinked**: the task is not in any tree. All pointer fields are null
///   and `is_linked()` returns `false`.
pub struct DlRbLink {
    /// Ordering key: `(absolute_deadline_ns, task_id)`.
    pub key: (AbsDeadlineNs, TaskId),
    parent: Option<NonNull<DlRbLink>>,
    left:   Option<NonNull<DlRbLink>>,
    right:  Option<NonNull<DlRbLink>>,
    color:  RBColor,
}

/// Root sentinel for an intrusive red-black tree.
///
/// Contains only the root pointer. The tree does not own any nodes — node
/// lifetime is tied to the containing struct (e.g., `Task`). All operations
/// (insert, remove, find) take `&mut IntrusiveRbRoot` and `&mut DlRbLink`
/// references, never allocate, and are safe to call under spinlock.
pub struct IntrusiveRbRoot<L> {
    root: Option<NonNull<L>>,
}

/// Data protected by `RunQueue.lock`.
pub struct RunQueueData {
    /// Single EEVDF tree (tasks filtered dynamically by vruntime_eligible() pruning).
    pub eevdf: EevdfRunQueue,
    /// RT FIFO/RR run queue for this CPU.
    pub rt: RtRunQueue,
    /// SCHED_DEADLINE (EDF) run queue for this CPU.
    pub dl: DlRunQueue,
    /// Per-cgroup CBS (Constant Bandwidth Server) instances for this CPU.
    ///
    /// Keyed by `CgroupId` (integer key → XArray, per
    /// [Section 3.13](03-concurrency.md#collection-usage-policy)). Each entry tracks the per-CPU budget
    /// remaining, deadline, and throttle state for one cgroup's cpu.max
    /// enforcement on this CPU.
    ///
    /// Each CBS server embeds its own local EEVDF tree for scheduling
    /// tasks assigned to that server. Tasks in a CBS-guaranteed cgroup
    /// are enqueued in the server's EEVDF tree, not the main per-CPU
    /// `eevdf` tree. `pick_next_task()` checks CBS servers (in EDF order
    /// by deadline) between the DL and plain-EEVDF steps. When a CBS
    /// server's budget is exhausted and steal fails, the server is
    /// throttled and skipped (OnRqState → CbsThrottled). Replenishment
    /// timer un-throttles and re-enables the server.
    ///
    /// See [Section 7.6](#cpu-bandwidth-guarantees) for `CbsCpuServer` struct,
    /// replenishment/steal protocol, and cpu.max/cpu.guarantee semantics.
    ///
    /// `XArray<CgroupId, Arc<CbsCpuServer>>` — per-CPU mapping from cgroup IDs to
    /// their CBS servers on this CPU. One entry per cgroup with a `cpu.guarantee`
    /// on this CPU. Allocated lazily on first task enqueue.
    pub cbs_servers: XArray<CbsCpuServer>,
    /// Per-cgroup group scheduling entities for this CPU.
    ///
    /// Keyed by `CgroupId` (integer key → XArray). Each entry is the per-CPU
    /// `GroupEntity` ([Section 7.2](#heterogeneous-cpu-scheduling--hierarchical-eevdf-via-group-scheduling-entities))
    /// for one cgroup on this CPU. Created lazily on first task enqueue into
    /// a cgroup on this CPU; removed when the last task from that cgroup
    /// leaves this CPU.
    ///
    /// `dequeue_task` and `enqueue_task` in `SchedClassOps` update the
    /// appropriate `GroupEntity` (decrement/increment `nr_running`, update
    /// `load_avg`). Cgroup migration (step 15) uses this to move a task
    /// between GroupEntities on the same CPU.
    pub group_entities: XArray<GroupEntity>,
    /// Cached per-CPU task clock (nanoseconds, monotonic). Updated once
    /// per scheduler entry by `update_rq_clock()`. Excludes IRQ time.
    /// Used by `update_curr()` as the canonical scheduler time source
    /// (`rq_clock_task(rq)` returns this value). Avoids per-entity
    /// sched_clock() calls — the clock is read once and shared.
    /// Linux equivalent: `rq->clock_task` in `kernel/sched/sched.h`.
    pub clock_task: u64,
    /// Accumulated IRQ time on this CPU (nanoseconds, monotonic).
    /// Updated by `irq_time_start()`/`irq_time_end()` at IRQ entry/exit.
    /// Subtracted from the raw clock in `update_rq_clock()` to produce
    /// the task-only clock (`clock_task`). Without this, IRQ processing
    /// time is charged to the running task's vruntime.
    /// Linux equivalent: `rq->prev_irq_time` + irq accounting in
    /// `irqtime_account_irq()` / `account_irq_enter_time()`.
    pub irq_time_ns: u64,
    /// Timestamp of the most recent IRQ entry (nanoseconds, monotonic).
    /// Set by `irq_time_start()`, consumed by `irq_time_end()` to
    /// compute the delta for `irq_time_ns` accumulation.
    pub irq_entry_timestamp: u64,
    /// Countdown timer for load-balance trigger. Decremented every tick
    /// by `scheduler_tick()`. When it reaches zero, `trigger_load_balance()`
    /// raises `SCHED_SOFTIRQ`. Reset to `load_balance_interval(rq)` after
    /// triggering. Initialized to `load_balance_interval(rq)` at boot.
    /// Typical values: 4-32 ticks depending on sched_domain depth.
    pub next_balance_tick: u32,
    /// The per-CPU idle task. Always runnable; never enqueued in any
    /// scheduling class. Returned by `pick_next_task()` when all three
    /// class queues are empty. Statically allocated at boot — one per CPU.
    pub idle_task: Arc<Task>,
}

/// Lock two run queues in a deadlock-free order (lower CPU ID first).
///
/// This is the **only** function that may hold two run queue locks simultaneously.
/// The load balancer (work stealing) must call this function instead of acquiring
/// `RunQueue.lock` directly while already holding another run queue's lock.
///
/// # Deadlock prevention — compile-time enforced
///
/// `RunQueue.lock` is a `SpinLock<RunQueueData, LEVEL=50>`. The `SpinLock` type
/// parameter prevents a caller that already holds a level-50 lock from acquiring
/// a second one without going through this function. Calling `lock_two_runqueues`
/// is therefore the only legal path to holding two run queue locks simultaneously,
/// and it always acquires them in CPU-ID order — eliminating ABBA deadlock at
/// compile time rather than relying on code review to uphold a naming convention.
///
/// The old Linux approach (documented rule "always lock min CPU first") has
/// produced real deadlocks in distribution kernels. UmkaOS's type-level enforcement
/// eliminates this class of bug entirely.
pub fn lock_two_runqueues<'a>(
    rq_a: &'a RunQueue,
    rq_b: &'a RunQueue,
) -> (SpinGuard<'a, RunQueueData, RQ_LOCK_LEVEL>, SpinGuard<'a, RunQueueData, RQ_LOCK_LEVEL>) {
    // Guard: same-CPU case would deadlock (locking the same spinlock twice).
    // Callers must check before calling. This is a debug_assert rather than
    // a runtime branch because all call sites are in the load balancer which
    // never steals from itself.
    debug_assert!(rq_a.cpu_id != rq_b.cpu_id,
        "lock_two_runqueues called with same CPU {} — would deadlock", rq_a.cpu_id);

    // Always acquire the run queue whose CPU has the lower numeric ID first.
    // Every pair of CPUs has a unique total order under this relation, so no
    // two threads can form an acquisition cycle.
    if rq_a.cpu_id < rq_b.cpu_id {
        let g_a = rq_a.lock.lock();
        let g_b = rq_b.lock.lock();
        (g_a, g_b)
    } else {
        let g_b = rq_b.lock.lock();
        let g_a = rq_a.lock.lock();
        (g_a, g_b)
    }
}

/// A reference to a runnable task held by the EEVDF scheduler.
/// Ownership: the scheduler holds one `Arc<Task>` per runnable task.
/// Using this alias makes ownership semantics explicit in data structure definitions.
///
/// **Type safety invariant**: Because `Arc<Task>` implies shared ownership,
/// scheduler methods receive `&Task` (shared reference), not `&mut Task`.
/// All scheduler-mutable fields in `Task` — `vruntime`, `vdeadline`,
/// `on_rq_state`, `cpu`, `sched_entity.*` — use interior mutability
/// (atomic types or fields protected by the per-CPU run queue lock).
/// The run queue lock provides mutual exclusion for non-atomic mutable
/// fields; the `&Task` reference type makes this explicit in the type system.
pub type TaskHandle = Arc<Task>;

/// A generic red-black tree with optional per-subtree augmented fields.
///
/// The EEVDF hot path uses intrusive `EevdfRbLink` (below) with `min_vruntime`
/// and `max_slice` augmentation; this generic `RBNode`/`RBTree` type is used
/// for the deadline scheduler's task tree and other cold-path ordered containers.
///
/// # Path Classification: COLD ONLY
///
/// This type heap-allocates nodes via `Box`. It MUST NOT be used on hot
/// paths (scheduler tick, context switch, enqueue/dequeue). Hot-path
/// ordered containers use intrusive `EevdfRbLink` / `IntrusiveRbRoot`.
pub struct RBNode<K, V> {
    pub key:   K,
    pub value: V,
    color: RBColor,
    left:  Option<Box<RBNode<K, V>>>,
    right: Option<Box<RBNode<K, V>>>,
    /// Augmented field: minimum key in this subtree.
    pub min_subtree_key: K,
}

/// Non-augmented red-black tree (standard ordered map).
///
/// # Path Classification: COLD ONLY
///
/// See `RBNode` for allocation constraints. Hot-path trees use intrusive links.
pub struct RBTree<K: Ord, V> {
    root: Option<Box<RBNode<K, V>>>,
    pub len: usize,
}

#[derive(Clone, Copy, PartialEq)]
enum RBColor { Red, Black }

/// Intrusive red-black tree link embedded in `EevdfTask`. Zero-allocation
/// insert/remove — the link is part of the task struct, not heap-allocated.
/// Keyed by `vdeadline` (virtual deadline) so that a left-descent walk in
/// `pick_eevdf()` naturally visits earlier-deadline tasks first. Augmented
/// with `min_vruntime` for eligibility pruning during `pick_eevdf()`.
pub struct EevdfRbLink {
    /// RB-tree key: task's vdeadline (virtual deadline).
    ///
    /// The tree is ordered by vdeadline so that `pick_eevdf()` finds the
    /// eligible task with the earliest deadline via a left-descent walk.
    /// Tie-breaking: when two tasks have identical vdeadline values, the
    /// task with the smaller `TaskId` sorts first (ensures deterministic
    /// ordering — `entity_before(a, b) = a.vdeadline < b.vdeadline ||
    /// (a.vdeadline == b.vdeadline && a.task_id < b.task_id)`).
    pub key: u64,
    /// Augmented field: minimum `vruntime` in this subtree. Used by
    /// `pick_eevdf()` to prune ineligible subtrees via the division-free
    /// `vruntime_eligible(rq, node.min_vruntime)` check. If the subtree's
    /// minimum vruntime is not eligible (greater than avg_vruntime), then
    /// no entity in the subtree can be eligible.
    ///
    /// Note: this is `vruntime` (not `vdeadline`). The tree is keyed by
    /// `vdeadline` for deadline-first selection, but the augmented field
    /// tracks `vruntime` for eligibility pruning. Linux equivalent:
    /// `sched_entity.min_vruntime` augmented field.
    pub min_vruntime: u64,
    /// Augmented field: maximum `slice` in this subtree. Used by
    /// `cfs_rq_max_slice()` for lag clamp computation. Linux equivalent:
    /// `sched_entity.max_slice` augmented field.
    pub max_slice: u64,
    /// Intrusive RB-tree link pointers (parent, left, right, color).
    /// Managed by the intrusive RB-tree implementation; not directly
    /// accessed by scheduler logic.
    ///
    /// Uses `Option<NonNull<EevdfRbLink>>` (matching `DlRbLink` pattern)
    /// for niche optimization and null-safety. `None` = no child/parent.
    ///
    /// # Safety
    ///
    /// - **Ownership**: The `EevdfTask` struct owns the `EevdfRbLink` via
    ///   embedding. The link's lifetime is tied to the task.
    /// - **Aliasing**: Multiple nodes' `parent`/`left`/`right` point to the
    ///   same node. No mutable aliasing — tree mutations happen exclusively
    ///   under `rq.lock` (lock level 50).
    /// - **Thread safety**: All pointer dereferences require `rq.lock` held.
    parent: Option<NonNull<EevdfRbLink>>,
    left:   Option<NonNull<EevdfRbLink>>,
    right:  Option<NonNull<EevdfRbLink>>,
    color:  RBColor,
}

/// Root of the intrusive augmented RB-tree containing all runnable tasks.
pub struct IntrusiveAugmentedRbRoot<L> {
    /// Root of the augmented red-black tree, or `None` if empty.
    /// Consistent with `IntrusiveRbRoot<L>` which also uses
    /// `Option<NonNull<L>>` for the nullable tree root.
    root: Option<NonNull<L>>,
    /// Total node count. O(1).
    pub len: usize,
}

/// Number of tasks queued on this runqueue, **NOT** including `curr`.
///
/// This is a UmkaOS design choice: Linux's `cfs_rq->nr_queued` INCLUDES `curr`,
/// so Linux threshold `nr_queued == 1` means "only curr, no waiters." UmkaOS
/// `WaiterCount` counts only tasks in the tree (excluding `curr`), so the
/// equivalent check is `!has_waiters()` (count == 0).
///
/// The newtype prevents agents from accidentally using Linux threshold values.
/// Use the semantic methods (`has_waiters()`, `single_waiter()`) instead of
/// raw numeric comparisons.
///
/// **Threshold translation guide** (Linux → UmkaOS):
/// - Linux `nr_queued == 1` (only curr) → `!waiter_count.has_waiters()`
/// - Linux `nr_queued > 1` (peers exist) → `waiter_count.has_waiters()`
/// - Linux `nr_queued <= 1` (no peers) → `!waiter_count.has_waiters()`
pub struct WaiterCount(u32);

impl WaiterCount {
    /// True if any task is waiting (curr has peers in the tree).
    pub fn has_waiters(&self) -> bool { self.0 > 0 }
    /// True if exactly one task is waiting.
    pub fn single_waiter(&self) -> bool { self.0 == 1 }
    /// Raw count (use sparingly — prefer semantic methods).
    pub fn count(&self) -> u32 { self.0 }
    /// Increment when a task is enqueued.
    pub fn inc(&mut self) { self.0 += 1; }
    /// Decrement when a task is dequeued. Debug-asserts non-zero.
    pub fn dec(&mut self) { debug_assert!(self.0 > 0); self.0 -= 1; }
}

/// Shared vruntime-ordered augmented RB tree with EEVDF accumulators.
///
/// `VruntimeTree` is the base type shared by the root per-CPU EEVDF run queue
/// (`EevdfRunQueue`), CBS bandwidth servers (`CbsCpuServer`), and hierarchical
/// group scheduling entities (`GroupEntity`). It contains the augmented RB tree
/// and the two-accumulator state needed for division-free eligibility checks.
///
/// All EEVDF helper functions that operate only on the tree and accumulators
/// (`avg_vruntime_update`, `entity_key`, `update_zero_vruntime`,
/// `__enqueue_entity`, `__dequeue_entity`) take `&VruntimeTree` or
/// `&mut VruntimeTree` directly. Functions that additionally need the
/// currently-running entity (`avg_vruntime`, `vruntime_eligible`,
/// `pick_eevdf`, `update_curr`, `place_entity`) take `&EevdfRunQueue`,
/// which embeds `VruntimeTree` as its `base` field.
///
/// **Classification**: Nucleus (data). Struct layout is non-replaceable; all
/// Evolvable code (formulas, policy) operates on these fields.
pub struct VruntimeTree {
    /// Single intrusive augmented red-black tree of ALL runnable tasks,
    /// keyed by `vdeadline` (virtual deadline). Matches the Linux 6.6+
    /// `tasks_timeline` design. Each task embeds an `EevdfRbLink` with
    /// augmented `min_vruntime` and `max_slice` fields so that
    /// `pick_eevdf()` can prune ineligible subtrees in O(log n).
    /// Eligibility is computed dynamically during the walk, not stored as
    /// tree membership. Zero heap allocation on enqueue/dequeue.
    ///
    /// EEVDF-deferred tasks (`OnRqState::Deferred`) reside in this tree;
    /// they are NOT filtered by `pick_eevdf()` during the tree walk.
    /// If a deferred entity is picked, the caller (`pick_next_task()`)
    /// handles it: dequeue without wake-up, then retry the pick.
    /// See `pick_next_task()` deferred-entity handling below.
    /// CBS-throttled tasks (`OnRqState::CbsThrottled`) are NOT in this tree.
    pub tasks_timeline: IntrusiveAugmentedRbRoot<EevdfRbLink>,
    /// Tracking reference point for the `avg_vruntime` two-accumulator
    /// computation. Updated on every call to `avg_vruntime()` to track
    /// close to the true weighted average. This is NOT the minimum vruntime
    /// — it is approximately equal to avg_vruntime after each update.
    ///
    /// Linux equivalent: `cfs_rq->zero_vruntime`. Note: Linux's old CFS
    /// `cfs_rq.min_vruntime` was REMOVED when EEVDF replaced CFS. The field
    /// `min_vruntime` exists only as a per-node augmented field (subtree
    /// minimum), not on the runqueue struct.
    ///
    /// `zero_vruntime` keeps accumulator deltas `(v_i - zero_vruntime)` small,
    /// preventing overflow in the `key * weight` products. After
    /// `avg_vruntime()` returns, `sum_w_vruntime` is approximately zero
    /// (the residual from integer division).
    pub zero_vruntime: u64,
    /// First avg_vruntime accumulator: Σ(w_i × (v_i − zero_vruntime)) for all
    /// enqueued entities (NOT including `curr` — it is added transiently).
    /// Updated on every enqueue and dequeue (including deferred). Never
    /// divided on the hot path — the division-free eligibility test uses this
    /// value directly.
    ///
    /// Linux equivalent: `cfs_rq->sum_w_vruntime` (type `s64`).
    ///
    /// **Overflow analysis**: `zero_vruntime` tracks close to avg_vruntime,
    /// so each `(v_i - zero_vruntime)` offset is bounded by approximately
    /// `±2 * lag_limit`. The lag limit for a nice -20 task is
    /// `calc_delta_fair(max_slice + TICK_NSEC, 88761) ≈ 750_000 + 4_000_000
    /// ≈ 4_750_000 ns` (virtual time). So each term is bounded by
    /// `88761 * 4_750_000 ≈ 4.2 × 10^11`. With 1000 tasks at maximum weight:
    /// `4.2 × 10^14`, well within i64 range `±9.22 × 10^18`.
    pub sum_w_vruntime: i64,
    /// Second avg_vruntime accumulator: Σ(w_i) for all enqueued entities
    /// (NOT including `curr`). Updated in lockstep with `sum_w_vruntime`.
    /// Deferred tasks contribute their weight until removed from the tree.
    ///
    /// Linux equivalent: `cfs_rq->sum_weight` (type `unsigned long`).
    /// UmkaOS uses `i64` to keep the same signedness as `sum_w_vruntime`
    /// for the division-free eligibility multiplication.
    ///
    /// **Overflow analysis**: Maximum per-task weight is 88761. With 1000
    /// tasks: `W = 88_761_000`, negligible relative to i64 range.
    pub sum_weight: i64,
    /// Number of entities on this tree (including deferred, excluding
    /// `curr` when used in EevdfRunQueue context). See `WaiterCount` docs
    /// for threshold semantics.
    /// Linux equivalent: `cfs_rq->nr_queued` (but Linux includes `curr`).
    pub nr_queued: WaiterCount,
    /// Number of runnable (non-deferred) tasks on this tree.
    /// `AtomicU32` so that work-stealing CPUs may read this field lock-free
    /// using `Relaxed` ordering — an approximate count is sufficient for
    /// steal candidate selection. Updated with `Relaxed` on enqueue and dequeue.
    ///
    /// **Design tradeoff**: AtomicU32 is required for the root per-CPU
    /// `EevdfRunQueue` (lock-free work-stealing reads). CBS server
    /// (`CbsCpuServer.tree`) and `GroupEntity` sub-trees pay the atomic cost
    /// unnecessarily (~5-20 cycles per enqueue/dequeue on x86-64 for
    /// `lock xadd` vs plain `add`). This is accepted as the cost of a unified
    /// `VruntimeTree` base type — duplicating into atomic and non-atomic
    /// variants would increase maintenance burden for marginal benefit. If
    /// profiling shows this is a bottleneck, split `VruntimeTree` into
    /// `VruntimeTreeAtomic` (root per-CPU) and `VruntimeTreeLocal` (CBS,
    /// GroupEntity) variants.
    pub task_count: AtomicU32,
}

/// Per-CPU EEVDF run queue state.
///
/// Embeds a `VruntimeTree` (the shared augmented RB tree and accumulators)
/// plus EEVDF-specific fields that are only meaningful for the root per-CPU
/// run queue: the currently-running entity, the wakeup buddy, and the CBS
/// bandwidth timer.
///
/// **CpuLocal access**: Accessed via `CpuLocal::get::<EevdfRunQueue>()` on the
/// hot path. The per-CPU runqueue is protected by `RunQueue.lock` at level 50
/// in the lock hierarchy ([Section 3.4](03-concurrency.md#cumulative-performance-budget)).
pub struct EevdfRunQueue {
    /// Shared vruntime-ordered tree and accumulators. CBS servers and group
    /// entities embed `VruntimeTree` directly; the root per-CPU run queue
    /// wraps it here with additional EEVDF-specific state.
    pub base: VruntimeTree,
    /// Currently-running entity on this CPU, or None if the CPU is idle.
    /// `curr` is NOT included in `base.sum_w_vruntime`/`base.sum_weight` —
    /// it is added transiently by `avg_vruntime()` and `vruntime_eligible()`.
    /// This matches Linux's design where `curr` is dequeued from the tree
    /// while running.
    ///
    /// **Root-only field**: In CBS sub-trees (`CbsCpuServer.tree`) and group
    /// entity sub-trees (`GroupEntity.child_rq`), there is no `curr` — the
    /// currently running task is tracked solely by the root per-CPU
    /// `EevdfRunQueue.curr`. Those sub-trees embed `VruntimeTree` directly,
    /// avoiding this unused field.
    ///
    /// Updated by `pick_next_task()` (set to picked entity) and
    /// `put_prev_task()` (set to None before the next pick).
    pub curr: Option<NonNull<EevdfTask>>,
    /// Cache-affinity wakeup buddy. When a task is woken by `try_to_wake_up()`
    /// with `WakeFlags::SYNC`, `next` is set to the wakee. `pick_eevdf()` checks
    /// `next` before the tree walk — if `next` is eligible, it is picked
    /// immediately (PICK_BUDDY optimization). This improves cache locality for
    /// producer-consumer patterns (pipe, futex, unix socket).
    ///
    /// **Root-only field**: CBS and group entity sub-trees do not use wakeup
    /// buddies — PICK_BUDDY is applied only at the root per-CPU level.
    ///
    /// Linux equivalent: `cfs_rq->next`.
    pub next: Option<NonNull<EevdfTask>>,
    /// Timer for CBS bandwidth replenishment checks.
    pub bandwidth_timer: HrTimer,
    // NOTE: Section 7.8 (Core Provisioning) adds `cgroup_filter: Option<CgroupId>`
    // to this struct for CG-core task filtering in `pick_next_task()`.
}

/// Augmented RB-tree invariants for `EevdfRbLink`:
///
/// Note: the tree is keyed by `vdeadline`, but the augmented fields track
/// `vruntime` (subtree minimum) and `slice` (subtree maximum).
///
/// **min_vruntime invariant** (the entity's vruntime stored alongside the
/// key, NOT `node.key` which is vdeadline):
///
/// ```text
/// node.min_vruntime = min(node.vruntime,
///                         left.min_vruntime if left exists,
///                         right.min_vruntime if right exists)
/// ```
///
/// **max_slice invariant**:
///
/// ```text
/// node.max_slice = max(node.slice,
///                      left.max_slice if left exists,
///                      right.max_slice if right exists)
/// ```
///
/// Both fields are maintained by the RB-tree rebalancing hooks:
/// every rotation or color change that restructures the tree must call
/// `recompute_augmented()` on each affected node, bottom-up. This is
/// the standard augmented-RB-tree update protocol.
///
/// `pick_eevdf()` exploits `min_vruntime` to prune ineligible subtrees
/// via `vruntime_eligible(rq, subtree.min_vruntime)`. If the subtree's
/// minimum vruntime is not eligible (greater than avg_vruntime), then
/// no entity in the subtree can be eligible — achieving O(log n)
/// selection even when many tasks are ineligible.
///
/// `cfs_rq_max_slice()` uses `max_slice` from the tree root to compute
/// the lag clamp limit without a full tree scan.

/// EEVDF scheduling state embedded in each Task struct.
///
/// This struct is embedded directly in `Task` (via `pub type SchedEntity = EevdfTask`)
/// so that the scheduler hot path can access all scheduling-relevant fields without
/// pointer indirection through `Task`. Fields like `weight`, `sched_class`, and `nice`
/// are here (not in `Task`) because the scheduler dispatch and accounting code reads
/// them on every tick, pick, and enqueue/dequeue — keeping them cache-local with
/// `vruntime` and `vdeadline` avoids a pointer chase on every scheduling decision.
pub struct EevdfTask {
    /// Virtual runtime: accumulated CPU consumption in virtual time units.
    /// Scales inversely with task weight (higher weight = slower accumulation).
    /// Updated by `update_curr()` on every tick and scheduling event.
    /// Linux equivalent: `sched_entity.vruntime`.
    ///
    /// **Overflow invariant**: The difference between any entity's `vruntime`
    /// and `VruntimeTree.zero_vruntime` must fit in an `i64` (i.e.,
    /// `|se.vruntime - tree.zero_vruntime| < i64::MAX`). This is maintained by
    /// the `zero_vruntime` normalization which periodically rebases all
    /// vruntimes when `zero_vruntime` drifts. The `as i64` casts in
    /// `entity_eligible()` and `avg_vruntime()` rely on this invariant.
    /// Since `zero_vruntime` tracks the run queue's minimum vruntime and
    /// all active entities have bounded lag (clamped by `update_entity_lag()`),
    /// the invariant is maintained for any realistic workload.
    vruntime: u64,
    /// Virtual deadline: `vruntime + calc_delta_fair(slice, weight)`. The task
    /// with the minimum vdeadline among eligible tasks is scheduled next.
    /// Set by `place_entity()` on wakeup and `update_deadline()` on slice expiry.
    /// Linux equivalent: `sched_entity.deadline`.
    vdeadline: u64,
    /// Virtual lag: `avg_vruntime(cfs_rq) - se.vruntime` at last dequeue.
    /// **Unweighted, signed, in virtual-time units** — NOT multiplied by weight.
    /// Positive = task has been underserved (owed CPU, eligible).
    /// Negative = task over-served (ineligible).
    /// Clamped by `update_entity_lag()` to
    /// `±calc_delta_fair(cfs_rq_max_slice + TICK_NSEC, se)`.
    /// Used by `place_entity()` on wakeup to position the task in virtual time.
    /// Preserved across sleep/wake cycles.
    /// Linux equivalent: `sched_entity.vlag` (type `s64`).
    vlag: i64,
    /// Time slice in nanoseconds. Default `sysctl_sched_base_slice` = 750_000
    /// (750 us; UmkaOS diverges from Linux's 700 us — see design note above).
    /// Configurable per task via `sched_setattr()` by setting `sched_runtime`
    /// to a non-zero value in `struct sched_attr`. When `sched_runtime != 0`,
    /// `custom_slice = true` and `slice_ns = sched_runtime`. When
    /// `custom_slice` is false, `place_entity()` resets this to
    /// `sysctl_sched_base_slice` on each wakeup.
    /// Linux equivalent: `sched_entity.slice`.
    slice_ns: u64,
    /// Whether this entity has a custom slice set via `sched_setattr()`.
    /// If false, `place_entity()` resets `slice_ns` to the sysctl default.
    /// Linux equivalent: `sched_entity.custom_slice`.
    custom_slice: bool,
    /// Relative deadline flag. Set by `reweight_entity()` when weight changes
    /// require deadline recalculation. `place_entity()` converts the relative
    /// deadline to absolute: `vdeadline += vruntime`. Cleared after conversion.
    /// Linux equivalent: `sched_entity.rel_deadline`.
    rel_deadline: bool,
    /// Protected slice threshold. `pick_eevdf()` retains `curr` if
    /// `curr.vruntime < curr.vprot`, preventing excessive preemption when the
    /// current task has not yet consumed a minimum quantum.
    /// Set by `set_protect_slice()` on each scheduling event.
    /// Linux equivalent: `sched_entity.vprot`.
    vprot: u64,
    /// Run queue membership state. Uses the four-variant enum to correctly
    /// capture both the EEVDF deferred-dequeue state (sleeping but still in
    /// the tree) and CBS throttled state (dequeued, awaiting replenishment).
    on_rq: OnRqState,
    /// True when the task has gone to sleep with negative vlag and is still
    /// physically resident in `tasks_timeline` pending vlag decay. While
    /// `sched_delayed` is set, the task remains eligible for selection by
    /// `pick_eevdf()` (matching Linux, which has no `sched_delayed` filter
    /// in the tree walk). The task still contributes its weight to the
    /// `avg_vruntime` accumulators.
    /// Cleared on wake-up (before re-enqueue) or when vlag decays to zero
    /// at pick time and the task is physically removed.
    sched_delayed: bool,
    /// Wall-clock execution time accumulated by this entity (nanoseconds).
    /// Incremented by `delta_exec` in `update_curr()` on every tick and
    /// scheduling event. Used by `check_preempt_tick` in `eevdf_task_tick()`
    /// to compare against `prev_sum_exec_runtime` for ideal-runtime checks,
    /// and by `getrusage(2)` / `/proc/[pid]/sched` for user-visible accounting.
    /// Linux equivalent: `sched_entity.sum_exec_runtime`.
    sum_exec_runtime: u64,
    /// Value of `sum_exec_runtime` at the time the entity was last enqueued.
    /// The difference `sum_exec_runtime - prev_sum_exec_runtime` gives the
    /// wall-clock time consumed in the current scheduling quantum.
    /// **Initialization**: Set to `sum_exec_runtime` on each enqueue
    /// (`enqueue_task()` / `set_next_task()` assign
    /// `prev_sum_exec_runtime = sum_exec_runtime`). Not updated by
    /// `eevdf_task_tick()` or `update_curr()` -- once set at enqueue, it
    /// is fixed for the duration of the task's on-CPU residence. The
    /// `check_preempt_tick` heuristic in `eevdf_task_tick()` step 3 reads
    /// this field to compute `delta_exec`.
    /// Linux equivalent: `sched_entity.prev_sum_exec_runtime`.
    prev_sum_exec_runtime: u64,
    /// Intrusive red-black tree link for the EEVDF `tasks_timeline` tree.
    /// Keyed by `vdeadline`, augmented with `min_vruntime` for eligibility
    /// pruning in `pick_eevdf()`. Embedded directly in the task struct for
    /// zero-allocation enqueue/dequeue on the scheduler hot path.
    eevdf_rb_link: EevdfRbLink,
    /// Intrusive red-black tree link for the `DlRunQueue` tree. When this
    /// task has `sched_class == SchedClass::Deadline`, `dl_rb_link` is
    /// inserted into the per-CPU `DlRunQueue.root` tree on enqueue and
    /// removed on dequeue. For non-deadline tasks this link remains in the
    /// unlinked state (`dl_rb_link.is_linked() == false`) and is never
    /// touched by the scheduler. Embedding the link here avoids any heap
    /// allocation when inserting into / removing from the DL run queue.
    dl_rb_link: DlRbLink,
    /// EEVDF weight derived from the nice value via `sched_prio_to_weight[]`.
    /// Determines the rate of vruntime accumulation: higher weight → slower
    /// accumulation → more CPU share. Updated when nice is changed via
    /// `setpriority(2)` or `sched_setattr(2)`. Must be kept in sync with
    /// `nice` — the canonical mapping is `weight = sched_prio_to_weight[nice + 20]`.
    /// Stored here (not derived on-the-fly) because the scheduler reads it
    /// on every `vruntime += delta * NICE_0_WEIGHT / weight` computation.
    weight: u32,
    /// Per-Entity Load Tracking state. Maintains exponentially-decaying averages
    /// of CPU utilisation (`util_avg`), runnability (`runnable_avg`), and weighted
    /// load (`load_avg`). Updated on every scheduling state transition (run→sleep,
    /// sleep→run, tick). Consumed by EAS for task placement, by cpufreq for
    /// frequency selection, and by the load balancer for migration decisions.
    /// See the `PeltState` definition above for field-level details.
    ///
    /// **Long-term precision**: PELT uses fixed-point (u32, 20-bit fractional)
    /// geometric decay. Inactive entities decay to zero within ~188 ms (6
    /// half-lives). Accumulated rounding errors do not drift over time because
    /// the decay is self-correcting — any rounding error in an active entity's
    /// load average is dominated by the next period's fresh contribution.
    pelt: PeltState,
    /// Latency-nice value (-20 to 19). Controls the EEVDF virtual deadline slack:
    /// lower latency_nice → shorter virtual deadline → task is picked sooner among
    /// eligible entities at the cost of reduced throughput for co-scheduled tasks.
    ///
    /// **UmkaOS-original extension** — NOT present in Linux mainline. Set via
    /// `sched_setattr(2)` with `sched_flags |= SCHED_FLAG_LATENCY_NICE` (0x80)
    /// and `sched_latency_nice` field in the extended `sched_attr`. Default 0
    /// (no adjustment).
    ///
    /// The effective slice used for deadline calculation:
    /// `effective_slice = base_slice_ns * LATENCY_NICE_0_WEIGHT / latency_weight`
    /// where `latency_weight = LATENCY_NICE_TO_WEIGHT[(latency_nice + 20)]`.
    /// A task with `latency_nice = -20` gets a ~88× shorter effective slice,
    /// giving it the earliest deadline among peers.
    latency_nice: i32,
    /// Scheduling class. Determines which per-CPU queue this task is managed by
    /// and which class-specific operations (enqueue/dequeue/pick/tick) apply.
    /// The `pick_next_task()` dispatch uses the class priority ordering:
    /// Deadline > RtFifo/RtRr > Eevdf > Idle. Changed via `sched_setscheduler(2)`
    /// or `sched_setattr(2)`. Stored in `EevdfTask` (not `Task`) because the
    /// scheduler dispatch reads it on every pick decision — keeping it cache-local
    /// with `vruntime` avoids an extra cache line fetch.
    sched_class: SchedClass,
    /// Scheduling policy. Encodes the user-visible POSIX policy that the task
    /// was configured with. Maps to `SchedClass` but preserves the distinction
    /// between policies within the same class (e.g., `UserSchedPolicy::Batch` and
    /// `UserSchedPolicy::Normal` both map to `SchedClass::Eevdf` but differ in
    /// preemption behavior — Batch tasks are never preempted by newly woken
    /// Eevdf peers, only by RT/DL tasks).
    sched_policy: UserSchedPolicy,
    /// Nice value (-20 to 19). The POSIX nice value set by `setpriority(2)` or
    /// `nice(2)`. Determines `weight` via `sched_prio_to_weight[nice + 20]`.
    /// Stored alongside `weight` because `getpriority(2)` must return it and
    /// `/proc/[pid]/stat` field 19 exposes it. Changing nice updates both `nice`
    /// and `weight` atomically under the runqueue lock.
    nice: i8,
    /// Accumulated RT CPU time in microseconds for RLIMIT_RTTIME enforcement.
    /// Incremented on every scheduler tick while the task is running under an
    /// RT scheduling class (SCHED_FIFO or SCHED_RR). When this value reaches
    /// the task's `rlimit(RLIMIT_RTTIME)`, the kernel sends SIGXCPU. If the
    /// task continues running for one additional second, SIGKILL is sent.
    /// Reset to zero each time the task voluntarily relinquishes the CPU
    /// (blocks, yields, or sleeps). AtomicU64 because it is read by the
    /// signal delivery path without holding the runqueue lock.
    rt_runtime_us: AtomicU64,
    /// Wall-clock timestamp (nanoseconds, monotonic) of the last scheduling
    /// accounting update. Set to `rq_clock_task(rq)` by `update_curr()` on
    /// every tick/scheduling event, and on enqueue. The delta
    /// `rq_clock_task(rq) - exec_start` gives task-only time since the last
    /// accounting update (excludes IRQ time). Used by `update_curr()`,
    /// `rt_bandwidth_tick()`, and `check_preempt_tick()` for time accounting.
    /// Linux equivalent: `sched_entity.exec_start` (uses `rq_clock_task()`).
    exec_start: u64,
    /// Dirty flag set by cgroup cpu.weight write path
    /// (`sched_group_set_weight()`) to signal that this task's GroupEntity
    /// weight needs recalculation on the next tick. AtomicBool because the
    /// cgroup migration path (cross-CPU) may set this flag while the task
    /// is running on a different CPU — a cross-context access that is a
    /// data race under the Rust abstract machine with `Cell`.
    /// Checked and cleared by `eevdf_task_tick()` step 4.
    cgroup_weight_dirty: AtomicBool,
    /// Saved `entity_key()` value at `sched_idle_enter()` time. Used by
    /// `sched_idle_exit()` to restore the exact accumulator contribution,
    /// preventing drift from `zero_vruntime` shifts during the idle interval.
    /// Only valid when `sched_idle_marked` is true. AtomicI64 for cross-context
    /// visibility (same rationale as `sched_idle_marked`).
    saved_idle_key: AtomicI64,
}

// ---------------------------------------------------------------------------
// avg_vruntime — Weighted Average Virtual Runtime
// ---------------------------------------------------------------------------

/// Compute and return the weighted average virtual runtime of all entities
/// on this run queue. This function has a SIDE EFFECT: it updates
/// `zero_vruntime` and adjusts `sum_w_vruntime` so that `zero_vruntime`
/// tracks close to the true weighted average. After this call,
/// `sum_w_vruntime` is approximately zero (the residual from integer
/// division).
///
/// **Classification**: Evolvable (replaceable). Correctness ensured by
/// `EevdfInvariantChecker` before swap is committed. See classification
/// table at top of this section.
///
/// Linux equivalent: `avg_vruntime()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn avg_vruntime(rq: &mut EevdfRunQueue) -> u64 {
///     let curr = rq.curr;
///     let mut weight: i64 = rq.base.sum_weight;
///     let mut delta: i64 = 0;
///
///     // Only include curr if it is on the runqueue.
///     let curr = match curr {
///         Some(c) if unsafe { c.as_ref() }.on_rq != OnRqState::Off => Some(c),
///         _ => None,
///     };
///
///     if weight > 0 {
///         let mut runtime: i64 = rq.base.sum_w_vruntime;
///
///         // Transiently add curr's contribution (curr is not in the
///         // accumulators while running).
///         if let Some(c) = curr {
///             let se = unsafe { c.as_ref() };
///             let w = se.weight as i64;
///             runtime += entity_key(&rq.base, se) * w;
///             weight += w;
///         }
///
///         // Floor-division bias: round toward negative infinity.
///         // This ensures that avg_vruntime() + 0 always yields
///         // entity_eligible() == true (a task placed exactly at the
///         // average must be eligible). Without this bias, truncation
///         // toward zero could make the average appear slightly too high,
///         // causing a correctly-placed task to be deemed ineligible.
///         if runtime < 0 {
///             runtime -= weight - 1;
///         }
///
///         delta = runtime / weight;  // integer division
///     } else if let Some(c) = curr {
///         // When only curr exists (no tree entities), it IS the average.
///         let se = unsafe { c.as_ref() };
///         delta = se.vruntime as i64 - rq.base.zero_vruntime as i64;
///     }
///
///     update_zero_vruntime(&mut rq.base, delta);
///     rq.base.zero_vruntime
/// }
/// ```
///
/// **`entity_key`**: Returns the signed offset of an entity's vruntime from
/// `zero_vruntime`. Takes `&VruntimeTree` — operates only on accumulator state.
/// Linux equivalent: `entity_key()`.
/// ```text
/// fn entity_key(tree: &VruntimeTree, se: &EevdfTask) -> i64 {
///     se.vruntime as i64 - tree.zero_vruntime as i64
/// }
/// ```
///
/// **`update_zero_vruntime`**: Shifts the tracking reference point by `delta`.
/// Takes `&mut VruntimeTree` — operates only on accumulator state.
/// Linux equivalent: `update_zero_vruntime()`.
/// ```text
/// fn update_zero_vruntime(tree: &mut VruntimeTree, delta: i64) {
///     // v' = v + d  ==>  sum_w_vruntime' = sum_w_vruntime - d * sum_weight
///     tree.sum_w_vruntime -= tree.sum_weight * delta;
///     tree.zero_vruntime = (tree.zero_vruntime as i64 + delta) as u64;
/// }
/// ```
pub fn avg_vruntime(rq: &mut EevdfRunQueue) -> u64 { /* in sched/eevdf.rs */ }

// ---------------------------------------------------------------------------
// Accumulator maintenance
// ---------------------------------------------------------------------------

/// Update the avg_vruntime accumulators when a task is enqueued or dequeued.
///
/// `avg_vruntime` is maintained without division using two running sums:
///
/// ```text
/// zero_vruntime = tracking reference (approximately = avg_vruntime)
/// sum_w_vruntime = Σ(w_i × (v_i − zero_vruntime))
/// sum_weight     = Σ(w_i)
///
/// avg_vruntime = zero_vruntime + sum_w_vruntime / sum_weight
/// ```
///
/// **On enqueue** (task entering the tree, including deferred re-enqueue):
/// ```text
/// tree.sum_w_vruntime += entity_key(tree, se) * se.weight as i64;
/// tree.sum_weight += se.weight as i64;
/// ```
///
/// **On dequeue** (task leaving the tree, including deferred removal):
/// ```text
/// tree.sum_w_vruntime -= entity_key(tree, se) * se.weight as i64;
/// tree.sum_weight -= se.weight as i64;
/// ```
///
/// Linux equivalent: called from `enqueue_entity()` and `dequeue_entity()`.
///
/// Takes `&mut VruntimeTree` — operates only on accumulator state, no access
/// to `curr` or other root-only fields. This allows CBS servers and group
/// entities to call this function directly on their embedded `VruntimeTree`.
///
/// **ML observation point**: After dequeue, emit runtime metrics:
/// ```text
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::RunqueueStats,
///     tree.nr_queued.count(), /* ... */);
/// ```
pub fn avg_vruntime_update(tree: &mut VruntimeTree, se: &EevdfTask, enqueue: bool) {
    let key = entity_key(tree, se);
    let w = se.weight as i64;
    if enqueue {
        tree.sum_w_vruntime += key * w;
        tree.sum_weight += w;
    } else {
        tree.sum_w_vruntime -= key * w;
        tree.sum_weight -= w;
    }
}

// ---------------------------------------------------------------------------
// Eligibility — division-free O(1) check
// ---------------------------------------------------------------------------

/// Division-free O(1) eligibility check for a specific vruntime value.
///
/// Returns true if `vruntime <= avg_vruntime(rq)`, computed without division:
///
/// ```text
/// vruntime <= zero_vruntime + sum_w_vruntime / sum_weight
/// ⟺  (vruntime - zero_vruntime) * sum_weight <= sum_w_vruntime
/// ```
///
/// **Curr transient inclusion**: The currently-running entity (`curr`) is NOT
/// in `sum_w_vruntime`/`sum_weight`. This function transiently adds `curr`'s
/// contribution before comparing. This matches Linux's `vruntime_eligible()`.
///
/// **Classification**: Evolvable (replaceable). Correctness ensured by
/// `EevdfInvariantChecker` before swap is committed. See classification
/// table at top of this section.
///
/// Linux equivalent: `vruntime_eligible()` in `fair.c`.
///
/// ```text
/// fn vruntime_eligible(rq: &EevdfRunQueue, vruntime: u64) -> bool {
///     let mut avg: i64 = rq.base.sum_w_vruntime;
///     let mut load: i64 = rq.base.sum_weight;
///
///     // Transiently include curr if on the runqueue.
///     if let Some(c) = rq.curr {
///         let se = unsafe { c.as_ref() };
///         if se.on_rq != OnRqState::Off {
///             let w = se.weight as i64;
///             avg += entity_key(&rq.base, se) * w;
///             load += w;
///         }
///     }
///
///     avg >= (vruntime as i64 - rq.base.zero_vruntime as i64) * load
/// }
/// ```
pub fn vruntime_eligible(rq: &EevdfRunQueue, vruntime: u64) -> bool {
    /* in sched/eevdf.rs */
}

/// Check whether a specific entity is eligible.
///
/// Delegates to `vruntime_eligible(rq, se.vruntime)`.
///
/// Linux equivalent: `entity_eligible()` in `fair.c`.
///
/// ```text
/// fn entity_eligible(rq: &EevdfRunQueue, se: &EevdfTask) -> bool {
///     vruntime_eligible(rq, se.vruntime)
/// }
/// ```
pub fn entity_eligible(rq: &EevdfRunQueue, se: &EevdfTask) -> bool {
    vruntime_eligible(rq, se.vruntime)
}

// ---------------------------------------------------------------------------
// update_entity_lag — Lag Tracking
// ---------------------------------------------------------------------------

/// Update and clamp the entity's virtual lag on dequeue.
///
/// Called on every dequeue (sleep, yield, preemption). Stores the virtual lag
/// (`vlag = V - v_i`) for use by `place_entity()` on the next wakeup.
///
/// **Classification**: Evolvable (replaceable). Correctness ensured by
/// `EevdfInvariantChecker` before swap is committed. See classification
/// table at top of this section.
///
/// Linux equivalent: `update_entity_lag()` in `fair.c`.
///
/// ```text
/// fn update_entity_lag(rq: &mut EevdfRunQueue, se: &mut EevdfTask) {
///     debug_assert!(se.on_rq != OnRqState::Off);
///
///     let vlag: i64 = avg_vruntime(rq) as i64 - se.vruntime as i64;
///     let limit: i64 = calc_delta_fair(
///         cfs_rq_max_slice(rq) + TICK_NSEC, se.weight
///     ) as i64;
///
///     se.vlag = vlag.clamp(-limit, limit);
/// }
/// ```
///
/// **`cfs_rq_max_slice`**: Returns the maximum `slice_ns` across the tree root's
/// `max_slice` augmented field and `curr` (if on_rq). Linux equivalent:
/// `cfs_rq_max_slice()`.
///
/// ```text
/// fn cfs_rq_max_slice(rq: &EevdfRunQueue) -> u64 {
///     let mut max_slice: u64 = 0;
///     if let Some(c) = rq.curr {
///         let se = unsafe { c.as_ref() };
///         if se.on_rq != OnRqState::Off {
///             max_slice = se.slice_ns;
///         }
///     }
///     if let Some(root) = rq.base.tasks_timeline.root() {
///         max_slice = max_slice.max(root.max_slice);
///     }
///     max_slice
/// }
/// ```
pub fn update_entity_lag(rq: &mut EevdfRunQueue, se: &mut EevdfTask) {
    /* in sched/eevdf.rs */
}

// ---------------------------------------------------------------------------
// place_entity — Wakeup Placement (CRITICAL — was completely missing)
// ---------------------------------------------------------------------------

/// Position a waking (or newly forked) entity in virtual time.
///
/// This is the most critical function in EEVDF — it determines where a task
/// appears in the virtual timeline on every wakeup, fork, and reweight. It
/// MUST match Linux's mathematical semantics for fairness correctness.
///
/// **Classification**: Evolvable (replaceable). The lag inflation formula and
/// initial placement policy are swappable via `EvolvableComponent`. The ML
/// framework can tune the effective slice via `ParamId::SchedEevdfWeightScale`.
///
/// Linux equivalent: `place_entity()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn place_entity(rq: &mut EevdfRunQueue, se: &mut EevdfTask, flags: EnqueueFlags) {
///     // Step 1: Get the current weighted average (also updates zero_vruntime).
///     let vruntime: u64 = avg_vruntime(rq);
///     let mut lag: i64 = 0;
///
///     // Step 2: Reset slice to sysctl default unless custom.
///     if !se.custom_slice {
///         // ML-tunable slice via ParamId::SchedEevdfWeightScale.
///         let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
///             .map_or(100, |p| p.current.load(Relaxed));
///         se.slice_ns = SYSCTL_SCHED_BASE_SLICE * weight_scale as u64 / 100;
///     }
///     let vslice: u64 = calc_delta_fair(se.slice_ns, se.weight);
///
///     // Step 3: PLACE_LAG — restore saved virtual lag with inflation.
///     // Only applies if the queue has entities and the entity has saved lag.
///     if rq.base.nr_queued.has_waiters() && se.vlag != 0 {
///         lag = se.vlag;
///
///         // Lag inflation: compensate for the effect of adding this entity on V.
///         // Adding a task with positive vlag moves V backwards, reducing the
///         // effective lag. The inflation formula:
///         //   inflated_lag = vlag * (W + w_i) / W
///         // prevents lag "evaporation" on sleep/wake cycles. Without this,
///         // frequently-sleeping interactive tasks gradually lose their
///         // accumulated service credits — a systematic fairness bias.
///         //
///         // Linux equivalent: the PLACE_LAG block in place_entity().
///         let mut load: i64 = rq.base.sum_weight;
///         if let Some(c) = rq.curr {
///             let curr_se = unsafe { c.as_ref() };
///             if curr_se.on_rq != OnRqState::Off {
///                 load += curr_se.weight as i64;
///             }
///         }
///
///         // lag *= (load + w_i) / load
///         lag = lag * (load + se.weight as i64);
///         debug_assert!(load > 0, "sum_weight == 0 with nr_queued.has_waiters(): corrupted runqueue");
///         if load == 0 { load = 1; }  // defensive fallback prevents div-by-zero in release
///         lag = lag / load;
///     }
///
///     // Step 4: Set vruntime. A task with positive lag (owed CPU) gets
///     // vruntime < avg_vruntime (eligible). A task with negative lag
///     // (over-served) gets vruntime > avg_vruntime (ineligible).
///     se.vruntime = (vruntime as i64 - lag) as u64;
///
///     // Step 5: Handle relative deadline (from reweight).
///     if se.rel_deadline {
///         se.vdeadline += se.vruntime;
///         se.rel_deadline = false;
///         return;
///     }
///
///     // Step 6: Initial placement (fork) — start with half a slice to ease
///     // into the competition. Existing tasks are, on average, halfway
///     // through their slice.
///     let effective_vslice = if flags.contains(EnqueueFlags::ENQUEUE_INITIAL) {
///         vslice / 2
///     } else {
///         vslice
///     };
///
///     // Step 7: Set virtual deadline.
///     // EEVDF: vd_i = ve_i + r_i/w_i (in virtual time units).
///     se.vdeadline = se.vruntime + effective_vslice;
///
///     // ML observation: emit placement metrics.
///     // observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskWoke,
///     //     se.cgroup_id, latency_ns as i32, rq.base.nr_queued.count(), prev_cpu);
/// }
/// ```
///
/// **Error paths**: `place_entity()` cannot fail — all inputs are bounded by
/// the lag clamp and the `sysctl_sched_base_slice` range. Division by zero
/// is prevented by the `load == 0` guard (defensive; `sum_weight == 0`
/// implies an empty runqueue, and `place_entity` is called when enqueueing
/// into a non-empty queue).
pub fn place_entity(
    rq: &mut EevdfRunQueue,
    se: &mut EevdfTask,
    flags: EnqueueFlags,
) { /* in sched/eevdf.rs */ }

// ---------------------------------------------------------------------------
// update_curr — Core Accounting Loop
// ---------------------------------------------------------------------------

/// Update the currently running entity's virtual runtime, deadline, PELT,
/// and CBS budget. This is the most frequently called EEVDF function — it
/// runs on EVERY scheduler tick and EVERY voluntary dequeue.
///
/// **Classification**: Evolvable. The vruntime advance formula, PELT update,
/// CBS charge path, and preemption check are all policy code, hot-swappable
/// via `EvolvableComponent`. The `EevdfRunQueue` and `EevdfTask` struct
/// layouts are Nucleus (data). The invariant checker validates that any
/// replacement `update_curr()` preserves weight-proportional vruntime
/// advance and that `vruntime` is monotonically non-decreasing for the
/// currently running entity.
///
/// Linux equivalent: `update_curr()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn update_curr(rq: &mut EevdfRunQueue) {
///     let curr = match rq.curr {
///         Some(c) => unsafe { c.as_mut() },
///         None => return,  // No current entity (idle CPU).
///     };
///
///     // Step 1: Compute execution delta using the runqueue task clock.
///     // rq_clock_task(rq) is the runqueue-cached clock that excludes IRQ
///     // time, updated once per scheduler entry by update_rq_clock().
///     // This matches Linux's update_curr() which uses rq_clock_task(rq),
///     // NOT sched_clock() or ktime_get_ns(). Using the task clock avoids
///     // charging IRQ time to the running task's vruntime.
///     let now = rq_clock_task(rq);
///     let delta_exec = now as i64 - curr.exec_start as i64;
///     if delta_exec <= 0 { return; }  // Clock went backwards / no time elapsed.
///     curr.exec_start = now;
///
///     // Step 1a: Accumulate wall-clock execution time (used by
///     // check_preempt_tick, getrusage, /proc/[pid]/sched).
///     curr.sum_exec_runtime += delta_exec as u64;
///
///     // Step 2: Skip vruntime accumulation for idle-marked tasks.
///     // See sched_idle_enter() / sched_idle_exit().
///     // NOTE: sum_exec_runtime continues accumulating during idle-spin
///     // (step 1a runs before this check) because the task IS consuming
///     // CPU time — getrusage() and /proc/[pid]/sched should reflect
///     // wall-clock CPU usage. Only vruntime and PELT are frozen.
///     if curr.sched_idle_marked.load(Relaxed) { return; }
///
///     // Step 3: Advance vruntime by wall-clock delta scaled by weight.
///     // NOTE: The avg_vruntime accumulators (`sum_w_vruntime`,
///     // `sum_weight`) are NOT updated here. They use curr's stale
///     // vruntime from the last enqueue/dequeue, not the updated
///     // value. This is correct (matches Linux): accumulator updates
///     // are performed on enqueue/dequeue via avg_vruntime_add/sub.
///     // Updating them on every tick would break the O(1) tick path
///     // (accumulator updates involve subtraction and re-addition of
///     // key * weight). The stale contribution is corrected when the
///     // entity is dequeued, and `avg_vruntime()` compensates for
///     // curr by subtracting the stale and adding the current value.
///     curr.vruntime += calc_delta_fair(delta_exec as u64, curr.weight);
///
///     // Step 3b: Propagate vruntime to ancestor GroupEntities (hierarchical
///     // EEVDF). Walk from the task's innermost cgroup up to the root cgroup,
///     // advancing each GroupEntity's vruntime in its parent's EEVDF tree by
///     // delta_exec_ns * NICE_0_WEIGHT / group_weight. All entities are on
///     // this CPU's runqueue — no cross-CPU locking needed.
///     // See [Section 7.2](#heterogeneous-cpu-scheduling--virtual-runtime-propagation).
///     // `group_entity_for(curr)` is an XArray lookup:
///     //   `rq.group_entities.get(curr.cgroup_id)` -> Option<&GroupEntity>
///     // `propagate_group_vruntime()` is fully specified in
///     // [Section 7.2](#heterogeneous-cpu-scheduling--virtual-runtime-propagation):
///     // walks from innermost cgroup to root, advancing each GroupEntity's
///     // vruntime by calc_delta_fair(delta_ns, group_weight).
///     if let Some(ge) = group_entity_for(curr) {
///         propagate_group_vruntime(ge, delta_exec as u64);
///     }
///
///     // Step 4: Update deadline if slice expired. Returns true when the
///     // entity's deadline was renewed (meaning a reschedule check is needed).
///     let deadline_renewed = update_deadline(rq, curr);
///
///     // Step 5: Update PELT for the current entity (running=true, runnable=true).
///     curr.pelt.update(/* running */ true, /* runnable */ true, now, curr.weight as u64);
///
///     // Step 6: CBS budget charge (if task is in a CBS-guaranteed cgroup).
///     // See [Section 7.6](#cpu-bandwidth-guarantees--cbs-charge).
///     if let Some(server) = cbs_server_for(curr) {
///         cbs_charge(server, &cbs_config_for(curr), delta_exec as u64);
///     }
///
///     // Step 6b: cpu.max ceiling charge (if task's cgroup has cpu.max set).
///     // See [Section 7.6](#cpu-bandwidth-guarantees--cpumax-ceiling-enforcement-bandwidth-throttling).
///     // `charge_cpu_max()` accepts nanoseconds (matching CBS budget_remaining_ns
///     // to avoid the persistent truncation bias of `delta_exec / 1000`).
///     if let Some(bw) = cpu_bandwidth_for(curr) {
///         if bw.quota_ns != u64::MAX {
///             charge_cpu_max(curr, bw, delta_exec as u64);
///         }
///     }
///
///     // Step 7: Fast path — only curr on the rq, no preemption needed.
///     // UmkaOS nr_queued excludes curr, so "no waiters" == "only curr running".
///     if !rq.base.nr_queued.has_waiters() { return; }
///
///     // Step 8: Preemption check. Request reschedule if the current entity's
///     // deadline expired or it has exhausted its protected slice.
///     if deadline_renewed || !protect_slice(curr) {
///         resched_curr(rq, ReschedUrgency::Lazy);
///     }
///
///     // ML observation: emit per-tick scheduling metrics (gated by static key).
///     // observe_kernel!(SubsystemId::Scheduler, SchedObs::UpdateCurr,
///     //     delta_exec, curr.vruntime, rq.base.nr_queued.count());
/// }
/// ```
///
/// **Error paths**: `update_curr()` cannot fail. All arithmetic is bounded:
/// `delta_exec` is clamped to non-negative by the guard in Step 1,
/// `calc_delta_fair` uses u128 intermediate to prevent overflow, and PELT
/// sums are bounded by `LOAD_AVG_MAX`. CBS charge uses saturating arithmetic.
///
/// **Locking**: Called with the local CPU's `rq.lock` held. All fields
/// accessed are either local to the current CPU's runqueue (no contention)
/// or use interior mutability (AtomicXX fields in `CbsCpuServer`).
pub fn update_curr(rq: &mut EevdfRunQueue) { /* in sched/eevdf.rs */ }

// ---------------------------------------------------------------------------
// update_deadline — Slice Expiry
// ---------------------------------------------------------------------------

/// Update the entity's virtual deadline when its slice expires.
///
/// **Precondition**: Called only from `update_curr()` after `se.vruntime`
/// has been advanced by `calc_delta_fair(delta_exec, se)`. Must not be
/// called independently — the `vruntime >= vdeadline` check assumes
/// vruntime reflects the entity's current accumulated execution.
///
/// Called by `update_curr()` when `se.vruntime >= se.vdeadline`. Assigns a
/// new slice and recomputes the deadline from the current vruntime.
///
/// Linux equivalent: `update_deadline()` in `fair.c`.
///
/// Returns `true` if a new deadline was assigned (slice expired), `false` if
/// the entity's current deadline has not yet been reached.
///
/// ```text
/// fn update_deadline(rq: &EevdfRunQueue, se: &mut EevdfTask) -> bool {
///     if se.vruntime < se.vdeadline { return false; }
///
///     if !se.custom_slice {
///         let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
///             .map_or(100, |p| p.current.load(Relaxed));
///         se.slice_ns = SYSCTL_SCHED_BASE_SLICE * weight_scale as u64 / 100;
///     }
///     se.vdeadline = se.vruntime + calc_delta_fair(se.slice_ns, se.weight);
///
///     // Update weighted average virtual runtime after deadline renewal.
///     // Linux `update_deadline()` calls `avg_vruntime(cfs_rq)` here to
///     // rebase zero_vruntime after the entity's vruntime has advanced past
///     // the old deadline. Without this call, entity_key() deltas grow,
///     // reducing precision of the division-free eligibility check.
///     avg_vruntime(rq);
///
///     // Update the protect slice threshold.
///     // NOTE: UmkaOS deliberately refreshes vprot on every slice expiry
///     // (via update_deadline()), NOT only at context-switch time as Linux
///     // does (Linux calls set_protect_slice() from set_next_entity()).
///     // Rationale: per-slice-expiry refresh provides tighter protection
///     // tracking for long-running tasks that span many slices without
///     // being context-switched out. The formula also diverges from Linux
///     // — see set_protect_slice() documentation below.
///     set_protect_slice(se);
///     true
/// }
/// ```
pub fn update_deadline(rq: &EevdfRunQueue, se: &mut EevdfTask) -> bool {
    /* in sched/eevdf.rs */
}

/// Set the protected slice threshold for an entity. `pick_eevdf()` retains
/// `curr` if `curr.vruntime < curr.vprot`, preventing excessive preemption.
///
/// **UmkaOS-original improvement over Linux.**
///
/// **Linux** (`kernel/sched/fair.c set_protect_slice()`, called from
/// `set_next_entity()`): with `RUN_TO_PARITY` enabled (default-true per
/// `features.h`: `SCHED_FEAT(RUN_TO_PARITY, true)`):
///   1. `slice = cfs_rq_min_slice(cfs_rq)` -- the minimum slice of all
///      enqueued entities.
///   2. `slice = min(slice, se->slice)`.
///   3. If `slice != se->slice` (i.e., a shorter-slice entity exists on the
///      runqueue), compute `vprot = min(se->deadline, se->vruntime +
///      calc_delta_fair(slice, se))`.
///   4. If `slice == se->slice` (the common case when all tasks use the
///      default base slice), `vprot = se->deadline` -- the FULL virtual
///      slice is protected.
/// Without `RUN_TO_PARITY`: `slice = min(base_slice_ns, se->slice)`, which
/// similarly protects the full slice for default-slice tasks.
///
/// **UmkaOS**: configurable fraction (default 50%) of each task's virtual
/// slice. This makes UmkaOS **more preemptive** than Linux under default
/// settings for same-slice tasks. The tradeoff is deliberate: server
/// throughput workloads benefit from shorter protection windows, and the
/// fraction is ML-tunable per-cgroup.
///
/// Tradeoff: Linux's full-deadline protection for same-slice tasks provides
/// complete run-to-completion guarantees within a slice. UmkaOS's fractional
/// protection enables more frequent preemption opportunities, reducing
/// tail latency at the cost of throughput. The fraction is tunable:
/// `pct=20` for latency-sensitive workloads, `pct=70` for throughput.
///
/// **Tunability (two levels)**:
/// 1. **Boot parameter**: `umka.sched.protect_pct=50` (system-wide default,
///    range 10-90). Sysadmin sets once: 70 for database clusters, 20 for
///    latency-sensitive trading systems.
/// 2. **ML-tunable per-cgroup**: `ParamId::SchedProtectFraction` (default
///    from boot param, range [10, 90]). ML framework adapts per-cgroup at
///    runtime: batch cgroups get 70% (more throughput), latency-sensitive
///    cgroups get 20% (more responsive).
///
/// **Classification**: Evolvable. The fraction is policy, hot-swappable.
///
/// ```text
/// fn set_protect_slice(se: &mut EevdfTask) {
///     let pct = PARAM_STORE.get(ParamId::SchedProtectFraction)
///         .map_or(BOOT_PROTECT_PCT.load(Relaxed), |p| p.current.load(Relaxed));
///     // Clamp to [10, 90] to prevent degenerate behavior.
///     let pct = pct.clamp(10, 90);
///     let vslice = calc_delta_fair(se.slice_ns, se.weight);
///     let protect_vtime = vslice * pct as u64 / 100;
///     se.vprot = se.vdeadline - protect_vtime;
/// }
/// ```
///
/// Default: `BOOT_PROTECT_PCT = 50` (protect half the virtual slice).
/// At `pct=50`, a nice-0 task with 750μs slice has ~375μs of protection
/// (virtual time). A nice -20 task's proportionally smaller vslice gives
/// proportionally smaller but still meaningful protection. Setting `pct=20`
/// approaches Linux-like behavior (short protection, fast preemption).
fn set_protect_slice(se: &mut EevdfTask) { /* in sched/eevdf.rs */ }

// ---------------------------------------------------------------------------
// pick_eevdf — Eligible Earliest Virtual Deadline First Selection
// ---------------------------------------------------------------------------

/// Select the eligible entity with the earliest virtual deadline in O(log n).
///
/// **Classification**: Evolvable (replaceable). The tie-breaking logic,
/// PICK_BUDDY optimization, and protect_slice check are policy decisions
/// that can be hot-swapped. The tree walk structure and eligibility check
/// are Nucleus-correct.
///
/// `tasks_timeline` is a single intrusive augmented RB-tree keyed by
/// `vdeadline`. Each `EevdfRbLink` caches `min_vruntime` — the minimum
/// `vruntime` in that subtree. The walk uses `vruntime_eligible()` on
/// subtree `min_vruntime` fields to prune ineligible subtrees.
///
/// Linux equivalent: `__pick_eevdf()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn pick_eevdf(rq: &mut EevdfRunQueue) -> Option<&EevdfTask> {
///     let mut node = rq.base.tasks_timeline.root();
///     let se = pick_first_entity(rq);  // leftmost (earliest deadline)
///     let mut curr = rq.curr;
///     let mut best: Option<&EevdfTask> = None;
///
///     // Fast path: single entity (only curr, no waiters in tree).
///     if !rq.base.nr_queued.has_waiters() {
///         return match curr {
///             Some(c) if unsafe { c.as_ref() }.on_rq != OnRqState::Off =>
///                 Some(unsafe { c.as_ref() }),
///             _ => se,
///         };
///     }
///
///     // PICK_BUDDY: prefer cache-affinity wakeup buddy if eligible.
///     // This is an Evolvable heuristic — can be disabled or replaced.
///     // Linux WARN_ON_ONCE(cfs_rq->next->sched_delayed): the buddy is set
///     // during wakeup, so it must not be delayed. UmkaOS uses debug_assert.
///     if let Some(next) = rq.next {
///         let next_se = unsafe { next.as_ref() };
///         debug_assert!(!next_se.sched_delayed, "buddy must not be sched_delayed");
///         if entity_eligible(rq, next_se) {
///             return Some(next_se);
///         }
///     }
///
///     // Filter curr: only consider if on_rq and eligible.
///     let curr_ref = curr.and_then(|c| {
///         let se = unsafe { c.as_ref() };
///         if se.on_rq != OnRqState::Off && entity_eligible(rq, se) {
///             Some(se)
///         } else {
///             None
///         }
///     });
///
///     // protect_slice: if curr has not exhausted its minimum quantum, retain it.
///     // This prevents excessive preemption with very short slices.
///     // Evolvable — the threshold can be tuned by ML policy.
///     if let Some(c) = curr_ref {
///         if c.vruntime < c.vprot {
///             return Some(c);
///         }
///     }
///
///     // Leftmost shortcut: if the earliest-deadline entity is eligible, pick it.
///     if let Some(leftmost) = se {
///         if entity_eligible(rq, leftmost) {
///             best = Some(leftmost);
///         }
///     }
///
///     // Tree walk: find the earliest-deadline eligible entity.
///     // The walk is ITERATIVE (not recursive) — UmkaOS improvement over
///     // potential stack depth issues.
///     if best.is_none() {
///         while let Some(n) = node {
///             let left = n.left;
///
///             // If left subtree contains eligible entities, descend left
///             // (earlier deadlines are always better).
///             if let Some(l) = left {
///                 if vruntime_eligible(rq, l.min_vruntime) {
///                     node = Some(l);
///                     continue;
///                 }
///             }
///
///             // Left subtree empty or no eligible entities. Check current node.
///             let node_se = task_from_link(n);
///             if entity_eligible(rq, node_se) {
///                 best = Some(node_se);
///                 break;
///             }
///
///             // Current node not eligible. Try right subtree.
///             node = n.right;
///         }
///     }
///
///     // Compare tree result against curr. If curr has an earlier deadline
///     // than best, prefer curr (it is already running — no context switch).
///     match (best, curr_ref) {
///         (None, c) => c,
///         (Some(b), Some(c)) if entity_before(c, b) => Some(c),
///         (b, _) => b,
///     }
/// }
/// ```
///
/// **`entity_before`**: Compares two entities by deadline (lower = earlier).
/// Tie-breaking by TaskId for determinism.
/// ```text
/// fn entity_before(a: &EevdfTask, b: &EevdfTask) -> bool {
///     a.vdeadline < b.vdeadline
///         || (a.vdeadline == b.vdeadline && a.task_id < b.task_id)
/// }
/// ```
///
/// # Invariant maintenance
///
/// Every tree mutation (insert, delete, rotation) must call
/// `recompute_augmented()` on affected nodes bottom-up. Violating this
/// causes `pick_eevdf()` to return a suboptimal or incorrect result.
///
/// **ML observation point**: After pick, emit task selection metrics:
/// ```text
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::PreemptionEvent,
///     picked_task_prio, preempted_task_prio, cgroup_id);
/// ```
pub fn pick_eevdf(rq: &mut EevdfRunQueue) -> Option<&EevdfTask> {
    /* in sched/eevdf.rs */
}

7.1.3 Key Properties¶

Preemptible locks by default: Mutexes and rwlocks are always sleeping locks with priority inheritance. Under PreemptionModel::Realtime (Section 8.4), spinlocks also become sleeping locks (RT-safe). Under Voluntary and Full preemption modes, SpinLock is a true spinlock that disables preemption for its critical section — but all spinlock-protected critical sections are bounded and O(1) in duration. Per-CPU data is protected by short IRQ-disabling guards (PerCpuMutGuard) that hold for bounded durations only — never across blocking operations. There are no unbounded preemption-disabled regions.
NUMA-aware load balancing: The load balancer models migration cost (cache invalidation, memory latency) and only migrates tasks when the imbalance exceeds the migration cost. The topology hierarchy is architecture-dependent:
x86-64: Core → LLC (L3) → Package → NUMA node (2-3 levels typical)
AArch64: Core → Cluster (DynamIQ/big.LITTLE) → Package → NUMA node
PPC64LE: Thread → Core → Chip → Drawer → Node (POWER9/10 can have 4 levels)
s390x: Thread → Core → Book → Drawer → Node (5 scheduling domain levels — z15 up to 190 PUs, z16 up to 200 PUs; migration cost increases sharply at each boundary due to private L2/L3/L4 caches and interconnect hop penalties)
RISC-V: Core → Cluster → NUMA node (DT-described, varies per SoC)
LoongArch64: Core → Package → NUMA node (3A5000/6000 topology)

The scheduler builds a SchedDomain hierarchy at boot from firmware topology data (ACPI SRAT/SLIT on x86/ARM server, DT on embedded/RISC-V, STSI instruction on s390x). Each level has a migration cost threshold — work stealing only crosses a boundary when the imbalance exceeds that level's threshold. On s390x, the 5-level topology means the scheduler must distinguish intra-book steals (fast, shared L4) from inter-drawer steals (slow, remote memory), using the topology distance from STSI to set per-level migration cost thresholds. - Per-CPU run queues: No global run queue lock. Each CPU manages its own queues independently. Each per-CPU run queue is protected by a per-CPU spinlock (rq->lock). - Work stealing: Idle CPUs steal tasks from busy CPUs at low frequency (~4ms interval) to avoid thundering-herd effects. The work stealing algorithm is specified below.

Target CPU selection. When a CPU goes idle and its local run queue is empty, it initiates a steal attempt. The target CPU is selected as follows:

Same-NUMA-node preference: Scan CPUs within the same NUMA node first. Cross-node steals incur higher migration cost (remote memory latency, cache invalidation of NUMA-local pages) and are only attempted if no same-node candidate has stealable work.
Highest load first: Among candidate CPUs, prefer the one with the highest run queue load (measured as rq.eevdf.base.task_count). This maximizes the probability that the target can spare a task without becoming underloaded itself.
Cache topology tiebreak: When multiple candidates have equal load, prefer the CPU that shares the closest cache level (L2 > L3 > cross-package). Tasks migrated within a shared cache domain retain warm cache lines, reducing post-migration stall cycles.
Cross-node fallback: If no same-node CPU has more than one runnable task, scan remote NUMA nodes in distance order (nearest first, using SLIT/SRAT distances from firmware). The migration cost threshold is higher for cross-node steals — the load imbalance must exceed NUMA_MIGRATION_THRESHOLD (default: 2 tasks) to justify the cross-node penalty.

Lock-free load observation: Each VruntimeTree (embedded in EevdfRunQueue) exposes:

pub task_count: AtomicU32,  // incremented on enqueue, decremented on dequeue

Ordering: Relaxed — an exact count is not required; approximate load is sufficient for steal decisions. The stealing CPU reads task_count atomically without acquiring the target runqueue's lock. This gives a snapshot that may be 1-2 operations stale, which is acceptable: the goal is finding a CPU with available work, not exact balance.

False positive handling: If the stealing CPU reads non-zero task_count but finds no stealable task after acquiring the target lock (due to intervening dequeue), it counts as one steal attempt.

Steal attempt limit: After STEAL_ATTEMPT_LIMIT = 4 failed candidates (same-NUMA probes first, cross-NUMA second), the idle CPU enters halted state via cpu::halt() and awaits the next IPI or timer tick.

pub const STEAL_ATTEMPT_LIMIT: usize = 4;

Task selection. From the target CPU's EEVDF tasks_timeline tree, steal the eligible task with the largest vdeadline (rightmost node in the tree). Rationale: the task with the largest vdeadline is the one furthest from being scheduled next on the source CPU — it has the most remaining virtual runtime before its next turn. Stealing it causes the least disruption to the source CPU's fairness invariants and avoids stealing a task that was about to run (which would waste the source CPU's scheduling decision). The stolen task's vlag is preserved; its vruntime is adjusted relative to the destination run queue's zero_vruntime to maintain fairness on the new CPU.

RT and deadline tasks are not stolen by the normal work-stealing path. RT task migration uses a separate push/pull mechanism triggered by RT priority changes (see Section 8.4).

Lock ordering. The work stealer must hold two run queue locks simultaneously (source and destination). Deadlock prevention is enforced at compile time via lock_two_runqueues() (see below). However, the idle CPU also uses a trylock with exponential backoff strategy to avoid priority inversion:

The idle CPU acquires its own run queue lock first (guaranteed success — it is local).
It calls trylock() on the target CPU's run queue lock. If the lock is contended (the target CPU is in a scheduling critical section), the steal attempt is abandoned for this cycle rather than spinning.
On trylock failure, the idle CPU backs off: it doubles the steal retry interval (from the base 4ms up to a cap of 32ms) and re-enters the idle loop. The backoff resets to 4ms on a successful steal or when a local wake-up occurs.

This trylock approach ensures the work stealer never blocks a busy CPU's scheduler path. In practice, trylock succeeds on the first attempt >95% of the time because run queue critical sections are bounded and short (O(1), typically < 1 microsecond).

Maximum steal count. Each steal attempt moves exactly 1 task. Stealing multiple tasks per attempt risks over-correcting the load imbalance and causing ping-pong migration between CPUs. The periodic 4ms steal interval provides natural convergence: a 4-task imbalance resolves in ~16ms (4 steal cycles), which is well within acceptable load-balancing latency for non-RT workloads.

Run queue lock ordering — compile-time enforced: The load balancer (work stealing) acquires remote run queue locks. ABBA deadlock prevention is type-enforced, not a runtime convention. All run queue locks share RQ_LOCK_LEVEL = 50 in the compile-time lock hierarchy (Section 3.4). The SpinLock<_, LEVEL=50> type prevents a caller holding one level-50 lock from acquiring a second one directly. The only legal path to holding two run queue locks simultaneously is lock_two_runqueues(rq_a, rq_b), which always acquires the lock for the lower CPU ID first — making CPU-ID-ordered acquisition the sole valid code path rather than a convention that code review must uphold. The load balancer never holds more than two run queue locks simultaneously. Cross-subsystem lock ordering (innermost last): TASK_LOCK (level 20) < PI_LOCK (level 45) < RQ_LOCK (level 50). This means a caller holding RQ_LOCK must NOT acquire TASK_LOCK or PI_LOCK.
Real-time guarantees: Dedicated RT cores can be reserved (isolcpus equivalent). Threaded interrupts ensure deterministic scheduling latency.
CPU frequency/power: Integration with cpufreq governors for power management.
Steal time accounting (paravirtualized guests): Each VCPU has a steal_time field in a shared memory page mapped by the hypervisor. On x86, KVM writes cumulative stolen nanoseconds to the page registered via MSR_KVM_STEAL_TIME (0x4b564d03). On AArch64, the PV steal time structure is registered via the SMCCC_HV_PV_SCHED_FEATURES hypercall (SMCCC v1.1+). The guest scheduler reads this on each timer tick:

/// Per-CPU steal time tracking. Updated on each scheduler tick.
pub struct StealTimeAccounting {
    /// Last observed cumulative steal value (nanoseconds) from the
    /// hypervisor-mapped shared page.
    pub last_steal_ns: u64,
    /// Running total of stolen time for /proc/stat reporting.
    pub total_steal_ns: u64,
}

/// Called from scheduler_tick() on paravirtualized guests.
/// Returns the stolen nanoseconds since the last call.
fn update_steal_time(cpu: &mut CpuLocalBlock) -> u64 {
    let current_steal = read_pv_steal_time(); // arch-specific MMIO read
    let delta = current_steal - cpu.steal_acct.last_steal_ns;
    cpu.steal_acct.last_steal_ns = current_steal;
    cpu.steal_acct.total_steal_ns += delta;
    delta
}

Stolen time is accumulated in total_steal_ns for /proc/stat reporting but is NOT subtracted from delta_exec for vruntime advancement. The rq_clock_task(rq) time base used by update_curr() already excludes IRQ time, and steal time is accounted separately through account_steal_time() called from timer_tick_handler() (upstream of scheduler_tick()). The vruntime advance formula (curr.vruntime += calc_delta_fair(delta_exec, weight)) uses the task clock delta directly — no steal adjustment is needed or applied. This matches Linux's behavior where steal time affects /proc/stat but not the CFS vruntime calculation.

Reported via /proc/stat as the steal column (field 9, in USER_HZ ticks). Per-task steal attribution is not tracked (steal is a per-CPU phenomenon, not per-task); the /proc/stat value is the sum across all CPUs.

On bare-metal (no hypervisor), read_pv_steal_time() always returns 0 (the MSR is not registered, so no shared page exists). The check is a single branch on a per-CPU boolean (pv_steal_enabled), which is false on bare-metal — zero overhead.

7.1.4 Scheduler Classes¶

The scheduler is modular. Each scheduling class implements a standard interface:

/// Documentation-only trait defining the per-class method signatures.
///
/// UmkaOS uses enum-based class dispatch (`match` on `SchedPolicy`) rather than
/// trait objects. The `SchedClassOps` trait exists only as a documentation aid
/// for the per-class method signatures; it is not used for dynamic dispatch.
/// See the `SchedClass` enum and dispatch mechanism below.
pub trait SchedClassOps: Send + Sync {
    /// Enqueue a task onto this class's run queue. Task-mutable fields
    /// (vruntime, deadline, on_rq_state) use interior mutability; the
    /// run queue lock provides mutual exclusion.
    fn enqueue_task(&mut self, task: &Task, flags: EnqueueFlags);
    fn dequeue_task(&mut self, task: &Task, flags: DequeueFlags);
    fn pick_next_task(&mut self, cpu: CpuId) -> Option<&Task>;
    fn check_preempt(&self, current: &Task, incoming: &Task) -> bool;
    fn task_tick(&mut self, task: &Task, cpu: CpuId, queued: bool);
    fn balance(&mut self, cpu: CpuId, flags: BalanceFlags) -> BalanceResult;
}

Classes are checked in priority order: Deadline > RT > EEVDF. The highest-priority class with a runnable task wins.

pick_next_task() dispatch algorithm. The per-CPU scheduler entry point traverses scheduling classes in strict priority order. Each class's pick_next_task is called at most once; the first class that returns a runnable task wins. This is O(1) in the number of classes (three fixed classes, not a dynamic list).

/// Select the highest-priority runnable task on this CPU's run queue.
///
/// Called from the scheduler core on every context switch, timer tick
/// preemption, and explicit `schedule()` invocation. The caller holds
/// `rq.lock` for the local CPU.
///
/// # Priority order
///
/// 1. **Deadline (CBS)** — tasks with active bandwidth reservations and
///    unexpired deadlines. Scheduled earliest-deadline-first within the
///    CBS server.
/// 2. **RT (FIFO / RR)** — real-time tasks. FIFO tasks run until they
///    yield or block; RR tasks rotate within their priority level on
///    each time slice expiry.
/// 3. **EEVDF (normal)** — the eligible task with the smallest virtual
///    deadline (leftmost eligible node in `tasks_timeline`).
///
/// If all three classes are empty, the CPU enters the idle task — a
/// per-CPU kernel thread that executes the architecture's halt/wait
/// instruction (`hlt` on x86, `wfi` on ARM, `wfi` on RISC-V) until
/// the next interrupt.
fn pick_next_task(rq: &mut RunQueueData) -> &Task {
    // 1. Deadline class: highest priority. CBS tasks with active
    //    reservations whose deadline has not yet expired are checked
    //    first. `dl.pick_next_task()` returns the task with the
    //    earliest absolute deadline (EDF within the CBS server).
    if let Some(task) = rq.dl.pick_next_task() {
        return task;
    }

    // 2. RT class: FIFO and RR tasks. The highest-priority RT task
    //    is returned. Within a priority level, FIFO tasks are ordered
    //    by arrival time; RR tasks rotate on slice expiry.
    if let Some(task) = rq.rt.pick_next_task() {
        return task;
    }

    // 3. CBS group servers: cgroups with cpu.guarantee
    //    ([Section 7.6](#cpu-bandwidth-guarantees)). Iterate CBS servers on this
    //    CPU in EDF order (earliest deadline first). For each server
    //    with remaining budget and runnable tasks, pick its next EEVDF
    //    task. This ensures guaranteed groups receive their reserved
    //    bandwidth ahead of non-guaranteed EEVDF tasks.
    //
    //    cpu.max interaction (GAP-22): if the cgroup is also
    //    max-throttled (cpu.max quota exhausted), skip this CBS server
    //    even if guarantee budget remains. cpu.max is always the hard
    //    ceiling; cpu.guarantee cannot override it.
    //
    //    RT/DL tasks in CBS-guaranteed cgroups bypass the CBS server
    //    entirely — they are picked in steps 1-2 above and do not
    //    consume CBS budget.
    // Linear scan of CBS servers on this CPU to find the one with the
    // earliest deadline. XArray provides O(1) lookup by cgroup ID but
    // does NOT support deadline-ordered iteration natively. The linear
    // scan is O(N) where N = number of CBS-guaranteed cgroups on this
    // CPU (typically 1-5, max ~20). See [Section 7.6](#cpu-bandwidth-guarantees)
    // for the authoritative specification and optimization threshold.
    for server in rq.cbs_servers.values() {
        if server.throttled.load(Acquire) { continue; }
        if server.max_throttled.load(Acquire) { continue; }
        if let Some(task) = server.pick_next_eevdf_task() {
            return task;
        }
    }

    // 4. EEVDF class: normal (SCHED_NORMAL / SCHED_BATCH) tasks
    //    WITHOUT cpu.guarantee. Returns the eligible task with the
    //    smallest vdeadline from `tasks_timeline` via the augmented
    //    `pick_eevdf()` walk that prunes ineligible subtrees.
    //
    //    Deferred-entity handling: `pick_eevdf()` does NOT filter
    //    `sched_delayed` entities (matching Linux `fair.c pick_eevdf()`).
    //    If a deferred entity is picked, dequeue it without wake-up and
    //    retry. This matches Linux `pick_task_fair()` behavior.
    loop {
        match rq.eevdf.pick_next_task() {
            Some(task) if task.eevdf.sched_delayed => {
                // Deferred entity picked: vlag decayed to zero.
                // Remove from tree without wake-up, then retry.
                dequeue_entity(rq, task, DEQUEUE_SLEEP);
                continue;
            }
            Some(task) => return task,
            None => break,
        }
    }

    // 5. All classes empty: return the per-CPU idle task.
    //    The idle task is always runnable and never enqueued in any
    //    scheduling class. It is a sentinel — the run queue is never
    //    truly "empty" because the idle task is always available.
    &*rq.idle_task
}

Per-CPU run queue interaction. Each CPU calls pick_next_task() independently on its own RunQueueData while holding the local run queue lock. There is no cross-CPU coordination in the pick path — load balancing and work stealing (Section 7.1.3) are separate, asynchronous operations that move tasks between run queues. This ensures the scheduling hot path is lock-local and O(1) in the number of CPUs.

Idle task behavior. The idle task is a statically allocated per-CPU kernel thread that does not participate in any scheduling class. When selected, it: 1. Checks for pending softirqs (Section 3.8) and processes them before halting. 2. Invokes the cpuidle governor to select the deepest safe C-state (Section 7.4) based on expected idle duration and latency constraints. 3. Executes the architecture halt instruction. The CPU remains halted until the next interrupt (timer tick, IPI from work stealing, device interrupt). 4. On wake, the idle task immediately calls pick_next_task() again — it never "runs" application logic.

7.1.4.1 eevdf_task_tick — Per-Tick Accounting for EEVDF Tasks¶

The EEVDF implementation of SchedClassOps::task_tick(). Called on every timer interrupt for the currently running CFS/EEVDF task. This is the single most important function in the scheduler — it runs on every tick on every CPU that has a running EEVDF task.

Classification: Evolvable. All runtime code (vruntime advance, PELT update, CBS charge, preemption check) is policy, hot-swappable via EvolvableComponent. The VruntimeTree, EevdfRunQueue, and EevdfTask struct layouts are Nucleus (data).

/// EEVDF implementation of `task_tick`. Called from the timer interrupt
/// handler on every scheduling tick (1-4 ms depending on `CONFIG_HZ`).
///
/// # Arguments
/// - `rq`: the local CPU's run queue data (caller holds `rq.lock`)
/// - `curr`: the currently running EEVDF task
/// - `queued`: derived from `rq.base.nr_queued.count()` inside the function body
///
/// # Locking
/// Called with the local CPU's `rq.lock` held (IRQs disabled). All
/// field accesses are to the local CPU's runqueue — no cross-CPU locking.
///
/// # Algorithm
///
/// ```text
/// fn eevdf_task_tick(rq: &mut EevdfRunQueue, curr: &EevdfTask) {
///     // Step 1: Core accounting — advance vruntime, check deadline,
///     // update PELT, charge CBS budget, check preemption.
///     update_curr(rq);
///
///     // Step 2: Update exec_start for the next delta_exec computation.
///     // (Already done inside update_curr — listed here for completeness.)
///
///     // Step 3: Check preemption against tree candidates.
///     // If the current task has exhausted its protected slice AND a peer
///     // with an earlier virtual deadline is eligible, request reschedule.
///     // This is the `check_preempt_tick` logic.
///     // UmkaOS WaiterCount excludes curr, so has_waiters() == true
///     // means "at least one peer exists" (equivalent to Linux's
///     // nr_queued > 1). See WaiterCount threshold translation guide.
///     if rq.base.nr_queued.has_waiters() {
///         if curr.vruntime >= curr.vprot {
///             // Protected slice consumed — check if a peer should preempt.
///             // Use the raw slice as the ideal runtime (wall-clock nanoseconds).
///             // The primary EEVDF preemption mechanism is `update_deadline()`
///             // (virtual-time comparison: vruntime >= vdeadline), which handles
///             // weight-proportional fairness. This secondary check just ensures
///             // no task runs beyond its raw slice in wall-clock time.
///             // NOTE: Do NOT use `calc_delta_fair(slice, weight)` here — that
///             // converts to virtual time, which would invert the weight
///             // relationship (high-weight tasks preempted sooner, not later).
///             let ideal_runtime = curr.slice_ns;
///             let delta_exec = curr.sum_exec_runtime - curr.prev_sum_exec_runtime;
///             // If we have run for at least one ideal runtime quantum, request
///             // reschedule. Lazy urgency: the tick fires every 1 ms; using
///             // Eager would cause immediate preemption on the next interrupt
///             // return, potentially interrupting kernel-mode work. Lazy lets
///             // the task finish its current kernel-mode operation and yields at
///             // the next voluntary preemption point (return-to-user,
///             // cond_resched()). This matches Linux 6.12+ scheduler_tick()
///             // behavior which uses resched_curr_lazy().
///             //
///             // Note: update_curr() (step 2) also uses Lazy for its own
///             // preemption check (deadline/protected-slice expiry). Both
///             // paths use Lazy consistently — the tick path does not
///             // override update_curr()'s urgency decision.
///             // Note: Once delta_exec >= ideal_runtime on one tick,
///             // this condition remains true on EVERY subsequent tick
///             // (prev_sum_exec_runtime is set on enqueue, not updated
///             // by the tick). This is intentional: Lazy reschedule is
///             // idempotent (setting TIF_NEED_RESCHED_LAZY again is a
///             // no-op). This matches Linux behavior.
///             if delta_exec >= ideal_runtime {
///                 resched_curr(rq, ReschedUrgency::Lazy);
///             }
///         }
///     }
///
///     // Step 4: Propagate cgroup weight changes (lazy reweight).
///     // If the task's cgroup cpu.weight was modified since the last tick,
///     // recompute the GroupEntity weight. No reschedule IPI needed for
///     // ticked cores — the weight update is picked up here.
///     // See [Section 17.2](17-containers.md#control-groups--cpu-controller-state).
///     // AtomicBool because cgroup migration (cross-CPU) may set this
///     // flag while the task is running on a different CPU.
///     if curr.cgroup_weight_dirty.load(Acquire) {
///         reweight_entity(rq, curr);
///         curr.cgroup_weight_dirty.store(false, Release);
///     }
///
///     // ML observation: emit per-tick scheduling metrics (gated by static key).
///     // observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskTick,
///     //     curr.vruntime, rq.base.nr_queued.count(), delta_exec);
/// }
/// ```
pub fn eevdf_task_tick(
    rq: &mut EevdfRunQueue,
    curr: &EevdfTask,
) { /* in sched/eevdf.rs — uses rq.base.nr_queued directly.
     * curr is &EevdfTask (shared reference) because it comes from
     * Arc<Task> via TaskHandle. All scheduler-mutable fields in EevdfTask
     * (vruntime, vdeadline, on_rq, sum_exec_runtime, etc.) are either
     * atomics or protected by the rq lock, which this function holds.
     * See the interior mutability convention at line ~1037. */ }

Relationship to update_curr(): eevdf_task_tick() delegates the core accounting to update_curr() which handles vruntime advance, deadline renewal, PELT update, CBS charge, cpu.max charge, and basic preemption check. The tick function adds the check_preempt_tick heuristic (step 3) which checks whether the current task has exceeded its ideal runtime — a coarser preemption trigger than the deadline-based check inside update_curr().

Why both preemption checks exist: update_curr() step 8 triggers on vprot expiry alone (curr.vruntime >= curr.vprot). The tick's step 3 adds a wall-clock runtime check (delta_exec >= ideal_runtime) which is relevant when vprot has not yet expired but the task has run for more than one full ideal runtime quantum — possible when a large delta_exec spans multiple ticks (e.g., tick interrupt delayed by IRQ storm). If update_curr() already set TIF_NEED_RESCHED_LAZY, the tick check is redundant (harmless — setting the flag again is a no-op). Both use Lazy urgency consistently.

Error paths: eevdf_task_tick() cannot fail. All arithmetic is bounded.

Idle CPU note: rq.curr is NEVER None -- idle CPUs have rq.curr pointing to the statically-allocated rq.idle_task (boot-allocated per-CPU, see idle_task: Arc<Task> in RunQueueData). The idle task has sched_class = SchedClass::Idle, which dispatches to this function. The update_curr() early return guard (step 2, idle-marked check) ensures the idle task's vruntime does not advance. rq.curr is NonNull<EevdfTask> in practice; the Option wrapper exists for the boot-time initialization window before the idle task is assigned.

7.1.4.2 sched_idle_enter / sched_idle_exit — Idle Accounting Hooks¶

Non-idle threads that spin-wait (e.g., KVM halt-poll, busy-wait synchronisation primitives) can bracket their idle-wait windows so the scheduler does not penalise them with excess vruntime. These hooks are distinct from the per-CPU idle task — they apply to any runnable thread that is temporarily doing non-productive work.

/// Mark the current task as idle-spinning. While in this state:
///
/// - **vruntime does not advance**: `update_curr()` skips the
///   `vruntime += delta * (NICE_0_WEIGHT / weight)` accumulation for the
///   idle-marked task. The task's `vruntime` is frozen at its value when
///   `sched_idle_enter()` was called.
/// - **Not counted in runqueue load**: The task's weight is subtracted from
///   the two `avg_vruntime` accumulators (`sum_w_vruntime`, `sum_weight`).
///   This prevents an idle-spinning
///   task from inflating the runqueue's apparent load, which would distort
///   load balancing decisions and EAS frequency selection.
/// - **PELT accounting pauses**: `last_update_time` is not advanced for the
///   idle-marked task. Neither `util_sum` nor `runnable_sum` accumulate.
///   When `sched_idle_exit()` is called, a single `update_load_avg()` call
///   applies geometric decay over the idle interval, correctly reducing the
///   stale PELT signal.
/// - **Preemption remains enabled**: The task can still be preempted by
///   higher-priority tasks (RT, DL, or EEVDF peers with earlier deadlines).
///   If preempted, the idle marking persists across the context switch —
///   it is cleared only by an explicit `sched_idle_exit()` call.
///
/// # Panics
/// Calling `sched_idle_enter()` when the task is already idle-marked is a
/// logic error and triggers a `WARN_ON_ONCE` diagnostic (debug builds panic).
///
/// # Usage
/// - **KVM halt-poll** ([Section 18.3](18-virtualization.md#kvm-operational--vcpu-scheduling-integration)):
///   vCPU threads call `sched_idle_enter()` before the halt-poll spin loop
///   and `sched_idle_exit()` when an interrupt arrives or the poll window
///   expires. This prevents latency-sensitive guests from accumulating
///   unfair vruntime during idle polling.
/// - **Busy-wait synchronisation**: Any kernel thread that spin-waits for a
///   bounded duration (≤1 ms) on a condition that does not justify sleeping
///   may use these hooks to avoid scheduler penalties.
/// **Locking precondition**: Must be called with the local rq lock held
/// (i.e., within a `rq_lock_irqsave()` / `rq_unlock_irqrestore()` section).
/// Preemption disabled alone is NOT sufficient — the timer tick IRQ handler
/// calls `update_curr()` which reads `sum_w_vruntime`/`sum_weight`, and
/// these are non-atomic `i64` fields. The rq lock disables IRQs, preventing
/// concurrent modification from the timer tick handler.
pub fn sched_idle_enter(rq: &mut RunQueueGuard) {
    let task = current();
    debug_assert!(!task.sched_idle_marked.load(Relaxed), "double sched_idle_enter");
    task.sched_idle_marked.store(true, Relaxed);
    // Save the entity_key at enter time. The saved value is used at exit
    // to restore the exact accumulator contribution, preventing drift from
    // zero_vruntime shifts during the idle interval. Without this, a
    // zero_vruntime update (from update_curr() called via timer tick on
    // another task) would cause the exit restoration to add back a
    // different key*weight product than was subtracted.
    let key = entity_key(&rq.eevdf.base, task);
    task.saved_idle_key.store(key, Relaxed);
    // Subtract weight from rq accumulators.
    // Access the runqueue data through the guard (no separate this_rq() call
    // — that would alias the guard's &mut reference, causing UB).
    rq.eevdf.base.sum_w_vruntime -= key * task.weight as i64;
    rq.eevdf.base.sum_weight -= task.weight as i64;
}

/// Exit idle-spinning state. Reverses `sched_idle_enter()`:
///
/// 1. Re-adds the task's weight to the two `avg_vruntime` accumulators
///    (`sum_w_vruntime`, `sum_weight`).
/// 2. Performs a single `update_load_avg()` call to apply PELT decay for
///    the elapsed idle interval.
/// 3. Clears `task.sched_idle_marked`.
///
/// After this call, the task resumes normal vruntime accumulation and PELT
/// accounting. The task's vruntime is unchanged — it resumes from the frozen
/// value, which means it has not consumed any of its fair share during the
/// idle window.
///
/// # Panics
/// Calling `sched_idle_exit()` without a prior `sched_idle_enter()` triggers
/// `WARN_ON_ONCE` (debug builds panic).
pub fn sched_idle_exit(rq: &mut RunQueueGuard) {
    let task = current();
    debug_assert!(task.sched_idle_marked.load(Relaxed), "sched_idle_exit without enter");
    task.sched_idle_marked.store(false, Relaxed);
    // Catch up PELT decay for the idle interval.
    // Access through the RunQueueGuard (no separate this_rq() — see sched_idle_enter).
    update_load_avg(task, &*rq);
    // Restore weight contribution to rq accumulators using the saved key
    // from sched_idle_enter(). This prevents accumulator drift from
    // zero_vruntime shifts during the idle interval.
    let saved_key = task.saved_idle_key.load(Relaxed);
    rq.eevdf.base.sum_w_vruntime += saved_key * task.weight as i64;
    rq.eevdf.base.sum_weight += task.weight as i64;
}

Invariant: The interval between sched_idle_enter() and sched_idle_exit() must be bounded. KVM enforces this via halt_poll_ns (default 200 us, max 10 ms). Unbounded idle marking would starve the runqueue's avg_vruntime advancement — the frozen task's vruntime contribution to the accumulators prevents avg_vruntime from advancing normally, which could delay ineligible tasks from becoming eligible. The halt_poll_ns sysctl provides the administrative bound; WARN_ON_ONCE fires if the idle interval exceeds 10 ms (configurable via sched_idle_max_ns, default 10_000_000).

Task struct field: sched_idle_marked: AtomicBool is added to the per-task scheduling state (adjacent to sched_delayed in EevdfTask). AtomicBool with Relaxed ordering is used instead of Cell<bool> because update_curr() reads this flag from IRQ context (timer tick handler) while sched_idle_enter() writes it from process context — a cross-context access that is a data race under the Rust abstract machine with Cell. On all 8 architectures, Relaxed load/store of a bool compiles to the same instruction as a non-atomic load/store (naturally aligned single-byte access), so there is zero performance cost. update_curr() checks this flag before accumulating vruntime — the check is a single branch that is almost always not-taken (the likely() hint ensures the branch predictor handles the common case with zero overhead).

7.1.4.3 Timer IRQ → `scheduler_tick()` Entry Path¶

The hardware timer interrupt (LAPIC timer on x86-64, ARM Generic Timer on AArch64/ARMv7, SBI timer on RISC-V, decrementer on PPC, CPU timer on s390x, stable counter on LoongArch64) fires at the configured tick rate (HZ=1000 → 1 ms, HZ=250 → 4 ms). The per-architecture timer IRQ handler calls timer_tick_handler() which is the single entry point connecting hardware timer interrupts to scheduler and RCU tick processing.

/// Timer tick handler. Called from the architecture-specific timer IRQ
/// handler with IRQs disabled on the local CPU.
///
/// This function bridges the hardware timer interrupt to the scheduler
/// and RCU subsystems. It runs on EVERY tick on EVERY online CPU.
fn timer_tick_handler() {
    // 1. Advance jiffies and update timekeeping.
    update_wall_time();

    // 2. Process expired hrtimers and timer wheel entries.
    run_local_timers();

    // 3. RCU quiescent state reporting.
    // Single-task CPUs that never context-switch must report quiescent
    // states here to avoid stalling RCU grace periods. Without this,
    // a CPU running a single long-running task would never call
    // rcu_note_context_switch() and the grace period would stall.
    rcu_sched_clock_irq();

    // 4. Scheduler tick.
    scheduler_tick();

    // 5. Subsystem tick hooks.
    perf_event_task_tick();   // PMU event multiplexing (rotate groups)
    calc_global_load_tick();  // /proc/loadavg update
    psi_task_tick();          // Pressure Stall Information accounting
}

7.1.4.4 Top-Level `scheduler_tick()` Dispatch¶

The timer interrupt handler calls scheduler_tick() on every tick (1 ms HZ=1000, or 4 ms HZ=250 depending on configuration). This is the top-level dispatch function that coordinates all per-tick scheduling work across all scheduling classes.

/// Top-level scheduler tick handler. Called from the timer interrupt
/// handler with IRQs disabled. Dispatches to per-class tick functions
/// and performs period maintenance (RT bandwidth, load balance interval).
///
/// Linux equivalent: `scheduler_tick()` in `kernel/sched/core.c`.
///
/// # Algorithm
///
/// ```text
/// fn scheduler_tick() {
///     let cpu = smp_processor_id();
///     let rq = &mut per_cpu_rq(cpu);
///     let _guard = rq.lock.lock();  // level 50
///
///     // Step 1: Update steal time (paravirtualized guests).
///     // See [Section 18.3](18-virtualization.md#kvm-operational--paravirtual-steal-time).
///     update_steal_time(rq);
///
///     // Step 1a: Update the runqueue task clock (cached, excludes IRQ time).
///     // All subsequent clock reads in this tick use rq.clock_task.
///     update_rq_clock(rq);
///
///     // Step 2: Dispatch to per-class task_tick().
///     let curr = rq.curr;
///     match curr.sched_class {
///         SchedClass::RtFifo | SchedClass::RtRr => {
///             rt_task_tick(rq, curr);
///         }
///         SchedClass::Deadline => {
///             dl_task_tick(rq, curr);
///         }
///         SchedClass::Eevdf | SchedClass::Idle => {
///             eevdf_task_tick(&mut rq.eevdf, curr);
///         }
///     }
///
///     // Step 3: RT bandwidth period accounting.
///     // If an RT task is running, charge delta_exec against the RT
///     // bandwidth budget. Check for period rollover and throttle.
///     if rq.rt_rq.rt_runtime_ns != u64::MAX {
///         rt_bandwidth_tick(rq);
///     }
///
///     // Step 4: Load balance interval check.
///     // Decrement the per-CPU load balance interval counter. When it
///     // reaches zero, trigger_load_balance() schedules the softirq
///     // that runs the load balancer. This amortizes the balancer cost
///     // across many ticks (interval = 4-32 ticks depending on topology).
///     rq.next_balance_tick -= 1;
///     if rq.next_balance_tick == 0 {
///         trigger_load_balance(rq);
///         rq.next_balance_tick = load_balance_interval(rq);
///     }
///
///     // Step 5: Update thermal/frequency pressure.
///     // See [Section 7.7](#power-budgeting--thermal-pressure-propagation).
///     update_thermal_pressure(rq);
///
///     // Step 6: nohz_full re-entry check.
///     // If this CPU was in nohz_full mode (tickless) and a second task
///     // became runnable, the tick was re-enabled. Check if we can return
///     // to tickless mode (only one runnable task again).
///     if is_nohz_full(cpu) && rq.nr_running == 1 {
///         nohz_full_kick_stop(cpu);
///     }
/// }
/// ```
pub fn scheduler_tick() { /* in sched/core.rs */ }

7.1.4.5 RT Bandwidth Period Accounting¶

/// Per-tick RT bandwidth accounting. Called from `scheduler_tick()` when
/// the RT bandwidth limiter is active (`rt_runtime_ns != u64::MAX`).
///
/// The RT bandwidth limiter prevents RT tasks from starving non-RT tasks
/// by limiting total RT execution time per period (default: 950 ms per
/// 1000 ms period, matching Linux's `sched_rt_runtime_us = 950000`).
///
/// # Algorithm
///
/// ```text
/// fn rt_bandwidth_tick(rq: &mut RunQueue) {
///     let rt_rq = &mut rq.rt_rq;
///     let now = rq_clock_task(rq);
///
///     // Step 1: Period rollover check.
///     // Advance period_start_ns by the period length (not set to `now`)
///     // to prevent drift when ticks are delayed. Use a while loop to
///     // handle missed periods (e.g., IRQ storm causing 2+ second gap).
///     // Bound: cap iterations to RT_MAX_ROLLOVER_ITERS (10) to prevent
///     // unbounded latency under the rq lock if the clock jumps (e.g.,
///     // VM pause/resume). If more than 10 periods are missed, the
///     // remaining catch-up happens over subsequent ticks. This matches
///     // Linux's RT_MAX_PERIODS bound in do_sched_rt_period_timer().
///     let mut rollover_iters = 0u32;
///     const RT_MAX_ROLLOVER_ITERS: u32 = 10;
///     while now - rt_rq.period_start_ns >= 1_000_000_000
///         && rollover_iters < RT_MAX_ROLLOVER_ITERS
///     {
///         rollover_iters += 1;
///         rt_rq.period_start_ns += 1_000_000_000;
///         rt_rq.rt_time_ns = 0;
///         // Un-throttle if previously throttled.
///         if rt_rq.throttled {
///             rt_rq.throttled = false;
///             // Re-enqueue RT tasks that were dequeued during throttle.
///             rt_unthrottle(rq);
///             resched_curr(rq, ReschedUrgency::Eager);
///         }
///     }
///
///     // Step 2: Charge runtime against the RT bandwidth budget.
///     // rt_bandwidth_tick() owns the exec_start update for RT tasks.
///     // rt_task_tick() does NOT update exec_start — it handles per-task
///     // slice management (RR timeslice, RTTIME limit) using the delta
///     // computed here. This single-owner design prevents double-charging.
///     if rq.curr.sched_class == SchedClass::RtFifo
///         || rq.curr.sched_class == SchedClass::RtRr
///     {
///         let delta = now as i64 - rq.curr.exec_start as i64;
///         if delta > 0 {
///             rq.curr.exec_start = now;
///             rt_rq.rt_time_ns += delta as u64;
///         }
///     }
///
///     // Step 3: Throttle check.
///     if !rt_rq.throttled
///         && rt_rq.rt_time_ns >= rt_rq.rt_runtime_ns
///     {
///         // RT budget exhausted for this period. Throttle all RT tasks
///         // on this CPU: dequeue them from the RT runqueue and set the
///         // throttled flag. Non-RT tasks will run until the period resets.
///         rt_rq.throttled = true;
///         rt_throttle_all(rq);
///         resched_curr(rq, ReschedUrgency::Eager);
///     }
/// }
/// ```
///
/// **Locking**: Called with `rq.lock` held (level 50), IRQs disabled.
///
/// **Default values**: `rt_runtime_ns = 950_000_000` (950 ms),
/// period = 1_000_000_000 ns (1 second). Configurable via
/// `/proc/sys/kernel/sched_rt_runtime_us` and `sched_rt_period_us`
/// (Linux compatibility interface in [Section 20.9](20-observability.md#kernel-parameter-store)).
pub fn rt_bandwidth_tick(rq: &mut RunQueue) { /* in sched/rt.rs */ }

7.1.4.6 Per-Class Tick Functions¶

/// RT scheduling class tick handler. Called from `scheduler_tick()` when
/// the current task is SCHED_FIFO or SCHED_RR.
///
/// **exec_start ownership**: This function does NOT update `exec_start`.
/// `rt_bandwidth_tick()` (called after this) owns the `exec_start` update
/// and delta computation for all RT tasks. This prevents double-charging.
///
/// # Algorithm
/// ```text
/// fn rt_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) {
///     // SCHED_FIFO: no timeslice — run until preempted or blocked.
///     // No action needed on tick (FIFO tasks are preempted only by
///     // higher-priority RT tasks, DL tasks, or explicit yield).
///     if curr.sched_policy == UserSchedPolicy::Fifo {
///         return;
///     }
///
///     // SCHED_RR: decrement timeslice and rotate if expired.
///     // The RR timeslice is stored in curr.slice_ns (default:
///     // DEF_TIMESLICE = 100ms, same as Linux).
///     curr.rr_time_remaining -= TICK_NS;
///     if curr.rr_time_remaining > 0 {
///         return;
///     }
///     // Reset timeslice for next quantum.
///     curr.rr_time_remaining = curr.slice_ns;
///     // If other RR/FIFO tasks at the same priority exist, rotate:
///     // move curr to the tail of its priority queue.
///     if rq.rt.has_peers_at_priority(curr.rt_priority) {
///         dequeue_rt_task(rq, curr);
///         enqueue_rt_task_tail(rq, curr);
///         resched_curr(rq, ReschedUrgency::Eager);
///     }
///
///     // RLIMIT_RTTIME check: accumulated total RT CPU time.
///     // curr.rt_runtime_us reflects the value accumulated as of the
///     // PREVIOUS tick's rt_bandwidth_tick() (which runs AFTER
///     // rt_task_tick in scheduler_tick). The check is therefore one tick
///     // behind, which is within RLIMIT_RTTIME tolerance (same as Linux).
///     // If it exceeds the task's rlimit, send SIGXCPU.
///     let limit_us = task_rlimit(curr, RLIMIT_RTTIME);
///     if limit_us != u64::MAX {
///         let runtime_us = curr.rt_runtime_us.load(Relaxed);
///         if runtime_us >= limit_us {
///             send_sig(SIGXCPU, curr);
///         }
///         if runtime_us >= limit_us + 1_000_000 {
///             send_sig(SIGKILL, curr);
///         }
///     }
/// }
/// ```
pub fn rt_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) { /* in sched/rt.rs —
     * curr is &EevdfTask (shared reference): same rationale as eevdf_task_tick.
     * Scheduler-mutable fields use interior mutability (atomics / rq lock). */ }

/// Deadline (EDF/CBS) scheduling class tick handler.
///
/// # Algorithm
/// ```text
/// fn dl_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) {
///     // Step 1: CBS runtime accounting.
///     // The delta_exec for DL tasks is computed from rq.clock_task.
///     let now = rq.clock_task;
///     let delta = now.saturating_sub(curr.exec_start);
///     curr.exec_start = now;
///     curr.dl_runtime_remaining = curr.dl_runtime_remaining.saturating_sub(delta);
///
///     // Step 2: Deadline expiry check.
///     // If the absolute deadline has passed, the task missed its deadline.
///     // Reclaim: reset runtime and advance deadline by one period.
///     // Unit convention: dl_runtime_us and dl_period_us are stored in
///     // microseconds (matching the sched_attr ABI), converted to nanoseconds
///     // by multiplying by 1000 when computing absolute runtime/deadline values.
///     // This diverges from the CBS subsystem which stores budget_remaining_ns
///     // natively in nanoseconds. The divergence is intentional: the DL fields
///     // match the Linux sched_attr struct's dl_runtime/dl_period (microseconds),
///     // while CBS budget_remaining_ns is an internal accounting field not
///     // exposed to userspace.
///     if now >= curr.dl_deadline {
///         curr.dl_runtime_remaining = curr.dl_runtime_us * 1000;
///         curr.dl_deadline += curr.dl_period_us * 1000;
///     }
///
///     // Step 3: Budget exhaustion.
///     // If runtime is exhausted before the deadline, throttle the task
///     // until the next period starts (CBS replenishment).
///     if curr.dl_runtime_remaining == 0 {
///         curr.on_rq = OnRqState::CbsThrottled;
///         dequeue_dl_task(rq, curr);
///         start_dl_replenishment_timer(rq, curr);
///         resched_curr(rq, ReschedUrgency::Eager);
///     }
/// }
/// ```
pub fn dl_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) { /* in sched/dl.rs —
     * curr is &EevdfTask (shared reference): same rationale as eevdf_task_tick.
     * Scheduler-mutable fields use interior mutability (atomics / rq lock). */ }

7.1.4.7 Scheduler Utility Functions¶

/// Raise SCHED_SOFTIRQ to trigger the load balancer on the next softirq
/// processing point. The load balancer runs in softirq context to avoid
/// blocking the scheduler tick path.
fn trigger_load_balance(rq: &RunQueueData) {
    raise_softirq(SoftirqVec::Sched);
}

/// Return the load balance interval in ticks for this CPU based on
/// the sched_domain topology depth. Deeper topologies (more NUMA hops)
/// use longer intervals to reduce cross-node balancing overhead.
/// Range: 4 ticks (single-socket) to 32 ticks (4+ socket NUMA).
fn load_balance_interval(rq: &RunQueueData) -> u32 {
    // Base interval scaled by sched_domain depth.
    let depth = rq.sched_domain_depth;
    core::cmp::min(4 * (1 << depth), 32)
}

/// Read the current thermal pressure from the architecture-specific
/// thermal monitoring interface and update the runqueue's capacity
/// reduction factor. Used by EAS for frequency/capacity decisions.
fn update_thermal_pressure(rq: &mut RunQueueData) {
    let pressure = arch::current::thermal::read_pressure(rq.cpu_id);
    rq.thermal_pressure = pressure;
}

/// Stop the tick timer for tickless (nohz_full) operation. Called when
/// a CPU returns to single-runnable-task state and can re-enter tickless
/// mode. The next scheduling event (wakeup, migration) will restart the tick.
fn nohz_full_kick_stop(cpu: u32) {
    arch::current::timer::stop_tick(cpu);
}

/// Update the per-runqueue task clock. Called once at the start of each
/// scheduler entry (timer_tick_handler, schedule(), try_to_wake_up).
/// Reads the raw monotonic clock and subtracts accumulated IRQ time
/// to produce the task-only clock value stored in rq.clock_task.
fn update_rq_clock(rq: &mut RunQueueData) {
    let raw_now = sched_clock_nanos();
    rq.clock_task = raw_now - rq.irq_time_ns;
}

/// Called at IRQ entry. Records the timestamp for IRQ time accounting.
/// Must be called from the architecture-specific IRQ entry trampoline,
/// BEFORE the IRQ handler dispatches to device-specific code.
/// Paired with `irq_time_end()` at IRQ exit. Together these functions
/// maintain `rq.irq_time_ns`, which is subtracted from the raw clock
/// in `update_rq_clock()` to produce the task-only clock. Without this
/// accounting, IRQ processing time would be charged to the running
/// task's vruntime, inflating vruntime for tasks that happen to be
/// running during IRQ storms.
/// Linux equivalent: `irqtime_account_irq()` called from IRQ entry/exit.
fn irq_time_start(rq: &mut RunQueueData) {
    rq.irq_entry_timestamp = sched_clock_nanos();
}

/// Called at IRQ exit. Accumulates the IRQ duration into `rq.irq_time_ns`.
/// Must be called from the architecture-specific IRQ exit trampoline,
/// AFTER all IRQ handlers have completed and BEFORE returning to the
/// interrupted context (task or idle).
fn irq_time_end(rq: &mut RunQueueData) {
    let delta = sched_clock_nanos() - rq.irq_entry_timestamp;
    rq.irq_time_ns += delta;
}

/// Return the cached task clock for this runqueue. This is the canonical
/// clock source for all scheduler accounting (update_curr, CBS charge,
/// PELT update). Excludes IRQ time.
/// Linux equivalent: rq_clock_task(rq) in kernel/sched/sched.h.
fn rq_clock_task(rq: &RunQueueData) -> u64 {
    rq.clock_task
}

7.1.4.8 `reweight_entity` — Lazy Weight Update¶

/// Recompute an entity's scheduling weight after a nice/cpu.weight change.
/// Called from `eevdf_task_tick()` step 4 when `cgroup_weight_dirty` is set,
/// and from `sched_setattr()`/`setpriority()` when the nice value changes.
///
/// The entity must be dequeued from the EEVDF accumulators, have its weight
/// updated, and re-enqueued with the new weight to maintain invariants.
///
/// Linux equivalent: `reweight_entity()` in `kernel/sched/fair.c`.
///
/// # Algorithm
/// ```text
/// fn reweight_entity(rq: &mut EevdfRunQueue, se: &mut EevdfTask) {
///     let old_weight = se.weight;
///     let new_weight = sched_prio_to_weight[(se.nice + 20) as usize];
///     if old_weight == new_weight { return; }
///
///     // Step 1: Remove entity's contribution from accumulators.
///     // Uses avg_vruntime_update() to ensure the same entity_key() formula
///     // (se.vruntime - tree.zero_vruntime) is used for both removal and
///     // re-addition. Directly using raw se.vruntime would corrupt the
///     // accumulator by omitting the zero_vruntime offset.
///     let was_on_tree = se.on_rq == OnRqState::Queued;
///     if was_on_tree {
///         __dequeue_entity(&mut rq.base, se);
///     }
///     avg_vruntime_update(&mut rq.base, se, false);  // dequeue: subtracts key*weight
///
///     // Step 2: Scale vlag by weight ratio to preserve relative position.
///     // vlag_new = vlag_old * old_weight / new_weight
///     se.vlag = se.vlag * old_weight as i64 / new_weight as i64;
///
///     // Step 3: Scale deadline by weight ratio.
///     // The remaining virtual time until deadline should be preserved
///     // proportionally: vd_remaining_new = vd_remaining_old * old/new.
///     let vd_remaining = se.vdeadline as i64 - se.vruntime as i64;
///     let scaled_remaining = vd_remaining * old_weight as i64 / new_weight as i64;
///     se.vdeadline = (se.vruntime as i64 + scaled_remaining) as u64;
///     se.rel_deadline = true;
///
///     // Step 4: Update weight.
///     se.weight = new_weight;
///
///     // Step 5: Re-add entity's contribution with new weight.
///     // Uses avg_vruntime_update() — same entity_key() path as enqueue/dequeue,
///     // ensuring the accumulator invariant is maintained:
///     //   sum_w_vruntime = SUM_i( (v_i - zero_vruntime) * w_i )
///     avg_vruntime_update(&mut rq.base, se, true);   // enqueue: adds key*new_weight
///     if was_on_tree {
///         __enqueue_entity(&mut rq.base, se);
///     }
///
///     // Step 6: Update PELT load average for the new weight.
///     se.pelt.update_weight(new_weight as u64);
/// }
/// ```
pub fn reweight_entity(rq: &mut EevdfRunQueue, se: &mut EevdfTask) { /* in sched/eevdf.rs */ }

SchedClass Dispatch Mechanism:

UmkaOS uses static enum dispatch (not a vtable or dyn SchedClassOps). Each task stores a SchedClass enum field. The scheduler's hot path uses a match statement on this enum, which the compiler can optimize to a direct jump table. The SchedClassOps trait (defined above) exists only as a documentation aid for the per-class method signatures; no dyn SchedClassOps or vtable pointer appears at runtime.

Rationale for static enum dispatch over vtable: - Zero indirection: enum match compiles to a jump table (O(1) branch predictor-friendly); vtable dispatch requires a pointer dereference before the call. On x86-64, this eliminates 1 cache miss per scheduling decision. - LTO-friendly: The compiler can inline small per-class operations (e.g., SCHED_IDLE.pick_next_task() always returns None if the idle task is the only runnable task). Vtable calls prevent inlining across compilation units. - No runtime registration: SchedClass is fixed at compile time. New scheduling classes require a kernel rebuild, not a runtime module. This avoids the race conditions and validation overhead of dynamically registered scheduling classes.

/// Scheduling class. Stored in Task; determines all scheduling decisions.
#[repr(u8)]
pub enum SchedClass {
    /// EEVDF (Eligible Earliest Virtual Deadline First).
    /// For all normal (CFS) tasks. Provides fair-share CPU time with
    /// latency-nice configurable slice sizes.
    Eevdf  = 0,
    /// POSIX SCHED_FIFO. Run until preempted by higher-priority RT task,
    /// blocked, or explicitly yields. Static priority 1-99.
    RtFifo = 1,
    /// POSIX SCHED_RR. Like SCHED_FIFO but with timeslices.
    RtRr   = 2,
    /// POSIX SCHED_DEADLINE. CBS (Constant Bandwidth Server). Specified
    /// by (runtime_us, deadline_us, period_us) at sched_setattr() time.
    Deadline = 3,
    /// SCHED_IDLE. Lower priority than any Eevdf task. Used for background
    /// maintenance tasks (garbage collection, defragmentation, telemetry).
    Idle   = 4,
}

/// User-visible scheduling policy. Stored in `EevdfTask.sched_policy` and
/// returned by `sched_getscheduler(2)` / `sched_getattr(2)`. Multiple
/// policies may map to the same `SchedClass` but differ in behavioral details.
///
/// Matches Linux `SCHED_*` constants for ABI compatibility.
#[repr(u32)]
pub enum UserSchedPolicy {
    /// SCHED_NORMAL (0). Standard time-sharing (EEVDF). Preemptible by newly
    /// woken peers if they have earlier virtual deadlines.
    Normal   = 0,
    /// SCHED_FIFO (1). Real-time FIFO — runs until blocked or preempted by
    /// higher-priority RT task.
    Fifo     = 1,
    /// SCHED_RR (2). Real-time round-robin — FIFO with per-priority timeslices.
    Rr       = 2,
    /// SCHED_BATCH (3). CPU-intensive batch processing. Uses EEVDF but with
    /// a behavioral difference: batch tasks are never preempted by newly woken
    /// EEVDF peers (only by RT/DL tasks). This avoids unnecessary context
    /// switches for throughput-oriented workloads.
    Batch    = 3,
    /// SCHED_IDLE (5). Extremely low priority. Only runs when no other
    /// non-idle task is runnable. Linux ABI value 5 (value 4 is unused).
    Idle     = 5,
    /// SCHED_DEADLINE (6). Earliest Deadline First with CBS bandwidth
    /// reservation. Linux ABI value 6.
    Deadline = 6,
}

// In the scheduler hot path, the canonical `pick_next_task()` defined above
// (§ EEVDF pick_next_task) is called.  The enum dispatch `match` statement in
// the scheduler tick and context switch paths calls into that function, which
// already implements the full priority order:
//   Deadline > RT > CBS-guaranteed > EEVDF > Idle
// and returns `&Task` (never None -- the idle task is always available).
// See `pick_next_task(rq: &mut RunQueueData) -> &Task` for the full logic.

Per-class operations (called via match in the scheduler): - enqueue(task): Add to the class-specific queue. - dequeue(task): Remove from the class-specific queue. - pick_next(): Select the next task to run. - put_prev(task): Task is being descheduled; update per-class bookkeeping (e.g., EEVDF virtual time advance). - check_preempt(task, new_task): Can new_task preempt task? Called when a new task becomes runnable. - task_tick(task): Called on each scheduler tick for the running task.

Relationship between SchedClass (enum dispatch) and SchedPolicy (replaceable vtable):

The enum-dispatched pick_next_task() above is the fixed scheduling skeleton that determines class priority ordering (DL > RT > CBS > EEVDF > Idle). This skeleton is compiled into the kernel and is not replaceable at runtime.

The SchedPolicy trait (Section 19.9) is a replaceable policy module that affects only the EEVDF scheduling class. It is called from step 4 of pick_next_task(): the EEVDF branch delegates to the active SchedPolicy module to select among eligible EEVDF tasks.

Component	Dispatch	Replaceable?	Scope
DL class	Fixed enum match	No	Step 1 of pick_next_task
RT class	Fixed enum match	No	Step 2
CBS servers	Fixed enum match	No	Step 3
EEVDF class	`SchedPolicy` vtable	Yes (live evolution)	Step 4
Idle	Fixed	No	Step 5

CBS servers use their own local EEVDF trees, which also delegate to the active SchedPolicy for task selection within each server. The SchedPolicy module receives a read-only SchedPolicyContext snapshot — it never accesses the runqueue directly. See Section 19.9 for the full policy module interface and live replacement protocol.

7.1.4.9 Task-to-Runqueue Lookup Protocol (Lockfree cpu_id + Retry)¶

Operations that need to lock a task's runqueue (try_to_wake_up, cgroup migration, sched_setaffinity) use a lockfree protocol based on task.cpu_id: AtomicU32. No pi_lock is needed for task pinning — pi_lock stays at level 45 exclusively for its actual purpose (protecting HeldMutexes and priority inheritance chain walking).

/// Lock the runqueue that `task` is currently enqueued on. Uses lockfree
/// optimistic lookup with bounded retry. The retry is bounded because
/// migration requires the same rq.lock we are acquiring — once we hold
/// the correct rq.lock, the task cannot escape.
///
/// # Algorithm
///
/// 1. Read `task.cpu_id.load(Acquire)` → `cpu`
/// 2. Acquire `rq_array[cpu].lock`
/// 3. Verify `task.cpu_id.load(Acquire) == cpu`
/// 4. If mismatch: release lock, goto 1
/// 5. If match: proceed with operation under rq.lock
///
/// The Acquire ordering on cpu_id ensures that if we observe cpu=X, all
/// memory writes that were visible to the CPU that last stored X are
/// visible to us. This prevents reading stale scheduling state after the
/// task was migrated.
///
/// # Why pi_lock is not needed
///
/// In Linux, `task_rq_lock()` acquires `pi_lock` then `rq->lock` to
/// prevent the task from being migrated between the CPU lookup and the
/// rq.lock acquisition. UmkaOS's `cpu_id: AtomicU32` + retry achieves
/// the same guarantee without pi_lock:
///
/// - Migration writes `task.cpu_id` under the source `rq.lock` (the
///   migration path acquires source rq.lock, dequeues, updates cpu_id,
///   releases source rq.lock, acquires dest rq.lock, enqueues).
/// - Our retry loop detects the stale cpu by re-reading cpu_id after
///   acquiring the (possibly wrong) rq.lock.
/// - The worst case is one retry per concurrent migration (rare).
fn lock_task_rq(task: &Task) -> RqLockGuard {
    loop {
        let cpu = task.cpu_id.load(Acquire);
        let guard = rq_array[cpu as usize].lock();
        // Re-check: did the task migrate while we were acquiring?
        if task.cpu_id.load(Acquire) == cpu {
            return guard;
        }
        // Task migrated — release wrong lock, retry.
        drop(guard);
    }
}

7.1.4.10 wake_up_new_task (Forked Task Activation)¶

After do_fork()/do_clone() completes task creation and cgroup attachment (Section 8.1), wake_up_new_task(child) introduces the child to the scheduler. This is the single entry point that transitions a newly-forked task from "exists but not runnable" to "eligible for scheduling."

/// Activate a newly-forked task. Called exactly once per fork, after
/// cgroup_post_fork() and before returning to the parent.
///
/// # Steps
///
/// 1. **Set initial vruntime** (EEVDF tasks only):
///    The child is placed via `place_entity(rq, child, ENQUEUE_INITIAL)`.
///    This sets `child.vruntime = avg_vruntime(rq)` (not parent's
///    vruntime — the child starts at the current weighted average).
///    The deadline is set to `vruntime + vslice / 2` (half-slice for
///    initial placement, matching Linux's `PLACE_DEADLINE_INITIAL`).
///    The half-slice gives the child a slight advantage — existing tasks
///    are on average halfway through their slice, so starting the child
///    at half-slice eases it into the competition without starvation.
///    Child inherits parent's nice value. Vlag is initialized to 0
///    (no accumulated credit or debt).
///
/// 2. **Select target CPU**: Place the child on the parent's current
///    CPU runqueue. This maximizes cache locality: the child's COW
///    page tables and TLB entries overlap with the parent's, and
///    running on the same CPU avoids cold-cache penalties. If the
///    parent's CPU is overloaded (runqueue length > 1.5× average),
///    the load balancer may select a nearby idle CPU in the same LLC
///    domain instead.
///
/// 3. **Insert into tree**: `place_entity()` has already computed the
///    child's vruntime and vdeadline. Insert into the `tasks_timeline`
///    tree (keyed by vdeadline). Eligibility is determined dynamically
///    by `pick_eevdf()` using `entity_eligible()`.
///
/// 4. **Enqueue**: Call the scheduling class's enqueue(child) to
///    insert the task into the appropriate per-CPU runqueue data
///    structure. Update rq.nr_running and load averages.
///
/// 5. **Check preemption**: If the child's vdeadline < current
///    task's vdeadline (i.e., the child is more urgent), set
///    `need_resched` on the current CPU's `CpuLocalBlock`
///    ([Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path)). The rescheduling
///    occurs at the next preemption point (syscall return or
///    interrupt exit).
///
/// 6. **Resched IPI** (cross-CPU case only): If step 2 placed the
///    child on a different CPU than the current CPU, send a resched
///    IPI to that CPU so it re-evaluates pick_next_task promptly.
///
/// # RT and Deadline tasks
///
/// RT tasks (SCHED_FIFO/SCHED_RR) inherit the parent's static
/// priority. The child is placed at the tail of the priority's
/// runqueue (FIFO) or gets a fresh timeslice (RR).
///
/// Deadline tasks (SCHED_DEADLINE) are not inherited via fork —
/// the child is demoted to SCHED_NORMAL (EEVDF) unless the parent
/// explicitly sets deadline parameters via sched_setattr() after
/// fork. This prevents accidental bandwidth over-commitment.
pub fn wake_up_new_task(child: &Task) {
    let rq = select_task_rq(child);
    init_new_task_vruntime(child, rq);
    activate_task(rq, child, EnqueueFlags::ENQUEUE_INITIAL);
    check_preempt_curr(rq, child);
}

/// Initialize a newly forked task's virtual runtime for EEVDF scheduling.
///
/// Delegates to `place_entity()` with `ENQUEUE_INITIAL`. This sets:
/// - `child.vruntime = avg_vruntime(rq)` — the child starts at the current
///   weighted average of the runqueue, not the parent's vruntime. Starting at
///   the average prevents a fork bomb from obtaining unfairly low vruntimes.
/// - `child.vdeadline = vruntime + vslice / 2` — half-slice initial placement
///   (matching Linux's `PLACE_DEADLINE_INITIAL`). Existing tasks are on
///   average halfway through their slice; half-slice eases the child into
///   competition without starving existing tasks.
/// - `child.vlag = 0` — no accumulated credit or debt.
///
/// For non-EEVDF classes (RT, DL), this function is a no-op — RT tasks inherit
/// the parent's static priority directly, and DL tasks are demoted to
/// SCHED_NORMAL at fork (DL parameters are not inheritable).
///
/// **Classification**: Evolvable (the placement formula is policy).
///
/// Linux equivalent: `task_fork_fair()` in `kernel/sched/fair.c`.
fn init_new_task_vruntime(child: &Task, rq: &mut RunQueue) {
    match child.sched_policy {
        SchedPolicy::Normal | SchedPolicy::Batch | SchedPolicy::Idle => {
            let se = &mut child.sched_entity;
            place_entity(&mut rq.cfs, se, EnqueueFlags::ENQUEUE_INITIAL);
        }
        // RT and DL tasks: no vruntime initialization needed.
        // RT inherits parent's static priority (set in do_fork step 7a).
        // DL was demoted to SCHED_NORMAL at fork (do_fork step 7a).
        _ => {}
    }
}

// ---------------------------------------------------------------------------
// resched_curr — Rescheduling Request with Urgency
// ---------------------------------------------------------------------------

/// Urgency level for rescheduling requests.
///
/// **UmkaOS-original abstraction** over Linux's two-function model
/// (`resched_curr()` + `resched_curr_lazy()`). Single function with enum
/// parameter provides the same zero-cost dispatch (branch on immediate)
/// while enabling ML policy and Evolvable scheduler modules to return
/// `ReschedUrgency` values — decoupling the decision (what urgency?)
/// from the mechanism (which flag to set). Extensible to future urgency
/// levels without adding new functions. Exhaustive `match` on the enum
/// ensures all callers handle new variants.
///
/// **Classification**: Evolvable. ML policy can tune per-cgroup urgency
/// via `ParamId::SchedReschedUrgency`.
#[repr(u8)]
pub enum ReschedUrgency {
    /// Must reschedule soon. Sets `TIF_NEED_RESCHED` on the target CPU's
    /// thread_info, which is checked at ALL preemption points (interrupt
    /// return, syscall return, cond_resched(), preempt_enable()).
    ///
    /// Use for: wakeup of higher-priority task, RT task ready, slice expiry,
    /// cross-class preemption (RT preempts EEVDF).
    Eager = 0,
    /// Should reschedule when convenient. Sets `TIF_NEED_RESCHED_LAZY` on
    /// the target CPU's thread_info, which is checked ONLY at voluntary
    /// preemption points (return-to-user, cond_resched()) but NOT at
    /// involuntary preemption points (interrupt return with preemption
    /// enabled).
    ///
    /// Use for: EEVDF eligibility change, load balancing suggestion, nice
    /// change, cgroup weight update. These events benefit from rescheduling
    /// but do not require immediate preemption — allowing the current task
    /// to complete its kernel-mode work reduces unnecessary context switches
    /// on throughput-oriented workloads.
    Lazy = 1,
}

/// Request rescheduling on the CPU that owns `rq`.
///
/// Sets the appropriate thread-info flag on `rq.curr` based on `urgency`:
/// - `Eager`: `TIF_NEED_RESCHED` — checked at all preemption points.
///   Sends IPI to remote CPUs and to nohz_full local CPU.
/// - `Lazy`: `TIF_NEED_RESCHED_LAZY` — checked at voluntary preemption
///   + return-to-user only. Does NOT send IPI to remote CPUs (the target
///   CPU will notice the flag at its next voluntary preemption point).
///   Sends IPI only if the target CPU is nohz_full (no tick to notice the flag).
///
/// **Locking**: Caller must hold `rq.lock`.
///
/// **Classification**: Evolvable. The urgency assignment at each call site
/// is policy that can be tuned by ML. The mechanism (flag set + IPI) is
/// the same for both urgency levels and is unlikely to change.
fn resched_curr(rq: &RunQueue, urgency: ReschedUrgency) {
    match urgency {
        ReschedUrgency::Eager => {
            set_tif_need_resched(rq.curr);
            // Eager: IPI needed for remote CPUs (to preempt immediately) and
            // nohz_full local CPU (no tick to notice the flag).
            if rq.cpu != current_cpu() || rq.is_nohz_full() {
                send_resched_ipi(rq.cpu);
            }
        }
        ReschedUrgency::Lazy => {
            set_tif_need_resched_lazy(rq.curr);
            // Lazy: no IPI for remote CPUs. The target CPU will notice
            // TIF_NEED_RESCHED_LAZY at its next voluntary preemption point
            // (return-to-user, cond_resched()). Sending an IPI would defeat
            // the purpose of lazy rescheduling — the whole point is "reschedule
            // when convenient," not "interrupt what you're doing right now."
            //
            // Exception: nohz_full CPUs have no periodic tick, so they would
            // never notice the lazy flag without an IPI. This matches Linux
            // 6.12+ resched_curr_lazy() behavior (kernel/sched/core.c).
            if rq.is_nohz_full() {
                send_resched_ipi(rq.cpu);
            }
        }
    }
}

// ---------------------------------------------------------------------------
// check_preempt_curr — Wakeup Preemption Check
// ---------------------------------------------------------------------------

/// Check whether a newly activated task should preempt the current task.
///
/// Dispatches to the appropriate scheduling class comparison:
/// - If `task`'s class has higher priority than `rq.curr`'s class
///   (RT > DL > Normal), set `need_resched` unconditionally.
/// - If both are in the same class, delegate to the class-specific
///   check (EEVDF: compare virtual deadline; RT: compare static priority;
///   DL: compare absolute deadline).
/// - If the current task's class is higher priority, no preemption.
///
/// Called from `wake_up_new_task()`, `try_to_wake_up()`, and
/// `sched_setscheduler()` (after changing a task's scheduling class).
///
/// Uses `resched_curr(rq, ReschedUrgency::Eager)` for cross-class preemption
/// (higher-priority class waking) and delegates to the class-specific check
/// for intra-class preemption (which may use either `Eager` or `Lazy`
/// depending on the scheduling class's policy).
fn check_preempt_curr(rq: &RunQueue, task: &Task) {
    let task_prio = sched_class_priority(task.eevdf.sched_class);
    let curr_prio = sched_class_priority(rq.curr.eevdf.sched_class);
    if task_prio > curr_prio {
        resched_curr(rq, ReschedUrgency::Eager);
        return;
    }
    if task_prio == curr_prio {
        // Delegate to class-specific preemption check via match dispatch
        // (not a vtable call — consistent with enum-based class dispatch).
        match task.eevdf.sched_class {
            SchedClass::Eevdf => check_preempt_eevdf(rq, task),
            SchedClass::RtFifo | SchedClass::RtRr => check_preempt_rt(rq, task),
            SchedClass::Deadline => check_preempt_dl(rq, task),
            SchedClass::Idle => { /* idle never preempts idle */ }
        }
    }
    // Lower-priority class: no preemption.
}

resched_curr call site urgency classification (exhaustive list across all scheduling classes):

Call site	Class	Urgency	Rationale
`update_curr()` deadline/slice expiry	EEVDF	`Lazy`	Task exceeded its protected slice; reschedule at next voluntary preemption point. Not urgent — the task is still running fairly.
`scheduler_tick()` ideal_runtime check	EEVDF	`Lazy`	Periodic fairness check. Lazy avoids interrupting kernel-mode work; the 1 ms tick already bounds latency. Matches Linux 6.12+ `resched_curr_lazy()`.
`check_preempt_curr()` cross-class	All	`Eager`	Higher-priority class task woke up (e.g., RT preempts EEVDF). Must preempt immediately.
`check_preempt_curr()` intra-class EEVDF	EEVDF	`Lazy`	Newly woken task has earlier virtual deadline than current. Preempt at convenience.
DL `earliest_deadline` preemption	DL	`Eager`	A deadline task with an earlier absolute deadline than the running DL task is ready. Deadline ordering is hard — must preempt now to meet the earlier deadline.
RT wakeup preemption	RT	`Eager`	A higher-static-priority RT task woke up. RT scheduling requires immediate preemption to maintain priority guarantees.
RT RR time-slice expiry	RT	`Eager`	Round-robin RT task exhausted its time quantum. Must yield immediately to the next RR task at the same priority (Linux `task_tick_rt()` calls `resched_curr()`).
CBS budget exhaustion (throttle)	CBS	`Eager`	CBS server budget depleted. Running task must yield to prevent bandwidth overrun beyond one tick period. Without `Eager` reschedule, the currently running task continues for up to one tick (1-4 ms) past budget exhaustion.
CBS server replenishment un-throttle	CBS	`Eager`	Throttled CBS server received budget replenishment. Tasks waiting for bandwidth should run promptly.
`sched_setscheduler()` class change	All	`Eager`	Task moved to a higher-priority class (e.g., SCHED_OTHER → SCHED_FIFO). Check preemption immediately.
Load balancer migration	EEVDF	`Lazy`	Load balance suggests migrating a task. Not urgent — the source CPU's work is not affected.
Nice/weight change	EEVDF	`Lazy`	Task's weight changed (renice, cgroup cpu.weight update). Reschedule when convenient.

Fork placement via place_entity(): The child's vruntime is set to avg_vruntime(rq) (the current weighted average, not the parent's vruntime). The virtual deadline uses vslice / 2 for initial placement (PLACE_DEADLINE_INITIAL). The child's vlag is initialized to 0 (no saved credit/debt). This matches Linux's place_entity() with ENQUEUE_INITIAL. The ML scheduler policy (Section 23.1) can tune the effective slice via ParamId::SchedEevdfWeightScale.

7.1.4.11 select_task_rq (CPU Selection for Task Placement)¶

select_task_rq() selects the optimal CPU runqueue for placing a task. It is called on three paths: new task activation (wake_up_new_task), task wakeup (try_to_wake_up), and explicit migration (sched_setaffinity). The function delegates to the task's scheduling class for class-specific CPU selection logic, then applies cross-class constraints (cpuset, affinity mask).

/// Select the target CPU runqueue for a task.
///
/// # Algorithm (EEVDF / Normal class)
///
/// 1. **Affinity mask filter**: Restrict candidates to CPUs in the task's
///    `cpus_allowed` mask (set by `sched_setaffinity` or inherited from
///    the cpuset cgroup). If the mask has exactly one CPU, return it
///    immediately (pinned task).
///
/// 2. **Idle CPU preference**: Scan for idle CPUs in the task's last-run
///    LLC (Last-Level Cache) domain first. An idle CPU avoids runqueue
///    contention and provides immediate execution. Idle scan uses the
///    per-LLC `idle_cpumask` bitmap (updated atomically by the idle loop
///    and `pick_next_task`).
///
/// 3. **Cache warmth**: If no idle CPU is found in the LLC domain, prefer
///    the CPU where the task last ran (`task.last_cpu`). Warm cache lines
///    (TLB entries, L1/L2 data) avoid cold-start penalties of ~10-50 us
///    on modern hardware. The benefit is estimated as:
///      cache_benefit_ns = sched_cache_hot_ns * (1 - time_since_last_run / decay_ns)
///    where `sched_cache_hot_ns` defaults to 2,500,000 ns (2.5 ms).
///
/// 4. **NUMA distance**: For NUMA systems, penalize CPUs on remote NUMA
///    nodes proportional to the NUMA distance (from the SLIT table).
///    A remote node adds `remote_penalty_ns` (default: proportional to
///    NUMA distance) to the placement cost. This ensures tasks stay
///    NUMA-local unless a remote CPU is idle and the local node is
///    overloaded.
///
/// 5. **Energy-Aware Scheduling (EAS)**: On heterogeneous platforms
///    (big.LITTLE, hybrid P/E cores), EAS evaluates the energy cost of
///    placing the task on each candidate CPU using the Energy Model
///    ([Section 7.1](#scheduler--energy-aware-scheduling-eas)). The CPU with the
///    lowest incremental energy cost is preferred, subject to a latency
///    constraint: if the energy-optimal CPU would delay the task's
///    wakeup by more than `eas_latency_budget_ns`, the faster CPU wins.
///
/// 6. **Load balance tiebreaker**: Among equally-scored candidates,
///    prefer the CPU with the shortest runqueue (fewest `nr_running`).
///
/// # RT class
/// Scans for the lowest-numbered CPU in the affinity mask that is not
/// currently running a higher-priority RT task (`rt_rq.highest_prio`).
///
/// # Deadline class
/// Uses `dl_bw` (deadline bandwidth) to find a CPU with sufficient
/// remaining bandwidth in the task's root domain. Falls back to the
/// task's affinity mask if all CPUs are over-committed.
pub fn select_task_rq(task: &Task) -> &mut RunQueue;

7.1.4.12 WakeFlags¶

/// Bitflags controlling wakeup behavior in `try_to_wake_up()`.
/// Passed by callers (signal delivery, futex wake, I/O completion) to
/// influence CPU selection and throttle bypass decisions.
bitflags! {
    pub struct WakeFlags: u32 {
        /// Synchronous wakeup hint: the waker will soon block, so placing
        /// the wakee on the waker's CPU improves cache reuse. Used by
        /// pipe write, futex wake, and unix socket send. The scheduler
        /// treats this as a placement hint, not a guarantee — it may be
        /// overridden by load balancing or NUMA distance penalties.
        const SYNC       = 1 << 0;

        /// Bypass CBS (Capacity-Based Scheduling) throttle. Used by
        /// SIGKILL delivery to ensure the target task is scheduled
        /// immediately regardless of bandwidth limits. Without this
        /// flag, a bandwidth-throttled cgroup could delay SIGKILL
        /// processing indefinitely.
        const BYPASS_CBS = 1 << 1;
    }
}

7.1.4.13 activate_task (Sleeping-to-Runnable Transition)¶

activate_task() transitions a task from sleeping (or newly created) to runnable by inserting it into the target runqueue. It is the counterpart of deactivate_task() which removes a task from its runqueue when it blocks.

/// Move a task from sleeping/new state to runnable on the specified runqueue.
///
/// # Steps
///
/// 1. **Set task state**: Transition `task.state` from `TASK_SLEEPING` (or
///    `TASK_NEW` for freshly forked tasks) to `TASK_RUNNING`. The state
///    transition uses `Ordering::Release` to ensure all task initialization
///    writes are visible to the runqueue's CPU before the task becomes
///    schedulable.
///
/// 2. **Enqueue into scheduling class**: Call the task's scheduling class
///    enqueue operation:
///    - **EEVDF**: Call `place_entity(rq, task, flags)` to compute
///      `vruntime` and `vdeadline` from saved vlag. Insert into
///      `tasks_timeline` (keyed by vdeadline). Eligibility is determined
///      dynamically by `pick_eevdf()`. Update `rq.nr_running` and `rq.load_weight`.
///    - **RT (FIFO)**: Insert at the tail of the priority's linked list
///      in the `rt_rq` bitmap-indexed array.
///    - **RT (RR)**: Same as FIFO, plus initialize the RR timeslice to
///      `sched_rr_timeslice_ms` (default: 100 ms).
///    - **Deadline**: Insert into the `dl_rq` red-black tree ordered by
///      absolute deadline. Update `dl_bw` accounting.
///
/// 3. **Update runqueue statistics**: Increment `rq.nr_running`.
///    Update `rq.load_weight` (sum of task weights for EEVDF load
///    balancing). Update `rq.avg_load` PELT (Per-Entity Load Tracking)
///    contribution.
///
/// 4. **NUMA statistics**: If the task's preferred NUMA node differs from
///    the runqueue's CPU NUMA node, increment `rq.nr_numa_foreign` (used
///    by the NUMA balancer to trigger migration).
///
/// # Flags
///
/// `flags: EnqueueFlags` controls placement behavior:
/// - `ENQUEUE_WAKEUP`: Task is waking from sleep. `place_entity()` uses
///    the saved `vlag` with PLACE_LAG inflation to position the task in
///    virtual time. No CFS-era "sleep bonus" — EEVDF uses lag-based
///    placement exclusively.
/// - `ENQUEUE_INITIAL`: New task (fork). `place_entity()` sets vruntime
///    to avg_vruntime and halves the virtual slice for the deadline.
/// - `ENQUEUE_MIGRATED`: Task was migrated from another CPU (skip cache
///    warmth assumptions).
/// - `ENQUEUE_RESTORE`: Re-enqueue after priority change (preserve
///    existing vruntime, do not recompute from vlag).
pub fn activate_task(rq: &mut RunQueue, task: &Task, flags: EnqueueFlags);

/// The inverse of activate_task: remove a task from its runqueue when
/// it blocks (sleep, wait, I/O). Updates rq.nr_running, load_weight,
/// and PELT statistics. The task's state is set to TASK_SLEEPING by
/// the caller before calling deactivate_task.
pub fn deactivate_task(rq: &mut RunQueue, task: &Task);

See also: Section 8.4 (Real-Time Guarantees) extends deadline scheduling with bounded-latency paths, threaded interrupts, and PREEMPT_RT-style priority inheritance for hard real-time workloads.

ML tuning: Key EEVDF parameters (eevdf_weight_scale, migration_benefit_threshold, eas_energy_bias, preemption_latency_budget) are registered in the Kernel Tunable Parameter Store and may be adjusted at runtime by Tier 2 AI/ML policy services via the closed-loop framework defined in Section 23.1. The scheduler emits SchedObs observations (task wakeup latency, EAS decisions, runqueue stats) that feed the umka-ml-sched Tier 2 service. All parameters revert to defaults within 60 seconds if the ML service stops sending updates.

7.2 Heterogeneous CPU Support (big.LITTLE / Intel Hybrid / RISC-V)¶

Modern SoCs are no longer symmetric. ARM big.LITTLE (2011+), Intel Alder Lake P-core/E-core (2021+), and RISC-V platforms with mixed hart types all present the scheduler with CPUs of different performance, power, and ISA capabilities. A scheduler that treats all CPUs as identical will either waste power (running background tasks on performance cores) or starve throughput (placing compute-heavy tasks on efficiency cores).

This section extends the scheduler with Energy-Aware Scheduling (EAS), per-CPU capacity tracking, and heterogeneous topology awareness.

7.2.1 CPU Capacity Model¶

Every CPU has a capacity value normalized to a 0–1024 scale, where the fastest core at its highest frequency = 1024. This is the fundamental abstraction that makes the scheduler heterogeneity-aware.

// umka-core/src/sched/capacity.rs

/// Per-CPU capacity descriptor.
/// Populated at boot from firmware tables (ACPI PPTT, devicetree, CPPC).
/// Updated at runtime when frequency changes.
pub struct CpuCapacity {
    /// Maximum capacity of this CPU at its highest OPP (Operating Performance Point).
    /// Normalized: fastest core in the system = 1024.
    /// An efficiency core might be 512 (half the throughput of a performance core).
    pub capacity: u32,

    /// Original (boot-time) maximum capacity. Does not change.
    pub capacity_max: u32,

    /// Current capacity, adjusted for current frequency.
    /// If a 1024-capacity core is running at 50% frequency, capacity_curr = 512.
    /// Updated by cpufreq governor on frequency change.
    ///
    /// **Memory ordering**: Cpufreq governor writes with `Release` after
    /// updating the frequency hardware registers. Scheduler reads with
    /// `Relaxed` in the misfit check (stale value for one tick is
    /// acceptable; re-evaluated every load balance interval ~4ms).
    /// EAS energy computation reads with `Acquire` to pair with the
    /// governor's `Release`, ensuring placement decisions reflect the
    /// actual frequency state.
    pub capacity_curr: AtomicU32,

    /// Core type classification.
    pub core_type: CoreType,

    /// Frequency domain this CPU belongs to.
    /// All CPUs in a frequency domain share the same clock.
    pub freq_domain: FreqDomainId,

    /// ISA capabilities of this CPU.
    /// On heterogeneous ISA systems (RISC-V), different cores may support
    /// different extensions.
    pub isa_caps: IsaCapabilities,

    /// Microarchitecture ID (for Intel Thread Director).
    /// Different core types have different uarch IDs.
    pub uarch_id: u32,
}

/// Core type classification.
#[repr(u32)]
pub enum CoreType {
    /// ARM Cortex-X/A7x, Intel P-core.
    /// High single-thread performance, high power.
    Performance = 0,

    /// ARM Cortex-A5x, Intel E-core.
    /// Lower performance, significantly lower power.
    Efficiency  = 1,

    /// ARM Cortex-A7x mid-tier (e.g., Cortex-A78 in a system with X3 and A510).
    Mid         = 2,

    /// Traditional SMP — all cores identical.
    /// When all cores are Symmetric, EAS is disabled (unnecessary).
    Symmetric   = 3,
}

/// ISA capability flags.
/// On heterogeneous ISA systems, the scheduler must ensure a task only runs
/// on a CPU that supports the ISA features the task uses.
bitflags! {
    pub struct IsaCapabilities: u64 {
        // ARM
        const ARM_SVE       = 1 << 0;   // Scalable Vector Extension
        const ARM_SVE2      = 1 << 1;   // SVE2
        const ARM_SME       = 1 << 2;   // Scalable Matrix Extension
        const ARM_MTE       = 1 << 3;   // Memory Tagging Extension

        // x86
        const X86_AVX512    = 1 << 16;  // AVX-512 (P-cores only on some Intel)
        const X86_AMX       = 1 << 17;  // Advanced Matrix Extensions (P-cores only)
        const X86_AVX10     = 1 << 18;  // AVX10 (unified AVX across core types)

        // RISC-V
        const RV_V          = 1 << 32;  // Vector extension
        const RV_B          = 1 << 33;  // Bit manipulation
        const RV_H          = 1 << 34;  // Hypervisor extension
        const RV_CRYPTO     = 1 << 35;  // Cryptography extensions
    }
}

/// Vector length metadata — companion to IsaCapabilities for variable-length
/// vector ISAs (ARM SVE/SVE2, RISC-V V). The bitflags above indicate *presence*
/// of the extension; this struct encodes the *vector register width* that the
/// thread actually uses, which determines migration constraints and XSAVE area size.
#[repr(C)]
pub struct VectorLengthInfo {
    /// ARM SVE/SVE2 vector length in bits (128-2048, must be power of 2).
    /// 0 = thread does not use SVE. Discovered per-core via `rdvl` at boot.
    pub sve_vl_bits: u16,
    /// RISC-V VLEN in bits (32-65536). 0 = thread does not use RVV.
    /// Discovered per-hart via `vlenb` CSR at boot.
    /// Uses u32 because the RISC-V V spec allows VLEN up to 65536 bits,
    /// which equals u16::MAX + 1 and would overflow a u16.
    pub rvv_vlen_bits: u32,
}
// kernel-internal, not KABI. Layout: 2 + 2(pad) + 4 = 8 bytes.
const_assert!(size_of::<VectorLengthInfo>() == 8);

Key design property: On a fully symmetric system (all cores CoreType::Symmetric), the capacity model is a no-op. All CPUs have capacity 1024, all have the same ISA capabilities. The scheduler fast path sees capacity_curr == 1024 on every CPU and skips all heterogeneous logic. Zero overhead on symmetric systems.

7.2.2 Energy Model¶

The energy model describes the power cost of running a workload at each performance level on each core type. It is the foundation of Energy-Aware Scheduling.

// umka-core/src/sched/energy.rs

/// Maximum number of Operating Performance Points per frequency domain.
/// 32 entries exceeds all known hardware OPP tables (typical: 8-20).
/// No downsampling is needed for current hardware.
pub const MAX_OPP_ENTRIES: usize = 32;

/// Energy model for one frequency domain.
/// A frequency domain is a group of CPUs that share the same clock.
/// All CPUs in a domain have the same core type and OPP table.
pub struct EnergyModel {
    /// Which frequency domain this model covers.
    pub freq_domain: FreqDomainId,

    /// Core type of CPUs in this domain.
    pub core_type: CoreType,

    /// Number of CPUs in this domain.
    pub cpu_count: u32,

    /// Operating Performance Points, sorted by frequency (ascending).
    /// Each OPP maps a frequency to a capacity and power cost.
    /// Fixed-capacity inline array avoids heap allocation and keeps OPP
    /// data cache-local. 16 entries accommodates all known hardware OPP
    /// tables.
    pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>,  // MAX_OPP_ENTRIES = 32
}

/// One Operating Performance Point.
pub struct OppEntry {
    /// Frequency in kHz.
    pub freq_khz: u32,

    /// Capacity at this frequency (0–1024 scale).
    /// Capacity scales linearly with frequency within a core type.
    pub capacity: u32,

    /// Power consumption at this frequency (milliwatts).
    /// This is the DYNAMIC power for one CPU running at 100% utilization.
    /// Power scales roughly as V²×f (voltage² × frequency).
    pub power_mw: u32,
}

OPP table population at boot: OPP tables are populated during boot Phase 2.3 (post-ACPI/DT parse). Sources: ACPI PPTT (x86), device tree opp-table node (ARM/RISC-V), CPPC (ACPI 6.0+). For each frequency domain, the boot code enumerates available OPPs and fills the opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES> in ascending frequency order. Fallback: if no firmware power data is available for a frequency domain, power_mw is set to 0 for all OPPs in that domain and EAS (Energy-Aware Scheduling) is disabled — the scheduler falls back to pure EEVDF without energy-aware placement. An FMA warning is logged: "EAS disabled: no power data for CPU cluster {cluster_id}". This ensures EAS never makes placement decisions based on fabricated power data.

Note: RAPL (Running Average Power Limit) on x86 provides real-time power monitoring for capping/budgeting but does NOT feed into EAS OPP power_mw values. EAS uses firmware-provided static power estimates (ACPI PPTT / CPPC / device tree opp-table).

Example: ARM big.LITTLE system (Cortex-X3 + Cortex-A510)

Performance cores (Cortex-X3), freq_domain 0:
  OPP 0:  600 MHz, capacity  256, power   80 mW
  OPP 1: 1200 MHz, capacity  512, power  280 mW
  OPP 2: 1800 MHz, capacity  768, power  650 mW
  OPP 3: 2400 MHz, capacity 1024, power 1200 mW

Efficiency cores (Cortex-A510), freq_domain 1:
  OPP 0:  400 MHz, capacity  100, power  15 mW
  OPP 1:  800 MHz, capacity  200, power  50 mW
  OPP 2: 1200 MHz, capacity  300, power 110 mW
  OPP 3: 1600 MHz, capacity  400, power 200 mW

Observation: running a task with utilization 200 (out of 1024):
  On performance core at OPP 0 (capacity 256): power = 80 mW
  On efficiency core at OPP 3 (capacity 400):  power = 200 mW
  → Wait. On efficiency core at OPP 1 (capacity 200): power = 50 mW
  → Efficiency core at OPP 1 uses 50 mW. Performance core at OPP 0 uses 80 mW.
  → Efficiency core wins. EAS places the task on the efficiency core.

  But a task with utilization 500:
  → Doesn't fit on any efficiency core OPP (max capacity 400).
  → Must go to performance core. EAS picks lowest OPP that fits: OPP 1 (512), 280 mW.

7.2.3 Energy-Aware Scheduling Algorithm¶

EAS runs at task wakeup time (the most impactful scheduling decision). It answers: "Which CPU should this task run on to minimize total system energy while meeting performance requirements?"

// umka-core/src/sched/eas.rs

pub struct EnergyAwareScheduler {
    /// Energy models for all frequency domains.
    /// Boot-allocated contiguous array (one per frequency domain), not heap Vec.
    /// Indexed by frequency domain ID. Length = number of frequency domains
    /// discovered at boot. Stored as a `BootVec<EnergyModel>` (boot-time-allocated,
    /// fixed-size-after-init, no heap pointer indirection on the wakeup fast path).
    energy_models: BootVec<EnergyModel>,

    /// Per-CPU utilization (PELT, see Section 7.1.5.4).
    /// Boot-allocated contiguous array (one per CPU), not heap Vec.
    /// Indexed by CPU ID. Stored as `BootVec<CacheAligned<AtomicU32>>` —
    /// NOT `PerCpu<T>` because `find_energy_efficient_cpu()` reads remote
    /// CPUs' utilization (`PerCpu<T>::get()` returns only the current CPU's
    /// copy). `CacheAligned` ensures each entry sits on its own cache line
    /// to prevent false sharing between CPUs updating their own utilization.
    cpu_util: BootVec<CacheAligned<AtomicU32>>,

    /// Threshold: a task is "misfit" if its utilization exceeds
    /// the capacity of the CPU it's running on.
    /// Misfit tasks are migrated to higher-capacity CPUs.
    misfit_threshold: u32,

    /// EAS is disabled on fully symmetric systems (no benefit).
    enabled: bool,
}

impl EnergyAwareScheduler {
    /// Find the most energy-efficient CPU for a waking task.
    /// Called from EEVDF enqueue path when EAS is enabled.
    ///
    /// Algorithm:
    ///   1. For each frequency domain:
    ///      a. Compute the new utilization if this task were placed here.
    ///      b. Find the lowest OPP that can handle the new utilization.
    ///      c. Compute energy cost = OPP power × (new_util / capacity).
    ///   2. Pick the frequency domain with the lowest energy cost.
    ///   3. Within that domain, pick the CPU with the most spare capacity
    ///      (to avoid unnecessary frequency increases).
    ///
    /// Complexity: O(domains × OPPs). Typically 2-3 domains × 4-6 OPPs = 8-18 iterations.
    /// Time: ~200-500ns. Acceptable for task wakeup path (~2000ns total).
    pub fn find_energy_efficient_cpu(&self, task_util: u32) -> CpuId {
        let mut best_energy = u64::MAX;
        let mut best_cpu = CpuId(0);

        for model in &self.energy_models {
            // Can this domain handle the task at all?
            let max_capacity = model.opps.last().map(|o| o.capacity).unwrap_or(0);
            if task_util > max_capacity {
                continue; // Task doesn't fit on this core type
            }

            // Compute energy cost for placing task in this domain.
            let energy = self.compute_energy(model, task_util);
            if energy < best_energy {
                best_energy = energy;
                best_cpu = self.find_idlest_cpu_in_domain(model.freq_domain);
            }
        }

        best_cpu
    }

    /// Estimate energy cost of adding `task_util` to a frequency domain.
    ///
    /// OPP selection uses the maximum per-CPU utilization in the domain (not
    /// the aggregate), because frequency is shared across all CPUs in a DVFS
    /// domain — the OPP must be high enough for the most loaded CPU.
    fn compute_energy(&self, model: &EnergyModel, task_util: u32) -> u64 {
        // Find max per-CPU utilization in this domain. Assumes the task
        // will be placed on the idlest CPU (same heuristic as
        // find_idlest_cpu_in_domain), so task_util is added to that
        // CPU's utilization when computing the domain's max.
        let max_cpu_util = self.max_cpu_utilization(model.freq_domain, task_util);

        // Find lowest OPP whose capacity can handle the busiest CPU.
        // OPPs are sorted by ascending capacity; use binary search (O(log N)).
        let idx = model.opps.partition_point(|o| o.capacity < max_cpu_util);
        let opp = model.opps.get(idx).unwrap_or(model.opps.last().unwrap());

        // Energy = power × (sum of all CPU utilizations) / capacity.
        // Power is determined by the OPP (selected by max CPU), but energy
        // is proportional to total work done across all CPUs in the domain.
        // `domain_utilization()` returns the sum of PELT utilization_avg
        // across all CPUs in the frequency domain (a u32 clamped to 1024
        // per CPU × number of CPUs in the domain).
        let domain_util = self.domain_utilization(model.freq_domain) + task_util;
        (opp.power_mw as u64) * (domain_util as u64) / (opp.capacity as u64)
    }
}

When EAS is NOT used (symmetric systems, or when all cores are of the same type): the standard EEVDF load balancer runs instead. EAS adds zero overhead because enabled == false and the check is a single branch at the top of the wakeup path.

7.2.4 Per-Entity Load Tracking (PELT)¶

EAS needs accurate, up-to-date utilization data for each task and each CPU. PELT provides this with an exponentially-decaying average that balances responsiveness with stability.

// umka-core/src/sched/pelt.rs

// ---------------------------------------------------------------------------
// PELT constants and decay lookup table (Gap 2.13)
// ---------------------------------------------------------------------------

/// One PELT period in nanoseconds.
///
/// Chosen as 1024 × 1000 = 1,024,000 ns ≈ 1.024 ms. The power-of-two
/// factor (1024) means the division `delta_ns / PERIOD_NS` and the modulo
/// `delta_ns % PERIOD_NS` can be computed with a right-shift and a bitmask
/// on architectures where the compiler elides the integer division.
pub const PELT_PERIOD_NS: u64 = 1_024_000;

/// Converged maximum load average.
///
/// A task that has been 100% runnable for effectively infinite time converges
/// to `LOAD_AVG_MAX`. This is the closed-form sum of the geometric series:
///
/// ```text
/// LOAD_AVG_MAX = 1024 × Σ_{n=0}^{∞} y^n = 1024 / (1 − y) ≈ 47742
/// ```
///
/// where `y = 0.5^(1/32) ≈ 0.97857` is the per-period decay factor.
/// Used to normalise the internal `*_sum` accumulators to the `*_avg` fields
/// (0–1024 scale): `util_avg = util_sum × 1024 / LOAD_AVG_MAX`.
pub const LOAD_AVG_MAX: u64 = 47742;

/// Number of periods for the geometric series to converge.
///
/// After `LOAD_AVG_MAX_N` periods at 100% utilisation the internal sum
/// reaches `LOAD_AVG_MAX` to within 1 ULP. Any periods beyond this index
/// need not be tracked — `decay_load()` returns 0 for `n ≥ LOAD_AVG_MAX_N`.
pub const LOAD_AVG_MAX_N: u64 = 345;

/// Sub-period fractional decay coefficients for PELT.
///
/// `RUNNABLE_AVG_YN_INV[i]` is the fixed-point (Q32) representation of `y^i`
/// where `y = 0.5^(1/32) ≈ 0.97857` and `i ∈ [0, 31]`:
///
/// ```text
/// RUNNABLE_AVG_YN_INV[i] = round(y^i × 2^32)
/// ```
///
/// Entry 0 = `2^32 - 1` (full weight, zero elapsed sub-periods).
/// Entry 31 = `round(y^31 × 2^32)` (nearly one full period of decay).
///
/// Used by `decay_load()` for the fractional-period component of decay:
///
/// ```text
/// val = (val * RUNNABLE_AVG_YN_INV[n % 32]) >> 32
/// ```
///
/// This avoids floating-point arithmetic at runtime; the table is computed
/// once at compile time from the analytic formula.
pub const RUNNABLE_AVG_YN_INV: [u32; 32] = [
    0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a,
    0xeac0c6e6, 0xe5b906e6, 0xe0ccdeeb, 0xdbfbb796,
    0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
    0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46,
    0xb504f333, 0xb123f581, 0xad583ee9, 0xa9a15ab4,
    0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
    0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a,
    0x8b95c1e3, 0x88980e80, 0x85aac367, 0x82cd8698,
];

/// Decay a PELT accumulator value by `n` elapsed periods.
///
/// Applies the compound decay factor `y^n` using integer arithmetic:
///
/// 1. If `n > LOAD_AVG_MAX_N` (345), return 0 — the value is fully decayed.
/// 2. Halve `val` for each complete group of 32 periods: `val >>= n / 32`.
///    Each group of 32 periods reduces the value by exactly 50% (`y^32 = 0.5`).
/// 3. Apply the remaining sub-period fractional decay using the lookup table:
///    `val = (val * RUNNABLE_AVG_YN_INV[n % 32]) >> 32`.
/// 4. Return the decayed value.
///
/// # Precision
///
/// The fixed-point multiply in step 3 is a Q32 multiply: the result is the
/// upper 32 bits of the 64-bit product. On 64-bit platforms this is a single
/// `mulhi` or equivalent instruction. On 32-bit platforms it requires a
/// 32×32→64 widening multiply.
///
/// # Usage
///
/// Called for each PELT accumulator (`load_sum`, `runnable_sum`, `util_sum`)
/// when a state transition spans `n ≥ 1` complete periods.
pub fn decay_load(val: u64, n: u64) -> u64 {
    if n > LOAD_AVG_MAX_N {
        return 0;
    }
    // Halve for each complete group of 32 periods.
    let val = val >> (n / 32);
    // Fractional sub-period decay via Q32 multiply with lookup table.
    (val * RUNNABLE_AVG_YN_INV[(n % 32) as usize] as u64) >> 32
}

// ---------------------------------------------------------------------------
// PeltState — per-entity load tracking state
// ---------------------------------------------------------------------------

/// Per-Entity Load Tracking state.
///
/// Attached to every schedulable entity (task) and every CPU run queue.
/// Maintains exponentially-decaying averages of CPU utilisation, runnability,
/// and weighted load over a ~32 ms half-life window.
///
/// ## Internal representation
///
/// Three raw accumulators (`load_sum`, `runnable_sum`, `util_sum`) hold the
/// un-normalised geometric sums. Three derived averages (`load_avg`,
/// `runnable_avg`, `util_avg`) are the normalised 0–1024 values consumed by
/// EAS, load balancing, and cpufreq. The averages are recomputed from the
/// sums whenever a state transition occurs:
///
/// ```text
/// util_avg     = util_sum     * NICE_0_WEIGHT / LOAD_AVG_MAX   (clamped to 1024)
/// runnable_avg = runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX   (clamped to 1024)
/// load_avg     = load_sum     * task.weight   / LOAD_AVG_MAX
/// ```
///
/// `NICE_0_WEIGHT = 1024` and `LOAD_AVG_MAX = 47742`.
///
/// ## Half-life
///
/// The decay factor `y = 0.5^(1/32) ≈ 0.97857` per 1.024 ms period gives a
/// half-life of 32 periods ≈ 32.768 ms. A task that stops running drops to
/// 50% utilisation after ~32 ms and is effectively zero after ~345 periods
/// (~353 ms).
pub struct PeltState {
    /// Raw load accumulator: `Σ(task.weight × runnable_time_in_period × y^n)`.
    /// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `load_avg`.
    pub load_sum: u64,

    /// Raw runnable accumulator: `Σ(runnable_time_in_period × y^n)`.
    /// Counts time the entity was either running or waiting in the run queue.
    /// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `runnable_avg`.
    pub runnable_sum: u64,

    /// Raw utilisation accumulator: `Σ(running_time_in_period × y^n)`.
    /// Counts only time the entity was executing on a CPU (not queued).
    /// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `util_avg`.
    pub util_sum: u64,

    /// Sub-period carry-forward in nanoseconds.
    ///
    /// Nanoseconds elapsed since the start of the current (incomplete) period.
    /// Preserved across state transitions so that sub-period time accumulates
    /// correctly. Range: `[0, PELT_PERIOD_NS)`.
    pub period_contrib: u32,

    /// Normalised weighted load average (0 = idle, `task.weight` = fully loaded).
    /// `load_avg = load_sum * task.weight / LOAD_AVG_MAX`.
    /// Used by the load balancer and NUMA placement.
    pub load_avg: u64,

    /// Normalised runnable average (0–1024).
    /// Includes both running and queued (waiting) time.
    /// `runnable_avg = runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX`.
    pub runnable_avg: u64,

    /// Normalised utilisation average (0–1024).
    /// Pure execution time only, excluding queued time.
    /// `util_avg = util_sum * NICE_0_WEIGHT / LOAD_AVG_MAX`.
    /// This is the primary signal consumed by EAS and cpufreq.
    pub util_avg: u64,

    /// Monotonic timestamp of the last `update()` call (nanoseconds since boot).
    /// Used to compute `delta_ns` on the next call.
    pub last_update_time: u64,
}

impl PeltState {
    /// Update PELT state with a new time sample using the canonical 3-phase
    /// algorithm. This is the ONLY PELT update implementation — there is no
    /// simplified version. The 3-phase decomposition correctly handles the
    /// head partial period contribution at the old decay level, which is
    /// critical for Linux accounting compatibility (`perf`, cgroup `cpu.stat`,
    /// container runtimes reading `/proc/schedstat`).
    ///
    /// Must be called at every scheduling event that changes entity state:
    /// task switch (running→queued, queued→running), sleep (queued→off),
    /// wake-up (off→queued), and scheduler tick. The caller must ensure
    /// `running`, `runnable`, and `now` accurately reflect the entity's
    /// state for the entire interval since the last call.
    ///
    /// **State-transition contract**: Between consecutive calls, the entity's
    /// state must be constant — exactly one of {running, runnable-but-not-running,
    /// sleeping}. Calling `update()` mid-interval and then again with a different
    /// state for the remainder is the correct pattern; calling `update()` with a
    /// blended state is incorrect and will misattribute the `period_contrib`
    /// carry-forward.
    ///
    /// `running`: was this entity executing on a CPU for the entire interval?
    /// `runnable`: was this entity on the run queue (running or waiting)?
    ///   Invariant: `running → runnable` (every running task is runnable).
    /// `now`: current timestamp in nanoseconds (from `ktime_get_ns()`).
    /// `task_weight`: the task's `sched_prio_to_weight` value (for `load_avg`).
    ///
    /// **Classification**: Evolvable. The decay formula and accumulation logic
    /// are policy code, hot-swappable via `EvolvableComponent`. The `PeltState`
    /// struct layout is Nucleus (data). The invariant checker validates that any
    /// replacement `update()` preserves the 3-phase decomposition and produces
    /// PELT values within `[0, LOAD_AVG_MAX]`.
    pub fn update(
        &mut self,
        running: bool,
        runnable: bool,
        now: u64,
        task_weight: u64,
    ) {
        debug_assert!(!running || runnable, "running implies runnable");

        let delta = now - self.last_update_time;
        if delta == 0 { return; }
        self.last_update_time = now;

        // Compute period boundaries accounting for the carry from the
        // previous call. `period_contrib` is the number of nanoseconds
        // already accumulated in the current (incomplete) period.
        // The total elapsed time including carry determines how many
        // complete period boundaries are crossed.
        let total = self.period_contrib as u64 + delta;
        let periods_crossed = total / PELT_PERIOD_NS;
        let tail = total % PELT_PERIOD_NS;

        // Phase 1: Complete the current partial period (head).
        // `head` is the time needed to finish the current period, or all
        // of `delta` if no boundary is crossed.
        let head = (PELT_PERIOD_NS - self.period_contrib as u64).min(delta);

        if periods_crossed == 0 {
            // No period boundary crossed — just accumulate the tail.
            self.period_contrib += delta as u32;
            if runnable {
                // load_sum accumulates UNWEIGHTED runnable time.
                // Weight is applied ONCE during normalization (load_avg).
                // This matches Linux `___update_load_sum()` where `load` is
                // 0 or 1 (boolean), not the task weight.
                self.load_sum     += delta;
                self.runnable_sum += delta;
            }
            if running {
                self.util_sum += delta;
            }
        } else {
            // At least one period boundary crossed.
            // Phase 1: The head completes the old partial period.
            // The entity was active for `head` ns of the current state
            // plus the previous `period_contrib` ns (same state per the
            // state-transition contract). Together = one full period.
            // Decay existing sums by one period, add full period contribution.
            self.load_sum     = decay_load(self.load_sum, 1);
            self.runnable_sum = decay_load(self.runnable_sum, 1);
            self.util_sum     = decay_load(self.util_sum, 1);
            if runnable {
                self.load_sum     += PELT_PERIOD_NS;
                self.runnable_sum += PELT_PERIOD_NS;
            }
            if running {
                self.util_sum += PELT_PERIOD_NS;
            }

            // Phase 2: Complete periods in the body (if more than 1 boundary crossed).
            // Each complete period contributes PELT_PERIOD_NS * y^i (unweighted).
            // The sum of the geometric series is computed via accumulate_sum().
            if periods_crossed > 1 {
                let body_periods = periods_crossed - 1;
                self.load_sum     = decay_load(self.load_sum, body_periods);
                self.runnable_sum = decay_load(self.runnable_sum, body_periods);
                self.util_sum     = decay_load(self.util_sum, body_periods);
                let contrib = accumulate_sum(body_periods);
                if runnable {
                    self.load_sum     += contrib;
                    self.runnable_sum += contrib;
                }
                if running {
                    self.util_sum += contrib;
                }
            }

            // Phase 3: Remaining partial period (tail).
            // The tail contributes to sums at y^0 == 1 (undecayed, current
            // period in progress). This matches Linux's `__accumulate_pelt_segments()`
            // where `c3 = d3` (tail contribution, undecayed).
            self.period_contrib = tail as u32;
            if tail > 0 {
                if runnable {
                    self.load_sum     += tail;
                    self.runnable_sum += tail;
                }
                if running {
                    self.util_sum += tail;
                }
            }
        }

        // Recompute the normalised averages from the updated sums.
        self.load_avg     = self.load_sum * task_weight / LOAD_AVG_MAX;
        self.runnable_avg = (self.runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX).min(1024);
        self.util_avg     = (self.util_sum * NICE_0_WEIGHT / LOAD_AVG_MAX).min(1024);
    }
}

/// Accumulate the decayed geometric series for `n` complete periods.
///
/// Returns the sum `Σ_{k=0}^{n-1} (1024 × y^k)`, which is the total
/// contribution of `n` complete periods of 100% activity to a PELT sum.
/// Implemented using the same two-step lookup as `decay_load()`:
///
/// ```text
/// // Full 32-period groups each contribute LOAD_AVG_MAX × (1 - y^32) = LOAD_AVG_MAX × 0.5
/// // but the incremental sum is easier to compute as LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n).
/// accumulate_sum(n) = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
/// ```
///
/// This identity holds because the converged sum minus the decayed tail is
/// exactly the contribution of `n` periods from a starting value of 0.
///
/// **Bit-identical note**: This identity may differ from Linux's
/// `runnable_avg_yN_sum` lookup table by +/- 1 due to fixed-point
/// rounding in the intermediate `decay_load()` computation. This is
/// acceptable: PELT values are exponentially-weighted averages consumed
/// by `perf`, `/proc/schedstat`, and `cpu.stat` — all of which display
/// them as percentages or averages where +/- 1 in the underlying
/// fixed-point sum is invisible. If exact Linux bit-compatibility is
/// required for a specific regression test, the `runnable_avg_yN_sum`
/// table can be added as a Phase 2 optimization (drop-in replacement,
/// same function signature).
pub fn accumulate_sum(n: u64) -> u64 {
    LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
}

The decay factor y ≈ 0.978 gives a half-life of 32 periods (~32.768 ms). decay_load() uses a precomputed table of y^n values for n = 0..31 (RUNNABLE_AVG_YN_INV) and halves for each group of 32 periods. accumulate_sum(n) = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n) computes the geometric partial sum without floating-point. This matches Linux's PELT exactly for tool compatibility.

Relationship to EAS: When a task wakes up, the scheduler reads task.pelt.util_avg to know the task's CPU demand. EAS uses this to find the core type where the task fits most efficiently. Without PELT, EAS would have no utilization data to work with.

7.2.5 Frequency Domain Awareness and Cpufreq Integration¶

CPUs within a frequency domain share a clock — changing one CPU's frequency changes all of them. The scheduler must be aware of this grouping.

// umka-core/src/sched/cpufreq.rs

/// Frequency domain: a group of CPUs sharing a clock source.
pub struct FreqDomain {
    /// Domain identifier.
    pub id: FreqDomainId,

    /// CPUs in this domain.
    pub cpus: CpuMask,

    /// Core type of all CPUs in this domain (always uniform within a domain).
    pub core_type: CoreType,

    /// Available OPPs for this domain. Fixed-capacity inline array avoids
    /// heap allocation and keeps OPP data cache-local. 32 entries
    /// exceeds all known hardware OPP tables (typical: 8-20).
    /// No downsampling is needed for current hardware.
    pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>,  // MAX_OPP_ENTRIES = 32

    /// Current OPP index (into `opps`).
    pub current_opp: AtomicU32,

    /// Aggregate utilization of all CPUs in this domain (sum of PELT util_avg).
    /// Updated at scheduler tick.
    pub domain_util: AtomicU32,

    /// Cpufreq governor for this domain.
    pub governor: CpufreqGovernor,
}

/// Cpufreq governor — decides when to change frequency.
pub enum CpufreqGovernor {
    /// Schedutil: frequency tracks utilization (default for EAS).
    /// New frequency = (util / capacity) × max_freq.
    /// Tight integration with scheduler — runs from scheduler context.
    Schedutil,

    /// Performance: always run at max frequency.
    Performance,

    /// Powersave: always run at min frequency.
    Powersave,

    /// Ondemand: legacy userspace sampling (Linux compat).
    Ondemand,

    /// Conservative: like ondemand but ramps gradually.
    Conservative,
}

Schedutil integration: On every scheduler tick, the schedutil governor reads the domain's aggregate utilization and adjusts frequency:

new_freq = (domain_util / domain_capacity) × max_freq

If domain has 4 CPUs at capacity 1024 each:
  domain_capacity = 4096
  If domain_util = 2048 (50% utilized):
    new_freq = (2048 / 4096) × max_freq = 50% of max_freq

Frequency change latency: ~10-50 μs (hardware-dependent).
The governor rate-limits changes to avoid oscillation (~4ms minimum interval).

7.2.6 Intel Thread Director (ITD) Integration¶

Intel Thread Director (Hardware Feedback Interface):

The HFI table is memory-mapped: 1. UmkaOS Core allocates a 4KB-aligned physical buffer at boot. 2. Writes the physical address to IA32_HW_FEEDBACK_PTR MSR (0x17D0H). 3. Hardware fills the table with per-class performance/efficiency scores using normal memory stores (not MSR writes). 4. When HFI data is updated, hardware fires a Thermal Interrupt (bit 26 set in IA32_PACKAGE_THERM_STATUS). 5. The interrupt handler reads the updated table via normal memory loads.

Per-thread class ID: read via RDMSR from IA32_THREAD_FEEDBACK_CHAR (per-logical-processor MSR, address 0x17D2H). Each thread has a hardware-assigned classification (e.g., integer-heavy, floating-point-heavy, memory-bound) that informs EAS core assignment decisions.

Intel Thread Director is a hardware feature on Alder Lake+ that classifies running workloads and provides hints about which core type is optimal. The hardware monitors instruction mix in real-time and populates the HFI table in memory.

// umka-core/src/sched/itd.rs

/// Intel Thread Director hint (decoded from the memory-mapped HFI table).
/// Hardware populates the table via normal memory stores; UmkaOS reads it
/// on Thermal Interrupt (bit 26 of IA32_PACKAGE_THERM_STATUS).
/// Per-thread class ID is obtained via RDMSR IA32_THREAD_FEEDBACK_CHAR (0x17D2H).
pub struct ItdHint {
    /// Hardware's assessment: how much this task benefits from a P-core.
    /// 0 = no benefit (pure memory-bound), 255 = maximum benefit (compute-bound).
    pub perf_capability: u8,

    /// Hardware's assessment: energy efficiency on an E-core.
    /// 0 = poor efficiency on E-core, 255 = excellent efficiency on E-core.
    pub energy_efficiency: u8,

    /// Workload classification.
    pub workload_class: ItdWorkloadClass,
}

#[repr(u8)]
pub enum ItdWorkloadClass {
    /// Scalar integer code — runs well on E-cores.
    ScalarInt       = 0,
    /// Scalar floating-point — runs well on E-cores.
    ScalarFp        = 1,
    /// Vectorized (SSE/AVX) — may benefit from P-cores (wider execution units).
    Vector          = 2,
    /// AVX-512 / AMX — P-core only (E-cores may lack these).
    HeavyVector     = 3,
    /// Branch-heavy — benefits from P-core branch predictor.
    BranchHeavy     = 4,
    /// Memory-bound — core type doesn't matter, memory is the bottleneck.
    MemoryBound     = 5,
}

Integration with EAS: ITD hints are a refinement. EAS uses PELT utilization to pick the energy-optimal core. ITD overrides this when hardware detects a mismatch:

EAS decision: task has low utilization (200/1024) → place on E-core.
ITD override: task is HeavyVector (AVX-512) → E-core lacks AVX-512 → force P-core.

EAS decision: task has high utilization (800/1024) → place on P-core.
ITD override: task is MemoryBound → P-core is wasted → suggest E-core.

ITD hints are read from the memory-mapped HFI table on Thermal Interrupt (not per scheduler tick). On interrupt: one memory load per table row (~4ns), no RDMSR needed for the table itself. Per-thread class ID is fetched via RDMSR IA32_THREAD_FEEDBACK_CHAR (0x17D2H) on context switch: cost ~30ns per switch, only on Intel Alder Lake+. On non-Intel or pre-Alder Lake: ITD is disabled, zero overhead.

Other architectures: ARM and RISC-V do not have an ITD equivalent. On ARM big.LITTLE/DynamIQ, the scheduler relies on the capacity-dmips-mhz device tree property and runtime PELT utilization to make core placement decisions (Section 7.2). On RISC-V heterogeneous harts, the riscv,isa device tree property and per-hart ISA capability flags drive placement (Section 7.2). Neither architecture provides hardware-level workload classification hints — the scheduler's software heuristics (PELT + EAS) perform this role. This is architecturally acceptable: ITD is an optimization (~15-25% better placement for mixed workloads on Intel hybrid), not a correctness requirement.

7.2.7 Asymmetric Packing¶

On heterogeneous systems, idle CPU selection must be topology-aware:

Symmetric (traditional):
  Spread tasks across all CPUs evenly for maximum parallelism.

Asymmetric (big.LITTLE):
  Pack tasks onto efficiency cores first.
  Only spill onto performance cores when efficiency cores are full
  or when a task is too large (misfit).

  Why: an idle performance core at its lowest OPP still draws more power
  than a busy efficiency core. Packing onto efficiency cores first
  minimizes total system power.

Misfit migration: A task is "misfit" if its utilization (pelt.util_avg) exceeds the capacity of the CPU it's currently running on. Misfit tasks are migrated to a higher-capacity CPU at the next load balance opportunity.

/// Check if a task is misfit on its current CPU.
pub fn is_misfit(task: &Task, cpu: &CpuCapacity) -> bool {
    task.pelt.util_avg > cpu.capacity_curr.load(Ordering::Relaxed)
}

/// Misfit migration is checked at every load balance interval (~4ms).
/// If a task is misfit:
///   1. Find the closest (topology-wise) CPU with enough capacity.
///   2. Migrate the task.
///   3. Mark the source CPU's rq->misfit_task flag for faster detection.

7.2.7.1 Hybrid Core Isolation Domain Asymmetry (x86-64, Intel Hybrid)¶

On Intel hybrid CPUs (Alder Lake, Raptor Lake, Meteor Lake), WRPKRU execution time differs significantly between P-cores and E-cores:

Core Type	WRPKRU Cost (per switch)	Consumer Batch (N=12, amortized)	Impact on Tier 1 I/O
P-core (Golden Cove / Raptor Cove)	~35 cycles	~6 cycles/op	~0.06% overhead on NVMe 4KB read (amortized)
E-core (Gracemont)	~89 cycles	~15 cycles/op	~0.15% overhead on NVMe 4KB read (amortized)

Note: The consumer loop performs one domain switch per batch (N operations dequeued from the ring). The amortized per-operation cost is WRPKRU_cost / N. At N≥12 (typical under NVMe load), the overhead is negligible on both core types.

The scheduler accounts for this asymmetry when placing isolation-heavy workloads (tasks that frequently cross Tier 1 domain boundaries). Tasks with high domain-switch rates are preferentially placed on P-cores where WRPKRU is ~2.5x faster. The X86Errata::HYBRID_ASYMMETRIC flag (set at boot for all hybrid CPUs) enables this placement heuristic. The ITD (Intel Thread Director) hardware hint (Section 7.2) classifies such workloads with a high HFI_CLASS that naturally maps to P-core preference.

7.2.8 Hierarchical Group Scheduling (cpu.weight Backing Mechanism)¶

The EEVDF scheduler as described above is flat: all tasks compete directly in a single per-CPU EEVDF tree. Cgroups v2 cpu.weight requires hierarchical fair sharing — tasks within a cgroup share CPU proportional to the cgroup's weight relative to sibling cgroups, and within a cgroup, tasks share proportional to their individual nice weights.

UmkaOS implements hierarchical EEVDF using group scheduling entities (GroupEntity), matching Linux's struct task_group / struct sched_entity hierarchy. Each cgroup with cpu.weight configured has one GroupEntity per CPU in the parent's EEVDF tree.

/// Per-CPU scheduling entity for a cgroup in the parent's EEVDF tree.
///
/// Each cgroup with a cpu.weight has one `GroupEntity` per CPU. The entity
/// participates in the parent cgroup's (or root's) EEVDF tree as if it were
/// a single task, representing all runnable tasks in the child cgroup on that CPU.
///
/// The entity's scheduling parameters (vruntime, vdeadline, lag) are maintained
/// using the same EEVDF algorithm as individual tasks — only the weight comes
/// from `cpu.weight` instead of the nice-to-weight table.
pub struct GroupEntity {
    /// Cgroup ID that this entity represents.
    pub cgroup_id: u64,
    /// Weight from `cpu.weight` (1–10000, default 100). Scales the entity's
    /// virtual runtime accumulation in the parent's EEVDF tree, exactly as
    /// a task's nice weight scales its vruntime in a flat tree.
    pub weight: u32,
    /// Virtual runtime in the parent's EEVDF tree. Accumulates inversely
    /// proportional to `weight`: `vruntime += delta_exec_ns * (NICE_0_WEIGHT / weight)`.
    pub vruntime: u64,
    /// Virtual deadline in the parent's EEVDF tree.
    /// `vdeadline = vruntime + calc_delta_fair(slice_ns, weight)`.
    /// EEVDF computes eligibility dynamically from `avg_vruntime()`,
    /// not from a stored field.
    pub vdeadline: u64,
    /// Accumulated lag for EEVDF eligibility in the parent's tree.
    pub lag: i64,
    /// Number of runnable tasks in this cgroup on this CPU.
    /// AtomicU32 for lock-free reads by work stealing and load estimation.
    /// When `nr_running` drops to zero, the entity is dequeued from the
    /// parent's EEVDF tree (no empty groups occupy tree space).
    pub nr_running: AtomicU32,
    /// Back-pointer to the parent cgroup's `GroupEntity` (or `None` for root).
    /// Forms the hierarchy chain for weight propagation.
    pub parent: Option<&'static GroupEntity>,
    /// The child EEVDF tree: tasks (and nested child GroupEntities) that
    /// belong to this cgroup on this CPU. Uses `VruntimeTree` directly
    /// ([Section 7.1](#scheduler)) — the shared base type containing the augmented
    /// RB tree and two-accumulator state.
    ///
    /// All accumulator-only EEVDF helpers (`avg_vruntime_update`,
    /// `entity_key`, `__enqueue_entity`, `__dequeue_entity`) operate
    /// on `&VruntimeTree` and work identically on this group sub-tree.
    /// Root-only fields (`curr`, `next`, `bandwidth_timer`) are not
    /// present — the currently running task is tracked solely by the
    /// root per-CPU `EevdfRunQueue.curr`. Hierarchical pick descends
    /// into `child_rq` and picks from the tree without a local `curr`.
    pub child_rq: VruntimeTree,
    /// PELT state for this group entity. Aggregates the utilization of all
    /// tasks in the cgroup on this CPU. Used by EAS and load balancing.
    pub pelt: PeltState,
}

Two-level pick algorithm. Task selection in a hierarchical EEVDF tree is recursive:

At the root per-CPU EEVDF tree, pick_eevdf() selects the eligible entity with the earliest virtual deadline. This entity may be either a bare EevdfTask (a task not in any cgroup with cpu.weight) or a GroupEntity.
If the selected entity is a GroupEntity, descend into its child_rq and repeat: pick the eligible entity with the earliest virtual deadline in the child tree. This entity may again be a GroupEntity (nested cgroups) or a leaf EevdfTask.
Repeat until a leaf EevdfTask is reached. This is the task to schedule.

The recursion depth equals the cgroup nesting depth (typically 2–4 levels: root → system.slice → service → container). Each level is an O(log n) tree walk, making the total pick cost O(D × log n) where D is the cgroup depth and n is the maximum tasks per level.

// `rq` is a VruntimeTree (or EevdfRunQueue.base at the root level).
pick_next_eevdf(rq):
    entity = rq.tasks_timeline.pick_eevdf()   // O(log n) augmented walk
    if entity.is_group():
        return pick_next_eevdf(entity.child_rq)  // recurse into child VruntimeTree
    else:
        return entity.task                        // leaf: schedule this task

7.2.8.1 Virtual Runtime Propagation¶

When a task runs for delta_exec_ns:

The task's own EevdfTask.vruntime advances by delta_exec_ns * NICE_0_WEIGHT / task_weight (within the innermost group's EEVDF tree).
Each ancestor GroupEntity up to the root also advances its vruntime in its parent's tree by delta_exec_ns * NICE_0_WEIGHT / group_weight. This ensures the group's virtual time reflects actual CPU consumption, maintaining fairness among sibling groups.
The propagation is bottom-up and happens in the task_tick() and put_prev_task() paths, both of which run under the per-CPU runqueue lock. No additional locking is needed because all entities in the hierarchy chain reside on the same CPU.

/// Propagate vruntime from a task to all ancestor GroupEntities.
/// Called from update_curr() step 3b after advancing the task's vruntime.
///
/// Walks from the task's innermost cgroup up to the root, advancing each
/// GroupEntity's vruntime in its parent's EEVDF tree. All entities are on
/// this CPU's runqueue — no cross-CPU locking needed.
fn propagate_group_vruntime(ge: &mut GroupEntity, delta_ns: u64) {
    let mut current = ge;
    loop {
        // Advance this GroupEntity's vruntime in its parent's tree.
        let vdelta = calc_delta_fair(delta_ns, current.weight);
        current.vruntime += vdelta;
        // Update avg_vruntime accumulators in the parent tree.
        if let Some(parent_rq) = current.parent_eevdf_rq() {
            parent_rq.sum_w_vruntime +=
                (vdelta as i64) * (current.weight as i64);
        }
        // Walk up to the parent GroupEntity, stop at root.
        match current.parent.as_mut() {
            Some(p) => current = p,
            None => break,
        }
    }
}

Weight mapping. The cpu.weight range (1–10000, default 100) is converted to EEVDF weights using the same formula as Section 7.1: group_weight = (cpu_weight * 1024) / 100. At the default cpu.weight = 100, a group entity has EEVDF weight 1024 (= NICE_0_WEIGHT, equivalent to nice 0). Two sibling cgroups with cpu.weight 100 and 200 get CPU in 1:2 ratio, regardless of how many tasks each contains.

Enqueue / dequeue lifecycle:

When the first task in a cgroup becomes runnable on a CPU: create (or reuse) the GroupEntity for that cgroup on that CPU, set nr_running = 1, enqueue the entity into the parent's EEVDF tree.
When a task in the cgroup wakes or forks on that CPU: increment nr_running, enqueue the task into the GroupEntity.child_rq. The group entity stays in the parent tree.
When a task sleeps or exits: decrement nr_running, dequeue from child_rq. If nr_running reaches zero: dequeue the GroupEntity from the parent tree (preserving lag for re-enqueue fairness, same as the deferred dequeue mechanism for individual tasks).

Per-CPU storage. GroupEntity instances are stored in per-CPU XArrays keyed by cgroup ID: RunQueueData.group_entities: XArray<GroupEntity>. This is O(1) lookup by cgroup ID when a task is enqueued and the scheduler needs to find or create the group entity for the task's cgroup on the local CPU.

// (Pseudo-code: this field is part of the RunQueueData struct definition, shown here for clarity)
/// Extension to RunQueueData for group scheduling.
impl RunQueueData {
    /// Per-cgroup group scheduling entities on this CPU.
    /// Keyed by `CgroupId` (integer key → XArray per collection policy).
    /// Lazily populated: an entry exists only when at least one task in
    /// the cgroup is runnable on this CPU.
    pub group_entities: XArray<GroupEntity>,
}

Runtime cpu.weight propagation. When userspace writes a new value to a cgroup's cpu.weight file, the kernel must propagate the weight change to every per-CPU GroupEntity instance for that cgroup. Without explicit propagation, the per-CPU entities retain the stale weight and the scheduler produces incorrect proportional sharing. On tickless (nohz_full) cores, a stale weight persists indefinitely because there is no periodic tick to trigger parameter re-evaluation.

/// Propagate a cpu.weight change to all per-CPU GroupEntity instances.
///
/// Called from the cgroupfs cpu.weight write handler (Section 17.2.3)
/// after the CpuController.weight AtomicU32 is updated.
///
/// The write handler holds the cgroup's `css_set_lock` (read side),
/// preventing concurrent task migration from racing with the weight update.
/// Per-CPU runqueue locks are acquired individually in ascending CPU order
/// to avoid ABBA deadlock with the scheduler's per-CPU lock ordering.
pub fn sched_group_set_weight(cgroup_id: u64, new_weight: u32) {
    for cpu_id in 0..nr_cpus_online() {
        let rq = per_cpu_runqueue(cpu_id);
        let _guard = rq.lock();

        if let Some(ge) = rq.group_entities.get_mut(cgroup_id) {
            let old_weight = ge.weight;
            ge.weight = new_weight;

            // Recompute vdeadline from the new weight. The entity's current
            // vruntime is preserved — only the rate of future vruntime
            // accumulation changes. This matches Linux's reweight_entity()
            // semantics: the group retains its accumulated lag (fairness debt)
            // and only future scheduling quanta are scaled by the new weight.
            //
            // vdeadline = vruntime + calc_delta_fair(EEVDF_SLICE_NS, new_weight)
            // EEVDF computes eligibility dynamically from avg_vruntime(), not
            // from a stored field.
            ge.vdeadline = ge.vruntime + calc_delta_fair(EEVDF_SLICE_NS, new_weight);

            // If the entity is currently enqueued in the parent's EEVDF tree,
            // update the tree's augmented min_vdeadline metadata (the weight
            // change may have altered this entity's position in the tree).
            if ge.nr_running.load(Ordering::Relaxed) > 0 {
                rq.eevdf_tree_update_key(ge);
            }

            // If this CPU is in nohz_full (tickless) mode and the entity is
            // currently running, send a reschedule IPI so the scheduler
            // re-evaluates with the new weight. Without this, the running
            // task continues at the old weight indefinitely.
            if rq.is_nohz_full() && rq.curr_group_id() == Some(cgroup_id) {
                rq.resched_ipi();
            }
        }
        // If no GroupEntity exists for this cgroup on this CPU, no action
        // is needed — the entity will be created with the new weight when
        // the first task in this cgroup becomes runnable on this CPU.
    }
}

Interaction with CBS bandwidth servers (Section 7.6): CBS (cpu.guarantee) and group scheduling (cpu.weight) are orthogonal. A cgroup can have both: cpu.weight determines its share of available CPU relative to siblings, while cpu.guarantee sets a minimum floor. The CBS server operates at the cgroup level — when CBS throttles a cgroup, the GroupEntity is removed from the parent EEVDF tree (same as OnRqState::CbsThrottled). When CBS un-throttles, the entity is re-enqueued with its preserved lag.

7.2.9 Cgroup Integration¶

Cgroups can constrain which core types a group of tasks may use:

/sys/fs/cgroup/<group>/cpu.core_type
# # Allowed core types for this cgroup.
# # "all"         — any core type (default)
# # "performance" — only P-cores (latency-critical workloads)
# # "efficiency"  — only E-cores (background/batch workloads)
# # Multiple: "performance mid" — P-cores and mid-tier cores

/sys/fs/cgroup/<group>/cpu.capacity_min
# # Minimum per-CPU capacity for tasks in this cgroup.
# # Tasks will not be placed on CPUs with capacity below this value.
# # Default: 0 (no minimum).
# # Example: "512" — only run on CPUs with at least half maximum capacity.

/sys/fs/cgroup/<group>/cpu.capacity_max
# # Maximum per-CPU capacity for tasks in this cgroup.
# # Tasks will not be placed on CPUs with capacity above this value.
# # Default: 1024 (no maximum).
# # Example: "400" — only run on efficiency cores.

Use case examples:

# # Kubernetes: latency-critical pod on P-cores only
echo "performance" > /sys/fs/cgroup/k8s-pod-frontend/cpu.core_type

# # Background log processing: E-cores only (save P-cores for real work)
echo "efficiency" > /sys/fs/cgroup/k8s-pod-logshipper/cpu.core_type

# # ML training: needs AVX-512 (P-cores on Intel, ISA-gated)
echo "512" > /sys/fs/cgroup/k8s-pod-training/cpu.capacity_min

7.2.10 RISC-V Heterogeneous Hart Support¶

RISC-V takes heterogeneity further: different harts (hardware threads) may have different ISA extensions. One hart may have the Vector extension (RVV), another may not. One hart may support the Hypervisor extension (H), another may not.

// umka-core/src/sched/riscv.rs

/// RISC-V ISA extension discovery per hart.
/// Read from the devicetree `riscv,isa` property for each hart.
///
/// Example devicetree:
///   cpu@0 { riscv,isa = "rv64imafdc_zba_zbb_v"; };  // Vector-capable
///   cpu@1 { riscv,isa = "rv64imafd"; };               // No vector
pub fn discover_hart_capabilities(dt: &DeviceTree) -> BootVec<IsaCapabilities> {
    // Parse each hart's ISA string.
    // Set IsaCapabilities flags accordingly.
    // The scheduler uses these to ensure tasks that use Vector
    // instructions only run on Vector-capable harts.
}

ISA gating in the scheduler:

Task affinity includes ISA requirements:

struct TaskAffinityHint {
    /// ISA extensions this task requires (detected from ELF header
    /// or set by userspace via prctl).
    pub isa_required: IsaCapabilities,

    /// Core type preference (from cgroup or auto-detected).
    pub core_preference: CorePreference,

    /// PELT utilization (for EAS).
    pub util_avg: u32,
}

Scheduler check:
  if !cpu.isa_caps.contains(task.affinity.isa_required) {
      // This CPU lacks ISA extensions the task needs.
      // Skip this CPU. Do NOT schedule here.
      continue;
  }

This prevents illegal-instruction faults from scheduling a Vector task on a non-Vector hart, which would be a silent correctness bug on Linux today (Linux began adding per-hart ISA detection in 6.2+ and capability tracking in 6.4+, but does not yet integrate per-hart ISA awareness into the scheduler's task placement decisions — a Vector-tagged task can still be scheduled on a non-Vector hart).

7.2.11 Topology Discovery¶

The scheduler builds its heterogeneous topology model from firmware:

Sources (checked in order):

1. ACPI PPTT (Processor Properties Topology Table):
   — Provides core type, cache hierarchy, frequency domain.
   — Available on ARM SBSA servers and Intel Alder Lake+.

2. ACPI CPPC (Collaborative Processor Performance Control):
   — Provides per-CPU performance range (highest/lowest/nominal).
   — The ratio highest_perf / nominal_perf indicates core type:
     P-cores have higher highest_perf than E-cores.
   — Used on Intel hybrid platforms.

3. Devicetree:
   — ARM and RISC-V embedded systems.
   — `capacity-dmips-mhz` property gives relative core performance.
   — RISC-V `riscv,isa` property gives per-hart ISA extensions.

4. Intel CPUID leaf 0x1A (Hybrid Information):
   — Reports core type (Atom = E-core, Core = P-core) for the running CPU.
   — Each CPU reads its own CPUID at boot.

5. Fallback: runtime measurement
   — If no firmware data: run a calibration loop on each CPU at boot.
   — Measure instructions-per-second to derive relative capacity.
   — Last resort. ~100ms at boot.

7.2.12 Linux Compatibility¶

All Linux interfaces for heterogeneous CPU systems are supported:

/sys/devices/system/cpu/cpuN/cpu_capacity
# # Read-only. Capacity of CPU N (0–1024).
# # Written by the kernel at boot. Used by userspace tools.
# # e.g., "1024" for P-core, "512" for E-core.

/sys/devices/system/cpu/cpuN/topology/core_type
# # "performance", "efficiency", or "unknown".
# # New in Linux 6.x, used by systemd and schedulers.

/sys/devices/system/cpu/cpufreq/policyN/
# # Standard cpufreq interface (per frequency domain):
# # scaling_governor, scaling_cur_freq, scaling_max_freq, etc.

sched_setattr(pid, &attr):
# # SCHED_FLAG_UTIL_CLAMP: set min/max utilization clamp.
# # util_min/util_max affect EAS placement decisions.
# # Fully supported with same semantics as Linux.

prctl(PR_SCHED_CORE, ...):
# # Core scheduling (co-scheduling related tasks on the same core).
# # Supported.

Kernel command line:
# # isolcpus=2-3 (reserve CPUs, same as Linux)
# # nohz_full=2-3 (tickless for RT, same as Linux)
# # nosmt (disable SMT, same as Linux)

sched_ext compatibility: Linux 6.12+ allows BPF scheduling policies via sched_ext. UmkaOS provides the foundation for sched_ext through the eBPF subsystem (Section 19.2) and the sched_setattr interface (Section 19.1). Full sched_ext support requires the BPF struct_ops framework and sched_ext-specific kfuncs (scx_bpf_dispatch, scx_bpf_consume, etc.), which are part of the eBPF subsystem implementation. BPF schedulers that use sysfs topology files and sched_setattr for configuration will work without modification once the struct_ops infrastructure is in place.

7.2.13 Performance Impact¶

Symmetric systems (all cores identical):
  EAS: disabled (enabled == false). One branch check at task wakeup: ~1 cycle.
  Capacity model: all CPUs = 1024. No capacity checks affect scheduling decisions.
  PELT: runs regardless (already exists in the standard EEVDF scheduler). Zero additional cost.
  Total overhead vs Linux on symmetric: ZERO.

Heterogeneous systems (big.LITTLE, Intel hybrid):
  EAS: ~200-500ns per task wakeup (iterate 2-3 domains × 4-6 OPPs).
  Task wakeup total (without EAS): ~1500-2000ns.
  Task wakeup total (with EAS): ~1700-2500ns.
  Overhead: ~15-25% of wakeup path. Same as Linux EAS.

  Large topology scaling: On 128-core NUMA systems with 4 NUMA nodes and 3
  frequency domains per node, wakeup scanning covers up to 12 frequency
  domains × 128 CPUs = up to 1536 capacity lookups in the worst case. With
  cache misses on remote NUMA nodes, this can reach 4-40 μs — exceeding the
  200-500 ns estimate for small systems. UmkaOS mitigates this with:
    (1) per-domain capacity caches refreshed on topology change, not per-wakeup;
    (2) early termination when a suitable idle CPU is found;
    (3) the eas_max_domains sysctl (default 8) caps the search depth.
  On systems where EAS latency exceeds 1 μs P99, set eas_max_domains=4 or
  disable EAS entirely (kernel.sched_energy_aware=0). The 200-500 ns estimate
  applies to systems with ≤64 cores and ≤4 frequency domains.

  ITD class ID fetch (RDMSR IA32_THREAD_FEEDBACK_CHAR): ~30ns per context switch.
  HFI table read: on Thermal Interrupt only (infrequent, hardware-triggered).
  Combined overhead: negligible (context-switch-bound, not tick-bound).

  Misfit check: one comparison per load balance (~4ms). Negligible.

  Benefit: 20-40% power reduction for mixed workloads (measured on
  Linux EAS vs non-EAS on ARM big.LITTLE). Same benefit expected.

  This is not overhead — it's a power optimization. The CPU time spent
  on EAS decisions is recovered many times over in power savings.

Summary: UmkaOS's heterogeneous scheduling has identical performance
to Linux EAS on the same hardware. The algorithms are the same (PELT,
EAS energy computation, schedutil). The implementation is clean-sheet
Rust but the scheduling mathematics are equivalent.

See also: Section 7.7 (Power Budgeting) extends EAS with system-level power caps and per-domain throttling. Section 7.7 specifies thermal_update_capacity() — the callback that updates CpuCapacity.capacity when thermal throttling reduces a CPU's maximum frequency, ensuring EAS placement decisions account for reduced throughput. Section 22.8 (Unified Compute Model) generalizes the CpuCapacity scalar into a multi-dimensional capacity vector spanning CPUs, GPUs, and accelerators.

7.3 Context Switch and Register State¶

7.3.1 Context Switch Procedure¶

The full context switch sequence when the scheduler selects a new task (next) to replace the currently running task (prev). This is the hot path executed on every involuntary preemption, voluntary yield, and explicit schedule() call. The caller holds the local CPU's run queue lock (rq.lock).

context_switch(prev, next):
  1. Update prev's scheduling class state (put_prev_task):
     - EEVDF: advance prev.vruntime, update lag, re-insert into eligible/ineligible tree
     - RT/DL: update runtime accounting, check timeslice expiry
  2. perf_schedule_out(prev_ctx: &PerfEventContext)
     — Stop and read all active PMU counters for prev's task context.
       `prev_ctx` is the task's `PerfEventContext` (see [Section 20.8](20-observability.md#performance-monitoring-unit)).
       Lock-free: iterates active[0..active_count] via Acquire load.
       See [Section 20.8](20-observability.md#performance-monitoring-unit--context-switch-fast-path).
  3. Save prev's general-purpose registers and stack pointer
     — Architecture-specific: pushes callee-saved registers onto prev's kernel stack,
       stores prev's stack pointer into prev.thread_struct.
  4. Switch address space (switch_mm):
     - x86-64: write next's PGD to CR3 (with PCID if available to avoid full TLB flush)
     - AArch64: write TTBR0_EL1, issue TLBI if ASID differs
     - Other arches: architecture-specific page table base register update
  5. Save/restore extended register state (lazy XSAVE/XRSTOR):
     — See Extended Register State Management below. Only dirty components are saved.
  6. Restore next's general-purpose registers and stack pointer
     — Pop callee-saved registers from next's kernel stack.
  7. perf_schedule_in(next_ctx: &PerfEventContext)
     — Program next's active PMU counters into hardware.
       `next_ctx` is the incoming task's `PerfEventContext` (see [Section 20.8](20-observability.md#performance-monitoring-unit)).
       Lock-free: same active[] iteration pattern as step 2.
       See [Section 20.8](20-observability.md#performance-monitoring-unit--context-switch-fast-path).
  8. Update per-CPU current task pointer
     — Write next's Task pointer to the per-CPU CpuLocal block.

Steps 2 and 7 invoke the PMU context switch fast path defined in Section 20.8. These calls are unconditional — they execute on every context switch regardless of whether perf events are active. When no perf events are open on the CPU (active_count == 0), both functions reduce to a single atomic load (the Acquire load of active_count) and an immediate return — no PMU register access, no loop iteration. The cost of this check is ~1ns per context switch, well within the performance budget (Section 1.3).

Software event accounting. The context switch path also increments the PERF_COUNT_SW_CONTEXT_SWITCHES software counter via a per-CPU atomic increment between steps 1 and 2. This counter is always active (not gated by perf event attachment) and feeds perf stat -e context-switches. Cost: one AtomicU64::fetch_add with Relaxed ordering (~1ns).

7.3.1.1 PKRU Management During Context Switch (x86-64)¶

PKRU is excluded from XSAVE/XRSTOR and managed manually. This is a correctness requirement, not an optimization. On all x86 CPUs with PKU — and particularly on AMD Zen processors — the CPU can aggressively clear the XSTATE_BV bit for PKRU when the register value matches the init state (all zeros). A subsequent XRSTOR would then reset PKRU to 0 (init state), silently disabling all protection key enforcement and destroying Tier 1 isolation. Linux fixed this in v5.14 (Thomas Gleixner's 66-patch "Spring Cleaning" series) by completely decoupling PKRU from XSAVE.

UmkaOS stores PKRU in prev.saved_pkru (a u32 field) and switches it manually during step 5 of the context switch procedure:

PKRU context switch (x86-64, between steps 4 and 5):
  a. rdpkru → read current PKRU value
  b. If PKRU != prev.saved_pkru: store current value to prev (prev may have
     modified PKRU via pkey_mprotect or via Tier 1 domain entry/exit)
  c. If next.saved_pkru != current PKRU: wrpkru(next.saved_pkru)
  d. Update per-CPU CpuLocalBlock.pkru_shadow = next.saved_pkru
     (keeps the isolation shadow in sync with the hardware register — required
     for switch_domain() elision correctness; see
     [Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes--pkru-write-elision-mandatory))
     **Memory ordering**: The shadow write uses Release ordering so that any
     subsequent switch_domain() on this CPU (which reads the shadow with
     Acquire) observes the updated value. The hardware register write
     (WRPKRU) acts as a permission barrier — subsequent memory accesses
     affected by PKRU will not execute (even speculatively) until WRPKRU
     completes (Intel SDM). It is NOT a full serializing instruction
     (unlike CPUID or WRMSR); it provides ordering only for PKU-protected
     memory accesses. The Release ordering on the shadow store matches
     the WRPKRU permission-barrier semantics for the elision comparison
     to be correct.
  e. The conditional write in step (c) avoids the ~23-89 cycle WRPKRU cost when
     adjacent tasks share the same PKRU value (common for non-isolated workloads).

Critical: when reading the target task's PKRU value, always use the explicit next pointer passed to the switch function — never a cached current pointer. Between Linux 5.2 and 5.13, using this_cpu_read_stable() to access current->flags produced stale values when switching from a kernel thread to a user thread, leaving PKRU unrestored.

XSAVE exclusion: during XSAVE (step 5), the PKRU component (bit 9 in XCR0/XSTATE_BV) is masked from both save and restore operations. The kernel XSAVE mask explicitly clears bit 9 so that XSAVE/XRSTOR never touch PKRU. This mask is set once during boot and never modified.

7.3.1.2 Isolation Register Save/Restore (All Architectures)¶

The PKRU management above is x86-64-specific. Every architecture with Tier 1 hardware isolation has an equivalent isolation register that must be saved/restored during context switch (between steps 4 and 5, same position as PKRU). On architectures without Tier 1 isolation (RISC-V, s390x, LoongArch64), this step is a no-op.

AArch64 — POR_EL0 (Permission Overlay Register, ARMv8.9-A / ARMv9.4-A+ with FEAT_S1POE):

POR_EL0 context switch (AArch64 + POE, between steps 4 and 5):
  a. MRS x0, POR_EL0 → read current permission overlay value
  b. If x0 != prev.saved_por: store current value to prev (prev may have
     changed POR_EL0 via Tier 1 domain entry/exit)
  c. If next.saved_por != x0: MSR POR_EL0, next.saved_por; ISB
     (ISB is required after POR_EL0 writes to ensure the permission overlay
     takes effect before any subsequent memory access)
  d. Update per-CPU CpuLocalBlock.por_shadow = next.saved_por
  e. On AArch64 without POE hardware: this step is skipped entirely.
     Tier 1 drivers on non-POE AArch64 use page-table + ASID isolation
     which is handled in step 4 (switch_mm).

ARMv7 — DACR (Domain Access Control Register):

DACR context switch (ARMv7, between steps 4 and 5):
  a. MRC p15, 0, r0, c3, c0, 0 → read current DACR value
  b. If r0 != prev.saved_dacr: store current value to prev
  c. If next.saved_dacr != r0: MCR p15, 0, next.saved_dacr, c3, c0, 0
  d. Update per-CPU CpuLocalBlock.dacr_shadow = next.saved_dacr
  e. ISB barrier after DACR write — ARM Architecture Reference Manual
     (B3.7.2) requires ISB after DACR modification to ensure the new
     domain permissions are visible to subsequent instructions. Without
     ISB, speculative access could use the old DACR value. This matches
     Linux kernel behavior (`isb()` after every `set_domain()` call).

PPC32 — Segment Registers:

Segment register context switch (PPC32, between steps 4 and 5):
  a. For each active segment register (sr0–sr15):
     mfsr rN, srX → read current segment register value
  b. If any sr differs from prev.saved_sr[X]: store current values to prev
  c. For each segment register that differs between prev and next:
     mtsr srX, next.saved_sr[X]
  d. Update per-CPU CpuLocalBlock.sr_shadow[0..16] = next.saved_sr[0..16]
  e. isync after the last mtsr to ensure new segment translations are visible.

PPC64LE — Radix PID (POWER9+ Radix mode):

Radix PID context switch (PPC64LE, between steps 4 and 5):
  a. mfspr r0, SPRN_PID → read current Radix PID value
  b. If r0 != prev.saved_rpid: store current value to prev
  c. If next.saved_rpid != r0: mtspr SPRN_PID, next.saved_rpid
  d. Update per-CPU CpuLocalBlock.rpid_shadow = next.saved_rpid
  e. isync after mtspr PID to synchronize the translation context.

RISC-V, s390x, LoongArch64: These architectures have no Tier 1 hardware isolation register. Tier 1 is unavailable on these platforms; drivers use Tier 0 or Tier 2. No isolation register save/restore is performed during context switch. The context switch code gates the entire isolation register save/restore block on arch::current::isolation::supports_fast_isolation() — a compile-time constant that evaluates to false on RISC-V, s390x, and LoongArch64, ensuring the dead code is eliminated by the compiler with zero runtime cost.

7.3.1.3 Return Stack Buffer (RSB) Fill¶

On every context switch, the kernel fills 32 RSB entries with safe return addresses (pointing to a speculative capture gadget that executes LFENCE; JMP back to itself). 32 entries matches the RSB depth on all current Intel and AMD microarchitectures (Skylake through Sapphire Rapids, Zen 1 through Zen 5). Future CPUs with deeper RSBs would require updating this constant; however, on CPUs with eIBRS (see below), RSB fill is skipped entirely, making the depth moot. This prevents speculative execution from following stale RSB entries left by the previous task, which could leak data via cache side channels.

RSB fill (x86-64, after step 6 — register restore):
  — Execute 32 CALL instructions (each pushes a return address onto the RSB).
  — Adjust RSP to discard the 32 pushed return addresses.
  — Total cost: ~20-40 cycles (32 predicted-taken near calls + stack adjustment).

RSB fill is also required on every VM exit (KVM VMEXIT handler) — the guest may have polluted the RSB. On CPUs with IBRS_ALL (eIBRS), RSB fill after context switch can be skipped (eIBRS provides RSB protection), but RSB fill after VM exit remains mandatory because eIBRS does not cover guest→host RSB pollution.

AArch64, ARMv7, and RISC-V do not have an RSB equivalent that requires filling. PPC64 uses a link stack that is flushed via the count cache flush sequence (Section 2.18).

7.3.1.4 LL/SC Reservation Clearing (ARM, RISC-V, PowerPC)¶

On architectures that implement CAS via load-linked/store-conditional (LL/SC) pairs, a thread preempted between the load-linked and the store-conditional retains a hardware reservation. When a different thread resumes on that CPU, its first store-conditional may spuriously succeed on a completely unrelated address if it happens to fall within the stale reservation granule. This is a correctness requirement, not an optimization.

The context switch path must clear any dangling reservation between steps 6 and 7:

Architecture	Clear instruction	Mechanism
AArch64	`CLREX`	Explicit reservation clear instruction. Zero cost, no memory access.
ARMv7	`CLREX`	Same as AArch64. Present from ARMv6K onward.
RISC-V	`SC` to a dummy word-aligned location	RISC-V has no `CLREX` equivalent. Execute `SC.W zero, zero, (dummy_addr)` where `dummy_addr` is a per-CPU scratch word. The `SC` unconditionally clears the reservation regardless of success or failure.
PPC32	`stwcx.` to a per-CPU dummy word	Same principle as RISC-V: `stwcx.` to a word-aligned scratch location clears the reservation. Must be in cacheable memory with M=1 (coherence required).
PPC64LE	`stdcx.` to a per-CPU dummy doubleword	64-bit equivalent. Same requirement: cacheable, coherent memory.
x86-64	Not needed	x86 uses `CMPXCHG` (single instruction, no reservation).
s390x	Not needed	s390x uses `CS`/`CSG` (compare-and-swap, single instruction).
LoongArch64	`LL.W`/`SC.W` to per-CPU scratch	LoongArch LL/SC semantics require a trailing SC to clear the LLbit.

Dummy word placement: On RISC-V, PPC32, PPC64LE, and LoongArch64, the per-CPU dummy word used for reservation clearing must be placed on its own cache line (64-byte aligned) to avoid false sharing. If the dummy word shares a cache line with a hot variable, the SC/stwcx./stdcx. will write to that cache line on every context switch (even though the store value is discarded), causing unnecessary cache invalidation traffic on other CPUs that share the cache line. The dummy word is declared as:

/// Per-CPU scratch word for LL/SC reservation clearing on context switch.
/// Cache-line aligned to prevent false sharing with adjacent per-CPU data.
#[repr(C, align(64))]
struct LlscDummy {
    word: u64,
    _pad: [u8; 56],
}
// kernel-internal, not KABI. Size = 64 bytes (one cache line).
const_assert!(size_of::<LlscDummy>() == 64);

Placed in CpuLocalBlock (each block is already cache-line aligned).

PowerPC-specific concern: The reservation granule size is implementation-dependent (minimum 16 bytes, typically 128 bytes = one cache line on POWER8/9). A stale reservation on one lock can cause a spurious stwcx. success on a different lock in the same granule. The dummy stwcx. must target a scratch word that is guaranteed to be in a different cache line from any real lock or atomic variable.

7.3.2 Extended Register State Management¶

Modern x86 CPUs carry large amounts of extended register state beyond the basic GPRs and x87 FPU. Blindly saving and restoring all of this on every context switch is wasteful — most threads never touch AVX-512 or AMX.

The cost problem:

State component	Size	Save/restore cost
x87 + SSE (XMM)	576 bytes	~20 ns
AVX (YMM)	256 bytes	~10 ns
AVX-512 (ZMM)	2048 bytes	~80 ns
Intel AMX (tiles)	8192 bytes	~300 ns
ARM SVE (Z regs)	256–8192 bytes (VL-dependent)	~100–500 ns

On a server running thousands of threads with microsecond-scale scheduling, 300ns of AMX save/restore overhead per switch is significant.

Lazy XSAVE policy:

UmkaOS tracks per-thread which extended state components have actually been used via an xstate_used bitmap that mirrors the hardware XSTATE_BV field:

/// Per-thread extended state tracking.
struct ThreadXState {
    /// Bitmap of XSAVE components this thread has used since creation.
    /// Mirrors hardware XSTATE_BV layout (bit 0 = x87, bit 1 = SSE, bit 2 = AVX, etc.)
    ///
    /// **Synchronization**: Plain `u64`, not atomic. This is safe because
    /// `xstate_used` is per-thread and only accessed on the thread's current
    /// CPU: the #NM handler sets bits (same CPU, IRQ context), and the
    /// context switch reads them (same CPU, preemption disabled). No
    /// cross-CPU visibility is needed.
    xstate_used: u64,

    /// Dynamically-allocated XSAVE area. Starts as None; allocated on first use.
    /// Size depends on which components are enabled (CPUID leaf 0xD).
    xsave_area: Option<XSaveArea>,
}

Context switch optimization:

On context_switch(prev, next):
  1. Determine prev's dirty components: xstate_dirty = prev.xstate_used & XSTATE_MODIFIED_BITS
  2. XSAVE only the dirty components (XSAVES with prev.xstate_used as the mask)
     — If prev never used AVX-512, the ZMM state is NOT saved (zero cost)
  3. XRSTOR next's components (XRSTORS with next.xstate_used as the mask)
     — Components not in next's mask are initialized to their reset state by hardware

Modified Optimization:
  - If a thread hasn't executed any AVX-512/AMX instruction since the last context switch,
    the corresponding XSTATE_BV bits are clear — XSAVES skips those components automatically.
  - The kernel does NOT need to track this manually; it falls out of the hardware XSAVE
    optimized mode (XSAVES/XRSTORS with INIT optimization).

Init optimization (demand allocation):

Threads that never use extended SIMD pay zero XSAVE cost:

1. New thread starts with xstate_used = 0, xsave_area = None
2. CR0.TS bit is set (or XCR0 is restricted) — any use of SIMD triggers #NM
   (Device Not Available exception)
3. #NM handler:
   a. Allocate xsave_area (sized per CPUID leaf 0xD for the used components)
   b. Set appropriate bits in xstate_used
   c. Clear CR0.TS (or extend XCR0)
   d. Return — the faulting SIMD instruction re-executes successfully
4. Subsequent SIMD use proceeds without trapping

This means a thread that only does integer arithmetic and memory copies never allocates an XSAVE area and never incurs XSAVE/XRSTOR cost during context switch.

AMX special case:

Intel AMX tile registers (8KB) are especially expensive. Additional optimization: - AMX has a TILERELEASE instruction that explicitly marks tile state as unused. - UmkaOS's kernel scheduler can hint userspace (via prctl or arch_prctl) to call TILERELEASE when exiting a compute-intensive section, so the next context switch doesn't save 8KB of dead tile state. - If a thread hasn't used AMX tiles in the last N context switches (configurable, default N=8), the kernel deallocates the AMX portion of the XSAVE area to reclaim memory.

ARM SVE/SME (AArch64):

ARM's Scalable Vector Extension has a variable vector length (128–2048 bits). The same lazy-allocation strategy applies, with ARM-specific mechanisms:

SVE state components and sizes (VL-dependent):

| Component        | Size at VL=128 | Size at VL=512 | Size at VL=2048 |
|------------------|----------------|----------------|-----------------|
| Z registers (Z0-Z31) | 512 bytes | 2048 bytes    | 8192 bytes      |
| P predicates (P0-P15) | 32 bytes  | 128 bytes     | 512 bytes       |
| FFR (first-fault) | 2 bytes       | 8 bytes        | 32 bytes        |
| Total             | 546 bytes     | 2184 bytes     | 8736 bytes      |

Lazy SVE allocation policy:

1. New thread starts with SVE disabled (CPACR_EL1.ZEN = 0b00).
2. First SVE instruction triggers #UND (EL1 undefined instruction trap).
3. #UND handler:
   a. Read current VL from ZCR_EL1 (or inherit from parent thread).
   b. Allocate SVE save area sized for current VL.
   c. Set CPACR_EL1.ZEN = 0b01 (enable SVE for EL0).
   d. Return — the faulting SVE instruction re-executes.
4. Context switch saves/restores only if thread has SVE enabled:
   a. Check CPACR_EL1.ZEN — if SVE was never used, skip (zero cost).
   b. If used: SVE_ST (store Z/P/FFR) and SVE_LD (load) to save area.
   c. Cost: proportional to VL, not fixed. VL=128: ~50ns. VL=512: ~200ns.

VL management:
  - Per-thread VL via prctl(PR_SVE_SET_VL, new_vl).
  - VL change takes effect on next context restore (no immediate effect).
  - If new_vl > old_vl, the save area is reallocated (grown, not shrunk).
  - System default VL set via sysctl: kernel.sve_default_vl = 256.

SME (Scalable Matrix Extension, ARMv9.2) provides matrix tiles analogous to AMX:

SME state:
  - ZA tile register: (SVL/8) x (SVL/8) bytes, where SVL is the Streaming
    Vector Length in BITS. At SVL=512 bits: 64x64 = 4KB. At the maximum
    SVL=2048 bits: 256x256 = 64KB.
  - Streaming SVE mode (SSVE): uses SVE registers at streaming VL.

Lazy SME allocation:
  1. SMSTART (enter streaming mode) traps if SMCR_EL1.ENA = 0.
  2. Handler allocates ZA storage and enables SME.
  3. SMSTOP (exit streaming mode) marks ZA as inactive.
     If ZA is inactive for N switches (default 4), deallocate ZA storage.
     This matters for memory pressure: ZA at SVL=2048 is 64KB per thread.

Context switch for SME:
  - Check PSTATE.SM (streaming mode active) and PSTATE.ZA (ZA active).
  - If neither: zero cost.
  - If ZA active: save ZA tile (up to 64KB at max SVL=2048). This is expensive.
  - The scheduler deprioritizes SME-heavy threads from migration to minimize
    ZA save/restore on context switch (locality preference).

ARMv7 VFP/NEON:

ARMv7 extended register state is simpler than x86 or AArch64 — only VFP (Vector Floating Point) and NEON (SIMD) use extended registers:

ARMv7 extended state:

| Component        | Size       | Save/restore cost |
|------------------|------------|-------------------|
| VFP/NEON (D0-D31)| 256 bytes  | ~15-30 ns         |
| FPSCR            | 4 bytes    | ~5 ns             |

Total: 260 bytes per thread. Always the same size (no variable-length
extensions like SVE or AVX-512).

Lazy allocation policy for ARMv7:

1. New thread starts with VFP/NEON disabled (FPEXC.EN = 0).
2. First VFP/NEON instruction triggers #UND trap.
3. Handler:
   a. Allocate 260-byte VFP save area.
   b. Set FPEXC.EN = 1 (enable VFP/NEON).
   c. Return — faulting instruction re-executes.
4. Context switch: VSTM/VLDM to save/restore D0-D31 + FPSCR.
   Cost is fixed (~30ns) regardless of which registers were used.

ARMv7 has no equivalent of XSAVE's selective save — all 32 double-word
registers are saved/restored as a unit. The fixed 260-byte size means no
dynamic allocation complexity. Threads that never use floating-point or
NEON pay zero VFP save/restore cost (same lazy trap approach as x86 #NM).

RISC-V Vector Extension (RVV):

RISC-V Vector extension (RVV 1.0, ratified 2021) has variable vector length (VLEN: 128–65536 bits), making it the most flexible — and most complex to manage — of any supported architecture:

RVV state components:

| Component           | Size at VLEN=128 | Size at VLEN=256 | Size at VLEN=1024 |
|---------------------|------------------|------------------|-------------------|
| V registers (v0-v31)| 512 bytes        | 1024 bytes       | 4096 bytes        |
| vtype, vl, vstart   | 24 bytes         | 24 bytes         | 24 bytes          |
| vcsr (vxrm, vxsat)  | 8 bytes          | 8 bytes          | 8 bytes           |
| Total               | 544 bytes        | 1056 bytes       | 4128 bytes        |

Lazy RVV allocation policy:

1. New thread starts with Vector disabled (mstatus.VS = Off).
2. First vector instruction triggers illegal-instruction trap.
3. Handler:
   a. Read VLEN from hart capabilities (discovered at boot from DT or CSR).
   b. Allocate vector save area: 32 × (VLEN/8) + overhead bytes.
   c. Set mstatus.VS = Initial (vector state clean).
   d. Return — faulting instruction re-executes.
4. Context switch:
   a. Check mstatus.VS — if Off, skip entirely (zero cost).
   b. If Dirty: save all 32 V registers using 4× vs8r.v (v0-v7, v8-v15,
      v16-v23, v24-v31). The RVV spec maximum whole-register store is 8
      registers per instruction; there is no vs32r.v.
   c. If Clean: skip save (state hasn't changed since last restore).
   d. Restore: 4× vl8re8.v to load all 32 registers from next thread's area.
      Total: 8 whole-register instructions (4 stores + 4 loads).
      Set mstatus.VS = Clean after restore.

Per-hart VLEN variation:
  On heterogeneous RISC-V systems, different harts may have different VLEN.
  The scheduler tracks per-hart VLEN (discovered at boot). A thread that
  uses vector instructions with VLEN=256 can only run on harts with
  VLEN >= 256. This is integrated with the ISA-gating mechanism
  (Section 7.1.5.9): VectorLengthInfo (companion to IsaCapabilities) encodes
  per-thread VLEN. The scheduler uses rvv_vlen_bits to constrain hart placement.

Alignment with scheduler hints:

The scheduler already tracks ISA capability usage per-thread (Section 7.2, IsaCapabilities bitflags including X86_AVX512, X86_AMX, ARM_SME). The XSAVE policy uses the same flags: - A thread with X86_AVX512 set in its IsaCapabilities is known to use AVX-512. The scheduler places it on a P-core (avoiding frequency throttling on E-cores). - The XSAVE subsystem uses the same flag to pre-allocate the ZMM XSAVE area, avoiding the #NM trap latency on the first AVX-512 instruction for threads that are known to use it (e.g., after an execve of a binary with AVX-512 in its ELF .note.gnu.property).

7.3.2.1 Saved State Composition by Architecture¶

This subsection enumerates exactly which registers UmkaOS saves and restores on each supported architecture during a context switch. For every architecture, state is divided into three categories: (1) general-purpose and control registers that are always saved; (2) extended floating-point/SIMD state that is saved lazily (only when the thread has actually used the relevant unit); and (3) debug registers that are saved only when hardware breakpoints are active. The lazy FP/SIMD policy is described in the preceding subsection; the per-architecture enable bits that control laziness are called out explicitly below.

x86-64:

Always saved (integer/control registers):

Register group	Registers	Notes
General-purpose	RAX, RBX, RCX, RDX, RSI, RDI, RBP, R8–R15	14 explicit GPRs; RSP is implicit in the kernel stack switch
Stack pointer	RSP	Saved via kernel stack pointer swap in the switch stub
Instruction pointer	RIP	Saved via the `call`/`ret` discipline in the switch stub; not written explicitly
Flags	RFLAGS	Saved with `pushfq`/`popfq`
TLS base registers	FS.base, GS.base	Written via `WRMSR`/`RDMSR` (IA32_FS_BASE / IA32_GS_BASE); userspace TLS and kernel per-CPU pointer respectively. `SWAPGS` handles the kernel↔user GS.base exchange on entry/exit

Extended state (lazy — saved only when xstate_used bitmap indicates use):

The extended state area is allocated as a single XSAVE-formatted buffer, size determined at boot from CPUID leaf 0xD subleaf 0 (ECX = full XSAVE area size for all enabled components). UmkaOS uses XSAVES/XRSTORS (compacted format) to save only the components indicated by the task's xstate_used bitmap.

Component	Registers	XSAVE component bit	Trigger
x87 FPU	ST0–ST7, FIP, FDP, FOP, FCW, FSW, FTW	Bit 0	Any x87 instruction
SSE	XMM0–XMM15, MXCSR	Bit 1	Any SSE/SSE2 instruction
AVX	YMM0–YMM15 upper 128-bit halves	Bit 2	Any AVX instruction
AVX-512 opmask	K0–K7	Bit 5	Any AVX-512 masked operation
AVX-512 ZMM hi256	ZMM0–ZMM15 upper 256-bit halves	Bit 6	Any AVX-512 ZMM instruction
AVX-512 ZMM hi16	ZMM16–ZMM31 full 512-bit	Bit 7	Any AVX-512 ZMM16–31 instruction
Intel AMX tile config	`TILECFG`	Bit 17	`LDTILECFG` instruction
Intel AMX tile data	Up to 8 tiles × 1024 bytes = 8192 bytes	Bit 18	Any AMX tile compute instruction

Whether AVX, AVX-512, or AMX are present is determined at boot from CPUID leaf 0x7 and leaf 0xD; UmkaOS enables only the components reported by the hardware.

Debug registers (lazy — saved only when hardware breakpoints are active):

DR0–DR3 (break address), DR6 (status), DR7 (control). Saved on context switch out and restored on context switch in for any thread that has set hardware breakpoints (indicated by a per-task debug_active flag). DR4 and DR5 are aliases of DR6 and DR7; DR4/DR5 are not saved separately.

AArch64:

Always saved:

Register group	Registers	Notes
General-purpose	X0–X30	31 integer registers; XZR (X31) is hardwired zero and never saved
User stack pointer	SP_EL0	User-mode stack pointer; saved as a plain 64-bit value
Instruction pointer	PC	Saved via the `ret` target in the switch stub (stored in the thread's saved `x30`/LR slot pointing to the resume label)
Process state	SPSR_EL1	Saved process state register; encodes NZCV flags, DAIF mask, execution state, SP selection
User TLS pointer	TPIDR_EL0	User-readable thread pointer register; holds glibc TLS base

UmkaOS's per-CPU kernel pointer lives in TPIDR_EL1; it is not a per-thread value and is not saved/restored on context switch.

Extended state (lazy):

NEON/FP state is controlled by CPACR_EL1.FPEN. If FPEN is set to 0b00 (trapping), any FP/NEON instruction from EL0 or EL1 takes a trap that triggers allocation and enable. Once enabled, NEON/FP is always saved (there is no hardware equivalent to XSAVE's component-level selective save on non-SVE AArch64).

Component	Registers	Save size	Enable bit	Trigger
NEON/FP	V0–V31 (128-bit each), FPSR, FPCR	528 bytes	`CPACR_EL1.FPEN`	Any FP or NEON instruction
SVE (FEAT_SVE)	Z0–Z31 (variable, up to 2048 bits each), P0–P15 (predicates), FFR	VL-dependent (see Section 7.1.6)	`CPACR_EL1.ZEN`	Any SVE instruction; first use traps to #UND
SME (FEAT_SME)	ZA tile array (SVL-dependent, up to 64 KB at SVL=2048 bits), streaming SVE state	SVL-dependent	`SMCR_EL1.ENA`	`SMSTART` instruction; traps if ENA=0

SVE and SME presence is determined from CPUID registers ID_AA64PFR0_EL1 and ID_AA64SMFR0_EL1 at boot. SVE vector length (VL) is read from ZCR_EL1; it may differ per-CPU cluster on heterogeneous SoCs, which is reflected in IsaCapabilities. SME streaming vector length (SVL) is read from SMCR_EL1.

Debug registers (lazy):

DBGBVR0–DBGBVR15 (breakpoint value registers) and DBGBCR0–DBGBCR15 (breakpoint control registers), plus DBGWVR0–DBGWVR15 and DBGWCR0–DBGWCR15 (watchpoints). The number of implemented breakpoints and watchpoints (up to 16 each) is read from ID_AA64DFR0_EL1 at boot. Saved only when the per-task debug_active flag is set.

ARMv7:

Always saved:

Register group	Registers	Notes
General-purpose	R0–R14	15 integer registers; R15 (PC) is handled by the switch stub's `bx lr` return
Process state	CPSR	Current Program Status Register (condition flags, mode bits, interrupt masks)
User TLS pointer	TPIDRURW	User-read/write thread pointer register; holds glibc TLS base

Extended state (lazy):

VFP/NEON state is controlled by the FPEXC.EN bit. With EN=0, any VFP or NEON instruction from any privilege level traps to the undefined-instruction handler, which allocates the save area and sets EN=1.

Component	Registers	Save size	Enable bit	Trigger
VFP/NEON	D0–D31 (64-bit each), FPSCR	260 bytes	`FPEXC.EN`	Any VFP or NEON instruction

ARMv7 has no hardware equivalent to XSAVE's component-level selective save: all 32 doubleword registers are saved and restored as a unit using VSTMIA/VLDMIA. The fixed 260-byte save area (32 × 8 + 4 FPSCR) is allocated on first VFP/NEON use and is never resized. Threads that never use floating-point or NEON pay zero VFP save cost.

Presence of the VFP and NEON units is detected from FPSID and MVFR0 at boot. Some ARMv7 implementations (e.g., Cortex-M targets) omit VFP entirely; UmkaOS skips all VFP save/restore logic on those cores.

Debug registers (lazy):

BVR0–BVR15 (Breakpoint Value Registers) and BCR0–BCR15 (Breakpoint Control Registers), plus WVR/WCR pairs for watchpoints. Count is read from DBGDIDR at boot. Saved only when debug_active is set.

RISC-V (RV64GC):

Always saved:

Register group	Registers	Notes
General-purpose	x1–x31	31 integer registers; x0 is hardwired zero and is never saved
Instruction pointer	PC	Saved via the `ra` (x1) convention in the switch stub; the stub stores the resume label in `ra` before saving and jumps via `ret` on restore
Thread pointer	x4 (tp)	Holds the per-CPU CpuLocalBlock pointer when in kernel mode (swapped with `sscratch` on trap entry); userspace `tp` is preserved in the task's saved register frame

Extended state (lazy):

Floating-point and vector state laziness is implemented using the FS and VS fields in the sstatus CSR. Hardware sets FS/VS to Dirty whenever the corresponding register set is written; UmkaOS checks this flag at context switch time rather than maintaining a separate software bitmap.

Component	Registers	Save size	sstatus field	Trigger
F extension (single)	f0–f31, fcsr	132 bytes	`FS`	Any F-extension instruction when `FS` = Off → traps; when `FS` = Initial or Dirty → no trap
D extension (double)	f0–f31 (64-bit view), fcsr	260 bytes	`FS` (shared with F)	Any D-extension instruction
V extension (vector)	v0–v31 (variable length), vcsr, vl, vtype, vstart	VLEN-dependent (see Section 7.1.6)	`VS`	Any V-extension instruction when `VS` = Off → illegal-instruction trap

The F and D extensions share the same register file and FS field; D is a superset of F. If both are present, UmkaOS always saves 64-bit doubles. The presence of F, D, and V extensions is determined from the misa CSR and from the ISA string in the device tree or SBI firmware at boot.

Context switch policy for float/vector: - If FS = Off: no FP registers are saved (zero cost). - If FS = Initial: registers are in reset state; skip save, but restore from a canonical all-zeros area on next thread's restore (or leave as Initial). - If FS = Dirty: save all f0–f31 and fcsr. After save, set FS = Clean. - If VS = Off: no vector registers are saved. - If VS = Dirty: save all v0–v31 plus vcsr/vl/vtype/vstart. After save, set VS = Clean.

On heterogeneous RISC-V systems where different harts have different VLEN, a thread's VLEN is fixed at the VLEN of the hart that first executed a vector instruction. The scheduler then constrains that thread to harts with matching VLEN (see VectorLengthInfo in Section 7.1.5).

Debug registers (lazy):

The RISC-V debug trigger module provides tselect, tdata1, tdata2, and optionally tdata3 CSRs for configuring hardware breakpoints and watchpoints. The number of implemented triggers is determined at boot by iterating tselect until it wraps. Saved only when debug_active is set.

PPC32:

Always saved:

Register group	Registers	Notes
General-purpose	R0–R31	32 integer registers
Special-purpose	LR, CTR, XER, CR	Link register, count register, integer exception register, condition register
Instruction pointer	SRR0	Machine state save/restore register 0 holds the saved PC (restored via `rfi`)
Machine state	SRR1	Machine state save/restore register 1 holds the saved MSR (restored via `rfi`)

User TLS is managed by convention: the ABI designates R2 as the small-data area pointer and R13 as the read-only TLS base; both are part of the general GPR save above. The kernel per-CPU pointer lives in SPRG3 and is not per-task.

Extended state (lazy):

FPU state is controlled by MSR.FP. With MSR.FP = 0, any floating-point instruction from any privilege level causes a floating-point unavailable exception. The handler allocates the save area, sets MSR.FP = 1, and returns.

Component	Registers	Save size	Enable bit	Trigger
FPR	FPR0–FPR31 (64-bit each), FPSCR	264 bytes	`MSR.FP`	Any FP instruction when `MSR.FP` = 0

AltiVec/VMX is not universally present on PPC32 targets supported by UmkaOS (primarily embedded e500/e500mc class cores). On embedded PPC32 cores that do implement SPE (Signal Processing Engine) floating-point, the SPE save area (32 × 32-bit upper halves of the 64-bit SPE GPRs, plus SPEFSCR) replaces the classical FPR block. Presence of SPE is detected from the PVR (Processor Version Register) at boot.

Debug registers (lazy):

DBCR0, DBCR1, DAC1, DAC2 (data address compare), IAC1, IAC2 (instruction address compare). Count and capability are read from DBCR0 at boot. Saved only when debug_active is set.

PPC64LE:

Always saved:

Register group	Registers	Notes
General-purpose	R0–R31	32 integer registers
Special-purpose	LR, CTR, XER, CR, DSCR, AMR	Link, count, exception, condition, data stream control, authority mask registers
Instruction pointer	SRR0 / HSRR0	SRR0 for normal exceptions; HSRR0 for hypervisor exceptions (used in KVM context)
Machine state	SRR1 / HSRR1	Saved MSR (restored via `rfid` / `hrfid`)

The AMR (Authority Mask Register) implements a hardware equivalent to memory protection keys on POWER9+ in Radix mode; it is always saved to preserve per-task memory domain state. DSCR controls the hardware prefetch engine and is saved to avoid polluting one task's prefetch hints into another.

User TLS follows the ELFv2 ABI: R13 holds the thread pointer in user mode. In kernel mode, R13 holds the PACA pointer (per-CPU area base). The kernel saves/restores the userspace R13 on kernel entry/exit. R13 is part of the general GPR save.

Extended state (lazy):

Three overlapping extended state components, each controlled by a separate MSR bit:

Component	Registers	Save size	MSR bit	Trigger
FPR	FPR0–FPR31 (64-bit each), FPSCR	264 bytes	`MSR.FP`	Any FP instruction when `MSR.FP` = 0
VMX/AltiVec	VR0–VR31 (128-bit each), VRSAVE, VSCR	528 bytes	`MSR.VEC`	Any VMX instruction when `MSR.VEC` = 0
VSX	VS0–VS63 (the VSX register file overlays FPR0–31 and VR0–31)	Covered by FPR + VMX saves	`MSR.VSX`	Any VSX instruction when `MSR.VSX` = 0

The VSX register file (VS0–VS63) is not an additional 64 independent registers: VS0–VS31 are the same physical registers as FPR0–FPR31 (double-precision view), and VS32–VS63 are the same physical registers as VR0–VR31. Saving FPR and VMX captures the complete VSX state; there is no additional VSX-specific save area.

All three components are saved independently and lazily: a task that uses FPR but not VMX pays only the 264-byte FPR save cost. MSR.VSX enables the xvmaddadp class instructions that cross the FPR/VMX boundary; it requires both MSR.FP and MSR.VEC to be set first.

Debug registers (lazy):

DAWR0 and DAWRX0 (data address watchpoint register, introduced POWER9) for hardware memory watchpoints. Hardware instruction breakpoints via CIABR (Completed Instruction Address Breakpoint Register). Saved only when debug_active is set.

s390x:

Always saved:

Register group	Registers	Notes
General-purpose	R0–R15	16 general registers (64-bit each)
Program Status Word	PSW (instruction address + condition code + system mask)	Saved/restored via `LPSWE`/`EPSW`; encodes PC, addressing mode, condition code, interrupt masks, and DAT mode
Control registers	CR0–CR15	16 control registers governing interrupts, address-space control, tracing, clock comparator, and PER (Program Event Recording). CR1 holds the primary ASCE (Address Space Control Element, the page table root). CR7 holds the secondary ASCE. CR13 holds the home ASCE
Access registers	AR0–AR15	16 access registers (32-bit each); used for secondary-space addressing (AR mode). Each AR selects which address space (primary, secondary, or home) a corresponding GPR-based address refers to
Thread pointer	AR0 (by convention)	glibc on s390x uses AR0 + ALET for TLS base addressing; the kernel saves AR0–AR15 as part of the access register block

s390x context switching uses the STMG/LMG instructions to save/restore GPRs in a single instruction pair (store multiple / load multiple). The PSW is not directly readable as a single register — the current PSW is captured by taking an interrupt (supervisor call) or by using EPSW (Extract PSW). On context switch, the kernel stores the interrupted PSW (from the interrupt-old PSW area in the lowcore) into the task's saved state.

Extended state (lazy):

FP and vector state is controlled by control register bits. CR0 bits 56–63 control the AFP (Additional Floating Point) register facility. The VX (Vector Extension) facility, when present, extends the FP registers to 128-bit vector registers.

Component	Registers	Save size	Enable mechanism	Trigger
FP (BFP/HFP)	FPR0–FPR15 (64-bit each), FPC (FP control register)	132 bytes	CR0 AFP bit	Any FP instruction when AFP is disabled causes a data exception
Vector (VX facility)	V0–V31 (128-bit each; V0–V15 overlay FPR0–FPR15)	512 bytes total (V0–V15 upper halves: 256 bytes, plus V16–V31: 256 bytes)	CR0 VX enable bit	Any vector instruction when VX is disabled causes a vector-processing exception

When the VX facility is present, V0–V15 are the 128-bit extensions of FPR0–FPR15 (the low 64 bits are the classical FPR values). Saving V0–V31 captures all FP state. When VX is not present, only the 16 classical 64-bit FPRs are saved. VX facility presence is detected from STFLE (Store Facility List Extended) bit 129 at boot.

Context switch policy (save and restore are symmetric — both conditional on prior use): - If the thread has never used FP: CR0 AFP bit is cleared, no FP state is saved or restored (zero cost). - If the thread uses FP but not VX: save FPR0–FPR15 + FPC (132 bytes) via STD/STFPC; restore via LD/LFPC. CR0 AFP bit is set for the incoming thread. - If the thread uses VX: save V0–V31 via VSTM (vector store multiple), which captures all FP + vector state in one operation; restore via VLM. CR0 VX enable bit is set. Cost: ~40–80 ns depending on how many registers are dirty. - On return to userspace: the kernel restores the incoming thread's FP/VX state only if that thread previously used FP/VX (lazy trap-on-first-use). There is no unconditional restore on the user-return path — the CR0 AFP/VX bits remain cleared if the thread has never used FP, causing a data exception on first FP use which triggers allocation.

Debug registers (lazy):

PER (Program Event Recording) registers: CR9 (PER event mask), CR10 (PER starting address), CR11 (PER ending address). PER provides instruction-address and storage-alteration tracing. Saved only when debug_active is set.

LoongArch64:

Always saved:

Register group	Registers	Notes
General-purpose	R0–R31	32 integer registers; R0 is hardwired zero and is never saved. R1 (RA) holds the return address. R3 (SP) is the stack pointer
CSR.PRMD	Previous mode register	Saves the privilege level (PLV), interrupt enable (PIE), and watchpoint enable (PWE) state from before the exception. Restored on `ertn` (exception return)
CSR.ERA	Exception Return Address	Holds the PC to return to after exception handling
Thread pointer	R2 (TP)	User-mode thread pointer register; holds glibc TLS base

LoongArch64 uses CSR.PRMD (not a general SPSR equivalent) to record the pre-exception processor state. On context switch, the kernel saves CSR.PRMD and CSR.ERA into the task's saved state. The kernel per-CPU pointer is stored in CSR.KS0 (scratch register 0) and is not per-task.

Extended state (lazy):

FP/SIMD state is controlled by CSR.EUEN (Extended Unit Enable register). Individual bits in CSR.EUEN control access to the FPU, LSX (Loongson SIMD Extension, 128-bit), and LASX (Loongson Advanced SIMD Extension, 256-bit).

Component	Registers	Save size	CSR.EUEN bit	Trigger
FPU	F0–F31 (64-bit each), FCSR0 (FP control/status)	260 bytes	Bit 0 (FPE)	Any FP instruction when FPE=0 causes a floating-point disabled exception
LSX (128-bit SIMD)	VR0–VR31 (128-bit each; overlay F0–F31)	512 bytes (upper 64-bit halves of VR0–VR31)	Bit 1 (SXE)	Any LSX instruction when SXE=0 causes an LSX disabled exception
LASX (256-bit SIMD)	XR0–XR31 (256-bit each; overlay VR0–VR31)	1024 bytes (upper 128-bit halves of XR0–XR31)	Bit 2 (ASXE)	Any LASX instruction when ASXE=0 causes a LASX disabled exception

The register files are hierarchically overlaid: XR0–XR31 (256-bit) contain VR0–VR31 (128-bit) which contain F0–F31 (64-bit). Saving the widest enabled component captures all narrower state. When LASX is enabled, saving XR0–XR31 captures LSX and FP state.

Context switch policy: - If CSR.EUEN.FPE = 0: no FP/SIMD state is saved (zero cost). - If FPE = 1 but SXE = 0: save F0–F31 + FCSR0 (260 bytes) via fst.d. - If SXE = 1 but ASXE = 0: save VR0–VR31 via vst (512 bytes captures FP state too). - If ASXE = 1: save XR0–XR31 via xvst (1024 bytes captures all FP + LSX state).

LSX and LASX presence is detected from CPUCFG register 2 (bits 6 and 7) at boot.

Debug registers (lazy):

LoongArch64 provides hardware breakpoint and watchpoint registers via CSR.DB0ADDR– CSR.DB7ADDR (data breakpoint addresses) and CSR.IB0ADDR–CSR.IB7ADDR (instruction breakpoint addresses), with corresponding control registers CSR.DB0CTL–CSR.DB7CTL and CSR.IB0CTL–CSR.IB7CTL. Up to 8 data breakpoints and 8 instruction breakpoints are supported. The actual count is read from CPUCFG register 6 at boot. Saved only when debug_active is set.

Lazy FP/SIMD save — unified policy statement:

UmkaOS uses lazy or conditional FP/SIMD save on all supported architectures. Extended state is saved and restored only for threads that have actually used the corresponding unit. On architectures with hardware trap-on-first-use (x86-64, AArch64, ARMv7, RISC-V, PPC32, PPC64LE, LoongArch64), the mechanism is hardware-lazy (trap on first use). On s390x, the mechanism is software-conditional (per-task usage flags, no trap). The mechanism by which laziness is enforced is architecture-specific:

Architecture	Lazy FP mechanism	Lazy vector mechanism
x86-64	CR0.TS=1 causes #NM on first FP/SSE use; `XSAVES` saves only dirty XSAVE components	Same XSAVE component mask; AVX/AVX-512/AMX each have independent bits
AArch64	CPACR_EL1.FPEN=0b00 causes trap on first NEON/FP use	CPACR_EL1.ZEN=0b00 causes #UND on first SVE use; SMCR_EL1.ENA=0 traps SMSTART
ARMv7	FPEXC.EN=0 causes undefined-instruction trap on first VFP/NEON use	N/A (no vector extension beyond NEON)
RISC-V	sstatus.FS=Off causes illegal-instruction trap; hardware sets FS=Dirty on write	sstatus.VS=Off causes illegal-instruction trap; hardware sets VS=Dirty on write
PPC32	MSR.FP=0 causes floating-point unavailable exception	N/A (SPE if present uses same exception mechanism)
PPC64LE	MSR.FP=0 causes FP unavailable; MSR.VEC=0 causes VMX unavailable	MSR.VSX=0 causes VSX unavailable; requires FP+VEC first
s390x	Software-tracked conditional. z/Architecture has no trap-on-first-use for FP/VX. UmkaOS tracks per-task FP/VX usage via software flags (set when FP state is first dirtied during context switch inspection). If a task has never used FP: CR0 AFP bit is cleared, no FP state is saved or restored (zero cost). If a task has used FP: save/restore only the register ranges actually used (FP only, or FP+VX). This is not hardware-lazy (no trap instruction), but achieves the same outcome: tasks that never touch FP pay zero context-switch cost.	Same — software-conditional; per-task VX usage flag.
LoongArch64	CSR.EUEN.FPE=0 causes FP disabled exception on first FP use	CSR.EUEN.SXE=0 causes LSX disabled exception; CSR.EUEN.ASXE=0 causes LASX disabled exception

A task that never touches FP or SIMD registers pays zero extended-state save cost on every context switch across all architectures.

7.3.3 Post-Context-Switch Cleanup (finish_task_switch)¶

After the hardware context switch completes and the new task begins executing, finish_task_switch() performs the handshake that completes the switch: releasing the runqueue lock, re-enabling preemption, and freeing the previous task's resources if it was exiting. This is the first code the new task executes after being switched to.

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). See CLAUDE.md Spec Pseudocode Quality Gates.

Call chain: schedule() acquires the runqueue lock, calls context_switch() (which performs the hardware switch), and the first code the new task executes is finish_task_switch(prev). The prev parameter is passed through the hardware switch (saved in a callee-saved register or on the kernel stack before the switch, restored after).

/// Post-context-switch cleanup. Called by the scheduler after switching
/// from `prev` to the current task. Runs in the context of the NEW task
/// (the one that was switched TO).
///
/// This function completes the context switch handshake: it releases the
/// runqueue lock that was held during the switch, re-enables preemption,
/// and handles cleanup of the previous task if it was exiting.
///
/// # Arguments
/// - `prev`: The task that was switched away from. Its register state was
///   saved by `context_switch()` before the hardware switch.
///
/// # Preconditions
/// - The caller holds the local CPU's runqueue lock (acquired by `schedule()`
///   before `context_switch()`).
/// - Preemption is disabled (disabled by `schedule()` before the switch).
/// - The current task is `next` from the preceding `context_switch(prev, next)`.
///
/// # Postconditions
/// - The runqueue lock is released.
/// - Preemption is re-enabled.
/// - If `prev` was TASK_DEAD, its kernel stack and task struct are freed.
fn finish_task_switch(prev: &Task) {
    let rq = this_rq();

    // Step 1: Complete the runqueue lock handshake.
    // The rq lock was acquired by schedule() before context_switch().
    // It is released here, after the switch, by the NEW task.
    // This ensures the rq state is consistent across the switch boundary:
    // no other CPU can observe a half-switched state (prev still on rq
    // but next already running).
    //
    // SAFETY: The lock was acquired by schedule() and is guaranteed held
    // at this point. The hardware switch preserves the lock state because
    // the lock is stored in the per-CPU runqueue struct, not on the stack.
    unsafe { rq.lock.unlock() };

    // Step 2: Re-enable preemption.
    // Preemption was disabled by schedule() before the switch. Re-enabling
    // it here allows the new task to be preempted by higher-priority tasks
    // or IRQs. This MUST happen after the rq lock is released (Step 1) —
    // holding a spinlock with preemption enabled is a deadlock risk.
    preempt_enable();

    // Step 3: Check if the previous task is dead (TASK_DEAD state).
    // TASK_DEAD means the task has been fully reaped (past zombie, parent
    // called waitpid, release_task() set TASK_DEAD). The task cannot free
    // its own kernel stack because it was still executing on it during the
    // context_switch(). The NEXT task (us) frees it here, safely.
    //
    // Note: ZOMBIE tasks are NOT freed here. A zombie retains its Task
    // struct and kernel stack until the parent calls waitpid(), which
    // invokes release_task() → transitions ZOMBIE to DEAD → the NEXT
    // schedule() on that CPU's finish_task_switch() frees the stack.
    let prev_state = prev.state.load(Acquire);
    if prev_state == TaskState::DEAD.bits() {
        // Free the previous task's kernel stack.
        // The stack was allocated from the kernel stack slab
        // ([Section 4.3](04-memory.md#slab-allocator)) at fork()/clone() time. The slab
        // allocator returns the stack pages to the per-CPU magazine
        // (hot path, no global lock).
        //
        // SAFETY: prev is TASK_DEAD and will never be scheduled again.
        // No other CPU can reference prev's stack because:
        // (a) prev was removed from the runqueue in do_exit() Step 14,
        // (b) prev's state is DEAD (Release store in release_task()
        //     pairs with our Acquire load above), and
        // (c) the rq lock that we just released in Step 1 serialized
        //     the switch — no CPU can be in the middle of switching TO prev.
        unsafe {
            free_kernel_stack(prev.stack_base, prev.stack_size);
        }

        // Drop the task struct reference. If this is the last Arc reference
        // (typical for TASK_DEAD — the parent's waitpid already dropped its
        // reference in release_task()), the Task struct is freed via
        // Arc::drop, returning memory to the task slab cache.
        //
        // PID slot release: the PID was already returned to the PID
        // allocator in release_task() (called by the parent's waitpid).
        // UID task count was also decremented there.
        drop(Arc::from_raw(prev as *const Task));
    }

    // Step 4: Fire scheduler notifiers.
    // These are lightweight callbacks registered by subsystems that need
    // to act on every context switch IN event (from the new task's
    // perspective):
    //
    // - KVM: If the new task is a vCPU thread, re-enter the guest
    //   (via kvm_sched_in() → vmresume/vmlaunch).
    // - Perf: perf_schedule_in() was already called in context_switch()
    //   step 7 — no additional action here.
    // - Cgroup: update per-CPU cgroup tracking for the new task.
    //
    // The notifier list is a per-CPU ArrayVec (bounded, no allocation).
    // Each callback is expected to complete in < 100ns.
    for notifier in CpuLocal::get().sched_in_notifiers.iter() {
        notifier.on_sched_in();
    }
}

Why the rq lock is released by the NEW task, not the OLD task: The rq lock must be held across the entire context switch to prevent another CPU from observing an inconsistent state (e.g., the old task still on the rq while the new task is already running). The old task cannot release the lock because it stops executing at the hardware switch point. The new task is the first to execute after the switch and is responsible for releasing the lock. This is the same design as Linux (finish_task_switch() in kernel/sched/core.c).

Kernel stack deferred free: A task cannot free its own kernel stack — it is still executing on that stack at the time of schedule(). Linux solves this with the same finish_task_switch() pattern: the NEXT task that runs on the CPU frees the dead task's stack. In UmkaOS, the stack was allocated from the kernel stack slab cache (Section 4.3), so freeing it returns the pages to the per-CPU magazine with no global lock (hot path, ~10ns).

Architecture notes: finish_task_switch() is architecture-neutral. The architecture-specific portion of the context switch ends at the hardware switch point (step 6 in the context switch procedure above). All post-switch cleanup is generic code. The prev pointer is passed through the switch via an architecture-specific mechanism:

Architecture	`prev` passing mechanism
x86-64	Callee-saved register (`r12` or `rbx`) preserved across `__switch_to()`
AArch64	Callee-saved register (`x19`) preserved across `cpu_switch_to()`
ARMv7	Callee-saved register (`r4`) preserved across `__switch_to()`
RISC-V	Callee-saved register (`s0`) preserved across `__switch_to()`
PPC32	Callee-saved register (`r14`) preserved across `_switch()`
PPC64LE	Callee-saved register (`r14`) preserved across `_switch()`
s390x	Callee-saved register (`r6`) preserved across `__switch_to()`
LoongArch64	Callee-saved register (`$s0`/`r23`) preserved across `__switch_to()`

7.3.4 CPU Hotplug Integration¶

CPU hotplug (CPUs going offline/online at runtime) must be handled by the scheduler to migrate tasks and maintain invariants. UmkaOS supports full CPU hotplug on all architectures.

Offline sequence (CPU N going offline):

1. Mark CPU N as draining: set runqueue[N].state = DRAINING.
   New tasks are no longer scheduled onto CPU N (load balancer skips it).

2. Migrate tasks from runqueue[N]:
   For each runnable task in runqueue[N] (EEVDF tree + RT queues + DL queues):
   a. Select migration target: prefer same-NUMA-node CPU with lowest load
      (EAS-aware, Section 7.1.5).
   b. Dequeue from runqueue[N], set task.cpu = target_cpu,
      enqueue on runqueue[target_cpu].
   c. If a task is currently running on CPU N: wait for it to yield
      (it will find DRAINING state on next preemption point and yield).

3. Drain RCU quiescent state:
   Call rcu_barrier() to process all pending RCU callbacks that reference
   CPU N's per-CPU data. CPU N reports a final quiescent state.

4. Drain per-CPU slab magazines:
   Flush CPU N's per-CPU slab magazines to their per-NUMA partial lists
   (returns cached pages to the system).

5. Drain per-CPU writeback queue:
   Flush any pending writeback work on CPU N.

6. Mark CPU N offline:
   clear_bit(N, cpu_online_mask).
   CPU N executes arch::current::cpu::park() (HLT loop on x86-64,
   WFI in low-power state on AArch64, pause loop on RISC-V).

7. Notify subsystems:
   Fire cpu_hotplug_notifier(OFFLINE, N) to allow subsystems (networking,
   RCU, scheduler) to clean up CPU-N-specific state.

Online sequence (CPU N coming online):

1. Architecture-specific bring-up:
   x86-64: INIT-SIPI-SIPI sequence via LAPIC.
   AArch64: PSCI CPU_ON call.
   RISC-V: SBI HSM hart_start call.

2. Per-CPU data initialization:
   CPU N's PerCpu data structures are NOT re-allocated (they were sized
   at boot for all possible CPUs — see Section 3.1.3). Only state is reset:
   - runqueue[N]: initialize as empty, state = ACTIVE.
   - CpuLocal block for CPU N: zero-fill state fields.
   - RCU: call `rcu_cpu_online(N)` — updates leaf node `online_mask`,
     sets `rcu_percpu[N].gp_seq_local`, clears `CpuLocal::rcu_passed_quiesce`.

3. Mark CPU N online:
   set_bit(N, cpu_online_mask).

4. Fire cpu_hotplug_notifier(ONLINE, N).

5. Load balancer picks up CPU N in the next balance interval and starts
   migrating tasks to it.

Design note: UmkaOS's per-CPU arrays are sized at boot for num_possible_cpus() (all CPUs that could ever be brought online, including hotplugged ones). This means CPU offline/online is a pure state machine transition with no memory allocation, matching the "no hardcoded MAX_CPUS" principle and enabling sub-millisecond hotplug transitions.

7.4 Platform Power Management¶

Standards: ACPI 6.5 Section 1.3 (Power Management), Intel SDM Vol 3B Section 18.9 (RAPL MSRs), AMD PPR (Zen2+) Section 2.1.9 (RAPL), ARM Energy Model (Documentation/power/energy-model.rst), IPMI v2.0 Section 11.2 / DCMI v1.5. IP status: All interfaces are open standards or documented hardware interfaces. No proprietary implementations referenced.

7.4.1 Problem and Scope¶

Power management is a kernel responsibility — not because policy belongs in the kernel, but because the mechanisms that enforce policy require ring-0 privileges and sub-millisecond response latency:

Why ring-0 is required for power management mechanisms:

RAPL MSR writes require ring-0 access. Intel and AMD RAPL power-limit registers (e.g., MSR_PKG_POWER_LIMIT at 0x610) are privileged MSRs. A WRMSR instruction executed from ring-3 causes a #GP(0) fault. There is no userspace API that provides equivalent direct hardware control; powercap sysfs writes go through the kernel driver.
Thermal trip point response must be sub-millisecond. A thermal Critical trip point (typically 5–10 °C below the hardware PROCHOT shutdown temperature) requires an immediate forced poweroff. The kernel cannot wait for a userspace daemon to wake up, read a netlink event, and issue a shutdown ioctl — that path has unbounded latency. The kernel's thermal interrupt handler must act directly. Note: on SMI-mediated platforms (where IA32_MISC_ENABLE.FORCEOP enables firmware-first thermal handling), the kernel's thermal interrupt fires after the SMI handler completes (50-150 us typical, up to 100 ms on slow BIOS implementations). The "sub-millisecond" target applies to the kernel's response time after receiving the interrupt, not to the end-to-end latency which includes firmware processing.
cgroup power accounting requires kernel-side energy counter integration. Attributing energy consumption to a cgroup requires reading RAPL energy counters at the same scheduler tick that records CPU time — these are indivisible from a correctness standpoint. A userspace poller cannot atomically correlate energy deltas to the task that was running.
VM power budgets must be enforced even if the VM misbehaves. A guest OS cannot be trusted to self-limit its power consumption. The hypervisor (umka-kvm) must enforce power caps from outside the VM, using kernel-level RAPL and cgroup mechanisms.

Scope: This section covers mechanisms only:

RAPL hardware interface abstraction (RaplInterface trait, Section 7.4)
Thermal zone and trip-point framework (ThermalZone, Section 7.4)
Powercap sysfs hierarchy (Section 7.4)
cgroup power accounting and per-cgroup power limits (Section 7.4)
VM watt-budget enforcement (Section 7.4)
DCMI/IPMI rack-level power management (Section 7.4)

Policy — which power profile a user selects, when to throttle a VM for economic reasons, how to balance performance against energy cost — is a userspace/orchestrator concern. The kernel provides the enforcement hooks; daemons (e.g., tuned, power-profiles-daemon, umka-kvm's scheduler) invoke them.

7.4.2 RAPL — Running Average Power Limit¶

7.4.2.1 Domain Taxonomy¶

RAPL partitions the platform into named power domains. Each domain has independent power-limit registers and energy-status counters:

Domain	Scope	Availability
`Pkg`	Entire CPU socket including uncore (LLC, memory controller, PCIe root complex, integrated graphics on server SKUs)	Intel SNB+, AMD Zen2+
`Core` (PP0)	CPU cores only (excluding uncore). Useful for isolating compute vs memory-bandwidth workloads.	Intel SNB+
`Uncore` (PP1)	Integrated GPU / GT on Intel client SKUs. Not present on server SKUs (Xeon).	Intel client only
`Dram`	Memory controller and attached DIMMs. Separate power rail on server platforms.	Intel IVB-EP+, AMD Zen2+ server
`Platform` (PSYS)	Entire platform as measured from the charger/PSU side. Introduced on Intel Skylake+ client. Captures power not visible to PKG (PCH, NVMe, display).	Intel SKL+ client only

The Core domain is always ≤ Pkg. Platform ≥ Pkg because it includes peripheral power not counted by the socket energy counter.

7.4.2.2 MSR Interface (x86-64 / x86)¶

Intel RAPL is exposed via Model-Specific Registers readable/writable with RDMSR/WRMSR from ring-0. The register layout is documented in Intel SDM Vol 3B Section 18.9.

Key registers for the Pkg domain (other domains follow the same pattern at different base addresses):

MSR Address	Name	Direction	Purpose
`0x610`	`MSR_PKG_POWER_LIMIT`	R/W	Set short-window and long-window power limits
`0x611`	`MSR_PKG_ENERGY_STATUS`	R	Read cumulative energy counter (wraps at ~65 J for typical units)
`0x613`	`MSR_PKG_PERF_STATUS`	R	Throttle duty cycle (fraction of time spent in power throttle)
`0x614`	`MSR_PKG_POWER_INFO`	R	Thermal Design Power (TDP), minimum, and maximum power

MSR_PKG_POWER_LIMIT bit layout: - Bits 14:0 — Long-window power limit (in hardware power units from MSR_RAPL_POWER_UNIT) - Bit 15 — Enable long-window limit - Bit 16 — Clamping enable (allow limit to go below TDP; requires CLAMPING_SUPPORT flag) - Bits 23:17 — Long-window time window (tau_x, encoded as y * 2^F × base unit, typically ≤ 28 s) - Bits 30:24 — Reserved - Bit 31 — Reserved - Bits 46:32 — Short-window power limit - Bit 47 — Enable short-window limit - Bit 48 — Short-window clamping enable - Bits 55:49 — Short-window time window (tau_y, ≤ 10 ms) - Bits 62:56 — Reserved - Bit 63 — Lock bit (locks the entire register until next RESET; kernel must not set this)

The short-window limit (tau_y ≤ 10 ms) is the primary mechanism for burst suppression. The long-window limit (tau_x ≈ 28 s) enforces sustained average power. Setting both gives a two-tier policy: allow short bursts up to short_limit_W for up to 10 ms, but enforce long_limit_W on average.

Energy units are encoded in MSR_RAPL_POWER_UNIT (address 0x606). The driver must read this at boot and convert all values accordingly.

7.4.2.3 AMD Equivalent¶

AMD Zen2 and later processors implement RAPL-compatible MSRs at the same addresses (0x610, 0x611, 0x614) with the same bit layout. This allows the same MSR driver to serve both Intel and AMD on Zen2+.

Older AMD processors (pre-Zen2) use the System Management Unit (SMU), a co-processor accessible via PCI config space (bus 0, device 0, function 0, PCI vendor/device ID varies by generation). The SMU interface is not publicly documented; UmkaOS uses the same reverse-engineered interface as the Linux amd_energy driver (kernel/drivers/hwmon/amd_energy.c).

The RaplInterface abstraction (Section 7.4) hides this difference from upper layers.

7.4.2.4 ARM and RISC-V Equivalents¶

ARM Energy Model (EM): ARM SoCs do not expose hardware energy counters equivalent to RAPL. Instead, the ARM Energy Model framework provides estimated power consumption based on empirically measured power coefficients per CPU frequency operating point (OPP). Each OPP has a power_mW coefficient stored in the device tree (operating-points-v2 table). The kernel integrates over active OPPs to estimate energy. This is less accurate than RAPL but enables the same cgroup accounting interface (Section 7.4).

RISC-V: There is no standardised RAPL equivalent in the RISC-V ISA or the SBI specification as of SBI v2.0. Platform-specific power management is exposed via vendor SBI extensions (e.g., T-HEAD/Alibaba extensions for their RISC-V SoCs). UmkaOS implements a NoopRaplInterface for RISC-V that returns PowerError::NotSupported for all limit-setting operations and provides zero energy readings. Cgroup accounting falls back to CPU-time-weighted estimation.

7.4.2.5 Kernel Abstraction¶

All RAPL consumers (cgroup accounting, VM power budgets, thermal passive cooling, DCMI enforcement) interact with power domains through the RaplInterface trait, never touching MSRs directly:

/// The type of a RAPL power domain.
/// See also the comprehensive PowerDomainType at [Section 7.7](#power-budgeting--design-power-as-a-schedulable-resource) which extends
/// this to non-CPU domains (accelerators, NICs, storage).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[repr(u32)]
pub enum PowerDomainType {
    /// Entire CPU socket including uncore (LLC, memory controller, PCIe root).
    CpuPackage  = 0,
    /// CPU cores only (PP0). Excludes uncore.
    CpuCore     = 1,
    /// Memory controller and attached DIMMs. Server platforms only.
    Dram        = 2,
    /// GPU / accelerator.
    Accelerator = 3,
    /// NIC (if power-metered).
    Nic         = 4,
    /// NVMe SSD (if power-metered).
    Storage     = 5,
    /// Entire system (platform-level RAPL or BMC).
    Platform    = 6,
}

/// A single RAPL power domain with its hardware interface and energy accumulator.
///
/// This is the x86/RAPL-specific domain object used for the powercap sysfs hierarchy
/// and energy accounting (Section 7.2.4). The generic cross-architecture abstraction
/// is `GenericPowerDomain` defined in Section 7.4.2.
pub struct RaplDomain {
    /// The type of power domain this represents.
    pub domain_type: PowerDomainType,
    /// The hardware driver implementing this domain's register interface.
    pub hw_interface: Arc<dyn RaplInterface>,
    /// Cumulative energy consumed by this domain in microjoules.
    ///
    /// This is a **software accumulator** (u64), distinct from the hardware
    /// energy counter. The hardware counter (Intel MSR `MSR_RAPL_POWER_UNIT`
    /// + `MSR_PKG_ENERGY_STATUS`) is typically 32 bits and wraps every
    /// 0.3-1.3 seconds at 200W (wrap period = 2^32 * energy_unit / power).
    /// The kernel power accounting thread (Section 7.2.5) polls the hardware
    /// counter at `poll_interval = wrap_time / 2` and accumulates the delta
    /// into this u64 field, which effectively never wraps (at 200W sustained,
    /// u64 overflows after ~2.9 billion years).
    ///
    /// **Implementer note**: The hardware poll loop must use `read_volatile`
    /// for the MSR read and compute `delta = (new - old) & hw_mask` to
    /// handle the 32-bit hardware wrap correctly.
    ///
    /// Readers must still handle wrap-around by tracking deltas (in case
    /// the software counter wraps at u64::MAX, which is astronomically
    /// unlikely but must be handled for 50-year correctness).
    pub energy_uj: AtomicU64,
    /// Socket index (0-based) this domain belongs to.
    pub socket_id: u32,
}

/// Hardware interface for reading and controlling a RAPL power domain.
///
/// Implementations exist for: Intel MSR (`IntelRaplMsr`), AMD MSR/SMU
/// (`AmdRaplInterface`), ARM Energy Model (`ArmEmInterface`), and no-op
/// (`NoopRaplInterface` for platforms without hardware support).
///
/// # Safety
///
/// Implementations that write MSRs must only do so from ring-0 kernel context.
/// MSR writes from interrupt context are permitted but must be idempotent and
/// must not acquire locks that could be held by non-interrupt code.
pub trait RaplInterface: Send + Sync {
    /// Set a power limit on the given domain.
    ///
    /// `limit_mw` is the power limit in milliwatts.
    /// `window_ms` is the averaging window in milliseconds. Hardware may
    /// round to the nearest supported window; callers must not assume exact values.
    ///
    /// Returns `PowerError::NotSupported` if the domain or windowed limiting
    /// is not available on this platform.
    fn set_power_limit(
        &self,
        domain: PowerDomainType,
        limit_mw: u32,
        window_ms: u32,
    ) -> Result<(), PowerError>;

    /// Remove a previously set power limit on the given domain, restoring
    /// the hardware default (TDP-derived limit).
    ///
    /// Returns `PowerError::NotSupported` if the domain is not available.
    fn clear_power_limit(&self, domain: PowerDomainType) -> Result<(), PowerError>;

    /// Read the cumulative energy consumed by the given domain in microjoules.
    ///
    /// The counter wraps at `max_energy_range_uj()`. Callers must track
    /// previous values and compute deltas to handle wrap-around correctly.
    ///
    /// Returns `PowerError::NotSupported` if the domain is not available.
    fn read_energy_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;

    /// Read the Thermal Design Power (TDP) of the given domain in milliwatts.
    ///
    /// This is the sustained power level the platform is designed to dissipate.
    /// It is used as the upper bound for VM admission control (Section 7.2.6).
    ///
    /// Returns `PowerError::NotSupported` if TDP information is not available.
    fn read_tdp_mw(&self, domain: PowerDomainType) -> Result<u32, PowerError>;

    /// Return the maximum value of the energy counter before it wraps, in microjoules.
    ///
    /// Callers use this to correctly handle wrap-around in `read_energy_uj`.
    fn max_energy_range_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;
}

/// Errors returned by `RaplInterface` operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum PowerError {
    /// The requested domain or operation is not supported on this platform.
    NotSupported,
    /// The requested power limit is below the hardware minimum or above the TDP.
    OutOfRange { min_mw: u32, max_mw: u32 },
    /// MSR or SMU access failed (hardware error or driver not initialised).
    HardwareFault,
    /// The domain's power limit register is locked until next RESET.
    Locked,
}

The platform boot sequence probes for available RAPL domains (by attempting RDMSR and checking for #GP) and registers each discovered domain with the global PowerDomainRegistry. Upper layers iterate the registry rather than hard-coding which domains exist.

7.4.3 Per-Architecture Power Management Interfaces¶

The RaplInterface abstraction in Section 7.2.2 covers the common interface for energy reading and power-limit setting. The hardware mechanisms that back that interface differ substantially across architectures. This section specifies what those mechanisms are so that the platform boot driver for each architecture knows which registers, protocols, and firmware services to initialise.

All per-architecture power interfaces are accessed through the PlatformPowerOps trait, which is registered at boot by each architecture's power driver (see Section 7.2.2.5 for the RaplInterface trait that PlatformPowerOps builds on). Upper layers — the cgroup power controller, the scheduler's Energy-Aware Scheduling path (Section 7.1.5), and the FMA health subsystem (Chapter 20) — use this trait exclusively. They never call architecture-specific MSRs, SCMI mailboxes, or SBI extensions directly.

7.4.3.1 x86-64 (Intel and AMD)¶

Energy reporting:

Intel and AMD Zen2+ both expose energy via RAPL MSRs. The complete register map and bit layout is specified in Section 7.2.2.2. Summary of the energy-status registers used by the IntelRaplMsr and AmdRaplInterface implementations:

MSR address	Domain	Availability
`MSR_PKG_ENERGY_STATUS` (0x611)	CPU socket (cores + uncore)	Intel SNB+; AMD Zen2+
`MSR_PP0_ENERGY_STATUS` (0x639)	CPU cores only	Intel SNB+
`MSR_PP1_ENERGY_STATUS` (0x641)	Integrated GPU (client only)	Intel client SKUs
`MSR_DRAM_ENERGY_STATUS` (0x619)	Memory controller + DIMMs	Intel IVB-EP+; AMD Zen2+ server
`MSR_PLATFORM_ENERGY_STATUS` (0x64D)	Entire platform (PSU side)	Intel SKL+ client only

The energy unit is encoded in MSR_RAPL_POWER_UNIT (0x606); it must be read at boot before any energy delta computation.

Frequency and voltage control:

Intel P-states with HWP (Hardware-controlled Performance States): On Broadwell+ processors, HWP is enabled by writing bit 0 of IA32_PM_ENABLE (MSR 0x770). The scheduler then controls per-CPU performance hints via IA32_HWP_REQUEST (MSR 0x774), which encodes minimum performance, maximum performance, desired performance, and energy-performance preference (EPP) in a single 64-bit write. UmkaOS uses HWP when available in preference to legacy ACPI P-state (_PSS) switching.
AMD P-states (CPPC): On Zen2+, frequency scaling uses the Collaborative Processor Performance Control (CPPC) interface exposed through ACPI CPPC objects or directly via MSR_AMD_CPPC_REQ (0xC00102B3). The desired_perf field in CPPC maps to a CPU frequency in the same role as HWP's desired_perf.
Legacy ACPI _PSS: On older hardware without HWP or CPPC, UmkaOS falls back to ACPI P-state switching via _PSS/_PPC/_PCT methods, which the ACPI driver evaluates and translates into MSR writes (e.g., IA32_PERF_CTL at 0x199).

Power caps and TDP:

MSR_PKG_POWER_LIMIT (0x610): dual-window power limit (see Section 7.2.2.2 for the full bit layout). UmkaOS sets this register to enforce VM power budgets (Section 7.2.6) and rack-level DCMI caps (Section 7.2.7).
MSR_PKG_POWER_INFO (0x614): read-only; provides TDP, minimum, and maximum power. The TDP value is used as the default admission-control ceiling for VM watt-budget enforcement.
MSR_RAPL_POWER_LIMIT lock bit (bit 63 of 0x610): UmkaOS never sets this bit. Setting it prevents further limit changes until the next platform RESET.

Thermal:

IA32_THERM_STATUS (MSR 0x19C): per-core thermal status register. Bit 0 is the prochot log (set when the core has throttled due to heat). Bits 22:16 encode the "thermal margin" (degrees Celsius below TjMax). UmkaOS reads this MSR periodically in the thermal polling loop (Section 7.2.3.5) and maps it to a ThermalZone temperature reading.
IA32_PACKAGE_THERM_STATUS (MSR 0x1B1): package-level equivalent of the per-core thermal status; includes the PROCHOT log and package thermal margin.
DTS (Digital Thermal Sensor): the thermal margin field combined with TjMax (read from MSR_TEMPERATURE_TARGET, address 0x1A2, bits 23:16) gives the absolute die temperature: T_die = TjMax - thermal_margin.
ACPI _TSS/_TPC throttling methods remain as a fallback on platforms where DTS MSRs are not accessible from the OS.

Runtime device power management:

Device power state transitions (D0 to D3, and back) on x86-64 are driven by ACPI _PS0/_PS3 control methods evaluated by the ACPI interpreter in umka-kernel. The ACPI runtime PM path is common to all ACPI platforms (x86-64 and ACPI-based AArch64 servers); it is not x86-specific beyond the x86 ACPI initialisation sequence.

7.4.3.1.1 Idle State Management and C-State Restrictions (x86-64)¶

The cpuidle governor selects idle states from a per-model table. Some CPU models have known C-state errata that require restricting the available idle states:

Errata Flag	CPUs	Restriction	Rationale
`X86Errata::BAYTRAIL_CSTATE`	Bay Trail / Cherry Trail (Atom Z3xxx, x5-Z8xxx)	Block C6 and deeper states	C6 freeze: CPU fails to wake from C6 on certain steppings, requiring platform reset
`X86Errata::TSC_C3STOP`	Pre-Nehalem Intel, some AMD	`tsc_reliable = false`	TSC stops in C3+ states, making it unusable as a clocksource when deep idle is allowed
`X86Errata::HPET_PC10`	Coffee Lake, Ice Lake, Bay Trail	HPET unreliable in PC10	HPET counter stops or glitches in package C10
`X86Errata::MWAIT_BROKEN`	Apollo Lake, Ice Lake-X, Lunar Lake	IPI fallback for idle wakeup	MWAIT fails to wake on interrupt, requiring backup IPI delivery

AMX tile state and deep C-states (Sapphire Rapids+): Before entering any C-state deeper than C1, the kernel must execute TILERELEASE to release AMX tile data. The hardware does not automatically save tile state on C-state entry (unlike FP/SSE/AVX state, which is preserved across all C-states). Entering C6 with active tiles causes silent data corruption of the tile registers on wake. The cpuidle enter path checks CpuFeatureSet.xstate_used & XFEATURE_XTILEDATA and calls TILERELEASE when tiles are active. On C-state exit, the lazy AMX fault (#NM from XFD) re-initializes tiles on first use.

Multi-socket TSC desynchronization: On multi-socket systems, the TSCs of different sockets may drift relative to each other (typically ~1-10 ppm, accumulating to microseconds over hours). UmkaOS maintains per-socket TSC offset values in a tsc_socket_offset[MAX_SOCKETS] array, calibrated during SMP bringup by measuring round-trip IPI latency between sockets. The scheduler's idle duration estimation and the clocksource watchdog (Section 7.6) use socket-local TSC values adjusted by these offsets. If drift exceeds a threshold (>10μs divergence from the reference socket), the kernel switches the system clocksource from TSC to HPET or ACPI PM timer and logs a warning.

7.4.3.2 AArch64 (ARM Servers: Graviton, Neoverse, Ampere)¶

AArch64 server platforms do not expose RAPL-equivalent MSRs. Energy reporting and frequency control are provided by a combination of hardware activity counters and a firmware-mediated control channel (SCMI).

Energy reporting:

The Activity Monitor Unit (AMU, FEAT_AMU, introduced in Armv8.4) provides a set of per-core hardware event counters accessible from EL1 via the AMEVCNTR0_EL0 and AMEVTYPER0_EL0 register families. Architecturally defined group-0 counters include:

Counter index	Event	Use
0	CPU cycles	Total core cycles consumed
1	Instructions retired	IPC computation
2	Memory stall cycles	DRAM latency pressure indicator
3	L3 cache miss stall cycles	LLC pressure (Neoverse V1/V2, Graviton3+)

AMU counters provide activity data, not joules. To derive energy, UmkaOS integrates AMU cycle counts against the per-OPP power coefficients from the ARM Energy Model (stored in the device tree operating-points-v2 table as opp-microwatt values). This gives an estimated energy per task, analogous to the delta-integration used on RAPL platforms, but with lower accuracy (±10–20% typical).

AMU is present on Neoverse V1, Neoverse V2, Neoverse N2, Cortex-A78, Cortex-X1, and later cores. On older cores (Neoverse N1, Graviton2), the ARM Energy Model estimation falls back to CPU-time-weighted power at the current OPP frequency, with no per-core AMU data.

Frequency and voltage control — SCMI:

On AArch64 server platforms (Graviton, Ampere Altra/Altra Max, Neoverse RD series), CPU frequency and power-domain gating are controlled via the System Control and Management Interface (SCMI, ARM specification DEN0056, currently version 3.2). SCMI runs between the OS and a dedicated System Control Processor (SCP) or equivalent firmware agent (e.g., the Nitro controller on AWS Graviton instances) over a shared-memory mailbox (doorbell register + shared SRAM buffer).

SCMI protocols used by UmkaOS:

SCMI protocol	Protocol ID	UmkaOS use
`SCMI_PERF`	0x13	P-state (OPP) transitions per CPU cluster or per-core. `PERF_LEVEL_SET` maps to the frequency request analogous to `IA32_HWP_REQUEST` on x86
`SCMI_POWER`	0x11	Power-domain gating (power on/off entire CPU clusters, peripherals). Used during CPU hotplug (Section 7.1.7) and system suspend (Section 7.2.10)
`SCMI_SENSOR`	0x15	Read platform sensor values (die temperature, supply voltage); used by the thermal framework (Section 7.2.3) as the sensor backend on SCMI platforms
`SCMI_PERF_CAP`	0x13 (cap sub-cmd)	Power capping per performance domain, where supported by the SCP firmware

SCMI message exchange is asynchronous on multi-channel implementations (one shared-memory channel per CPU cluster); UmkaOS posts a request and either polls or waits for a doorbell interrupt (platform-dependent). Latency is typically 100–500 µs for a PERF_LEVEL_SET round-trip to the SCP. This is too slow for per-task frequency switching; SCMI frequency control is therefore applied at the granularity of runqueue load-balance intervals (typically 4–10 ms), not on every context switch.

Power caps:

On Graviton2/3 instances, AWS exposes the Nitro hypervisor's power budget to the guest OS via a platform-specific MMIO register or ACPI DSDT method; there is no standard SCMI power-capping channel available from within a Graviton VM. On bare metal (non-VM) AArch64 servers with SCMI power-capping protocol support, UmkaOS uses SCMI_PERF_CAP to enforce rack-level power budgets.

The PlatformPowerOps::set_power_limit implementation for SCMI platforms translates the limit_mw parameter into a performance level ceiling via the OPP table and issues a SCMI_PERF_LEVEL_SET with that ceiling as the maximum. This achieves power capping by constraining achievable frequency, not by a hardware power clamp as on x86-64 RAPL.

Thermal:

Temperature data on SCMI platforms is obtained via: 1. SCMI_SENSOR protocol: the SCP reads die-temperature sensors and exposes them to the OS as named sensors. UmkaOS's thermal zone driver registers these as SensorBackend::Scmi entries (Section 7.2.3.6). 2. Device tree thermal-zones nodes with thermal-sensors references: on embedded and mobile AArch64 SoCs (Qualcomm, MediaTek, Samsung), temperature sensors are MMIO-mapped and described in the device tree; UmkaOS's thermal zone driver reads them directly. 3. ACPI _TSS/_TPC: on ACPI-enumerated AArch64 servers (those following the ACPI for Arm specification, SBSA/SBBR), the ACPI thermal zone path is the same as on x86-64.

Thermal trip-point response on SCMI platforms follows the same framework as x86-64 (Section 7.2.3): the thermal interrupt (or polling timer) fires the trip-point callback, which issues a cooling action via CoolingDevice::set_state. On SCMI platforms, the FrequencyScalingCooler implementation translates the cooling state to a SCMI_PERF_LEVEL_SET call.

Runtime device power management:

Device power gating on AArch64 uses one of: - Device tree power-domains nodes backed by SCMI SCMI_POWER protocol: the generic power-domain framework calls SCMI_POWER_DOMAIN_STATE_SET to transition devices between POWER_ON and POWER_OFF. - PSCI SYSTEM_SUSPEND (function ID 0x8400_000E): used for system-wide suspend to RAM (Section 7.2.10). Per-CPU idle states also use PSCI CPU_SUSPEND. - On embedded/mobile: operating-points-v2 device tree nodes with regulator framework bindings allow the kernel to request voltage changes alongside frequency changes, forming a complete DVFS (Dynamic Voltage and Frequency Scaling) path.

7.4.3.3 RISC-V¶

The RISC-V ISA specification and the SBI (Supervisor Binary Interface, specification v2.0) do not define a standardised energy reporting or frequency scaling interface equivalent to RAPL or SCMI. Power management on RISC-V platforms is therefore entirely platform-specific.

Energy reporting:

The RISC-V ISA defines hardware performance monitor (HPM) counters: hpmcounter3 through hpmcounter31 (CSRs 0xC03–0xC1F), each counting a platform-defined event selected by the corresponding mhpmevent machine-mode CSR. Whether any HPM counter counts an energy-proxy event (e.g., CPU cycles at a known voltage-frequency point) is platform-defined. On platforms where such a counter exists, UmkaOS's RISC-V energy driver reads it and converts to milliwatts using a boot-time calibration coefficient from the device tree or SBI vendor extension.

On platforms with no HPM energy counter, UmkaOS falls back to CPU-time-weighted power estimation: power (mW) = active_fraction × OPP_power_mW, where OPP_power_mW comes from the operating-points-v2 device tree node. Cgroup energy accounting uses this estimated power.

Frequency and voltage control:

SBI HSM (Hart State Management) extension (EID 0x48534D): the HART_SUSPEND function (FID 0x0) requests per-hart low-power clock-gating. This is the only standardised per-hart power state transition in SBI v2.0. It is used for idle (cpu_idle) and for offline CPUs (Section 7.1.7), not for frequency scaling.
Frequency scaling: there is no standardised RISC-V frequency scaling interface in the SBI specification as of version 2.0. On server-class RISC-V platforms following the RISC-V Server Platform specification (published 2023), ACPI CPPC is required; UmkaOS uses the ACPI CPPC driver path (same as on x86-64 legacy CPPC platforms). On embedded RISC-V SoCs, frequency scaling uses platform-specific MMIO registers described in the device tree, accessed through a platform-specific clock driver.
SBI vendor extensions: some RISC-V SoC vendors (e.g., T-HEAD/Alibaba for their C906/C910 series cores) define private SBI vendor extensions for DVFS. UmkaOS implements these as optional platform drivers registered at boot if the SBI probe reports the vendor extension ID.

Power caps:

No standardised RISC-V power-capping interface exists in the base SBI specification or RISC-V platform specifications as of 2025. The PlatformPowerOps::set_power_limit implementation for RISC-V returns PowerError::NotSupported unless a platform-specific driver (loaded via device tree compatible string matching) implements a power-capping MMIO interface.

Thermal:

Temperature sensors on RISC-V platforms are described in the device tree using the standard thermal-zones binding with thermal-sensors references pointing to platform-specific thermal sensor nodes (e.g., compatible = "sifive,fu740-temp"). UmkaOS's thermal zone driver reads them via the platform's sensor driver. Trip-point response uses the same thermal framework as other architectures (Section 7.2.3).

Runtime device power management:

SBI HSM HART_SUSPEND provides per-hart suspend (with and without local context retention, depending on the suspend_type field). System-wide suspend follows the platform-specific mechanism (ACPI S3 on RISC-V ACPI platforms; device-tree power domains on embedded platforms).

7.4.3.4 PPC32 and PPC64LE¶

IBM POWER and PowerPC platforms have two distinct power management environments: bare-metal (directly running on the hardware, including OpenPOWER) and LPAR (Logical Partition, running under the PowerVM or KVM hypervisor). The mechanisms differ between these environments.

Energy reporting:

LPAR on IBM POWER (PowerVM hypervisor): Energy data is exposed via the PHYP (PowerVM Hypervisor) H-call H_GET_EM_PARMS, which returns the partition's current power consumption as measured by the system's power meters. This is the LPAR equivalent of RAPL: the hypervisor aggregates physical PSU data and attributes a share to each partition.
Bare metal (OpenPOWER, POWER9/POWER10 with OPAL): The OPAL (OpenPOWER Abstraction Layer) firmware exposes power data via opal_sensor_read (OPAL call 0x30) and opal_sensor_read_u64 (0x52). The ibm,opal-sensors device tree node lists available sensors (die temperature, core power, memory power) by sensor handle. UmkaOS's OPAL sensor driver iterates this list at boot and registers each as a GenericPowerDomain in the PowerDomainRegistry.
Bare metal without OPAL (classic PPC32 embedded): No hardware power counters accessible from the OS. CPU-time-weighted OPP estimation is the only option.
PMU counters: Both PPC32 and PPC64LE have hardware performance monitor facilities (configurable via MMCR0/MMCR1/MMCR2 and PMCx registers). These can count CPU cycles, L2/L3 misses, and memory bandwidth — useful energy proxies — but require platform-specific calibration. UmkaOS optionally uses PMC0 (total cycles) as a proxy if an OPAL/PHYP energy interface is not available.

Frequency and voltage control:

LPAR (PowerVM): The H_SET_PPP (Processor Folding Priority) H-call allows a partition to request a change in its CPU frequency priority relative to other partitions on the same physical POWER system. This is not a direct frequency knob; the hypervisor honors the request subject to available capacity. UmkaOS issues H_SET_PPP from the scheduler's EAS path when the workload shifts between low and high throughput modes.
Bare metal OpenPOWER (OPAL): On POWER8 and POWER9 systems with OPAL, CPU frequency is controlled via the opal_set_freq call or by writing to the EPS (Energy Management) registers via OPAL. UmkaOS uses the OPAL cpufreq driver for OpenPOWER platforms.
ACPI on OpenPOWER: POWER9 and POWER10 systems running the Little-Endian Linux ABI (ppc64le) and ACPI-enumerated (ACPI is supported on OpenPOWER via SBSA-like profile) can use ACPI CPPC for frequency control, the same as ARM ACPI servers.
Embedded PPC32 (e500/e500mc): Frequency scaling is platform-specific; most embedded PPC32 SoCs use a simple PLL register write, described in the device tree.

Thermal:

OPAL platforms: Temperature sensor data is read via opal_sensor_read using the sensor handles discovered from the ibm,opal-sensors node. UmkaOS registers these as SensorBackend::Opal entries in the thermal framework.
LPAR (PowerVM): Thermal management is entirely hypervisor-controlled; the guest OS has no visibility into die temperature and cannot control throttling. UmkaOS does not register thermal zones in LPAR mode.
Server platforms with IPMI: Both PPC32 and PPC64LE rack servers typically have a Baseboard Management Controller (BMC) accessible via IPMI. Temperature sensors reported by the BMC are accessed through the IPMI thermal zone backend (Section 7.2.3.6). This is the same DCMI/IPMI path as on x86-64 rack servers (Section 7.2.7).

Runtime device power management:

LPAR: Device power gating is hypervisor-managed. UmkaOS does not control device power state directly in LPAR mode; the hypervisor handles it transparently.
OPAL bare metal: OPAL exposes device power domains via opal_pci_set_power_state for PCIe devices and via the ibm,opal device tree node's power-management subnode for on-chip devices.
Device tree power domains: Embedded PPC32/PPC64 platforms follow standard device tree power-domains bindings, identical to ARM embedded.

7.4.4 Thermal Framework¶

7.4.4.1 Thermal Zones¶

A thermal zone is a region of the system that has one or more temperature sensors and a set of trip points. Physical examples:

CPU die (one per socket; typically uses the TCONTROL MSR or PECI for temperature)
GPU die (integrated or discrete)
Battery (reported via ACPI _BTP or Smart Battery System)
Skin/chassis (NTC thermistor on laptop lid; used to prevent burns)
NVMe drive (SMART temperature, reported via hwmon Section 13.13)

/// A thermal zone: a named region with a temperature sensor and trip points.
pub struct ThermalZone {
    /// Human-readable name (e.g., `"cpu0-die"`, `"battery"`, `"skin"`).
    /// Must be unique within the system. Used as the sysfs directory name.
    pub name: &'static str,

    /// The temperature sensor for this zone.
    pub sensor: Arc<dyn TempSensor>,

    /// Ordered list of trip points, sorted by `temp_mc` ascending.
    ///
    /// The thermal monitor evaluates all trip points on each poll cycle and
    /// fires actions for any whose threshold has been crossed.
    ///
    /// **Boot-time only**: populated by the ACPI/DT thermal zone parser at boot
    /// and never resized after the thermal subsystem initializes. `Vec` is used
    /// for owned, contiguous storage — not for dynamic growth.
    pub trip_points: Vec<TripPoint>,

    /// Cooling devices bound to this zone with their maximum cooling state
    /// and the trip point(s) that activate them.
    ///
    /// **Boot-time only**: populated at boot alongside `trip_points` and never
    /// modified at runtime. Typical zone has 1–4 bindings.
    pub cooling_devices: Vec<CoolingBinding>,

    /// Current polling interval in milliseconds.
    ///
    /// Starts at 1000 ms (normal), drops to 100 ms when the zone temperature
    /// is within 5 °C of any trip point, and drops to 10 ms when within 1 °C
    /// of a `Hot` or `Critical` trip point.
    pub polling_interval_ms: AtomicU32,
}

7.4.4.2 Trip Points¶

A trip point is a temperature threshold with an associated action type:

/// A temperature threshold that triggers a thermal action when crossed.
pub struct TripPoint {
    /// Temperature at which this trip point fires, in millidegrees Celsius.
    ///
    /// For example, 95000 = 95 °C.
    pub temp_mc: i32,

    /// The action to take when this trip point is crossed.
    pub trip_type: TripType,

    /// Hysteresis in millidegrees Celsius.
    ///
    /// The trip point is considered cleared only when the temperature drops
    /// below `temp_mc - hysteresis_mc`. This prevents oscillation around the
    /// threshold. Typical value: 2000 (2 °C).
    pub hysteresis_mc: i32,
}

/// The action taken when a thermal trip point threshold is crossed.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TripType {
    /// Reduce power consumption by notifying the cpufreq governor to lower
    /// the maximum CPU frequency. Does not forcefully reduce frequency;
    /// relies on the governor to converge. This is the primary mechanism
    /// for sustained thermal management.
    Passive,

    /// Activate a cooling device (e.g., spin up a fan to a higher speed).
    /// The bound `CoolingDevice` is set to its next higher state.
    Active,

    /// The temperature has reached a dangerous level. Post a `ThermalEvent`
    /// to userspace monitoring daemons via the thermal netlink socket.
    /// Userspace may respond by reducing workload. No kernel-side action.
    Hot,

    /// Emergency condition. The kernel immediately forces a system poweroff
    /// (equivalent to `kernel_power_off()`). This happens synchronously in
    /// the thermal interrupt handler or poll loop — userspace is not consulted.
    /// Data integrity is not guaranteed; this is a last resort before hardware
    /// thermal shutdown.
    Critical,
}

The Critical trip point is typically set 5–10 °C below the hardware's own PROCHOT# shutdown temperature to give the kernel a chance to shut down cleanly (flushing journal, unmounting filesystems) before the hardware forcibly powers off.

7.4.4.3 Cooling Devices¶

A cooling device is something the kernel can actuate to reduce heat generation or increase heat dissipation:

/// A device that can reduce thermal load on a thermal zone.
///
/// Cooling states are represented as integers from 0 (no cooling) to
/// `max_state()` (maximum cooling). The mapping from state number to physical
/// action is device-specific.
///
/// # Examples
///
/// - `CpufreqCooler`: state 0 = max frequency, state N = minimum frequency.
/// - `FanCooler`: state 0 = fan off, state N = 100% PWM duty cycle.
pub trait CoolingDevice: Send + Sync {
    /// Return the maximum cooling state this device supports.
    ///
    /// The device can be set to any state in `[0, max_state()]`.
    fn max_state(&self) -> u32;

    /// Return the current cooling state.
    fn current_state(&self) -> u32;

    /// Set the cooling state to `state`.
    ///
    /// Must be idempotent if `state == current_state()`.
    /// Returns `ThermalError::OutOfRange` if `state > max_state()`.
    fn set_state(&self, state: u32) -> Result<(), ThermalError>;

    /// Human-readable name for this cooling device (e.g., `"cpufreq-cpu0"`,
    /// `"fan-chassis0"`). Used as the sysfs `type` file content.
    fn name(&self) -> &'static str;
}

/// Binding between a thermal zone and a cooling device.
pub struct CoolingBinding {
    /// The cooling device to actuate.
    pub device: Arc<dyn CoolingDevice>,

    /// The trip point index (into `ThermalZone::trip_points`) that activates
    /// this binding. The cooling device is stepped up one state each time the
    /// thermal zone crosses this trip point.
    pub trip_point_index: usize,

    /// The cooling state to apply when the trip point is in the active
    /// (crossed) state. When the zone cools below `temp_mc - hysteresis_mc`,
    /// the device is stepped back down toward 0.
    pub target_state: u32,
}

Standard cooling device types provided by UmkaOS:

Type	Description	State mapping
`CpufreqCooler`	Limits max CPU frequency via cpufreq (Section 7.2)	0 = `cpu_max_freq`, N = `cpu_min_freq`, linear interpolation
`GpufreqCooler`	Limits max GPU frequency via drm/gpu driver	Same as above
`FanCooler`	Sets fan PWM duty cycle via hwmon (Section 13.13)	0 = fan off, `max_state()` = 100% PWM
`UsbCurrentCooler`	Reduces USB charging current to lower battery heat	0 = max current, N = 0 mA
`RaplCooler`	Reduces RAPL PKG limit directly	0 = TDP, N = minimum supported limit

7.4.4.4 Cooling Map Discovery¶

The binding between thermal zones and cooling devices is discovered at boot from:

ACPI: _TZD (thermal zone devices), _PSL (passive cooling list), _AL0–_AL9 (active cooling lists). The ACPI thermal driver evaluates these control methods and populates the cooling_devices list in each ThermalZone.
Device tree: cooling-maps node under the thermal zone node (binding documented in Linux kernel Documentation/devicetree/bindings/thermal/thermal-zones.yaml). UmkaOS parses this during DTB processing (Section 3.14).
Static board description: For platforms without ACPI or DTB thermal tables, a board-specific Rust module in umka-kernel/src/arch/ can register zones and bindings at compile time.

7.4.4.5 Polling and Interrupt-Driven Monitoring¶

The thermal monitor uses two mechanisms:

Polling (always available): A kernel timer fires periodically to call TempSensor::read_temp_mc() and evaluate all trip points. The polling interval is adaptive:

Temperature distance from nearest trip point	Polling interval
> 5 °C below any trip point	1000 ms
1–5 °C below a `Passive` or `Active` trip	100 ms
< 1 °C below a `Hot` or `Critical` trip	10 ms

Interrupt-driven (when available): Some platforms provide hardware thermal interrupts that fire when a temperature threshold is crossed:

Intel PROCHOT interrupt: the CPU asserts PROCHOT# when the die temperature reaches the factory-programmed limit. The kernel registers an interrupt handler on APIC vector 0xFA (Linux convention for thermal LVT). This fires before RAPL-based throttling takes effect.
AMD SB-TSI alert: an SMBus alert from the SB-TSI temperature sensor on AMD platforms. Handled by the amd_sb_tsi I2C driver.
ACPI _HOT / _CRT notify: the firmware sends an ACPI notify event when a thermal zone crosses its Hot or Critical temperature. The ACPI event handler evaluates the zone immediately rather than waiting for the next poll cycle.

Interrupt-driven monitoring reduces the latency from temperature threshold crossing to kernel response from ≤ 1000 ms (polling) to ≤ 100 µs (interrupt).

7.4.4.6 Temperature Sensor Abstraction¶

/// A hardware temperature sensor.
///
/// Implementations include: x86 PECI (Platform Environment Control Interface),
/// ACPI `_TMP` control method, I2C/SMBus sensors (LM75, TMP102, etc.),
/// and ARM SoC on-die sensors.
pub trait TempSensor: Send + Sync {
    /// Read the current temperature in millidegrees Celsius.
    ///
    /// Returns `ThermalError::SensorFault` if the hardware sensor reports
    /// an error condition (e.g., I2C NACK, PECI timeout).
    fn read_temp_mc(&self) -> Result<i32, ThermalError>;

    /// Human-readable name for this sensor (e.g., `"peci-cpu0"`, `"acpi-tz0"`).
    fn name(&self) -> &'static str;
}

/// Errors returned by thermal framework operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ThermalError {
    /// The sensor or cooling device is not available or not initialised.
    NotAvailable,
    /// The sensor returned an error condition (hardware fault or communication error).
    SensorFault,
    /// The requested cooling state is outside `[0, max_state()]`.
    OutOfRange,
    /// The cooling device is currently locked by another subsystem.
    DeviceBusy,
}

7.4.4.7 Linux sysfs Compatibility¶

UmkaOS exposes the thermal framework under the same sysfs paths as the Linux kernel thermal framework, enabling unmodified Linux monitoring tools:

/sys/class/thermal/
  thermal_zone0/
    type          # zone name (e.g., "x86_pkg_temp")
    temp          # current temperature in millidegrees (e.g., "52000")
    mode          # "enabled" or "disabled"
    trip_point_0_temp    # first trip point temperature
    trip_point_0_type    # "passive", "active", "hot", or "critical"
    trip_point_0_hyst    # hysteresis in millidegrees
    policy        # cooling policy: "step_wise" or "user_space"
  cooling_device0/
    type          # cooling device name (e.g., "Processor")
    max_state     # maximum cooling state
    cur_state     # current cooling state

The type file content for CPU cooling devices uses the string "Processor" for compatibility with lm_sensors, thermald, and similar tools that match on this string.

7.4.5 Powercap Interface (sysfs)¶

The powercap sysfs hierarchy provides a unified interface for reading energy counters and setting power limits. UmkaOS's layout is byte-for-byte compatible with Linux's intel_rapl_msr driver output, ensuring that existing power monitoring and management tools work without modification.

7.4.5.1 Directory Structure¶

/sys/devices/virtual/powercap/
  intel-rapl/                          # Control type: Intel RAPL
    intel-rapl:0/                      # Socket 0 PKG domain
      name                             # "package-0"
      energy_uj                        # Cumulative energy (µJ, read-only, wraps)
      max_energy_range_uj              # Counter wrap value in µJ
      constraint_0_name                # "long_term"
      constraint_0_power_limit_uw      # Long-window limit in µW (read-write)
      constraint_0_time_window_us      # Long-window duration in µs (read-write)
      constraint_0_max_power_uw        # Maximum settable limit (TDP) in µW
      constraint_1_name                # "short_term"
      constraint_1_power_limit_uw      # Short-window limit in µW (read-write)
      constraint_1_time_window_us      # Short-window duration in µs (read-write)
      constraint_1_max_power_uw        # Maximum settable short-term limit
      enabled                          # "1" to enable limits, "0" to disable
      intel-rapl:0:0/                  # Socket 0 Core (PP0) sub-domain
        name                           # "core"
        energy_uj
        max_energy_range_uj
        constraint_0_name              # "long_term"
        constraint_0_power_limit_uw
        constraint_0_time_window_us
        constraint_0_max_power_uw
        enabled
      intel-rapl:0:1/                  # Socket 0 Uncore (PP1) sub-domain (client only)
        name                           # "uncore"
        ...
    intel-rapl:1/                      # Socket 1 PKG domain (dual-socket servers)
      ...

The DRAM domain appears as a separate top-level entry on server platforms:

    intel-rapl:0:2/                    # DRAM sub-domain of socket 0
      name                             # "dram"

On AMD Zen2+ systems, the same layout is used with the control type still named intel-rapl for compatibility (Linux uses the same driver name). AMD-specific extensions (if any) appear in an amd-rapl control type directory.

7.4.5.2 Tool Compatibility¶

The following tools work against UmkaOS's powercap hierarchy without modification:

Tool	Use
`powerstat`	Per-socket power consumption over time
`turbostat`	CPU frequency, power, and temperature combined
`s-tui`	Terminal UI showing frequency and power
`powertop`	Process-level power attribution (uses `/proc`, not powercap, but reads `energy_uj`)
Prometheus `node_exporter`	`--collector.powersupplyclass` and powercap collector
`rapl-read`	Low-level RAPL register dump

7.4.5.3 Write Semantics¶

Writing constraint_N_power_limit_uw calls RaplInterface::set_power_limit() on the corresponding RaplDomain. Writes from unprivileged userspace are rejected with EPERM. Root (or a process with CAP_SYS_ADMIN) may write any domain.

Writing a limit that exceeds the domain's constraint_N_max_power_uw returns EINVAL. Writing 0 is equivalent to calling RaplInterface::clear_power_limit() (removes the software limit, restoring hardware default).

7.4.6 Cgroup Power Accounting¶

7.4.6.1 Design¶

Energy consumption is attributed to cgroups using a sampling-based model that parallels CPU time accounting. A dedicated kernel thread (the power accounting thread) wakes every 10 ms (configurable via /proc/sys/kernel/power_sample_interval_ms, range 1–1000 ms) and:

Reads energy_uj from all active RAPL domains (all sockets, all sub-domains).
Computes the delta from the previous sample, handling counter wrap-around.
Queries the scheduler to get, for each cgroup, the CPU time consumed in the last 10 ms interval.
Distributes the energy delta across cgroups proportional to their CPU time share.
Accumulates the attributed energy into each cgroup's power.energy_uj counter.

Clarification on per-cgroup overhead: The 10 ms interval is the RAPL sampling and reporting period — a single kernel thread reads hardware energy counters and distributes the delta. This is NOT per-cgroup polling. The power accounting thread performs one RAPL read per domain per interval (typically 4-8 RAPL domains total), then a single O(n) pass over active cgroups to distribute the delta. The per-cgroup CPU time data is already maintained by the scheduler's existing accounting (updated on context switch and dequeue events, not by polling). Therefore, even with 4096 active cgroups, the accounting thread's cost is: ~4-8 RAPL reads (~200 ns each) + one linear scan of cgroup time deltas (~4096 × ~20 ns = ~80 μs) = under 100 μs per 10 ms interval, or <0.001% CPU. The naive concern that 4096 cgroups × 10 ms polling would cost ~4% CPU assumes each cgroup requires independent hardware polling; in reality, the hardware counters are per-socket (not per-cgroup) and the per-cgroup attribution is a lightweight arithmetic distribution.

This is the same weighted attribution model used by Linux's cpuacct cgroup controller and, more recently, by Intel's Energy Aware Scheduling patches.

7.4.6.2 Attribution Model¶

Let E_delta be the total PKG energy delta in the current interval (µJ), and let T_i be the CPU time consumed by cgroup i in the interval (µs). The energy attributed to cgroup i is:

E_i = E_delta × (T_i / Σ T_j)

where the sum is over all cgroups with T_j > 0. Idle time (no cgroup running) is attributed to a synthetic idle cgroup and not charged to any user cgroup.

Limitation: This model has two known imprecisions:

It does not account for memory bandwidth differences between cgroups sharing a socket. A cgroup running a memory-bandwidth-intensive workload consumes more power per CPU cycle than one running a compute-bound workload, but they receive the same energy charge per CPU time unit. This is acceptable for accounting and billing; it is not suitable for precise per-process energy metering.
On a multi-socket server, PKG energy from socket 0 may be attributed to a cgroup whose threads ran on socket 1 if the sampling window captures a migration. The error is bounded by one sampling interval (10 ms default).

7.4.6.3 Cgroup Interface¶

The power cgroup controller provides the following files:

File	Mode	Description
`power.energy_uj`	R	Cumulative energy attributed to this cgroup in µJ. Wraps at `u64::MAX`.
`power.stat`	R	Per-domain energy breakdown: `pkg_energy_uj`, `core_energy_uj`, `dram_energy_uj`.
`power.limit_uw`	RW	Power limit for this cgroup in µW. `0` = no limit. Setting a non-zero value enables power cap enforcement (Section 7.4).
`power.limit_window_ms`	RW	Averaging window for `power.limit_uw` enforcement, in ms. Default: 100.

These files are created under the cgroup hierarchy directory, e.g.:

/sys/fs/cgroup/my-vm/power.energy_uj
/sys/fs/cgroup/my-vm/power.limit_uw

7.4.6.4 Per-Cgroup Power Limit Enforcement¶

When power.limit_uw is non-zero, the power accounting thread checks, at each sample interval, whether the cgroup's rolling-average power consumption (calculated from power.energy_uj deltas over power.limit_window_ms) exceeds the limit. The power accounting thread runs at SCHED_NORMAL priority (nice 0) and is pinned to NUMA node 0's first online CPU. It does not require real-time priority because power accounting tolerates jitter — sample intervals are typically 100 ms, and a delayed sample by one tick (1-4 ms) has negligible impact on energy attribution accuracy.

If the limit is exceeded:

The cgroup's effective RAPL PKG short-window limit is reduced proportionally to bring the cgroup's power consumption within budget. This is implemented by adjusting the cpu.max bandwidth (Section 7.6) for the cgroup's tasks — reducing their CPU time allocation reduces their power consumption.
A PowerLimitEvent is posted to the cgroup's event fd (readable via cgroup.events), allowing userspace monitoring daemons to observe throttling.

Limitation: RAPL enforcement at sub-PKG granularity (per-cgroup, per-core) is not directly supported by hardware. Per-cgroup limits are enforced indirectly via CPU time throttling (Section 7.6). True per-cgroup hardware power isolation would require per-core RAPL (available on some Intel Xeon generations as MSR_PP0_POWER_LIMIT) combined with strict core pinning — a configuration that umka-kvm uses for VM power budgets (Section 7.4), but which is not the general case.

7.4.7 VM Power Budget Enforcement¶

7.4.7.1 Motivation¶

Traditional VM resource accounting models (CPU cores, RAM) do not capture actual power consumption. A VM running a STREAM memory-bandwidth benchmark or a dense linear algebra kernel (e.g., BLAS DGEMM with AVX-512) can consume 2–3× the power of a VM running a web server at equivalent CPU utilisation. In a datacenter where the binding constraint is rack PDU amperage, not CPU cores, CPU-count quotas systematically mis-model the actual cost of workloads.

Watt-based quotas reflect actual rack power budget more honestly:

A 500W rack PDU can host either ten 50W VMs or five 100W VMs regardless of how many vCPUs each is assigned.
A burst-capable VM (bursty ML inference job) can be allocated 80W sustained with a 150W burst cap for 10 ms — mirroring the RAPL two-tier limit model.
Overcommit is detectable and rejectable at admission time by comparing sum(vm_power_limit_mw) against measured or rated socket TDP.

7.4.7.2 Mechanism¶

When umka-kvm creates a VM with a vm_power_limit_mw budget:

Dedicated cgroup: A cgroup is created at /sys/fs/cgroup/umka-vms/<vm-id>/ for the VM's vCPU threads. power.limit_uw is set to vm_power_limit_mw * 1000.
Core pinning: The VM's vCPU threads are pinned to a CPU set on a single socket (or across sockets if vm_numa_topology specifies multi-socket). This ensures energy counter attribution is accurate (Section 7.4 limitation 2).
Socket RAPL coordination: The PKG short-window limit for each socket is set to sum(vm_power_limit_mw for all VMs pinned to that socket) + headroom_mw, where headroom_mw is a configurable per-socket constant (default: 10% of TDP) reserved for host kernel overhead.
Monitoring: The VmPowerBudget::update() method is called by umka-kvm's power accounting thread every 100 ms. If the VM exceeds its budget, vCPU scheduling quota is reduced.

/// Power budget tracking for a single VM.
pub struct VmPowerBudget {
    /// Allocated sustained power budget for this VM in milliwatts.
    pub limit_mw: u32,

    /// Allocated burst power limit in milliwatts, enforced for windows ≤ 10 ms.
    ///
    /// Maps to the RAPL short-window limit. Set to `limit_mw` if no burst
    /// allowance is configured (conservative mode).
    pub burst_limit_mw: u32,

    /// Measured average power consumption over the last 1-second sliding window,
    /// in milliwatts. Updated by `update()` every 100 ms.
    pub measured_mw: AtomicU32,

    /// Number of times vCPU quota was reduced due to power budget violation.
    ///
    /// Monotonically increasing. Used for throttle-rate monitoring and alerting.
    pub throttle_count: AtomicU64,

    /// Handle to the cgroup backing this VM's power accounting and enforcement.
    cgroup: CgroupHandle,
}

impl VmPowerBudget {
    /// Called every 100 ms by umka-kvm's power accounting thread.
    ///
    /// Reads the energy delta from the VM's cgroup, updates `measured_mw`,
    /// and reduces vCPU scheduling quota if the budget is exceeded.
    ///
    /// `sched` is the CPU bandwidth controller for this VM's vCPU threads ([Section 7.6](#cpu-bandwidth-guarantees)).
    pub fn update(&self, sched: &CpuBandwidth) {
        let delta_uj = self.cgroup.read_energy_delta_uj();
        // 100 ms window: delta_uj / 100 ms = µJ/ms = mW.
        let measured = (delta_uj / 100) as u32;
        self.measured_mw.store(measured, Ordering::Relaxed);

        if measured > self.limit_mw {
            // Reduce vCPU CBS quota proportional to overage fraction.
            // Uses fixed-point arithmetic (percentage × 10) to avoid FPU in kernel.
            // Example: measured = 120 mW, limit = 100 mW →
            //   overage = 20, reduction_permille = 20 * 1000 / 120 = 166 (16.6%).
            let overage = measured - self.limit_mw;
            let reduction_permille = overage * 1000 / measured;
            sched.reduce_quota_permille(reduction_permille);
            self.throttle_count.fetch_add(1, Ordering::Relaxed);
        }
    }
}

7.4.7.3 Admission Control¶

Before umka-kvm creates a new VM with a vm_power_limit_mw budget, it checks the admission constraint:

sum(vm.limit_mw for all VMs on target socket) + new_vm.limit_mw
    ≤ socket_tdp_mw - host_headroom_mw

where socket_tdp_mw is read from RaplInterface::read_tdp_mw(PowerDomainType::Pkg) at boot, and host_headroom_mw defaults to 10% of TDP (configurable via /sys/module/umka_kvm/parameters/power_headroom_pct).

If the constraint is violated, umka-kvm returns ENOSPC to the VM creation ioctl. The caller (e.g., an orchestrator) must either reduce the requested budget, migrate an existing VM to another socket/host, or reject the workload.

This is an admission control gate, not a guarantee. Actual power consumption may exceed TDP temporarily due to:

Turbo Boost / AMD Precision Boost (transient power above TDP for ≤ 10 ms)
RAPL enforcement latency (the hardware enforces limits over the configured window, typically 1-10 ms; instantaneous power can spike). On ARM platforms using SCMI (System Control and Management Interface), enforcement latency is firmware-dependent and typically higher (~10-100 ms) because power limits are communicated via SCMI mailbox messages to the SCP (System Control Processor), which enforces them asynchronously.

These transient overages are expected and handled by the hardware's own thermal and power-delivery circuitry. UmkaOS's admission control operates at the sustained (long-window) level.

7.4.7.4 Observability¶

Per-VM power accounting is exposed via:

/sys/fs/cgroup/umka-vms/<vm-id>/power.energy_uj    # Cumulative energy (µJ)
/sys/fs/cgroup/umka-vms/<vm-id>/power.stat          # Per-domain breakdown
/sys/fs/cgroup/umka-vms/<vm-id>/power.limit_uw      # Current limit (µW)

umka-kvm also exposes power metrics via the KVM statistics interface (/sys/bus/event_source/devices/kvm/), enabling Prometheus node_exporter KVM collector to report per-VM power consumption.

7.4.8 DCMI / IPMI Rack Power Management¶

7.4.8.1 Overview¶

In server deployments managed by a Baseboard Management Controller (BMC), the BMC may impose a platform-level power cap via the Data Center Manageability Interface (DCMI), an extension of IPMI v2.0 (specification: DCMI v1.5, published by Intel/DMTF).

DCMI provides the following power management commands over the IPMI channel:

DCMI Command	NetFn/Cmd	Description
`Get Power Reading`	`2C/02h`	Current platform power in watts (instantaneous, min, max, average over a rolling window)
`Get Power Limit`	`2C/03h`	Read the currently configured platform power cap
`Set Power Limit`	`2C/04h`	Set a platform power cap and exception action (hard power-off or OEM-defined)
`Activate/Deactivate Power Limit`	`2C/05h`	Enable or disable the configured power cap
`Get DCMI Capabilities`	`2C/01h`	Enumerate which DCMI features the BMC supports

These commands are sent by the datacenter management infrastructure (e.g., OpenBMC, Redfish, Dell iDRAC, HP iLO) to impose a rack-level power budget on individual servers.

7.4.8.2 UmkaOS Integration¶

The UmkaOS IPMI driver (Tier 1; KCS/SMIC/BT keyboard-controller-style interfaces via I2C or LPC, as described in Section 13.13) handles incoming DCMI commands from the BMC. When the BMC asserts a power cap via Set Power Limit + Activate Power Limit, the kernel responds as follows:

BMC sets cap C_bmc (watts)
  │
  ▼
umka-ipmi driver receives DCMI Set/Activate Power Limit
  │
  ├─► Reduce aggregate RAPL PKG limits across all sockets (TDP-proportional)
  │     for each socket i:
  │       new_pkg_limit[i] = C_bmc × (socket_tdp[i] / Σ socket_tdp)
  │     TDP values are read from MSR_PKG_POWER_INFO (x86) or ACPI PPTT at boot.
  │     On heterogeneous systems (mixed socket SKUs), this gives higher-TDP
  │     sockets a proportionally larger share of the cap. On homogeneous systems,
  │     this reduces to C_bmc / num_sockets.
  │     → RaplInterface::set_power_limit(Pkg, new_pkg_limit[i], long_window_ms)
  │
  ├─► Notify umka-kvm to reduce VM watt budgets proportionally
  │     reduction_factor = C_bmc / current_total_vm_budget
  │     → for each VM: VmPowerBudget::limit_mw *= reduction_factor
  │     → Re-run admission control check (may trigger VM migration signal)
  │
  └─► Post PowerCapEvent to userspace monitoring channel
        → sysfs: /sys/bus/platform/drivers/dcmi/power_cap_uw (updated)
        → netlink thermal event (for `thermald` compatibility)
        → KVM statistics update (for Prometheus collector)

7.4.8.3 Escalation Hierarchy¶

Power management operates at three levels, each enforced by a different actor:

Level	Mechanism	Enforced by	Override possible?
Software power limit	RAPL `MSR_PKG_POWER_LIMIT`	UmkaOS kernel	Yes (root can raise within TDP)
BMC power cap	DCMI `Set Power Limit`	BMC firmware	Only by BMC admin
Physical current limit	PSU OCP / PDU circuit breaker	Hardware	No

The kernel controls only the first level. The BMC cap (second level) is communicated to the kernel via DCMI but ultimately enforced by the BMC's power management controller, which can throttle the server via SYS_THROT# or force a hard power-off regardless of OS state. The kernel's DCMI integration is cooperative, not authoritative.

7.4.8.4 `DcmiPowerCap` Interface¶

/// Interface for the DCMI power cap enforcement callback.
///
/// Implemented by the IPMI driver. Called when the BMC asserts or modifies
/// a DCMI power limit.
pub trait DcmiPowerCap: Send + Sync {
    /// Called when the BMC sets a new platform power cap.
    ///
    /// `cap_mw` is the new cap in milliwatts. `0` indicates the cap has been
    /// deactivated (no limit). Implementors must update RAPL limits and notify
    /// umka-kvm within this call or schedule it for immediate async processing.
    fn on_cap_set(&self, cap_mw: u32);

    /// Return the currently active BMC-imposed cap in milliwatts.
    ///
    /// Returns `None` if no cap is currently active.
    fn current_cap_mw(&self) -> Option<u32>;

    /// Return the last measured platform power reading from the BMC in milliwatts.
    ///
    /// This is the BMC's own measurement, which may differ from RAPL's
    /// (BMC measures at the PSU, RAPL measures at the socket).
    fn last_reading_mw(&self) -> u32;
}

7.4.9 Battery and SMBus Monitoring¶

SMBus (System Management Bus) is a subset of I2C used for battery/charger chips. Example: Smart Battery System (SBS) batteries expose registers at I2C address 0x0B: - 0x08: Temperature (in 0.1K units). - 0x09: Voltage (in mV). - 0x0A: Current (in mA, signed). - 0x0D: Relative State of Charge (0-100%). - 0x0F: Remaining Capacity (in mAh).

The battery driver (Tier 1, probed via ACPI PNP0C0A device) reads these registers periodically (every 5 seconds when on battery, every 60 seconds when on AC) and exposes them via sysfs/umkafs (see Section 7.4 for the userspace interface).

7.4.10 Consumer Power Profiles¶

This subsection defines the user-facing policy layer; Section 7.4–Section 7.4 define the underlying mechanisms.

7.4.10.1 Power Profile Enumeration¶

// umka-core/src/power/profile.rs

/// User-facing power profile (consumer policy; translates to Section 7.2 mechanisms).
#[repr(u32)]
pub enum PowerProfile {
    /// Maximum performance. AC adapter expected.
    Performance  = 0,
    /// Balanced performance and power. Default on AC.
    Balanced     = 1,
    /// Aggressive power saving. Default on battery.
    BatterySaver = 2,
    /// User-defined constraints loaded from /ukfs/power/custom_profile.
    Custom       = 3,
}

7.4.10.2 Profile → Mechanism Translation¶

Each profile maps to concrete Section 7.4 parameters:

Profile	RAPL PKG limit	CPU turbo	GPU freq cap	WiFi PSM	Display brightness
Performance	None (HW TDP)	Enabled	100%	Disabled	100%
Balanced	80% TDP	Enabled	80%	PSM	75%
BatterySaver	50% TDP	Disabled	40%	Aggressive	40%

set_profile() calls RaplInterface::set_power_limit() (Section 7.4), the cpufreq governor, and WirelessDriver::set_power_save() (Section 13.2). No RAPL MSR writes happen in consumer-layer code — all hardware access goes through Section 7.4.

7.4.10.3 AC/Battery Auto-Switch¶

The PowerManager listens for ACPI AC adapter events (from the battery driver, Section 7.4) and automatically applies the user's preferred profile for each power source. On critical battery (≤5%), BatterySaver is forced. A BatteryCritical event is posted to the Section 7.9 event ring so userspace can display a notification.

7.4.10.4 Per-Process Power Attribution¶

Per-process energy attribution (for desktop power managers like GNOME Settings, KDE Powerdevil) is provided by the Section 7.4 cgroup power accounting. Per-process granularity: each process lives in a cgroup; power.energy_uj on that cgroup gives its energy consumption. Exposed via /proc/<pid>/power_consumed_uj in umka-sysapi procfs.

7.4.10.5 Userspace Interface¶

Kernel exposes power management state via the following paths:

Path	Description
`/sys/kernel/umka/power/profile`	Read/write power profile selection (`performance`, `balanced`, `battery-saver`). Writable by processes with `CAP_SYS_ADMIN`.
`/proc/<pid>/power_consumed_uj`	Per-process energy consumption in microjoules (RAPL cgroup attribution, Section 7.11).
`/sys/class/power_supply/BAT0/capacity`	Battery charge percentage (0–100).
`/sys/class/power_supply/BAT0/energy_now`	Remaining energy in µWh.
`/sys/class/power_supply/BAT0/current_now`	Discharge/charge current in µA.
`/sys/class/power_supply/BAT0/cycle_count`	Charge cycle count.
`/sys/class/power_supply/BAT0/status`	`Charging`, `Discharging`, `Full`, `Unknown`.

Time-remaining estimation, low-battery notifications, and battery health display are handled by userspace daemons (UPower or equivalent) reading these paths.

Continued in Section 7.5: Suspend/resume protocol (S3, S0ix), driver suspend/resume callbacks, device suspend ordering, resume failure recovery, per-device runtime power management (two-counter state machine), and cpuidle governor.

7.5 Suspend, Resume, and Runtime Power Management¶

7.5.1 Suspend and Resume Protocol¶

Context: The S4Hibernate variant in SleepState (below) covers S4 hibernate — see Section 18.4 for the full hibernate specification. This section specifies S3 (Suspend-to-RAM) and S0ix (Modern Standby), which are the primary suspend mechanisms on consumer laptops.

7.5.1.1 Sleep State Enumeration¶

// umka-core/src/power/suspend.rs

/// ACPI sleep state.
#[repr(u32)]
pub enum SleepState {
    /// S3: Suspend-to-RAM. CPU powered off, DRAM refreshing, ~2-5W, wake in <2s.
    S3SuspendToRam = 3,
    /// S0ix: Modern Standby (S0 Low Power Idle). CPU in deep C-states, OS "running",
    /// network alive, <1W, instant wake. Intel 6th gen+, AMD Ryzen 3000+, ARM.
    S0ixModernStandby = 0x0F, // Not a standard ACPI state, vendor-specific
    /// S4: Hibernate. See [Section 18.4](18-virtualization.md#suspend-and-resume) for the
    /// full specification (snapshot creation, dm-crypt signed snapshot, resume protocol).
    S4Hibernate = 4,
}

/// Power state machine states.
#[repr(u32)]
pub enum SuspendPhase {
    /// System running normally.
    Running = 0,
    /// Pre-suspend: freeze userspace, sync filesystems.
    PreSuspend = 1,
    /// Device suspend: call driver suspend callbacks.
    DeviceSuspend = 2,
    /// CPU suspend: save CPU state, enter ACPI sleep state.
    CpuSuspend = 3,
    /// (system asleep, this state is never observed by running code)
    Asleep = 4,
    /// CPU resume: restore CPU state.
    CpuResume = 5,
    /// Device resume: call driver resume callbacks.
    DeviceResume = 6,
    /// Post-resume: thaw userspace.
    PostResume = 7,
    /// Unwind in progress due to a failure during suspend.
    /// The system transitions through this state while reversing
    /// already-completed suspend steps. Returns to `Running` after
    /// cleanup completes.
    SuspendAbort = 8,
}

7.5.1.2 Power State Machine¶

impl SuspendManager {
    /// Initiate suspend to a given sleep state.
    pub fn suspend(&self, state: SleepState) -> Result<(), SuspendError> {
        // Phase 1: PreSuspend
        self.set_phase(SuspendPhase::PreSuspend);
        self.freeze_userspace()?; // Stop all userspace tasks
        self.sync_filesystems()?; // Flush all dirty pages, journal commits

        // Phase 2: DeviceSuspend
        self.set_phase(SuspendPhase::DeviceSuspend);

        // Before calling SuspendResume::suspend(), the suspend path calls
        // rtpm_get(dev) for any device in D2/D3hot state (i.e., runtime PM
        // Suspended). This ensures the device is in D0 (active) before
        // receiving the suspend callback — drivers must not receive a
        // system suspend while in a runtime-suspended power state, because
        // their suspend callback assumes an active device context.
        //
        // Devices in RtpmState::Switching (mid-tier-change) are handled by
        // wait_for_tier_switches() — see below.
        self.reconcile_runtime_pm()?;

        // Wait for any in-progress tier switches to complete before
        // suspending devices. A device in RtpmState::Switching is between
        // driver teardown and replacement init — it cannot receive a suspend
        // callback in this state.
        self.wait_for_tier_switches()?;

        if let Err(e) = self.suspend_devices(state) {
            // Unwind: resume already-suspended devices in reverse order,
            // then thaw userspace. Without this, userspace remains frozen
            // and the system is deadlocked.
            self.set_phase(SuspendPhase::SuspendAbort);
            self.resume_devices_partial(state); // resume [0..N-1] in reverse
            self.thaw_userspace().ok(); // best-effort thaw
            self.set_phase(SuspendPhase::Running);
            return Err(e);
        }

        // Phase 3: CpuSuspend
        self.set_phase(SuspendPhase::CpuSuspend);
        if let Err(e) = self.save_cpu_state() {
            self.set_phase(SuspendPhase::SuspendAbort);
            self.resume_devices(state).ok();
            self.thaw_userspace().ok();
            self.set_phase(SuspendPhase::Running);
            return Err(e);
        }
        self.enter_acpi_sleep_state(state)?; // Write to ACPI PM1a_CNT, CPU halts

        // ... (system asleep, wake event occurs) ...

        // Phase 4: CpuResume (code resumes here after wake)
        self.set_phase(SuspendPhase::CpuResume);
        self.restore_cpu_state()?; // Restore registers, reload CR3, GDTR, IDTR

        // Phase 5: DeviceResume
        self.set_phase(SuspendPhase::DeviceResume);
        self.resume_devices(state)?; // Call driver resume callbacks in probe order

        // Phase 6: PostResume
        self.set_phase(SuspendPhase::PostResume);
        self.thaw_userspace()?; // Unfreeze userspace tasks

        self.set_phase(SuspendPhase::Running);
        Ok(())
    }

    /// Wait for all devices in `RtpmState::Switching` to complete their
    /// tier switch before proceeding with system suspend. A tier switch
    /// involves tearing down the old driver domain and loading the
    /// replacement driver — the device cannot receive suspend callbacks
    /// during this window.
    ///
    /// # Algorithm
    ///
    /// 1. Enumerate all devices in the device registry.
    /// 2. For each device whose `rtpm.state == RtpmState::Switching`:
    ///    a. Poll `rtpm.state` with 1 ms intervals up to a 100 ms timeout.
    ///    b. If the state transitions to `Active` (or any non-`Switching`
    ///       state) within the timeout: continue — the tier switch completed
    ///       and the device is ready for the normal suspend path.
    ///    c. If 100 ms expires with the device still in `Switching`:
    ///       abort the tier switch by signaling the replacement driver's
    ///       `init()` to cancel, revert the device to its previous tier's
    ///       driver (which is still loaded until the switch commits), and
    ///       set `rtpm.state` back to `Active`. Log via klog!(Warning).
    ///       The device proceeds through suspend with its original driver.
    ///
    /// # Rationale
    ///
    /// 100 ms is chosen because tier switches typically complete in 50–150 ms
    /// (Tier 1 reload latency). Waiting indefinitely would block system
    /// suspend on a potentially stuck driver. Reverting preserves system
    /// suspend reliability: the original driver can still suspend the device.
    fn wait_for_tier_switches(&self) -> Result<(), SuspendError> {
        const TIER_SWITCH_TIMEOUT_MS: u64 = 100;
        const POLL_INTERVAL_MS: u64 = 1;

        for dev in self.device_registry.iter() {
            if dev.rtpm.state.load(Ordering::Acquire) == RtpmState::Switching as u32 {
                let deadline = monotonic_ms() + TIER_SWITCH_TIMEOUT_MS;
                loop {
                    if dev.rtpm.state.load(Ordering::Acquire) != RtpmState::Switching as u32 {
                        break; // Tier switch completed.
                    }
                    if monotonic_ms() >= deadline {
                        klog!(Warning,
                            "suspend: device {} still in Switching after {}ms, aborting tier switch",
                            dev.name(), TIER_SWITCH_TIMEOUT_MS);
                        self.abort_tier_switch(&dev)?;
                        break;
                    }
                    sleep_ms(POLL_INTERVAL_MS);
                }
            }
        }
        Ok(())
    }

    /// Resume all runtime-suspended devices to D0 before system suspend.
    ///
    /// Drivers' suspend callbacks assume the device is in D0 (active).
    /// A device in D2/D3hot (runtime-suspended) cannot receive a system
    /// suspend callback because the driver's saved context corresponds to
    /// the active state, not the low-power state.
    ///
    /// # Algorithm
    ///
    /// 1. Enumerate all devices in the device registry.
    /// 2. For each device whose `rtpm.state == RtpmState::Suspended`:
    ///    a. Call `rtpm_get_sync(dev)` to resume the device to D0.
    ///    b. If resume fails, log a warning and mark the device as
    ///       `SuspendError::DeviceResumeFailed`. The system suspend
    ///       continues — a single device failure does not abort suspend
    ///       (unless it is the root storage device).
    /// 3. For devices in `RtpmState::Active` or `RtpmState::Disabled`: no action.
    /// 4. For devices in `RtpmState::Switching`: handled separately by
    ///    `wait_for_tier_switches()`.
    ///
    /// # Ordering
    ///
    /// `reconcile_runtime_pm()` runs BEFORE `wait_for_tier_switches()` and
    /// BEFORE `suspend_devices()`. The ordering is:
    /// `reconcile_runtime_pm` → `wait_for_tier_switches` → `suspend_devices`.
    /// This ensures all devices are in D0 before tier-switch completion is
    /// checked and before driver suspend callbacks are invoked.
    fn reconcile_runtime_pm(&self) -> Result<(), SuspendError> {
        for dev in self.device_registry.iter() {
            let state = dev.rtpm.state.load(Ordering::Acquire);
            if state == RtpmState::Suspended as u32 {
                if let Err(e) = rtpm_get_sync(&dev) {
                    klog!(Warning,
                        "suspend: failed to resume runtime-suspended device {}: {:?}",
                        dev.name(), e);
                    // Non-fatal for most devices. Fatal only for root storage.
                    if dev.is_root_storage() {
                        return Err(SuspendError::DeviceResumeFailed(dev.id()));
                    }
                }
            }
        }
        Ok(())
    }
}

Device suspend ordering vs topology sort: The "reverse dependency order" for suspend and "dependency order" for resume are both computed as a topological sort of the device tree maintained by the device registry (Section 11.4). The device_order field on each DeviceNode is NOT used for suspend ordering — it records the probe (enumeration) order for debugging. Suspend order is computed dynamically from the parent-child edges in the device tree at suspend time, because the topology may have changed since boot (hotplug, tier switches). The topological sort runs once per suspend cycle and is cached for the corresponding resume. SuspendManager uses this cached snapshot of the device tree topological sort; if a device is hot-plugged during suspend_prepare(), the order is recomputed before proceeding to the device suspend phase.

S0ix provider-client power edges: S0ix (Modern Standby) adds platform-level power domain constraints beyond the device tree parent-child edges. The SoC's power management controller defines provider-client relationships between power domains (e.g., the PCH provides power to all PCIe root ports; the display controller's power domain depends on the GPU domain for scanout). These edges are discovered from ACPI _PR0/_PR3 power resource lists at boot and added to the device registry's dependency graph. During S0ix entry, enter_s0ix() uses the same topological sort but includes power-domain edges in addition to device-tree edges, ensuring that a power provider is suspended only after all its clients have entered low-power state.

7.5.1.3 Driver Suspend/Resume Callbacks¶

Every driver (Tier 1 and Tier 2) must implement suspend/resume:

// umka-driver-sdk/src/suspend.rs

/// Driver suspend/resume trait.
pub trait SuspendResume {
    /// Suspend the device to the given sleep state.
    ///
    /// # Contract
    /// - Flush all pending I/O to the device.
    /// - Disable interrupts (deregister interrupt handler or mask at device level).
    /// - Power down the device (write to PCI PM registers, or device-specific power control).
    /// - Save any device state that cannot be reconstructed (e.g., firmware upload not repeatable).
    ///
    /// # Timeout
    /// If suspend does not complete within the tier-specific timeout (Tier 0/1: 2s,
    /// Tier 2: 5s), the kernel may force-kill the driver (Tier 2) or mark it as
    /// failed (Tier 0/1).
    fn suspend(&self, target: SleepState) -> Result<(), SuspendError>;

    /// Resume the device from the given sleep state.
    ///
    /// # Contract
    /// - Restore device state (e.g., re-upload firmware, reconfigure registers).
    /// - Re-enable interrupts.
    /// - Re-establish any connections (WiFi: reconnect to AP, NVMe: reinitialize controller).
    ///
    /// # Failure handling
    /// If resume fails, return Err. The kernel will attempt recovery (Section 7.2.10.6).
    fn resume(&self, from: SleepState) -> Result<(), SuspendError>;
}

7.5.1.4 Device Suspend Ordering¶

Devices must suspend in reverse dependency order and resume in dependency order:

Suspend order (leaves first, roots last):
1. pci0000:01:00.0 (GPU — no child dependents)
2. pci0000:00:02.0 (display controller — depends on GPU for framebuffer scanout)
3. nvme0n1 (NVMe namespace — no child dependents)
4. vda / sda (block device layer — depends on nvme0n1 for I/O)
5. wlan0 / eth0 (NIC — no child dependents)
6. platform:rtc0 (RTC — leaf device, no dependents)

Resume order (roots first, leaves last):
1. platform:rtc0
2. wlan0 / eth0
3. nvme0n1
4. vda / sda
5. pci0000:00:02.0
6. pci0000:01:00.0

The device registry (Section 11.4) tracks dependencies. Before suspend, the registry computes a topological sort of the device tree and calls suspend callbacks in that order.

7.5.1.5 Tier 2 Driver Suspend¶

Tier 2 drivers are separate processes. Suspending them requires IPC:

impl SuspendManager {
    /// Suspend a Tier 2 driver (send message via ring buffer).
    fn suspend_tier2_driver(&self, driver_pid: Pid, state: SleepState) -> Result<(), SuspendError> {
        // Send DRIVER_SUSPEND message to the driver's control ring (Section 11.6.2).
        let msg = DriverControlMessage::Suspend { state };
        driver_ring_buffer.push(msg)?;

        // Wait for response (DRIVER_SUSPEND_ACK) with tier-specific timeout
        // (Tier 2: 5s — longer because userspace drivers may need to flush
        // async I/O and signal completion over the control ring).
        match self.wait_for_response(driver_pid, Duration::from_secs(5)) {
            Ok(DriverControlMessage::SuspendAck) => Ok(()),
            Err(TimeoutError) => {
                // Driver did not respond: force terminate.
                process::kill(driver_pid, Signal::SIGKILL)?;
                // Mark device as unavailable until resume attempts to restart the driver.
                self.device_registry.mark_unavailable(driver_pid)?;
                Ok(()) // Continue suspend, device is orphaned but system suspends
            }
            Err(e) => Err(SuspendError::DriverFailed(e)),
        }
    }
}

7.5.1.6 Resume Failure Recovery¶

If a driver fails to resume:

Tier 1 driver failure: 1. Attempt Function Level Reset (FLR) via PCI config space (PCI_EXP_DEVCTL_BCR_FLR). 2. If FLR succeeds, reload the driver module (call probe() again). 3. If FLR fails or reload fails, mark device unavailable, continue boot. Log error to console and /var/log/kernel.log.

Tier 2 driver failure: 1. Kill the driver process. 2. Restart the driver process (spawn new process, re-initialize ring buffers). 3. If restart succeeds, device resumes normal operation (~10-50ms total recovery). 4. If restart fails 3 times, mark device unavailable, continue boot.

Critical device failure (NVMe root filesystem, display controller on the only display): - If the root NVMe fails to resume, suspend fails, system must reboot (no recovery possible without filesystem). - If the display fails to resume, system continues to run but displays a VT panic message (via Tier 0 VGA fallback).

7.5.1.7 S0ix Modern Standby¶

S0ix is not a true suspend state — the OS remains "running", but the CPU enters deep C-states (C10 on Intel, Ccd6 on AMD) where cores are powered off but SoC stays alive.

Differences from S3: - CPU does not power off: Scheduler still runs, interrupts still fire, but all tasks are idle (blocked on I/O or sleeping). - WiFi stays live: Driver keeps the radio in D3hot (low power but still connected to AP), wakes on packet. - Display off: Panel enters DPMS Off (Section 21.5), backlight off, but display controller stays powered. - Device D3hot, not D3cold: Devices enter D3hot (low power, can wake quickly) instead of D3cold (unpowered).

Enter S0ix:

href="#__codelineno-102-1">impl SuspendManager { pub fn enter_s0ix(&self) -> Result<(), SuspendError> { // 1. Suspend devices in leaf-to-root topological order. // // The device registry ([Section 11.4](11-drivers.md#device-registry-and-bus-management)) maintains // the device tree with parent-child relationships. enter_s0ix() walks the // tree from leaves to root, suspending children before parents. This ensures // that a child device's DMA and interrupts are quiesced before its parent // power domain enters D3hot. // // The ordering is the same reverse-dependency topological sort used by S3 // suspend (see "Device Suspend Ordering" above). For S0ix, the target power // state is D3hot (not D3cold) because devices must remain wake-capable. // // Each device transitions through runtime PM: rtpm_put_sync() decrements // the usage count and immediately suspends. Devices that support wake // (PCI PME, GPIO wake) configure their wake source before entering D3hot. // Devices that do not support wake from D3hot remain in D0 (idle) — the // driver's RuntimePmOps.idle callback returns NoAction for such devices. let topo_order = self.device_registry.topological_sort_leaves_first(); for dev in &topo_order { if dev.supports_runtime_pm() { // Suspend via runtime PM path — this calls the driver's // RuntimePmOps.suspend, which programs PCI PMCSR or ACPI _PS3. rtpm_put_sync(dev); } else { // Legacy devices without runtime PM: direct power state write. dev.set_power_state(PowerState::D3Hot)?; } } // 2. Set CPU P-state to minimum frequency. self.cpu_freq_governor.set_min_freq()?; // 3. Program CPU package C-state limit to C10 (deepest). self.cpu_cstate_governor.set_max_cstate(CState::C10)?; // 4. Idle all CPUs (all tasks blocked or sleeping). // Scheduler tick timer set to 1 Hz (extremely long idle periods). self.scheduler.set_idle_mode(true)?; // System now in S0ix. CPUs enter C10, wake on interrupt (timer, GPIO, PCIe PME). Ok(()) } pub fn exit_s0ix(&self) -> Result<(), SuspendError> { // Resume in root-to-leaf order (reverse of enter_s0ix). // Parent power domains must be active before children resume. self.scheduler.set_idle_mode(false)?; self.cpu_cstate_governor.set_max_cstate(CState::default())?; self.cpu_freq_governor.restore_governor()?; let topo_order = self.device_registry.topological_sort_roots_first(); for dev in &topo_order { if dev.supports_runtime_pm() { rtpm_get(dev)?; } else { dev.set_power_state(PowerState::D0Active)?; } } Ok(()) } }

Exit S0ix: Any interrupt (lid open, network packet, RTC alarm, USB device activity) wakes the CPU from C10 back to C0, resume is instant (~1-5ms).

7.5.2 Integration Points¶

The following table maps each Section 7.4 mechanism to its consumers in other sections:

Mechanism	Defined in	Consumed by
`RaplInterface::set_power_limit()`	Section 7.4	Section 7.4 consumer power profiles, Section 7.4 VM power budgets, Section 7.4 DCMI enforcement, thermal passive cooling (Section 7.4 `RaplCooler`)
`PowerDomainRegistry`	Section 7.4	`powercap` sysfs (Section 7.4), cgroup power accounting (Section 7.4), VM admission control (Section 7.4)
`ThermalZone` trip points	Section 7.4–3.2	Scheduler passive cooling (Section 7.1): `Passive` trips reduce cpufreq max; hwmon fan control (Section 13.13): `Active` trips actuate `FanCooler`
`TripType::Critical` handler	Section 7.4	`kernel_power_off()` — no other dependencies
cgroup `power.energy_uj`	Section 7.11	Billing/monitoring userspace agents, umka-kvm per-VM accounting (Section 7.4), Section 7.5 per-process attribution
cgroup `power.limit_uw`	Section 7.11–5.4	umka-kvm `VmPowerBudget` (sets this file on VM creation), Section 7.4 power profiles (sets this on cgroup creation)
`VmPowerBudget` struct	Section 7.4	`umka-kvm/src/power.rs`; interacts with Section 7.6 `CpuBandwidth::reduce_quota()`
`DcmiPowerCap::on_cap_set()`	Section 7.4	BMC-driven cap propagation to RAPL (Section 7.4) and umka-kvm (Section 7.4)
SMBus battery registers	Section 7.4	Battery driver, sysfs/umkafs (Section 7.4), Section 7.4 AC/battery auto-switch
`PowerProfile` enum	Section 7.4	Section 7.4 profile translation, userspace power managers
`SuspendManager::suspend()`	Section 7.5	System suspend/resume, Section 7.5 Tier 2 driver IPC
`SuspendResume` trait	Section 7.5	Tier 1/Tier 2 driver implementations

7.5.3 Per-Device Runtime Power Management¶

Sections 6.2.1–6.2.11 specify system-level power management: package-level RAPL limits, thermal trip points, cgroup power accounting, rack-level DCMI control, battery monitoring, consumer power profiles, suspend/resume protocols, and integration points. Per-device runtime PM addresses a different concern: controlling the power state of individual peripheral devices when they are idle.

An NIC that has not transmitted or received a packet in 10 seconds should be able to power-gate its PHY. A USB host controller with no active transfers should enter D3cold. A GPU idle between rendering jobs should clock-gate its shader cores. Without per-device runtime PM, these devices stay fully powered regardless of utilization, wasting energy and creating unnecessary thermal load.

Cross-references: workqueue for async suspend/resume (Section 3.11), clock gating (Section 2.24), regulator framework (Section 13.27), PCIe ASPM (Section 11.5), ACPI device power states (Section 2.4).

7.5.3.1 Design: Two-Counter State Machine¶

Linux pm_runtime uses three separate state variables (rpm_status, rpm_enabled, power.disable_depth) plus multiple lock types. This has documented races and requires careful lock ordering across different subsystems.

UmkaOS approach: Two atomic counters + one atomic state enum. All state transitions are serialized through a single ordered work item dispatched to the pm-async workqueue (Section 3.1.11). This eliminates lock ordering issues: the state machine has exactly one concurrent writer at any time.

/// Runtime PM state of a device. Transitions are always serialized through pm-async.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum RtpmState {
    /// Device is fully powered and ready for I/O.
    Active,
    /// Suspend callback is in progress. Device is transitioning to low-power state.
    Suspending,
    /// Device is in low-power state. DMA is stopped; clock gates may be closed;
    /// voltage may be reduced (depending on device-specific suspend callback).
    Suspended,
    /// Resume callback is in progress. Device is returning to Active state.
    Resuming,
    /// Runtime PM is disabled for this device. Device stays in Active state
    /// regardless of usage count. Use during device reset or power-critical operations.
    Disabled,
    /// Device is undergoing a driver tier switch (live evolution). The old driver's
    /// domain is being torn down and the replacement driver is being loaded. In this
    /// state: (a) new `rtpm_get()` calls return `EBUSY`, (b) stale autosuspend timers
    /// from the old driver are silently discarded (the timer callback checks for
    /// `Switching` and no-ops), (c) the replacement driver's `init()` transitions
    /// the device back to `Active` via `rtpm_get()` after establishing its domain.
    /// Tier switching is an operational change (not a crash). The tier-switch protocol
    /// is defined in [Section 11.1](11-drivers.md#three-tier-protection-model--tier-mobility). Crash recovery
    /// ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)) is a separate mechanism for
    /// post-crash driver reload.
    Switching,
}

/// Runtime PM state embedded in every DeviceNode.
/// All fields are accessed only through the rtpm_* API; never directly.
pub struct RuntimePm {
    /// Current state. Encoded as u32 for atomic access; cast to RtpmState on read.
    pub state:               AtomicU32,

    /// Usage count. Must be > 0 for the device to stay Active.
    /// Incremented by rtpm_get(); decremented by rtpm_put().
    /// When it reaches 0: the autosuspend timer is started.
    /// When it is incremented from 0 while Suspended: resume is triggered.
    pub active_count:        AtomicI32,

    /// Time in nanoseconds to wait after active_count reaches 0 before
    /// triggering automatic suspend.
    /// 0 = suspend immediately when active_count reaches 0.
    /// u64::MAX = never autosuspend (device must be manually suspended).
    pub autosuspend_delay_ns: AtomicU64,

    /// Monotonic timestamp (from clock_monotonic_ns()) when active_count
    /// last transitioned to 0. Used by the autosuspend timer to check if
    /// the device has been re-acquired since the timer was set.
    pub idle_since_ns:       AtomicU64,

    /// Parent device for power sequencing.
    /// If set: when this device enters Active, parent.rtpm_get() is called.
    /// When this device enters Suspended, parent.rtpm_put() is called.
    /// This ensures a power domain stays active while any child device is active.
    pub parent:              Option<Weak<DeviceNode>>,

    /// Driver-supplied power state callbacks. If None: device has no runtime PM.
    pub ops:                 Option<&'static RuntimePmOps>,
}

/// Driver-supplied callbacks for runtime PM transitions.
/// All callbacks run in the `pm-async` workqueue context:
/// preemptible, may sleep, may allocate with GFP_KERNEL.
pub struct RuntimePmOps {
    /// Prepare the device for low-power state.
    ///
    /// The driver must:
    /// - Quiesce all outstanding DMA (wait for completion or cancel)
    /// - Stop Rx (disable device interrupts, if safe)
    /// - Save any state that must be restored on resume
    /// - Gate device clocks if managed by the driver (not the clock framework)
    ///
    /// Must complete within RTPM_SUSPEND_TIMEOUT_MS. If it does not: the kernel
    /// logs KERN_WARNING and transitions the device back to Active.
    pub suspend: fn(dev: &DeviceNode) -> Result<(), KernelError>,

    /// Restore the device from low-power state.
    ///
    /// The driver must restore all state saved in `suspend`, re-enable Rx,
    /// and confirm the device is ready for I/O before returning.
    pub resume: fn(dev: &DeviceNode) -> Result<(), KernelError>,

    /// Optional: called when active_count first reaches 0.
    ///
    /// The driver can inspect device state and return:
    /// - `Idle`: proceed with the normal autosuspend path.
    /// - `NoAction`: driver is not ready to suspend (has pending work, etc.).
    ///   The autosuspend timer will not be started; the driver calls rtpm_put()
    ///   again later to re-trigger the idle path.
    pub idle: Option<fn(dev: &DeviceNode) -> RtpmIdleAction>,
}

#[repr(u8)]
pub enum RtpmIdleAction {
    Idle     = 0,  // Proceed to autosuspend
    NoAction = 1,  // Driver not ready; skip autosuspend
}

/// Maximum time the suspend callback may take before the kernel gives up.
pub const RTPM_SUSPEND_TIMEOUT_MS: u64 = 5000;

RtpmState to ACPI device power state mapping (ACPI 6.5 Section 7.2):

The runtime PM subsystem does not directly program ACPI Dx states — the driver's RuntimePmOps.suspend callback is responsible for writing the appropriate PCI PM register or ACPI _PSx method. This mapping defines the expected correspondence so that drivers and the kernel agree on the semantics of each RtpmState:

RtpmState	ACPI Dx	Description
`Active`	D0	Device fully powered, clocks running, ready for I/O.
`Suspending`	D0 (transitional)	Device still in D0 while the driver quiesces DMA and saves state. No ACPI state change yet.
`Suspended`	D2 or D3hot (device-dependent)	Low-power state. D2: device retains context, reduced power. D3hot: device loses most context but PCI config space is accessible, PME# can wake. The driver's suspend callback selects D2 vs D3hot based on device capabilities (`PCI_PM_CAP_PME_D3hot`, wake requirement).
`Resuming`	D0 (transitional)	Driver has initiated transition back to D0. `_PS0` method or PCI PMCSR write in progress.
`Disabled`	D0	Runtime PM is disabled; device stays in D0 regardless of usage count.
`Switching`	D0	Device is undergoing a driver tier switch. Device stays powered (D0) but new runtime PM operations are rejected until the replacement driver completes initialization.

D3cold: Not reachable via runtime PM alone. D3cold (auxiliary power removed, device completely off) requires platform-level power rail control (e.g., ACPI _PR3 power resource, GPIO-controlled regulator). D3cold is used only during system suspend (S3/S4) or when the platform power manager explicitly removes power. The transition path is: RtpmState::Suspended (D3hot) then platform removes power rail to reach D3cold. Re-entering D0 from D3cold requires a full device reset (PCI FLR or bus re-enumeration).

7.5.3.2 API¶

/// Increment usage count. If device is Suspended, triggers resume and blocks
/// until the device is Active (resume callback has completed successfully).
///
/// May sleep; must not be called from atomic context or IRQ handlers.
/// Use rtpm_get_async() from non-sleeping contexts.
pub fn rtpm_get(dev: &DeviceNode) -> Result<(), KernelError>;

/// Async variant: enqueues a resume work item in pm-async and returns immediately.
/// The caller must check rtpm_is_active() or use rtpm_get_sync() before performing I/O.
pub fn rtpm_get_async(dev: &DeviceNode) -> Result<WorkHandle, KernelError>;

/// Decrement usage count.
///
/// If count reaches 0:
/// - If ops.idle is set: calls it. If it returns NoAction, stops here.
/// - Else: starts the autosuspend timer (autosuspend_delay_ns from now).
/// Does NOT block; returns immediately.
pub fn rtpm_put(dev: &DeviceNode);

/// Decrement usage count and immediately enqueue a suspend work item,
/// bypassing the autosuspend delay. Blocks until the device is Suspended.
pub fn rtpm_put_sync(dev: &DeviceNode);

/// Disable runtime PM: transitions to Disabled state if currently Active.
/// Device will not be suspended while Disabled.
pub fn rtpm_disable(dev: &DeviceNode);

/// Re-enable runtime PM: transitions from Disabled to Active.
pub fn rtpm_enable(dev: &DeviceNode);

/// Returns true iff the device is currently in Active state.
/// Non-blocking; returns immediately.
pub fn rtpm_is_active(dev: &DeviceNode) -> bool;

/// Set the autosuspend delay.
///
/// - 0: suspend immediately when active_count reaches 0
/// - u64::MAX: never autosuspend (only manual rtpm_put_sync will suspend)
pub fn rtpm_set_autosuspend_delay(dev: &DeviceNode, delay_ns: u64);

/// Mark the device as active WITHOUT incrementing the usage count.
///
/// This resets the autosuspend timer by updating `idle_since_ns` to
/// the current monotonic time. Called by the KABI vtable trampoline
/// ([Section 11.6](11-drivers.md#device-services-and-boot--service-vtable-trampoline-mechanism))
/// on every KABI call that targets this device — it indicates the
/// device is in active use without creating a matching `rtpm_put()`
/// obligation.
///
/// **Cost**: one `Ordering::Relaxed` atomic store to a per-device
/// cache line (`idle_since_ns`). No cross-core invalidation traffic
/// because each device's `RuntimePm` struct is on its own cache line.
///
/// Does NOT trigger resume if the device is Suspended. The KABI
/// trampoline calls `rtpm_get()` separately when dispatching to a
/// suspended device — `rtpm_mark_active()` is an optimization for
/// the common case where the device is already Active and only the
/// autosuspend timer needs resetting.
#[inline]
pub fn rtpm_mark_active(dev: &DeviceNode) {
    dev.rtpm.idle_since_ns.store(
        clock_monotonic_ns(),
        Ordering::Relaxed,
    );
}

7.5.3.3 State Machine¶

All state transitions are serialized: only one transition can be in progress at a time, because they are dispatched as ordered work items to the pm-async workqueue.

State transitions:

  Active ──(active_count → 0 AND idle check passes AND delay expires)──► Suspending
  Suspending ──(suspend callback returns Ok)──────────────────────────► Suspended
  Suspending ──(suspend callback returns Err OR timeout)──────────────► Active
    [on error: KERN_WARNING logged; retry after RTPM_RETRY_DELAY_MS = 100 ms]
  Suspended ──(rtpm_get() or rtpm_get_async() called)─────────────────► Resuming
  Resuming ──(resume callback returns Ok)─────────────────────────────► Active
  Resuming ──(resume callback returns Err)────────────────────────────► Active
    [on error: KERN_ERR logged; device assumed broken; I/O will fail]

  Any state ──(rtpm_disable() called)─────────────────────────────────► Disabled
  Disabled ──(rtpm_enable() called)───────────────────────────────────► Active

Autosuspend timer:
  Implemented using the UmkaOS timer framework (Section 7.5).
  Armed at: idle_since_ns + autosuspend_delay_ns (monotonic time).
  On expiry: check if active_count == 0 AND state == Active.
    - If yes: enqueue suspend work item in pm-async.
    - If no (device re-acquired since timer was armed): no-op; timer is self-canceling.
  Timer handler runs in IRQ context and only enqueues work; it does NOT directly
  call the suspend callback (which may sleep).

Parent-child power sequencing:

When child device enters Active state:
  → rtpm_get(child.parent) is called automatically by the runtime PM core.
  → This ensures the parent power domain is Active before the child uses it.

When child device enters Suspended state:
  → rtpm_put(child.parent) is called automatically.
  → If parent's active_count reaches 0, parent may also suspend.

Circular dependency detection:
  Performed at DeviceNode registration time using a BFS cycle check on the
  parent chain. A circular dependency causes a kernel panic with the cycle
  description logged. Circular power dependencies cannot be correctly handled
  and indicate a driver or DT error.

Maximum parent chain depth: 16 devices. Deeper chains are rejected at registration.

7.5.3.4 Linux External ABI¶

The following sysfs files are required for userspace power management tools (udev, tlp, tuned, gnome-power-manager, systemd):

/sys/bus/<bus>/devices/<device>/power/control
  Values: "on" | "auto"
  "on"   = rtpm_disable() (device stays Active)
  "auto" = rtpm_enable()  (runtime PM active; device may autosuspend)

/sys/bus/<bus>/devices/<device>/power/runtime_status
  Values: "active" | "suspended" | "suspending" | "resuming" | "unsupported"
  Read-only. "unsupported" if ops == None.

/sys/bus/<bus>/devices/<device>/power/autosuspend_delay_ms
  Read-write integer: autosuspend delay in milliseconds.
  Writes map to rtpm_set_autosuspend_delay(dev, value_ms * 1_000_000).
  -1 means "never autosuspend" (maps to autosuspend_delay_ns = u64::MAX).

/sys/bus/<bus>/devices/<device>/power/runtime_usage
  Read-only integer: current active_count value.

/sys/bus/<bus>/devices/<device>/power/runtime_active_time
  Read-only u64: cumulative nanoseconds spent in Active state (for energy accounting).

/sys/bus/<bus>/devices/<device>/power/runtime_suspended_time
  Read-only u64: cumulative nanoseconds spent in Suspended state.

These files are identical in layout and semantics to Linux's sysfs PM interface, enabling existing tools to work without modification.

7.5.3.5 Tier 2 Runtime PM API¶

Both Tier 1 and Tier 2 drivers invoke rtpm_get() / rtpm_put() via kabi_call! which resolves to the appropriate transport: ring buffer commands for cross-domain dispatch (Section 12.6). Tier 0 modules call the API directly (same domain, direct vtable call). The driver code is the same regardless of tier — only the transport differs, selected at bind time.

KABI runtime PM commands:

/// Runtime PM commands sent by Tier 2 drivers over the driver control ring.
/// These are KABI ring buffer message types, not syscalls — Tier 2 drivers
/// use the standard ring buffer push/pop protocol to send them.
#[repr(u32)]
pub enum DriverRtpmCommand {
    /// Increment the device's runtime PM usage count. If the device is
    /// currently Suspended, the kernel triggers a resume and blocks the
    /// driver's ring buffer consumer thread until the device reaches Active
    /// state. The response message carries the result.
    ///
    /// Equivalent to the in-kernel rtpm_get().
    Get = 0x0010_0001,

    /// Decrement the device's runtime PM usage count. If the count reaches
    /// zero, the autosuspend timer is started. Non-blocking: the response
    /// is sent immediately (suspend happens asynchronously).
    ///
    /// Equivalent to the in-kernel rtpm_put().
    Put = 0x0010_0002,

    /// Set the autosuspend delay for this device.
    /// Payload: u64 delay_ns (0 = immediate, u64::MAX = never).
    ///
    /// Equivalent to the in-kernel rtpm_set_autosuspend_delay().
    SetAutosuspendDelay = 0x0010_0003,

    /// Query the current runtime PM state. Non-blocking.
    /// Response payload: RtpmState as u32.
    GetStatus = 0x0010_0004,
}

/// Response to a DriverRtpmCommand, sent back to the Tier 2 driver
/// on its completion ring.
pub struct DriverRtpmResponse {
    /// The command this response is for.
    pub command: DriverRtpmCommand,
    /// 0 = success, negative = error (KernelError encoding).
    pub result: i32,
    /// For GetStatus: the current RtpmState as u32.
    /// For other commands: unused (0).
    pub state: u32,
}

Kernel-side dispatch: The Tier 2 driver's KABI control ring consumer (running as a kernel thread) receives DriverRtpmCommand messages and translates them to the corresponding in-kernel rtpm_*() calls:

/// Handle a runtime PM command from a Tier 2 driver.
/// Called by the KABI control ring consumer in process context (may sleep).
fn handle_tier2_rtpm(
    dev: &DeviceNode,
    cmd: DriverRtpmCommand,
    payload: &[u8],
) -> DriverRtpmResponse {
    match cmd {
        DriverRtpmCommand::Get => {
            let result = rtpm_get(dev);
            DriverRtpmResponse {
                command: cmd,
                result: result.map(|_| 0i32).unwrap_or_else(|e| e.to_errno()),
                state: 0,
            }
        }
        DriverRtpmCommand::Put => {
            rtpm_put(dev);
            DriverRtpmResponse { command: cmd, result: 0, state: 0 }
        }
        DriverRtpmCommand::SetAutosuspendDelay => {
            let delay_ns = u64::from_le_bytes(payload[..8].try_into().unwrap());
            rtpm_set_autosuspend_delay(dev, delay_ns);
            DriverRtpmResponse { command: cmd, result: 0, state: 0 }
        }
        DriverRtpmCommand::GetStatus => {
            let active = rtpm_is_active(dev);
            let state = dev.rtpm.state.load(Ordering::Acquire);
            DriverRtpmResponse { command: cmd, result: 0, state }
        }
    }
}

Blocking semantics: DriverRtpmCommand::Get blocks the KABI control ring consumer thread until the device resume completes (matching the in-kernel rtpm_get() semantics). During this time, other control ring commands from the same Tier 2 driver are queued. This is acceptable because runtime PM resume is bounded (RTPM_SUSPEND_TIMEOUT_MS = 5000 ms) and the data-plane ring (used for actual I/O) is a separate ring that is not blocked. If the Tier 2 driver needs non-blocking resume, it can send GetStatus after Get to poll for completion asynchronously (send Get, continue processing data-plane I/O, periodically send GetStatus until state == Active).

Security: The kernel validates that the Tier 2 driver process holds a DeviceHandle for the target device. A Tier 2 driver cannot send runtime PM commands for devices it does not own. The DeviceHandle is bound to the driver's KABI session at device probe time and cannot be forged.

7.5.4 Cpuidle Governor¶

When a CPU enters the idle task (no runnable work), the cpuidle governor selects the appropriate processor idle state (C-state). The governor decides how deeply the CPU should sleep based on expected idle duration and wake-up latency constraints.

/// Per-CPU cpuidle governor state.
///
/// Implements a menu-based idle state selection algorithm (matching Linux's
/// menu governor concept but with a simpler, more predictable implementation).
/// Updated on every idle entry and exit.
pub struct CpuidleGovernor {
    /// Predicted idle duration in microseconds, based on recent history
    /// and the next scheduled timer event.
    pub predicted_us: u32,
    /// Exponential moving average of actual idle durations (microseconds).
    /// Updated on every idle exit: avg = avg * 7/8 + actual * 1/8.
    pub avg_idle_us: AtomicU32,
    /// History ring of recent idle durations (8 entries, microseconds).
    /// Used to detect bimodal idle patterns (e.g., alternating short/long
    /// idles from periodic interrupt sources).
    pub history: [u32; 8],
    /// Current index into the history ring (wraps at 8).
    pub history_idx: u8,
    /// Correction factor for prediction accuracy (fixed-point Q8, range [25, 256]).
    /// 256 = 1.0 (perfect prediction), 25 ≈ 0.1 (worst case).
    /// Updated after each idle exit based on prediction error.
    pub correction_factor: u32,
}

/// A single processor idle state (C-state), discovered at boot.
///
/// Populated from ACPI `_CST` objects (x86), DT `idle-states` node
/// (ARM/RISC-V), or OPAL idle state table (PPC64).
pub struct CpuidleState {
    /// C-state index (0 = poll/mwait, higher = deeper sleep).
    pub index: u8,
    /// Human-readable name (e.g., "C1", "C6S", "POLL").
    pub name: ArrayString<16>,
    /// Worst-case exit latency in microseconds. The CPU is unavailable
    /// for this duration when woken from this state.
    pub exit_latency_us: u32,
    /// Target residency in microseconds. The minimum time the CPU must
    /// remain in this state for the entry/exit overhead to be worthwhile.
    pub target_residency_us: u32,
    /// Architecture-specific entry function.
    /// x86: MWAIT hint value (Cx sub-state). ARM: WFI or PSCI CPU_SUSPEND.
    /// RISC-V: WFI or SBI HSM hart_suspend. PPC: nap/sleep/winkle.
    pub enter: fn(state: &CpuidleState),
}

Idle state selection algorithm:

Compute expected idle duration: expected_us = min(next_timer_event_us, avg_idle_us * correction_factor / 256). The next timer event is obtained from the per-CPU timer wheel; avg_idle_us provides a history-based prediction.
Select deepest safe C-state: iterate C-states from deepest to shallowest. Select the deepest state whose exit_latency_us < expected_us * latency_factor / 256. The latency_factor is 128 (i.e., 0.5), meaning a state is eligible only if its exit latency is at most 50% of the expected idle duration. This prevents selecting states where the wake-up cost dominates.
Target residency check: additionally require target_residency_us <= expected_us. This ensures the CPU stays in the state long enough for the power savings to offset the entry/exit cost.
Update correction factor on wake: after exiting idle, compute ratio = actual_idle_us * 256 / predicted_us. Update correction_factor as an exponential moving average: correction_factor = correction_factor * 7/8 + ratio * 1/8, clamped to [25, 256] (i.e., [0.1, 1.0]). If prediction was too optimistic (woken early by interrupt), the factor decreases, causing shallower states to be selected in future.
History ring update: record actual_idle_us in the history ring. The ring is checked for bimodal patterns: if the coefficient of variation exceeds 1.5, the governor uses the shorter mode of the distribution as the predicted value, avoiding deep states that will be interrupted.

Per-architecture C-state table population:

Architecture	Source	Discovery Time
x86-64	ACPI `_CST` method or `CPUID` MWAIT leaf (0x05)	Boot (ACPI parse)
AArch64	DT `idle-states` node (ARM DEN0022D)	Boot (DTB parse)
ARMv7	DT `idle-states` node	Boot (DTB parse)
RISC-V	DT `idle-states` node or SBI HSM extension	Boot (DTB parse)
PPC64LE	OPAL idle state table (`ibm,opal/power-mgt`)	Boot (OPAL query)
PPC32	Static table (single WFI state on most e500 cores)	Compile time
s390x	`STSI` facility query + `diagnose 0x44` idle	Boot (facility query)
LoongArch64	DT `idle-states` node or CSR-based idle (CPUCFG discovery)	Boot (DTB parse / CPUCFG)

Integration with scheduler idle path: The idle task (Section 7.1.4) calls cpuidle_select(governor, states) after processing softirqs. The selected state's enter function executes the platform halt instruction. On interrupt wake, the idle task updates the governor with the actual idle duration before calling pick_next_task().

Runtime PM interaction with driver tier transitions: When the FMA engine (Section 20.1) demotes a driver from Tier 1 to Tier 2 (or promotes Tier 2 to Tier 1), the runtime PM state must be coordinated:

Before tier demotion (Tier 1 → Tier 2): call rtpm_put_sync(dev) to suspend the device if it is runtime-active. The device enters a known low-power state before the isolation boundary is reconfigured. After the tier switch completes and the driver is reloaded in the new tier, rtpm_get(dev) resumes the device.
Before tier promotion (Tier 2 → Tier 1): the Tier 2 driver process must call rtpm_put_sync(dev) before being terminated. The Tier 1 driver's init() calls rtpm_get(dev) to resume the device in the new isolation domain.
Autosuspend timer coordination: the autosuspend timer (RtpmAutoSuspendTimer) is cancelled before the tier switch and re-armed by the replacement driver's init(). Stale timers from the old driver must not fire against the new driver's domain — the timer callback checks dev.rtpm_state and no-ops if the device is in RtpmState::Switching.

7.6 CPU Bandwidth Guarantees¶

Inspired by: QNX Adaptive Partitioning (concept only). IP status: Built from academic scheduling theory (CBS/EDF, 1998) and Linux cgroup v2 interface. QNX-specific implementation NOT referenced. Term "adaptive partitioning" NOT used.

7.6.1 Problem¶

Section 7.1 defines three scheduler classes: EEVDF (normal), RT (FIFO/RR), and Deadline (EDF/CBS). Cgroups v2 provides resource limits (cpu.max caps the ceiling, cpu.weight sets relative priority).

What is missing: guaranteed minimum CPU bandwidth under overload. Current mechanisms:

cpu.weight is proportional sharing — if one cgroup has weight 100 and another has weight 900, the first gets 10% of whatever is available. But if the system is fully loaded, "10% of available" might not meet the minimum requirement.
cpu.max is a ceiling, not a floor. It limits maximum, does not guarantee minimum.
Deadline scheduler provides guarantees, but only for individual tasks, not for groups.

Use case: A server runs a database (needs guaranteed 40% CPU), a web frontend (needs guaranteed 20%), and batch jobs (uses the rest). Under overload, the batch jobs must not be able to starve the database below 40%, even if the batch jobs are numerous.

7.6.2 Design: CBS-Based Group Bandwidth Reservation¶

The solution combines the existing Deadline scheduler's Constant Bandwidth Server (CBS) algorithm with cgroup v2's group hierarchy.

CBS (Abeni & Buttazzo, 1998) provides bandwidth isolation: each server (task or group) is assigned a bandwidth Q/P (Q microseconds of CPU time every P microseconds). The server is guaranteed this bandwidth regardless of other servers' behavior. Unused bandwidth is redistributed (work-conserving).

This is a well-established academic algorithm with no IP encumbrance.

7.6.3 Cgroup v2 Interface¶

New control file in the cpu controller, additive to the existing interface:

/sys/fs/cgroup/<group>/cpu.guarantee

Format: $QUOTA $PERIOD (microseconds). The syntax is identical to cpu.max ($QUOTA $PERIOD in microseconds). Typical cpu.guarantee periods are longer (1 second) than typical cpu.max periods (100 ms) because guarantee is a coarser-grained bandwidth allocation — CBS replenishment overhead is amortized over longer periods.

Example:

# # Database cgroup: guaranteed 40% CPU bandwidth
echo "400000 1000000" > /sys/fs/cgroup/database/cpu.guarantee
# # = 400ms of CPU time every 1000ms = 40% guaranteed

# # Web frontend: guaranteed 20%
echo "200000 1000000" > /sys/fs/cgroup/web/cpu.guarantee

# # Batch jobs: no guarantee (uses whatever is left)
# # cpu.guarantee defaults to "max" (no guarantee)

# # Total guaranteed: 60%. Remaining 40% is shared by weight among all.

Semantics:

A group with cpu.guarantee set is backed by a CBS server at the specified bandwidth.
The CBS server ensures the group receives at least its guaranteed bandwidth even under full system load.
When the group is idle, its unused bandwidth is redistributed to other groups (work-conserving). This is inherent to CBS — no special logic needed.
cpu.guarantee cannot exceed cpu.max (if both are set, guarantee <= max).
Sum of all cpu.guarantee across the system must not exceed total CPU bandwidth. Attempting to overcommit returns -ENOSPC.
Nested cgroups: a child's guarantee comes out of its parent's guarantee budget.

Multi-core accounting: cpu.guarantee specifies system-wide bandwidth, not per-CPU. The implementation uses a global budget pool with per-CPU runtime slices (same pattern as Linux EEVDF bandwidth throttling for cpu.max). A CBS server with a 40% guarantee on a 4-CPU system gets a global budget of 400ms per 1000ms period. This budget is drawn down as tasks in the group run on any CPU. When exhausted, all tasks in the group are throttled until the next period. Per-NUMA-node variants are tracked in BandwidthAccounting.per_node_guaranteed for NUMA-aware scheduling hints, but the guarantee is enforced globally.

RT and DL task handling: RT and DL tasks within a CBS-guaranteed cgroup bypass the CBS server's budget accounting. Their guarantees come from the RT/DL schedulers directly. The CBS guarantee applies only to EEVDF-class (normal) tasks within the group. This matches Linux's behavior where cpu.max throttling does not apply to RT tasks.

7.6.4 Kernel-Internal Design¶

Evolvable component classification (Section 13.18):

Component	Classification	Rationale
`CbsGroupConfig` data structure	Nucleus (data)	Struct layout must survive policy swap; budget fields are data
`CbsCpuServer` data structure	Nucleus (data)	Per-CPU server state (budget, deadline) layout must survive replacement
`CbsCpuServer.tree` (`VruntimeTree`)	Nucleus (data)	Shared tree + accumulator base type; same Nucleus classification as `VruntimeTree` in Section 7.1
`cbs_replenish()` budget math	Evolvable	Runtime formula code. Correctness ensured by Nucleus invariant checker: validates `budget_remaining_ns` post-replenishment is `<= quota_us * 1000` and `deadline_ns` is monotonically advancing. Making it Nucleus prevents fixing a budget formula bug without reboot.
`cbs_charge()` accounting	Evolvable	Runtime formula code. Invariant checker validates that charge does not exceed `delta_exec` and that deficit is bounded to `-(quota_us * 1000)`.
Admission control (`cpu_guarantee_write`)	Evolvable	Admission formula code. Invariant checker validates sum invariant (`total_guaranteed <= max_guarantee`).
`cbs_try_steal()` heuristics	Evolvable	Steal order (NUMA-first), steal fraction (1/2) are tunable policy
CBS `pick_next_eevdf_task()`	Evolvable	Delegates to the active `SchedPolicy` module (same as main EEVDF)
`BandwidthAccounting` data	Nucleus (data)	System-wide admission bookkeeping struct layout; corrupted = overcommit possible
`CpuBandwidthThrottle` data	Nucleus (data)	Ceiling enforcement struct layout
`CpuMaxLocalSlice` per-CPU cache	Nucleus (data)	Local slice accounting struct layout
cpu.max burst threshold	Evolvable	Burst tolerance is a tunable heuristic

Phase assignment: CBS guarantee mechanism is Phase 2 (required for cgroup CPU isolation). ML policy hooks for CBS steal heuristics are Phase 4 (Section 23.1).

// umka-core/src/sched/cbs_group.rs (kernel-internal)

/// Cgroup-wide CBS configuration. Stored in `CpuController.cbs`.
/// Written by the `cpu.guarantee` / `cpu.max` cgroupfs interface.
/// The per-CPU servers read this at replenishment time.
pub struct CbsGroupConfig {
    /// Bandwidth: quota microseconds per period microseconds.
    pub quota_us: u64,
    pub period_us: u64,

    /// **Note**: CBS guarantee (floor) does NOT use a burst buffer.
    /// Burst semantics apply only to `cpu.max` (ceiling) enforcement via
    /// `CpuBandwidthThrottle.burst_us`. The guarantee mechanism provides
    /// bandwidth reservation, not burst tolerance — the per-CPU proportional
    /// model and atomic steal already absorb scheduling jitter naturally.

    /// Total weight of runnable tasks across all CPUs. Updated atomically
    /// on enqueue/dequeue. Used by per-CPU servers to compute proportional
    /// share: `local_share = quota_us * local_weight / total_weight`.
    pub total_weight: AtomicU64,

    /// Admission control: sum of guarantee/period ratios across all CBS
    /// groups on this system. Must not exceed the configured utilization
    /// cap (default 95%, matching Linux SCHED_DEADLINE).
    /// Updated atomically when `cpu.guarantee` is written.
    pub system_utilization: &'static AtomicU64, // Shared singleton
}

/// Per-CPU CBS server for a single cgroup. Allocated lazily when the
/// cgroup's first task becomes runnable on a CPU. Stored in per-CPU
/// `RunQueueData.cbs_servers` XArray, keyed by cgroup ID (O(1) lookup).
///
/// Each CBS server maintains its own local EEVDF tree for tasks in the
/// guaranteed cgroup. `pick_next_task()` checks CBS servers (in EDF
/// order by `deadline_ns`) between the DL and plain-EEVDF steps. This
/// ensures guaranteed groups receive CPU time ahead of non-guaranteed
/// tasks while budget remains. When budget is exhausted (`throttled`
/// is set), the server is skipped and its tasks wait for replenishment.
///
/// **Enqueue/dequeue**: When a task in a CBS-guaranteed cgroup becomes
/// runnable, it is enqueued in the CBS server's EEVDF tree (not the
/// main EEVDF tree). When it blocks/exits, it is dequeued from the
/// server's tree. If the cgroup's `cpu.guarantee` is removed, tasks
/// are migrated to the main EEVDF tree.
/// **Unit convention**: Budget accounting (`budget_remaining_ns`) uses nanoseconds
/// internally, matching Linux's `cfs_rq->runtime_remaining` precision. This avoids
/// the persistent truncation bias of `delta_exec / 1000` (up to ~0.1% systematic
/// under-charge per second). The user-facing `cpu.guarantee` cgroup knob accepts
/// microseconds; the conversion (`quota_us * 1000 -> quota_ns`) happens once at
/// configuration time. Timestamps and deadlines (`deadline_ns`) also use nanoseconds
/// for compatibility with `ktime_get_ns()` and hrtimer APIs.
///
/// **Kernel-internal struct, not `#[repr(C)]`.** Layout determined by Rust compiler.
/// Never crosses KABI or wire boundaries. No `const_assert!` needed.
pub struct CbsCpuServer {
    /// Number of tasks currently CBS-throttled in this cgroup on this CPU.
    /// Incremented when a task exhausts its CBS budget and is dequeued from
    /// the EEVDF tree (throttle path). Decremented on replenishment wake
    /// (timer fires, budget restored) or task exit (`do_exit` →
    /// `sched_task_exit`). Used to determine whether the per-CPU
    /// replenishment timer can be cancelled: when `nr_throttled_tasks`
    /// reaches 0, no tasks on this CPU need replenishment and the timer
    /// is disarmed to avoid spurious wakeups.
    pub nr_throttled_tasks: AtomicU32,

    /// Budget remaining on this CPU for this cgroup (nanoseconds).
    /// Signed: CBS allows transient overspend (tick granularity > remaining
    /// budget). When negative:
    ///   1. Server is throttled, tasks dequeued from EEVDF tree.
    ///   2. Scheduler attempts atomic steal from sibling CPUs (see below).
    ///   3. If steal fails, tasks remain throttled until replenishment.
    ///   4. Deficit carries forward: replenishment adds `share_ns` to current
    ///      budget, so negative balance reduces next period's effective budget.
    ///
    /// **Deficit cap**: clamped to `-(quota_ns as i64)` to bound recovery
    /// time to at most one period.
    ///
    /// **Precision**: All budget arithmetic uses integer nanoseconds (i64).
    /// No fixed-point or floating-point — integer subtraction and addition
    /// are exact. Nanosecond precision eliminates the truncation bias that
    /// microsecond accounting would introduce (up to 999 ns lost per tick).
    /// Tick granularity (typically 1-4 ms) bounds the maximum overshoot per
    /// period to one tick worth of CPU time.
    pub budget_remaining_ns: AtomicI64,

    /// CBS deadline for this server (absolute nanoseconds). When budget
    /// is exhausted and replenished, deadline is pushed by one period.
    /// EDF ordering: the server with the earliest deadline is replenished
    /// first when multiple servers compete on the same CPU.
    pub deadline_ns: AtomicU64,

    /// Weight of runnable tasks on this CPU for this cgroup. Used to
    /// compute proportional share at replenishment.
    pub local_weight: AtomicU64,

    /// Whether this server is currently CBS-throttled (guarantee budget
    /// exhausted). When true, the server's tasks are dequeued from the
    /// CBS EEVDF tree. Cleared on replenishment or successful steal.
    ///
    /// **Layout note**: `throttled` and `max_throttled` are grouped together
    /// at the end of the struct to avoid padding between AtomicBool (align 1)
    /// and AtomicU64 (align 8) fields. Both are read together on the CBS
    /// pick path, so co-locating them improves cache efficiency.
    pub throttled: AtomicBool,

    /// Whether this cgroup is cpu.max-throttled (ceiling quota exhausted).
    /// cpu.max is always the hard ceiling — when `max_throttled` is true,
    /// the CBS server is skipped in `pick_next_task()` even if CBS budget
    /// remains. CBS budget consumption is frozen while max-throttled: no
    /// replenishment timers fire, no runtime is charged. When cpu.max
    /// unthrottles the cgroup (period boundary), CBS resumes with its
    /// current budget.
    ///
    /// Set by the cpu.max bandwidth enforcement path (same mechanism as
    /// Linux CFS bandwidth throttling). Cleared by the cpu.max period
    /// timer. The CBS replenishment timer is armed/re-armed only when
    /// `max_throttled` is false.
    pub max_throttled: AtomicBool,

    /// Cumulative runtime consumed this period (nanoseconds). For
    /// `cpu.stat` accounting (converted to microseconds at read time).
    pub consumed_ns: AtomicU64,

    /// Per-CPU replenishment timer. Fires at `deadline_ns`.
    pub replenish_timer: HrTimer,

    /// Local EEVDF tree for tasks in this CBS-guaranteed cgroup on this
    /// CPU. Tasks are enqueued here (not in the main per-CPU EEVDF tree)
    /// while the cgroup has an active `cpu.guarantee`. `pick_next_task()`
    /// selects the eligible task from this tree when this server wins
    /// EDF arbitration. When the cgroup's guarantee is removed, tasks
    /// are migrated back to the main EEVDF tree.
    ///
    /// **Relationship with `GroupEntity`** ([Section 7.2](#heterogeneous-cpu-scheduling--hierarchical-eevdf-via-group-scheduling-entities)):
    /// `CbsCpuServer.tree` is a **flat** EEVDF tree containing only
    /// leaf tasks (no nested group entities). It is separate from the
    /// hierarchical EEVDF tree managed by `GroupEntity`. `GroupEntity`
    /// implements proportional weight sharing across cgroups in the main
    /// EEVDF tree; `CbsCpuServer` provides CBS minimum-bandwidth
    /// guarantees as an orthogonal mechanism. A cgroup may have both
    /// a `cpu.weight` (`GroupEntity` in the main tree) and a
    /// `cpu.guarantee` (`CbsCpuServer` with its own tree). Tasks with
    /// an active CBS guarantee are placed in the CBS tree, not in the
    /// `GroupEntity.child_rq`.
    ///
    /// Uses `VruntimeTree` directly ([Section 7.1](#scheduler)) — the shared base
    /// type containing the augmented RB tree and two-accumulator state.
    /// All accumulator-only EEVDF helpers (`avg_vruntime_update`,
    /// `entity_key`, `__enqueue_entity`, `__dequeue_entity`) operate
    /// on `&VruntimeTree` and work identically on this CBS sub-tree.
    /// Root-only fields (`curr`, `next`, `bandwidth_timer`) are not
    /// present — CBS has its own `replenish_timer` and the currently
    /// running task is tracked solely by the root per-CPU
    /// `EevdfRunQueue.curr`.
    pub tree: VruntimeTree,
}

/// **Throttle state summary**: A task is throttled if `throttled || max_throttled`.
/// - `throttled` = CBS guarantee budget exhausted. Cleared by replenishment or steal.
/// - `max_throttled` = cgroup `cpu.max` hard ceiling. Cleared by period boundary timer.
/// - `OnRqState::CbsThrottled` = task-level state reflecting either condition.
/// Unthrottle checks both flags independently: clearing one does not unthrottle
/// if the other is still set.
///
/// **Weight accounting during throttle/unthrottle**: When a task enters
/// `CbsThrottled`, it is dequeued from the CBS server's `VruntimeTree` —
/// the tree's `sum_weight` accumulator is decremented by the task's weight.
/// `CbsGroupConfig.total_weight` is unchanged: the task still belongs to the
/// cgroup, it is just not schedulable. On replenishment, the task is re-enqueued
/// into the `VruntimeTree` (`sum_weight` incremented). This ensures that
/// proportional share calculations in `cbs_replenish()` reflect only runnable
/// (non-throttled) tasks when computing `local_share = quota_us * local_w / total_w`.
///
/// **GroupEntity weight invariant**: CBS throttling does NOT affect the cgroup's
/// `GroupEntity` weight in the hierarchical scheduler. The `GroupEntity` continues
/// contributing its weight to the parent EEVDF tree — only the individual task is
/// dequeued from the `CbsCpuServer`'s `VruntimeTree`. On replenishment, the task
/// re-enters the CBS server's tree with its original weight.

7.6.4.1.1 CBS Replenishment (per-CPU, no global pool)¶

Each CbsCpuServer has its own replenishment timer. There is no global pool and no global lock. Budget is distributed proportionally and reclaimed via atomic steal.

Period boundary replenishment (per-CPU timer fires at deadline_ns):

fn cbs_replenish(server: &CbsCpuServer, config: &CbsGroupConfig):
    // Proportional share for this CPU.
    // Memory ordering rationale: local_weight and total_weight are updated
    // on the slow path (cgroup migration, weight change) and read here on
    // the timer path. Relaxed is safe because stale values cause only a
    // transient proportional error (corrected on the next period). No
    // invariant depends on these loads being sequentially consistent.
    local_w = server.local_weight.load(Relaxed)
    total_w = config.total_weight.load(Relaxed)
    // Compute this CPU's proportional share in nanoseconds.
    // quota_us is in microseconds (cgroup API unit); convert to ns for budget math.
    share_ns = if total_w > 0 { config.quota_us * 1000 * local_w / total_w } else { 0 }

    // **Transient weight staleness**: Under rapid cgroup migration (many tasks
    // moving between CPUs simultaneously), both local_weight and total_weight
    // may be transiently stale. Example: a weight-1024 task migrates from CPU A
    // to CPU B. If CPU A's replenishment timer fires before local_weight is
    // updated, CPU A computes a larger-than-deserved share (includes the departed
    // task's weight). Meanwhile CPU B's share does not yet include the arriving
    // task. For one period, the total distributed bandwidth may exceed quota_us
    // by up to the migrating task's proportional share (e.g., ~1-2% of CPU time
    // for a 40% guarantee). This is self-correcting: the next replenishment
    // period reads updated weights and normalizes. CBS theory absorbs transient
    // bandwidth errors — no correctness invariant is violated.

    // Replenish: add share to current budget (may be negative from deficit).
    // Relaxed: budget_remaining_ns is only read by this CPU's scheduler tick
    // and by cbs_charge() on this CPU (both are serialized by the rq lock).
    old = server.budget_remaining_ns.load(Relaxed)
    // Deficit cap: clamp to at most one period's deficit (converted to ns).
    clamped = max(old, -(config.quota_us as i64 * 1000))
    server.budget_remaining_ns.store(clamped + share_ns as i64, Relaxed)

    // Advance deadline by one period.
    server.deadline_ns.fetch_add(config.period_us * 1000, Relaxed)

    // Un-throttle if budget is now positive.
    // Release ordering: the store to `throttled` must be visible AFTER the
    // budget and deadline updates above. Other CPUs that read `throttled`
    // with Acquire will see the updated budget/deadline values.
    if clamped + (share_ns as i64) > 0:
        server.throttled.store(false, Release)
        // Re-enqueue all CbsThrottled tasks for this cgroup on this CPU.
        // Dequeue-first semantics: each task is removed from the throttled
        // list BEFORE being placed on the runqueue. This prevents double-
        // enqueue if a concurrent SIGKILL delivery also tries to enqueue
        // the task (signal delivery checks OnRqState before re-enqueue).
        //
        // migration_pending check: skip tasks with migration_pending == true.
        // A task whose migration is in flight will be enqueued on its
        // destination CPU by the migration completion path. Enqueuing it
        // here would cause a double-enqueue race: the migration path also
        // calls enqueue_task() on the destination CPU after the task has
        // been dequeued from this CPU. The migration completion path is
        // responsible for clearing CbsThrottled state on the destination.
        cbs_unthrottle_tasks(this_cpu, cgroup_id)
        // cbs_unthrottle_tasks pseudocode:
        //   for task in throttled_list(this_cpu, cgroup_id):
        //       if task.migration_pending:
        //           continue  // migration path will handle re-enqueue
        //       task.on_rq_state.store(Queued, Release)
        //       throttled_list.remove(task)
        //       enqueue_task(task)

    rearm server.replenish_timer to fire at new deadline_ns

SIGKILL interaction with CBS throttle: A SIGKILL to a CBS-throttled task must not cause double-enqueue. The invariant is enforced by OnRqState transitions:

fn signal_wake_up_throttled(task: &Task):
    // Atomically transition from CbsThrottled → Queued.
    // If the CAS fails, the task was already unthrottled by
    // cbs_replenish() — no action needed (the task is already
    // on the runqueue or running).
    if task.on_rq_state.compare_exchange(
        OnRqState::CbsThrottled,
        OnRqState::Queued,
        AcqRel, Relaxed
    ).is_ok():
        // We won the race: remove from throttled list, enqueue on runqueue.
        cbs_throttled_list.remove(task)
        enqueue_task(task)
    // If CAS failed: cbs_replenish() already handled it. No-op.

Atomic steal (when a CPU exhausts its budget before the period ends):

fn cbs_try_steal(server: &CbsCpuServer, cgroup_id: CgroupId) -> bool:
    // Scan sibling CPUs on the same NUMA node first, then remote.
    for sibling_cpu in steal_order(this_cpu):
        if let Some(donor) = sibling_cpu.cbs_servers.get(cgroup_id):
            if donor.throttled.load(Relaxed):
                continue  // Already exhausted.
            // Try to steal half of the donor's remaining budget.
            // Bounded retry: 3 CAS attempts per donor. If all fail (high
            // contention — many CPUs targeting the same donor), move to the
            // next donor in steal_order(). This bounds per-donor spin to
            // ~600ns worst-case while preserving steal success rate.
            const CBS_STEAL_CAS_RETRIES: usize = 3;
            for _ in 0..CBS_STEAL_CAS_RETRIES:
                let avail = donor.budget_remaining_ns.load(Acquire)
                if avail <= 0:
                    break  // Nothing to steal.
                let steal = avail / 2
                if steal == 0:
                    break
                if donor.budget_remaining_ns.compare_exchange(
                    avail, avail - steal, AcqRel, Relaxed).is_ok():
                    server.budget_remaining_ns.fetch_add(steal, Relaxed)
                    server.throttled.store(false, Release)
                    return true
            // CAS contention — move to next donor in steal_order().
    false  // No budget available anywhere — remain throttled.

CBS budget charging (called from update_curr() on every scheduler tick and on dequeue when a CBS-guaranteed task has been running):

fn cbs_charge(server: &CbsCpuServer, config: &CbsGroupConfig, delta_ns: u64):
    // Decrement this CPU's budget by the consumed runtime (nanoseconds).
    // Relaxed ordering rationale: the local rq lock serializes this
    // CPU's charge path (scheduler tick) and replenishment timer. However,
    // budget_remaining_ns is ALSO modified by cbs_try_steal() from remote
    // CPUs via CAS (without holding this CPU's rq lock). The Relaxed
    // ordering here is still correct because: (a) fetch_sub is atomic
    // regardless of ordering, (b) the rq lock provides happens-before for
    // local reads of the updated value, and (c) steal's CAS provides its
    // own atomicity guarantee for the remote modification. A steal that
    // races with a local charge may observe a slightly stale value, but
    // the CAS loop in cbs_try_steal() retries on failure, and any
    // resulting budget imprecision is bounded to one tick's worth of
    // runtime (~1-4 ms), corrected at the next replenishment.
    let prev = server.budget_remaining_ns.fetch_sub(delta_ns as i64, Relaxed)
    // Note: `new` is computed locally from `prev`. A concurrent steal may
    // add budget between `fetch_sub` and the `if new <= 0` check below.
    // In that case, `new` is negative but the actual `budget_remaining_ns`
    // is now positive (steal added budget). The code enters the throttle
    // path, attempts `cbs_try_steal()` (which succeeds since budget is
    // positive), and un-throttles immediately. This is a benign race:
    // a spurious throttle-steal-unthrottle cycle with no correctness impact.
    let new = prev - delta_ns as i64

    // Also charge consumed_ns for accounting (cpu.stat, reported as us to userspace).
    // Note: consumed_ns may slightly over-count vs budget_remaining_ns in the
    // steal race window. Between the fetch_sub above and this fetch_add, a
    // remote steal CAS (cbs_try_steal) could reclaim budget, causing
    // consumed_ns to account the full delta_ns even though some budget was
    // "reclaimed" by the steal. This is a statistics-only discrepancy (no
    // scheduling correctness impact): bounded to one tick's worth (~1-4 ms),
    // corrected at the next replenishment. Same approximate accounting as Linux.
    server.consumed_ns.fetch_add(delta_ns, Relaxed)

    // Check if budget is exhausted.
    if new <= 0 && !server.throttled.load(Relaxed):
        // Attempt atomic steal from sibling CPUs before throttling.
        if !cbs_try_steal(server, cgroup_id):
            // Steal failed — throttle this server.
            server.throttled.store(true, Release)
            // Dequeue all tasks in this cgroup on this CPU from the
            // CBS server's EEVDF tree. Set OnRqState to CbsThrottled.
            cbs_throttle_tasks(this_cpu, cgroup_id)
            // Request immediate reschedule — the currently running task
            // must yield. Without this, the running task continues for up
            // to one full tick (1-4 ms) past CBS budget exhaustion,
            // violating the CBS bandwidth guarantee. The resched_curr
            // urgency table entry: "CBS budget exhaustion | Eager".
            resched_curr(rq, ReschedUrgency::Eager)
            // Arm replenishment timer if not already armed.
            if !server.replenish_timer.is_armed():
                arm_hrtimer(server.replenish_timer, server.deadline_ns.load(Relaxed))

Relationship with cpu.max charging: CBS guarantee charging and cpu.max ceiling charging are independent. A single scheduler tick for a CBS-guaranteed task calls both cbs_charge() (decrements the CBS server's per-CPU budget) and charge_cpu_max() (decrements the cgroup's global cpu.max budget). If either budget is exhausted, the task is throttled. The max_throttled and throttled flags are checked independently in pick_next_task().

Key properties of the per-CPU CBS model: - No global pool lock: all budget operations are per-CPU atomics or CAS on sibling CPUs. - No 1ms residual stranding: idle servers are steal targets, not locked pools. - No thundering herd: each CPU has its own replenishment timer at its own deadline. - Proportional rebalance: at each replenishment, share is recalculated from live weights. - NUMA-aware steal: local node first, reducing cross-node atomic CAS traffic.

Lock ordering for CBS operations:

Operation	Lock held	Reason
`cbs_charge()` (scheduler tick)	Local `RunQueue.lock` (level 50)	Called from `update_curr()` under rq lock
`cbs_replenish()` (hrtimer)	Local `RunQueue.lock` (level 50)	Timer handler acquires rq lock before modifying server state
`cbs_try_steal()` (steal from sibling)	Local `RunQueue.lock` only	Reads sibling's `budget_remaining_ns` via atomic CAS — no lock on sibling's rq. The CAS provides atomicity for the budget transfer.
`cbs_throttle_tasks()`	Local `RunQueue.lock` (level 50)	Modifies local EEVDF tree and task OnRqState
`cbs_unthrottle_tasks()`	Local `RunQueue.lock` (level 50)	Re-inserts tasks into local EEVDF tree
`cpu_guarantee_write()`	Cgroup config lock + CAS on `system.total_guaranteed`	No rq lock held — only modifies cgroup config. Per-CPU servers read config at next replenishment under their own rq lock.
`charge_cpu_max()` (cpu.max tick)	Local `RunQueue.lock` (level 50)	Atomics on global `runtime_remaining`; rq lock for task state changes

Critical invariant: The RunQueue.lock is NEVER held across CPUs for CBS operations. cbs_try_steal() accesses sibling CPUs' CbsCpuServer fields via atomic operations only — it does NOT acquire the sibling's runqueue lock. This eliminates cross-CPU lock contention on the CBS hot path.

Cgroup task migration (task moves from CPU A → CPU B): 1. Dequeue from CPU A: A.server.local_weight -= task.weight. If last task, server becomes a steal donor (budget remains, accessible via CAS). 2. Enqueue on CPU B: if no CbsCpuServer for this cgroup on B, create one with budget = cbs_try_steal(). Set B.server.local_weight += task.weight. - Zero-budget edge case: If cbs_try_steal() fails (no budget available on any sibling), the new server is created with budget_remaining_ns = 0 and throttled = true. The task is enqueued as OnRqState::CbsThrottled. The replenishment timer is armed for the next period boundary (deadline_ns = ktime_get_ns() + config.period_us * 1000). This means the migrated task waits at most one period before receiving its first budget allocation. This is the correct CBS behavior: the server starts fresh with a new deadline and receives its proportional share at that deadline. 3. Proportional shares are recalculated at next replenishment (no immediate rebalance).

This ensures both exhaustion-triggered and timer-triggered replenishment follow the same CBS invariant: the server's deadline advances by exactly one period per replenishment, bounding the guaranteed bandwidth to quota_us / period_us over any sliding window.

Integration with existing scheduler (Section 7.1):

Per-CPU run queue structure (updated):

    +------------------+
    | RT Queue         |   <- Highest priority (unchanged)
    +------------------+
    | DL Queue         |   <- Deadline tasks (unchanged)
    +------------------+
    | CBS Group Servers|   <- NEW: CBS servers for guaranteed groups
    |  +-- db_server   |      Each server has its own EEVDF tree inside
    |  +-- web_server  |
    +------------------+
    | EEVDF Tree       |   <- Normal tasks without guarantee (unchanged)
    +------------------+

Scheduling decision: 1. Check RT queue (highest priority) — unchanged. 2. Check DL queue (deadline tasks) — unchanged. 3. Check CBS group servers (ordered by earliest deadline): - If a server has budget and runnable tasks: pick its next task. - CBS guarantees each server receives its bandwidth. - Iteration: CBS servers are stored in RunQueueData.cbs_servers (XArray keyed by CgroupId). XArray provides O(1) lookup by cgroup ID but does NOT support deadline-ordered iteration natively. The pick_next_cbs_task() function performs a linear scan of all CBS servers on this CPU to find the one with the earliest deadline_ns that has budget and runnable tasks. For typical server counts (N < 20 CBS-guaranteed cgroups with tasks on a single CPU), this is ~20 comparisons = ~100 cycles, which is acceptable within the scheduler tick budget. If N exceeds 20 (unlikely in practice — it requires 20+ simultaneously-active guaranteed cgroups on one CPU), a deadline-sorted intrusive list maintained alongside the XArray should be added as an optimization. The XArray remains the authoritative store (keyed by CgroupId for O(1) lookup during enqueue/dequeue); the intrusive list provides O(1) earliest-deadline access for pick_next_task(). 4. Check EEVDF tree (normal tasks without guarantee) — unchanged.

Unguaranteed tasks (step 4) run when all CBS servers are idle or throttled. Plus, CBS servers that are under-utilizing their budget donate the slack back to step 4 (work-conserving).

7.6.5 Overcommit Prevention¶

// umka-core/src/sched/cbs_group.rs

/// System-wide guarantee accounting.
pub struct BandwidthAccounting {
    /// Total guaranteed bandwidth across all CBS servers.
    /// Stored as fixed-point fraction scaled by `BW_SCALE` (1 << 20 = 1_048_576).
    /// Example: 40% bandwidth = `(40 * BW_SCALE) / 100 = 419_430`.
    pub total_guaranteed: AtomicU64,

    /// Maximum allowable guarantee (default: 95%).
    /// Reserves 5% for kernel threads, interrupts, housekeeping.
    pub max_guarantee: u64,

    /// Per-NUMA-node guaranteed bandwidth (for NUMA-aware scheduling).
    /// Allocated at boot via `Box::new_zeroed_slice(numa_node_count)`
    /// once NUMA topology is discovered (Section 4.9). Owned by the
    /// cgroup subsystem for the lifetime of the cgroup hierarchy.
    pub per_node_guaranteed: Box<[AtomicU64]>,
}

Setting cpu.guarantee fails with -ENOSPC if total_guaranteed + new_guarantee would exceed max_guarantee. This prevents overcommit and guarantees all promises are simultaneously satisfiable.

7.6.5.1 CPU Hotplug Interaction¶

When a CPU goes offline, the total system capacity decreases. This interacts with CBS guarantee admission control:

CPU offline: The system recalculates effective_capacity = online_cpu_count. If total_guaranteed > effective_capacity * max_guarantee_ratio, the system is temporarily over-guaranteed. The guarantees are NOT revoked — CBS servers on the offline CPU are migrated to surviving CPUs (tasks and their local_weight are redistributed via the standard task migration path). The proportional share formula (quota_us * local_w / total_w) naturally adapts: with fewer CPUs, each surviving CPU's share increases, but the absolute guarantee bandwidth is unchanged. The CBS period timer and budget math are unaffected because they operate on system-wide totals, not per-CPU capacities.

Warning condition: If the sum of guarantees exceeds what the remaining CPUs can physically deliver, guaranteed cgroups will receive less than their promised bandwidth. The kernel emits a cbs_overcommit_warning tracepoint and writes a rate-limited pr_warn to the kernel log. No guarantees are revoked automatically — the administrator (or Kubernetes) must respond to the warning.

CPU online: New CPU capacity becomes available. max_guarantee is NOT automatically increased (it is an absolute ratio, not CPU-count-dependent). The new CPU starts with empty cbs_servers XArray. As tasks are migrated or scheduled on the new CPU, CbsCpuServer instances are created lazily. The proportional share formula adapts at the next replenishment period.
Admission control during hotplug: New cpu.guarantee writes continue to check against the static max_guarantee (default 95%). The admission control does NOT factor in current online CPU count — it assumes all CPUs are available (matching the steady-state). This prevents oscillating admission decisions when CPUs are transiently offline during maintenance.
CBS server cleanup on CPU offline: All CbsCpuServer instances on the dying CPU are drained: throttled tasks are unthrottled and migrated, replenishment timers are cancelled, and remaining local budget is returned to sibling CPUs via cbs_try_steal() (reverse direction: the dying CPU donates its full remaining budget to the first sibling that accepts it). This matches the Linux pattern in unthrottle_offline_cfs_rqs().

7.6.6 Interaction with Existing Controls¶

Control	Meaning	Interaction with cpu.guarantee
`cpu.weight`	Relative share of excess CPU	Distributes CPU beyond guaranteed minimums
`cpu.max`	Maximum CPU ceiling	Guarantee cannot exceed max; max still enforced
`cpu.guarantee`	Minimum CPU floor	NEW: CBS-backed guaranteed bandwidth
`cpu.pressure`	PSI pressure info	Reports pressure relative to guarantee

Example: a cgroup with cpu.guarantee=40%, cpu.max=60%, cpu.weight=100: - Always gets at least 40% CPU (even under full system load) - Never gets more than 60% CPU (even if system is idle — ceiling applies) - Between 40% and 60%, shares proportionally with other cgroups by weight

7.6.7 Nested Cgroup Hierarchy¶

A child cgroup's cpu.guarantee draws from its parent's guarantee budget. The kernel enforces this at write time — it is not merely advisory.

Admission control for nested guarantees:

fn cpu_guarantee_write(cgrp: &Cgroup, new_quota_us: u64, period_us: u64) -> Result<()> {
    // All bandwidth comparisons use fixed-point arithmetic: bandwidth is
    // represented as (quota_us * BW_SCALE) / period_us, yielding a u64
    // in the range [0, BW_SCALE]. This avoids FPU save/restore in kernel
    // context and eliminates negative-f64-to-u64 truncation bugs.
    //
    // BW_SCALE = 1 << 20 (1_048_576), matching Linux's BW_UNIT = 1 << BW_SHIFT.
    // CBS and DL admission use the SAME scale factor for direct comparison
    // in the system-wide overcommit check (total_guaranteed + dl_bandwidth < capacity).
    const BW_SCALE: u64 = 1 << 20;

    let new_bw = new_quota_us.saturating_mul(BW_SCALE) / period_us;

    // 1. Check against cpu.max ceiling (guarantee cannot exceed max).
    let max_us = cgrp.cpu.as_ref().map(|c| c.max_us.load(Relaxed)).unwrap_or(u64::MAX);
    if max_us != u64::MAX {
        let max_period = cgrp.cpu.as_ref().unwrap().period_us.load(Relaxed);
        let max_bw = max_us.saturating_mul(BW_SCALE) / max_period;
        if new_bw > max_bw { return Err(EINVAL); }
    }

    // 2. Check against parent's remaining guarantee budget.
    if let Some(parent) = cgrp.parent() {
        let parent_cfg = &parent.cpu.cbs;
        if parent_cfg.is_none() {
            // Parent has no guarantee — child cannot have one either.
            // Exception: root cgroup implicitly has 100% guarantee.
            if !parent.is_root() { return Err(EINVAL); }
        } else {
            let parent_quota = parent_cfg.unwrap().quota_us;
            let parent_period = parent_cfg.unwrap().period_us;
            let parent_bw = parent_quota.saturating_mul(BW_SCALE) / parent_period;

            // Sum of existing children's guarantees (excluding this cgroup).
            // children_guarantee_sum_excluding() returns fixed-point (scaled by BW_SCALE).
            let siblings_bw = parent.children_guarantee_sum_excluding(cgrp);
            if siblings_bw + new_bw > parent_bw {
                return Err(ENOSPC);  // Would overcommit parent's budget.
            }
        }
    }

    // 3. Check-and-commit system-wide utilization cap (CAS loop).
    // total_guaranteed and max_guarantee are both stored in BW_SCALE units.
    //
    // This MUST be an atomic compare-and-swap loop, not a Relaxed load followed
    // by a separate fetch_add. With Relaxed check-then-commit, two concurrent
    // writers can both read the same `total`, both pass the overcommit check,
    // and both commit — exceeding max_guarantee. The CAS loop ensures exactly
    // one writer succeeds per slot of available bandwidth.
    let system = &BANDWIDTH_ACCOUNTING;
    let current_bw = cgrp.current_guarantee_bw(); // fixed-point, BW_SCALE units

    if new_bw >= current_bw {
        let increase = new_bw - current_bw;
        loop {
            let total = system.total_guaranteed.load(Acquire);
            if total + increase > system.max_guarantee {
                return Err(ENOSPC);
            }
            if system.total_guaranteed.compare_exchange(
                total, total + increase, AcqRel, Acquire
            ).is_ok() {
                break;
            }
            // CAS failed: another writer committed first. Re-read and retry.
        }
    } else {
        // Shrinking: total decreases — always succeeds, no overcommit risk.
        // fetch_sub is safe because we hold the cgroup's config lock, and
        // current_bw was committed by a prior successful CAS.
        let decrease = current_bw - new_bw;
        system.total_guaranteed.fetch_sub(decrease, Release);
    }

    // 4. Commit: update cgroup config.
    cgrp.cpu.cbs = Some(CbsGroupConfig { quota_us, period_us, .. });
    Ok(())
}

Key invariants: - sum(children.guarantee) <= parent.guarantee at every level. - Root cgroup has an implicit 100% guarantee (all CPU bandwidth). - Removing a child's guarantee immediately frees budget for siblings. - Guarantee bandwidth is expressed in absolute terms (quota/period), not relative weights. This makes nesting arithmetic straightforward.

7.6.8 cpu.max vs cpu.guarantee Interaction¶

cpu.max (ceiling) and cpu.guarantee (floor) can be set simultaneously. The interaction is deterministic:

cpu.max always wins. If the cgroup's cpu.max quota is exhausted, all tasks are max-throttled regardless of CBS guarantee budget. The CBS server's max_throttled flag is set, and pick_next_task() skips the server.
CBS budget is frozen during max-throttle. While max_throttled is true:
No CBS runtime is charged (tasks aren't running).
The CBS replenishment timer is not re-armed.
CBS budget remains at its current value.
When cpu.max unthrottles (cpu.max period boundary timer fires):
max_throttled is cleared.
CBS replenishment timer is re-armed if not already armed.
Tasks become eligible for CBS pick in the next pick_next_task().
Validation (guarantee write): cpu.guarantee write is rejected with -EINVAL if the guarantee bandwidth would exceed cpu.max bandwidth. The constraint guarantee_bw <= max_bw is checked at write time.

Validation (max write — reverse constraint): cpu.max write is rejected with -EINVAL if the new max bandwidth would fall below the existing cpu.guarantee bandwidth. This prevents the system from entering an inconsistent state where guarantee > max. The check:

fn cpu_max_write(cgrp: &Cgroup, new_max_us: u64, period_us: u64) -> Result<()> {
    // ... existing cpu.max validation ...
    if let Some(cbs) = &cgrp.cpu.cbs {
        let guarantee_bw = cbs.quota_us.saturating_mul(BW_SCALE) / cbs.period_us;
        let max_bw = new_max_us.saturating_mul(BW_SCALE) / period_us;
        if max_bw < guarantee_bw {
            return Err(EINVAL);  // Cannot lower max below guarantee.
        }
    }
    // ... commit cpu.max change ...
}

Invariant: At all times, guarantee_bw <= max_bw for every cgroup. Both write paths enforce their respective side of this constraint.

State diagram per CBS server:

                    ┌──────────────────┐
                    │    RUNNING       │
                    │  (budget > 0,    │
                    │   !throttled,    │
                    │   !max_throttled)│
                    └─┬──────────┬────┘
         CBS budget   │          │  cpu.max quota
         exhausted    │          │  exhausted
                      ▼          ▼
          ┌──────────────┐  ┌──────────────┐
          │ CBS_THROTTLED│  │ MAX_THROTTLED│
          │ (throttled)  │  │ (max_throttled)│
          └──────┬───────┘  └──────┬───────┘
       replenish │        cpu.max  │
       or steal  │        period   │
                 ▼        boundary ▼
          ┌──────────────┐  ┌──────────────┐
          │   RUNNING    │  │   RUNNING    │
          └──────────────┘  │  (or CBS_    │
                            │  THROTTLED   │
                            │  if budget=0)│
                            └──────────────┘

7.6.9 Use Case: Driver Tier Isolation¶

CPU guarantees integrate naturally with the driver tier model:

# # Ensure Tier 1 drivers always have CPU bandwidth for I/O processing
echo "200000 1000000" > /sys/fs/cgroup/umka-tier1/cpu.guarantee  # 20%

# # Ensure Tier 2 drivers have some guaranteed bandwidth
echo "50000 1000000" > /sys/fs/cgroup/umka-tier2/cpu.guarantee   # 5%

A misbehaving Tier 2 driver process spinning in a loop cannot starve Tier 1 NVMe or NIC drivers of CPU time.

7.6.10 CBS Task Migration Between Cores¶

When a task inside a CBS bandwidth server is migrated to another CPU (via load balancing or explicit affinity change), the EEVDF scheduling parameters must be translated to the destination run queue's context:

vruntime adjustment: The migrated task's vruntime is rewritten relative to the destination CBS server's zero_vruntime (Section 7.1):
```
// src_tree / dst_tree are the VruntimeTree in each CPU's CbsCpuServer.
vruntime_offset = task.vruntime as i64 - src_tree.zero_vruntime as i64
task.vruntime = (dst_tree.zero_vruntime as i64 + vruntime_offset) as u64
```
zero_vruntime tracks close to avg_vruntime (the weighted average virtual runtime), NOT the minimum vruntime. This preserves the task's relative position in virtual time on the destination CPU. A task that was "behind" (vruntime < avg) on the source remains "behind" on the destination. Note: vruntime_offset is distinct from the EEVDF task.vlag field (which is the signed, unweighted quantity avg_vruntime - vruntime).
Eligibility is computed dynamically — there is no stored eligible_vtime field. After adjusting vruntime, the task's vlag is preserved from the dequeue on the source CPU (set by update_entity_lag()). On the destination, place_entity() uses the saved vlag to position the task correctly in virtual time: se.vruntime = avg_vruntime(dst_rq) - inflated_vlag. The task is then inserted into the CBS server's tree (keyed by vdeadline). Eligibility is determined dynamically by pick_eevdf() via vruntime_eligible().
CBS server state: The CBS server's budget_remaining_ns, deadline_ns, and replenishment timer are not adjusted — they are absolute (nanoseconds since boot) and do not depend on per-CPU vruntime. The migrated task continues consuming from the same CBS budget. If the budget was exhausted on the source CPU, the task remains throttled until the next period boundary (timer fires on any CPU — hrtimers are per-timer, not per-CPU).
avg_vruntime accumulator update: The source run queue's sum_w_vruntime and sum_weight accumulators are updated to remove the migrated task's contribution (via avg_vruntime_update(src_rq, se, false)); the destination's accumulators add it (via avg_vruntime_update(dst_rq, se, true)). This ensures entity_eligible() / vruntime_eligible() remains correct on both CPUs after migration.

This is the same protocol used for non-CBS EEVDF tasks (Section 7.1), with the CBS-specific invariant that the bandwidth budget is absolute time (not relative to any per-CPU baseline).

7.6.11 cpu.max Ceiling Enforcement (Bandwidth Throttling)¶

cpu.max enforces a hard ceiling on CPU consumption per cgroup. Unlike cpu.guarantee (which provides a floor via CBS), cpu.max is a ceiling that limits the maximum CPU time a cgroup can consume in any given period. This matches Linux cgroup v2 cpu.max semantics exactly.

/// Per-cgroup cpu.max bandwidth throttle state.
///
/// Stored in `CpuController` ([Section 17.2](17-containers.md#control-groups--cpu-controller-state)).
/// Tracks the global budget for a cgroup's cpu.max enforcement across all CPUs.
///
/// Design: Unlike CBS guarantee (per-CPU servers with per-CPU budgets), cpu.max
/// uses a **global pool** with per-CPU runtime slices. This matches Linux's CFS
/// bandwidth throttling design. The global pool is necessary because cpu.max is
/// a system-wide ceiling (a cgroup limited to 200ms/1000ms should get at most
/// 200ms total across all CPUs combined).
pub struct CpuBandwidthThrottle {
    /// Quota in microseconds per period. From `cpu.max` first field.
    /// `u64::MAX` means unlimited (no throttling — the default).
    /// Exposed to userspace as microseconds via `cpu.max` cgroup file.
    pub quota_us: u64,
    /// Quota in nanoseconds (= `quota_us * 1000`). Used internally by
    /// `charge_cpu_max()` and replenishment to avoid truncation bias.
    /// Computed once at configuration time.
    pub quota_ns: u64,
    /// Period in microseconds. From `cpu.max` second field (default 100_000 = 100 ms).
    pub period_us: u64,
    /// Runtime remaining in the current period (**nanoseconds**, signed).
    /// Nanoseconds are used internally (matching CBS `budget_remaining_ns`)
    /// to avoid the persistent truncation bias of `delta_exec / 1000`.
    /// The userspace-visible `cpu.max` interface uses microseconds; the
    /// kernel converts at configuration time: `quota_ns = quota_us * 1000`.
    ///
    /// Signed to allow transient overshoot: a task may consume a few
    /// nanoseconds beyond zero before the scheduler tick detects exhaustion.
    /// The overshoot is carried forward as a deficit into the next period
    /// (subtracted from the replenished quota). Clamped to `-(quota_ns as i64)`
    /// to bound recovery to at most one period.
    pub runtime_remaining: AtomicI64,
    /// Whether all tasks in this cgroup are currently throttled (dequeued from
    /// runqueues). Set when `runtime_remaining` drops to zero or below. Cleared
    /// when the period timer fires and replenishes the quota.
    pub throttled: AtomicBool,
    /// Burst buffer in microseconds. Allows temporary over-budget execution
    /// to absorb scheduling jitter without throttling. Set via `cpu.max.burst`.
    /// Default 0. Maximum: `quota_us`. When non-zero, throttling triggers at
    /// `runtime_remaining < -burst_us` instead of `runtime_remaining <= 0`.
    pub burst_us: u64,
    /// Burst buffer in nanoseconds (= `burst_us * 1000`). Used internally
    /// by `charge_cpu_max()`. Computed once at configuration time.
    pub burst_ns: u64,
    /// Number of periods that have elapsed (for `cpu.stat` accounting).
    /// Statistics accessed from multiple CPUs (cgroup accounting reads,
    /// timer handler writes). Relaxed ordering sufficient for statistics.
    pub nr_periods: AtomicU64,
    /// Number of periods in which this cgroup was throttled.
    pub nr_throttled: AtomicU64,
    /// Total time spent throttled (microseconds, for `cpu.stat`).
    pub throttled_time_us: AtomicU64,
    /// High-resolution timer that fires at each period boundary.
    /// On expiry: replenishes `runtime_remaining = quota_us`, clears
    /// `throttled`, and re-enqueues all throttled tasks.
    pub timer: HrTimerHandle,
}

Period timer replenishment:

fn cpu_max_replenish(throttle: &CpuBandwidthThrottle):
    // Carry forward any overshoot as deficit, clamped to one period.
    let deficit = min(0, throttle.runtime_remaining.load(Relaxed))
    let clamped = max(deficit, -(throttle.quota_ns as i64))

    // Replenish: add quota_ns to current remaining (which may be negative).
    throttle.runtime_remaining.store(clamped + throttle.quota_ns as i64, Release)

    // Un-throttle: re-enqueue all CbsThrottled/MaxThrottled tasks.
    if throttle.throttled.swap(false, Release):
        for cpu in online_cpus():
            if let Some(server) = cpu.rq.cbs_servers.get(cgroup_id):
                server.max_throttled.store(false, Release)
            // Re-enqueue all tasks in this cgroup on this CPU.
            cpu_max_unthrottle_tasks(cpu, cgroup_id)

    // Advance period counter.
    throttle.nr_periods.fetch_add(1, Relaxed);

    // Re-arm timer for next period.
    rearm_hrtimer(throttle.timer, throttle.period_us * 1000)

Runtime charging and throttle detection:

On every scheduler tick (task_tick()) and on voluntary dequeue (put_prev_task()), the runtime consumed by the current task is charged against the cgroup's global pool:

/// **Locking**: `charge_cpu_max()` is called from `update_curr()` with the
/// local CPU's `RunQueue.lock` held (level 50). The `runtime_remaining`
/// atomic allows concurrent charging from multiple CPUs without a global
/// lock. When throttling is triggered, the iteration uses IPI-based
/// throttle broadcast instead of inline remote lock acquisition:
/// `cpu_max_throttle_tasks(cpu, cgroup_id)` sends an IPI to each remote
/// CPU, which locally throttles tasks under its own `RunQueue.lock`. This
/// avoids holding multiple rq locks simultaneously and prevents ABBA
/// deadlock with load balancing (which acquires rq locks in CPU-ID order).
/// `charge_cpu_max` accepts **nanoseconds** (matching CBS `budget_remaining_ns`
/// and `update_curr()`'s `delta_exec` which is in nanoseconds). This avoids the
/// persistent truncation bias of `delta_exec / 1000` that CBS explicitly chose
/// to avoid (see §CBS Budget Accounting).
fn charge_cpu_max(task: &Task, bw: &CpuBandwidthThrottle, delta_ns: u64):
    // Unlimited — no throttling.
    if bw.quota_ns == u64::MAX:
        return

    let prev = bw.runtime_remaining.fetch_sub(delta_ns as i64, AcqRel)
    let new = prev - delta_ns as i64

    // Check if budget is exhausted (accounting for burst buffer).
    // burst_ns = burst_us * 1000 (converted at configuration time).
    if new <= -(bw.burst_ns as i64) && !bw.throttled.load(Relaxed):
        bw.throttled.store(true, Release)
        bw.nr_throttled.fetch_add(1, Relaxed)
        let cgroup_id = task.cgroup_id();
        // Throttle all tasks in this cgroup across all CPUs.
        // Each cpu_max_throttle_tasks() sends an IPI; the remote CPU's
        // IPI handler acquires its own rq lock and throttles locally.
        for cpu in online_cpus():
            if let Some(server) = cpu.rq.cbs_servers.get(cgroup_id):
                server.max_throttled.store(true, Release)
            cpu_max_throttle_tasks(cpu, cgroup_id)  // IPI-based, not inline lock
        // Request immediate reschedule for the currently running task.
        // Without this, the task continues for up to one full tick past
        // budget exhaustion, violating the cpu.max bandwidth ceiling.
        resched_curr(rq, ReschedUrgency::Eager)

Throttle/unthrottle task state transitions:

When cpu.max throttles a cgroup: 1. Each task in the cgroup has its OnRqState set to CbsThrottled (reusing the same throttled state as CBS guarantee exhaustion — the scheduler makes no distinction). 2. Tasks are removed from their EEVDF tasks_timeline tree. 3. pick_next_task() skips max_throttled CBS servers.

When the period timer fires and replenishes quota: 1. throttled is cleared. 2. All throttled tasks have OnRqState restored to Queued and are re-inserted into tasks_timeline. Eligibility is computed dynamically by pick_eevdf(). 3. A reschedule IPI is sent to CPUs that have unthrottled tasks, so pick_next_task() runs immediately.

Multi-CPU quota distribution: The global runtime_remaining is accessed via AtomicI64 from all CPUs. To reduce cross-CPU atomic contention on high-core-count systems, each CPU maintains a local runtime slice:

/// Per-CPU local slice of the cpu.max quota. Reduces atomic contention
/// on the global `CpuBandwidthThrottle.runtime_remaining`.
///
/// **Units: nanoseconds** — consistent with the global pool
/// (`CpuBandwidthThrottle.runtime_remaining`) and the CBS accounting
/// model which chose nanoseconds to "avoid the persistent truncation
/// bias of `delta_exec / 1000`."
pub struct CpuMaxLocalSlice {
    /// Local runtime budget (nanoseconds). Drawn from the global pool
    /// in chunks of `slice_size_ns`. When exhausted, refills from global.
    pub local_remaining_ns: i64,
    /// Refill chunk size (nanoseconds). Default: `(period_us * 1000) / nr_cpus`,
    /// clamped to minimum 1_000_000 (1 ms) and maximum `quota_ns / 2`.
    pub slice_size_ns: u64,
}

The per-CPU slice is refilled from the global pool via fetch_sub on the global runtime_remaining. When the global pool is also exhausted, throttling triggers. This reduces global atomic operations from once-per-tick to once-per-slice-refill.

CBS budget cleanup on task exit: When a task exits (do_exit → sched_task_exit), its CBS server state is cleaned up: 1. Cancel the cgroup's per-CPU CBS replenishment timer if no other throttled tasks remain on this CPU (the replenish_timer lives in CbsCpuServer, not in the task). 2. If the task was throttled (OnRqState::CbsThrottled), decrement the cgroup's nr_throttled_tasks counter. 3. Return any remaining per-CPU local slice budget to the global CpuBandwidthThrottle.runtime_remaining pool (fetch_add of the local remainder). 4. Remove the task from the CBS server's task list. This ensures no stale timers fire for dead tasks and unused budget is returned to the cgroup for other tasks to use.

Cross-cgroup migration bandwidth for cpu.max: When a task migrates between cgroups (via cgroup_migrate or sched_move_task), the CBS budget transfer follows: 0. Before dequeuing, cancel any pending CBS replenishment timer for this task on the source CPU: if the migrating task is the last throttled task on this CPU (nr_throttled_tasks == 1), call hrtimer_cancel(&source_server.replenish_timer). Set task.migration_pending = true to prevent the timer callback from re-enqueuing the task during migration. This flag is cleared after enqueue in the destination cgroup (step 6). 1. Dequeue the task from the source cgroup's runqueue. 2. Cancel the source cgroup's CBS timer for this task. 3. Return any unconsumed budget from the source cgroup's per-CPU local slice. 4. Re-initialize the task's CBS parameters from the destination cgroup's cpu.max (quota, period, burst). Set budget = 0, deadline = now + period (fresh start). 5. Note: The task's own nice-derived weight (sched_prio_to_weight[task.nice + 20]) does NOT change during cgroup migration — nice is a per-task property, not per-cgroup. The cgroup's cpu.weight affects the GroupEntity weight in the hierarchical scheduler, not individual task weights. No weight recomputation is needed here. The dequeue→enqueue sequence in step 6 ensures the task re-enters the EEVDF tree in the correct GroupEntity for its new cgroup. 6. Enqueue the task in the destination cgroup's runqueue. Clear task.migration_pending (set in step 0) so replenishment timers on the destination CPU can enqueue this task normally. Counter maintenance: Step 1 (dequeue) decrements the source server's nr_throttled_tasks if the task was in CbsThrottled state. Step 6 (enqueue) does NOT increment the destination's nr_throttled_tasks because the task starts with a fresh budget (budget = 0, deadline = now + period from step 4) and is enqueued as Queued, not CbsThrottled. The destination's nr_throttled_tasks increments only when the task's fresh budget is exhausted in a future tick. This ensures nr_throttled_tasks remains accurate across migration boundaries. The fresh-start policy prevents a task from carrying a large unconsumed budget from a generous cgroup into a restrictive one (bandwidth amplification). It also prevents a nearly-exhausted budget from causing immediate throttling in the destination cgroup.

cpu.guarantee (CBS minimum bandwidth) transfer: When a task with a cpu.guarantee reservation migrates between cgroups, the guaranteed bandwidth (expressed as microseconds per period) is transferred proportionally. The per-task guarantee (task_guarantee_us) is derived at runtime as cgroup.cpu.guarantee / nr_tasks_in_cgroup (proportional share of the cgroup's total guarantee divided evenly among its member tasks). It is not a stored field; it is computed during cgroup_migrate(). The source cgroup's committed guarantee decreases by task_guarantee_us and the destination cgroup's committed guarantee increases by the same amount. If the destination cgroup's total committed guarantees would exceed its cpu.guarantee limit, the migration is rejected with ENOSPC at cgroup_migrate() time (admission control). This ensures that guaranteed bandwidth is never overcommitted and the CBS admission invariant (sum(task_guarantees) <= cgroup.cpu.guarantee) is maintained across migrations.

UmkaOS behavioral difference from Linux: echo $PID > cgroup.procs may return ENOSPC if the destination cgroup's CBS bandwidth guarantee would be overcommitted. Linux does not have this failure mode because it lacks cpu.guarantee. Container runtimes (Docker, containerd, systemd) that write to cgroup.procs already handle errors (EBUSY, ENOMEM, ESRCH); they must additionally handle ENOSPC when UmkaOS CBS guarantees are in use.

Cgroup freezer interaction: When a cgroup enters the Frozen state, all CBS replenishment timers for tasks in that cgroup are cancelled (hrtimer_cancel). On thaw, CBS timers are re-armed with fresh budgets (budget = full period, deadline = now + period, throttled flags cleared). This prevents both free bandwidth accrual during freeze and spurious immediate throttling on resume. See Section 17.2 for the full freeze/thaw protocol.

7.6.12 ML Policy Integration¶

The CBS guarantee subsystem exposes tunable parameters and observations to the ML policy framework (Section 23.1).

Tunable parameters (registered via register_param!, SubsystemId::Scheduler):

ParamId	Name	Default	Min	Max	Effect
`0x0010`	`cbs_steal_fraction_pct`	50	10	90	Fraction of donor's budget to steal (percent). Default: steal half. Lower values reduce steal impact on donor; higher values reduce throttle latency.
`0x0011`	`cbs_steal_numa_only`	0	0	1	If 1, only steal from same-NUMA-node siblings (skip remote). Reduces cross-node CAS traffic at cost of higher throttle latency when local budgets are exhausted.

Observation points (emitted via observe_kernel!, gated by static_key):

Event	obs_type	features[0..5]	Frequency
CBS throttle	`0x10`	cgroup_id, cpu_id, budget_deficit_ns, period_us, nr_throttled_tasks, steal_attempts	On throttle
CBS replenish	`0x11`	cgroup_id, cpu_id, share_ns, prev_budget_ns, new_budget_ns, consumed_ns	On replenish
CBS steal success	`0x12`	cgroup_id, src_cpu, dst_cpu, stolen_ns, donor_remaining_ns, _	On successful steal
CBS steal fail	`0x13`	cgroup_id, cpu_id, nr_siblings_scanned, , , _	On failed steal (all siblings exhausted)

These observations enable the ML framework to learn: - Whether steal fraction should be adjusted per workload (batch vs interactive) - Whether NUMA-only steal is beneficial for a given topology - Correlation between guarantee utilization and application performance

Parameter consumption: cbs_steal_fraction_pct is read in cbs_try_steal() (Evolvable hot path — replaces hardcoded avail / 2):

let fraction = PARAM_STORE.get(ParamId::CbsStealFractionPct)
    .map_or(50, |p| p.current.load(Relaxed));
let steal = avail * fraction as i64 / 100;

7.6.13 cpu.stat CBS Guarantee Statistics¶

The cpu.stat file for a cgroup with an active cpu.guarantee includes additional CBS-specific counters alongside the standard Linux counters:

# Standard Linux counters (unchanged):
usage_usec 123456789
user_usec 100000000
system_usec 23456789
nr_periods 1234
nr_throttled 56
throttled_usec 789000

# UmkaOS CBS guarantee counters (new):
guarantee_nr_periods 1234       # Periods elapsed for CBS guarantee
guarantee_nr_throttled 12       # Periods in which CBS guarantee budget exhausted
guarantee_throttled_usec 45000  # Total time CBS-throttled (microseconds)
guarantee_consumed_usec 456000  # Total CBS budget consumed (microseconds)
guarantee_steal_count 89        # Number of successful budget steals from siblings
guarantee_quota_usec 400000     # Configured guarantee quota (for reference)
guarantee_period_usec 1000000   # Configured guarantee period (for reference)

These counters are per-cgroup (aggregated across all CPUs). They are read from CbsCpuServer.consumed_ns (divided by 1000 for usec output) and CbsCpuServer.nr_throttled_tasks aggregated via a per-CPU scan when cpu.stat is read. The scan is O(nr_cpus) and takes the cgroup's config lock (not any runqueue lock) to snapshot the per-CPU servers.

7.7 Power Budgeting¶

7.7.1 Problem¶

Datacenters in 2026 are power-wall limited. A rack has a fixed power budget (typically 20-40 kW). Power, not compute, is the scarce resource.

Linux has power management (cpufreq, DVFS, C-states, RAPL readout) but no power budgeting. There is no way to say "this container gets at most 150W total across CPU, GPU, memory, and NIC." There is no way for the scheduler to make holistic power-performance tradeoffs.

Relationship to Section 7.2 (Heterogeneous CPU / EAS): Section 7.2 covers Energy-Aware Scheduling at the per-task level — selecting the most energy-efficient core type (P-core vs E-core) for each task using OPP tables and PELT utilization. This section covers a complementary concern: per-cgroup power budgeting — enforcing total watt caps across all power domains (CPU + GPU + DRAM + NIC). The two mechanisms interact: EAS picks the optimal core, power budgeting enforces the envelope.

7.7.2 Design: Power as a Schedulable Resource¶

Power joins CPU time, memory, and accelerator time as a kernel-managed resource with cgroup integration.

// umka-core/src/power/budget.rs

/// Maximum number of power domains tracked by the power budgeting subsystem.
/// A typical datacenter server has:
///   - 1-2 CPU packages (CpuPackage)
///   - 8-128 CPU cores (CpuCore, if per-core RAPL is available)
///   - 1-2 DRAM controllers (Dram)
///   - 0-8 GPUs/accelerators (Accelerator)
///   - 1-4 NICs (Nic, if power-metered)
///   - 1-8 NVMe SSDs (Storage, if power-metered)
///   - 1 platform-level domain (Platform)
/// Setting 256 covers high-end servers with per-core monitoring enabled.
/// The ArrayVec avoids heap allocation on the tick hot path.
pub const MAX_POWER_DOMAINS: usize = 256;

/// Maximum number of cgroups tracked by the power budgeting subsystem.
/// Cgroups are typically hierarchical; a large server may have:
///   - 1 root cgroup
///   - 10-100 system.slice cgroups (systemd services)
///   - 10-1000 user.slice cgroups (user sessions, containers)
/// Setting 4096 covers large container hosts without excessive memory.
pub const MAX_POWER_CGROUPS: usize = 4096;

/// Platform-agnostic power domain.
///
/// This is the generic cross-architecture power domain object used by the
/// power-budget enforcer (Section 7.4.4) and cgroup power accounting
/// (Section 7.2.5). It identifies a device by `DeviceNodeId` and tracks
/// current and maximum power draw regardless of the underlying measurement
/// mechanism (RAPL, SCMI, ACPI, or estimation).
///
/// Contrast with `RaplDomain` (Section 7.2.2.5), which is x86/RAPL-specific
/// and carries a `RaplInterface` hardware handle. `GenericPowerDomain` is the
/// unified abstraction that upper layers use after the architecture-specific
/// boot driver has populated the `PowerDomainRegistry`.
pub struct GenericPowerDomain {
    /// Domain identifier (matches device registry node).
    pub device_id: DeviceNodeId,

    /// Domain type.
    pub domain_type: PowerDomainType,

    /// Current power draw (milliwatts, updated every tick).
    pub current_mw: AtomicU32,

    /// Maximum power this domain can draw (TDP or configured limit).
    /// Initialized from ACPI PPCC (Participant Power Control Capabilities)
    /// tables where available; these define hardware power limits that
    /// the OS must respect. Falls back to TDP from CPUID/ACPI otherwise.
    pub max_mw: u32,

    /// Current performance level (0 = lowest power, 100 = maximum).
    pub perf_level: AtomicU32,

    /// Power measurement source.
    pub measurement: PowerMeasurement,
}

/// CPU packages are represented as `GenericPowerDomain` with
/// `device_id = INVALID_DEVICE_NODE_ID` (sentinel). The power domain identifies
/// the package by its ACPI/DT topology path, not as a `DeviceNode`. `PowerState`
/// transitions (D0→D3) are initiated via ACPI/SCMI methods, not device driver
/// callbacks.

// PowerDomainType: see canonical definition in
// [Section 7.4](#platform-power-management--kernel-abstraction).
// 7 variants: CpuPackage(0), CpuCore(1), Dram(2), Accelerator(3),
// Nic(4), Storage(5), Platform(6).
// Not duplicated here — the single canonical definition in
// platform-power-management.md is authoritative.

#[repr(u32)]
pub enum PowerMeasurement {
    /// Intel RAPL (Running Average Power Limit) via MSR.
    IntelRapl       = 0,
    /// AMD RAPL equivalent.
    AmdRapl         = 1,
    /// ARM SCMI (System Control and Management Interface).
    ArmScmi         = 2,
    /// ACPI Power Meter device.
    AcpiPowerMeter  = 3,
    /// BMC/IPMI (out-of-band, lower frequency).
    BmcIpmi         = 4,
    /// Estimated from utilization (no hardware meter).
    Estimated       = 5,
}

Per-architecture power measurement details:

Intel/AMD RAPL (x86):
  - Read via MSR: IA32_PKG_ENERGY_STATUS (package), IA32_PP0_ENERGY_STATUS (cores),
    IA32_DRAM_ENERGY_STATUS (DRAM), IA32_PP1_ENERGY_STATUS (GPU/uncore).
  - Resolution: ~15.3 μJ per LSB (Intel), ~15.6 μJ (AMD).
  - Read cost: ~100ns per MSR read. 6 domains × 100ns = 600ns per tick.
  - Per-core RAPL (AMD Zen 2+ via `MSR_CORE_ENERGY_STAT`, MSR `0xC001_029A`):
    per-CPU energy attribution. Enables precise per-cgroup power accounting without
    proportional estimation. **Intel does not provide per-core energy counters** —
    Intel RAPL PP0 is an all-cores aggregate for the entire package. On Intel
    platforms, per-CPU energy is estimated proportionally from PP0 using utilization
    weights (less precise than AMD's direct per-core counters).
  - Overflow: 32-bit energy counters. Overflow interval depends on the CPU model's
    energy unit and current power draw — it MUST be computed at runtime. At boot,
    the kernel reads MSR_RAPL_POWER_UNIT to extract the energy unit (bits 12:8),
    giving energy_unit_joules = 2^(-ESU) (e.g., ESU=14 → ~61 μJ on Haswell
    and later (including Skylake), ESU=16 → ~15.3 μJ on Sandy Bridge / Ivy
    Bridge (the architecture default)). The overflow interval is then:
      overflow_seconds = 2^32 * energy_unit_joules / current_power_watts
    For a 200W package with ESU=16: 2^32 * 15.3e-6 / 200 ≈ 329 seconds.
    For a 500W package with ESU=14: 2^32 * 61e-6 / 500 ≈ 524 seconds.
    The kernel sets the RAPL polling interval to min(overflow_seconds / 2, tick)
    to guarantee no counter wraparound is missed. With a 4ms tick and typical
    overflow intervals of 329-524 seconds, the formula always evaluates to `tick`
    (4ms) on current hardware — the overflow margin is enormous. The calculation
    is still performed at runtime (not assumed) to handle hypothetical hardware
    where very high power draw or very coarse energy units could produce an
    overflow interval shorter than twice the tick period.

ARM SCMI (AArch64/ARMv7):
  - SCMI (System Control and Management Interface, ARM DEN 0056) is a standardized
    protocol for communication between the OS and a System Control Processor (SCP).
  - Power domains are discovered via SCMI_POWER_DOMAIN_ATTRIBUTES (protocol 0x11, message 0x03).
  - Power measurement: SCMI_SENSOR_READING_GET (protocol 0x15, message 0x06) reads sensor values
    from the SCP. Sensor types include POWER (watts), ENERGY (joules), CURRENT (amps).
  - Read cost: ~1-5 μs per SCMI message (shared memory + doorbell interrupt to SCP).
    Higher than RAPL (~100ns) but still within the 4ms tick budget.
  - Available on: ARM SBSA servers (AWS Graviton, Ampere), Cortex-M SCP-based
    platforms, and any SoC implementing SCMI power management.
  - Fallback: If SCMI is not available (e.g., simple embedded boards without SCP),
    fall back to Estimated mode.

  Power domain mapping:
    SCMI domain ID → UmkaOS GenericPowerDomain:
      - SCMI_POWER_DOMAIN type "CPU" → PowerDomainType::CpuPackage or CpuCore
      - SCMI_POWER_DOMAIN type "GPU" → PowerDomainType::Accelerator
      - SCMI_POWER_DOMAIN type "MEM" → PowerDomainType::Dram
      - Platform-level SCMI sensor → PowerDomainType::Platform

RISC-V SBI PMU:
  - RISC-V has no standard power measurement interface. The SBI PMU extension
    (ratified) provides performance counters but not power counters.
  - On platforms with BMC/IPMI (e.g., datacenter RISC-V): use BmcIpmi source.
  - On platforms without any power measurement: use Estimated mode.
  - Future: the RISC-V power management task group is defining power management
    extensions. UmkaOS will adopt these when ratified.

Estimated (fallback for all architectures):
  - When no hardware power meter is available, UmkaOS estimates power from:
    a. CPU utilization × TDP per core (linear model, ~10% accuracy).
    b. Frequency scaling: power ∝ V² × f. Frequency from cpufreq.
    c. C-state residency: idle cores at deep C-states draw ~0.5-2W.
  - Estimation runs in the scheduler tick handler (zero additional overhead).
  - Accuracy: ±20-30% vs actual hardware measurement. Sufficient for coarse
    power budgeting (e.g., "keep this rack under 30kW") but not for fine-grained
    per-cgroup accounting.
  - The estimated source is logged at boot:
    umka: power: No hardware power meter detected, using estimated power model
    umka: power: Estimated power accuracy: ±25%. Consider RAPL/SCMI hardware.

7.7.3 Cgroup Integration¶

/sys/fs/cgroup/<group>/power.max
# # Maximum total power budget for this cgroup (milliwatts).
# # Enforced across ALL power domains (CPU + GPU + memory + NIC).
# # Format: "150000" (150W)
# # "max" = no limit (default)

/sys/fs/cgroup/<group>/power.current
# # Current power draw by this cgroup (milliwatts, read-only).
# # Sum of all power domains attributed to this cgroup's processes.

/sys/fs/cgroup/<group>/power.stat
# # Power statistics:
# #   energy_uj <total energy consumed in microjoules>
# #   throttle_count <times power budget was exceeded and throttled>
# #   throttle_us <total microseconds spent throttled>
# #   avg_power_mw <average power over last 10 seconds>

/sys/fs/cgroup/<group>/power.weight
# # Relative share of excess power budget (like cpu.weight).
# # Default: 100. Higher = more power when contended.

/sys/fs/cgroup/<group>/power.domains
# # Per-domain power limits (optional, for fine-grained control).
# # Format: "cpu 80000 gpu 60000 dram 10000"
# # If not set, the global power.max is split by the kernel
# # based on workload demand.

7.7.4 Power-Aware Scheduler¶

// umka-core/src/sched/power.rs

const MAX_POWER_CGROUPS: usize = 4096; // Must be power of two

/// A fixed-capacity open-addressing hash map with compile-time maximum size.
/// Uses power-of-two capacity with linear probing. Never allocates — backed
/// by a static or slab-allocated array. Insertion returns Err if at capacity.
/// N must be a power of two; load factor is capped at 75% (0.75 * N entries).
///
/// Hash function: FxHash (Firefox's fast integer hash — multiply by a golden
/// ratio constant, shift right). FxHash is ideal for small integer keys
/// (CgroupId, DeviceId) and has no allocation or state. FxHash is not
/// DoS-resistant, but this is safe because the keys (cgroup IDs) are
/// kernel-assigned integers, not user-controlled. For string keys or
/// user-controlled inputs, SipHash-1-3 is used (DoS-resistant).
///
/// Collision resolution: linear probing with step size 1. At 75% load factor
/// and power-of-two sizing, expected probe length is ~2 (Birthday paradox
/// bound). At capacity (75% of N), insert returns `Err(MapFull)`.
///
/// The map is pre-allocated at CBS initialization with capacity
/// `max_concurrent_cbs_tasks × 2` (load factor 0.5), where
/// `max_concurrent_cbs_tasks` is derived from the system's admission-control
/// limit ([Section 7.6](#cpu-bandwidth-guarantees)). Pre-allocation guarantees that no insertion fails
/// during the tick hot path as long as the admitted task count does not exceed
/// the limit enforced at `cpu.guarantee` write time.
///
/// On the rare case of map overflow (should not occur with correct
/// pre-allocation; indicates a kernel bug or admission-control bypass):
/// charge `budget_remaining_ns = 0` for the current tick as a conservative
/// fallback — the task is considered to have exhausted its budget for this
/// tick. This is safe: it errs on the side of throttling the task rather than
/// allowing unaccounted CPU consumption, preserving CBS bandwidth isolation
/// guarantees. A kernel warning is emitted unconditionally on overflow
/// (not gated on debug_assertions) since this path should never be reached.
/// **Removed**: `FixedHashMap` replaced by XArray below. CgroupId is an integer
/// key — per collection policy, integer-keyed mappings must use XArray.

/// Power budget enforcer.
/// Runs at scheduler tick frequency (~4ms). NOT per-scheduling-decision.
pub struct PowerBudgetEnforcer {
    /// Power domains on this machine.
    /// Populated at boot from ACPI/DT hardware discovery. The number of power
    /// domains is small and bounded (typically <=16: package + cores + DRAM +
    /// accelerators). Uses a fixed-capacity array sized to MAX_POWER_DOMAINS.
    domains: ArrayVec<GenericPowerDomain, MAX_POWER_DOMAINS>,

    /// Per-cgroup power accounting.
    /// XArray keyed by CgroupId (u64, integer key — per collection policy).
    /// O(1) lookup at tick frequency (~4ms). Maximum cgroup count bounded by
    /// the cgroup hierarchy (typically <1024 active cgroups, max MAX_POWER_CGROUPS).
    /// Entries inserted on cgroup creation (warm path), removed on cgroup deletion.
    cgroup_power: XArray<CgroupPowerState>,

    /// Global power budget (rack-level, from BMC or admin-configured).
    global_budget_mw: Option<u32>,
}

pub struct CgroupPowerState {
    /// Budget for this cgroup (from power.max).
    budget_mw: u32,

    /// Current attributed power draw.
    current_mw: u32,

    /// Running energy counter (microjoules).
    energy_uj: u64,

    /// Is this cgroup currently throttled?
    throttled: bool,

    /// Throttle mechanism:
    /// 1. Reduce CPU frequency (cpufreq) for this cgroup's cores.
    /// 2. Reduce accelerator clock (AccelBase set_performance_level).
    /// 3. As last resort: CPU throttling (delay scheduling).
    /// At most one action per PowerDomainType variant (7 variants), so bounded to 8.
    /// Using 8 instead of MAX_POWER_DOMAINS (256) saves ~3 KB per cgroup entry.
    throttle_actions: ArrayVec<ThrottleAction, 8>,
}

pub enum ThrottleAction {
    /// Reduce CPU frequency to this level (MHz).
    CpuFrequency(u32),
    /// Reduce accelerator performance level.
    AccelPerformance { device_id: DeviceNodeId, level: u32 },
    /// Throttle CPU time (insert idle cycles).
    CpuThrottle { duty_cycle_percent: u32 },
}

Enforcement flow:

Every scheduler tick (~4ms):
  1. Read power counters from all domains (1 MSR read per domain, ~100ns each).
  2. Attribute power to cgroups based on CPU time + accelerator time share.
     Note: power attribution is an APPROXIMATION. RAPL gives package-level
     power, not per-process. Attribution model:
       Per-core RAPL (AMD Zen 2+): precise per-CPU attribution. Intel: proportional.
       Package-level RAPL: proportional to (cgroup CPU time / total CPU time)
         weighted by frequency at time of execution.
       Accelerator: proportional to AccelBase get_utilization() per cgroup.
     This is the same limitation as Linux (perf energy-cores event).
     Exact per-process power metering requires hardware not yet available.
  3. For each cgroup:
     a. Is current_mw > budget_mw?
     b. Yes → rebuild throttle_actions array (see selection algorithm below).
     c. No → clear throttle_actions and release any active throttles.
  4. Total overhead analysis:
     - RAPL domain reads: up to 32 domains × ~100ns MSR read = ~3.2 μs.
     - Cgroup budget checks: up to MAX_POWER_CGROUPS (4096) × ~20ns = ~82 μs.
     - Worst case: ~85 μs per tick = 2.1% of a 4ms tick.
     - Typical case (8 domains, 64 cgroups): ~2 μs = 0.05% of a 4ms tick.
     The worst case is acceptable: power budgeting runs on a dedicated
     kthread (not on scheduler tick path), and 4096 power-budgeted cgroups
     is an extreme configuration. Systems with >256 power-budgeted cgroups
     should increase the poll interval to 8ms via
     `umka.power_budget_interval_ms`.

Throttle action selection algorithm (step 3b):
  Input: excess_mw = current_mw - budget_mw for the cgroup.
  Output: throttle_actions filled in priority order.

  Step 1 — CPU frequency reduction:
    For each CPU frequency domain that contains CPUs running this cgroup's tasks:
      current_pstate = cpufreq_get_current_pstate(domain)
      If current_pstate > PSTATE_MIN:
        Add ThrottleAction::CpuFrequency(pstate_to_mhz(current_pstate - 1))
        Estimated power reduction: (current_mw * pstate_freq_ratio_drop) mw
    If estimated reduction ≥ excess_mw: stop here (frequency alone is enough).

  Step 2 — Accelerator clock reduction:
    Only if remaining_excess_mw > 0 after Step 1.
    For each accelerator context used by this cgroup:
      current_level = accel_vtable.get_performance_level(device_id)
      If current_level > 0:
        Add ThrottleAction::AccelPerformance { device_id, level: current_level - 1 }
    If estimated reduction ≥ remaining_excess_mw: stop here.

  Step 3 — CPU duty-cycle throttle (last resort):
    Only if remaining_excess_mw > 0 after Steps 1 and 2.
    duty_cycle = max(50, 100 - (remaining_excess_mw * 100 / current_mw))
    Add ThrottleAction::CpuThrottle { duty_cycle_percent: duty_cycle }

  Invariants:
    - throttle_actions is rebuilt from scratch every tick (no incremental state).
    - At most one CpuFrequency entry per frequency domain.
    - At most one AccelPerformance entry per device_id.
    - At most one CpuThrottle entry (covers all CPUs for this cgroup).
    - Array capacity 8 is sufficient: at most one entry per PowerDomainType (7 types)
      plus the CpuThrottle fallback = 8 maximum.

7.7.5 System-Level Power Accounting¶

/sys/kernel/umka/power/
    energy_total_uj        # Total system energy since boot (microjoules)
    budget_mw              # System-wide power budget (admin-set)

Carbon policy is NOT the kernel's job. The kernel measures watts and enforces watt budgets. Carbon intensity depends on grid mix, geography, time of day, renewable contracts — all external to the machine. Orchestrators (Kubernetes, Nomad, custom fleet managers) can read energy_total_uj and compute carbon externally. This is the correct separation of concerns: kernel provides accurate power telemetry, userspace applies policy.

EAS feedback: When PowerBudgetEnforcer throttles a power domain (CPU package, DRAM), it updates the EAS capacity table: throttled cores have reduced capacity_dmips. The scheduler's EAS path reads this capacity on every task placement decision (Section 7.2). Feedback latency: ~1ms (enforcer runs at tick frequency, capacity update is an atomic store).

Thermal Throttling Coordination:

When the hardware thermal throttle engages, the power budget enforcer backs off to prevent double-throttling. Detection is architecture-specific: - x86: MSR IA32_THERM_STATUS (PROCHOT assertion) or ACPI thermal zone events. - AArch64/ARMv7: SCMI thermal notifications from SCP, or ACPI thermal zones on SBSA-compliant servers. On DT-based platforms: thermal zone DT nodes with trip points. - RISC-V: Platform-specific (BMC/IPMI thermal events, or DT thermal zones).

If hardware is already throttling a domain, the kernel does not apply additional software throttling to that domain — doing so would reduce performance below what the thermal situation requires. The kernel logs the thermal event and adjusts its power model to account for reduced headroom.

Hardware/software throttle coordination: To prevent double-throttling during the detection window, the power enforcer reads the hardware throttle status BEFORE applying its own throttle. On x86, IA32_THERM_STATUS bit 0 (PROCHOT active) and IA32_PACKAGE_THERM_STATUS indicate active hardware throttling. On ARM, SCMI notifications deliver thermal events asynchronously. The coordination protocol: 1. Before each enforcement tick, read hardware throttle status. 2. If hardware throttling is active, skip software throttling for this tick (hardware is already reducing power draw). 3. If hardware throttling was active on the previous tick but is now inactive, re-evaluate software throttle based on current power measurement (the RAPL or SCMI reading now reflects the hardware-throttled power level). 4. Race window: between hardware engaging thermal throttle and the next enforcement tick (~4 ms worst case), both throttles may be active simultaneously. This is safe — double-throttling reduces performance temporarily but does not cause correctness issues. The next tick detects the hardware throttle and removes the software throttle.

7.7.5.1 Thermal Passive Cooling — EAS Capacity Update¶

When thermal throttling reduces a CPU's operating frequency (passive cooling), the scheduler's energy-aware placement decisions become incorrect unless the per-CPU CpuCapacity.capacity value is updated to reflect the reduced throughput. Without this feedback, EAS may place tasks on a thermally throttled core believing it has full capacity, leading to both missed performance targets and further thermal escalation.

The thermal framework notifies the scheduler via thermal_update_capacity() whenever a thermal zone's passive cooling governor reduces a CPU's maximum frequency:

/// Called by the thermal framework when passive cooling reduces or restores
/// a CPU's maximum operating frequency. Updates the EAS capacity model so
/// the scheduler accounts for the reduced throughput.
///
/// # Arguments
///
/// * `cpu` — The CPU whose capacity changed.
/// * `throttled_freq_khz` — The new maximum frequency allowed by the thermal
///   governor (in kHz). If the thermal constraint is lifted, this equals the
///   CPU's original `max_freq`.
///
/// # Effect
///
/// Recomputes `CpuCapacity.capacity` for the target CPU:
///   new_capacity = capacity_max × (throttled_freq_khz / max_freq_khz)
///
/// The `capacity_max` field (boot-time maximum at the CPU's highest OPP)
/// is unchanged. Only the `capacity` field (current maximum available to the
/// scheduler) is reduced. `capacity_curr` continues to track the actual
/// instantaneous frequency set by the cpufreq governor (which is now clamped
/// to at most `throttled_freq_khz`).
///
/// This function also updates the cpufreq policy's `max` frequency, preventing
/// the governor from requesting a frequency above the thermal limit.
///
/// # Integration with cpufreq
///
/// The thermal governor calls `thermal_update_capacity()` BEFORE calling
/// `cpufreq_update_policy()`. This ordering ensures that:
/// 1. The scheduler sees the reduced capacity before the next `pick_next_task`.
/// 2. The cpufreq governor sees the clamped `policy.max` before the next
///    frequency decision.
///
/// # Performance
///
/// Called only when thermal trip points are crossed (rare — typically once
/// per thermal event, not per tick). The capacity recalculation is O(1)
/// per affected CPU.
pub fn thermal_update_capacity(cpu: CpuId, throttled_freq_khz: u32) {
    let cap = per_cpu!(cpu_capacity, cpu);
    let ratio = throttled_freq_khz as u64 * 1024 / cap.max_freq_khz as u64;
    let new_cap = core::cmp::min(ratio as u32, cap.capacity_max);
    cap.capacity.store(new_cap, Ordering::Release);

    // Clamp cpufreq policy max to prevent the governor from exceeding
    // the thermal limit.
    if let Some(policy) = cpufreq_get_policy(cpu) {
        policy.max_freq_khz.store(throttled_freq_khz, Ordering::Release);
    }
}

Thermal → EAS feedback path:

Thermal zone poll (every thermal_polling_delay_ms):
  → zone temperature exceeds passive trip point
  → thermal governor reduces CPU freq: cpufreq_cooling_set_max(freq_khz)
  → cpufreq_cooling_set_max calls thermal_update_capacity(cpu, freq_khz)
  → CpuCapacity.capacity reduced proportionally
  → EAS next wakeup sees reduced capacity → avoids placing tasks on throttled core
  → Schedutil sees clamped policy.max → does not request frequency above thermal limit

When the thermal zone cools below the trip point, the governor restores the original frequency, and thermal_update_capacity() is called with the CPU's original max_freq_khz, restoring CpuCapacity.capacity to capacity_max.

ML Policy → EAS Closed-Loop Feedback Protocol:

The ML policy framework (Section 23.1) provides predictive power management by feeding power telemetry into the EAS (Energy-Aware Scheduling) task placement engine. The closed-loop protocol:

1. Power telemetry collection (every scheduler tick, ~4ms):
   → PowerDomain.current_watts read from RAPL/SCMI/OCC
   → Per-CPU utilization from CFS load tracking
   → Thermal zone temperature from thermal polling

2. ML policy inference (every policy_inference_interval_ms, default 100ms):
   → Input: [power_watts, utilization, temperature, freq_khz] per domain
   → Output: PowerPolicyAction { target_freq_khz, capacity_headroom_pct }
   → Inference runs in the ML policy kthread (SCHED_NORMAL, nice 5)

3. EAS feedback application (immediate, on ML policy output):
   → If target_freq_khz < current_freq_khz:
       cpufreq_cooling_set_max(target_freq_khz)  // same path as thermal
       thermal_update_capacity(cpu, target_freq_khz)
   → capacity_headroom_pct adjusts EAS's energy_threshold:
       energy_threshold = base_threshold * (100 + capacity_headroom_pct) / 100
       (Higher headroom → EAS more willing to use higher-power cores)

4. Observation (next telemetry collection):
   → ML policy observes the effect of its previous action
   → Adjusts next inference based on actual power/thermal response
   → Convergence: within 3-5 inference cycles (~300-500ms)

Safety bound: The ML policy cannot set target_freq_khz below policy.min_freq_khz (hardware minimum) or above the thermal governor's current limit. It operates strictly within the envelope defined by the thermal governor and cpufreq driver constraints.

Battery Systems:

Power budgeting for battery-powered systems (laptops, edge devices) is out of scope for v1. Battery charge level, discharge rate, and remaining runtime are platform-management concerns handled by ACPI/UPower in userspace. The power budgeting system provides the watt-level telemetry that battery management software can consume, but does not implement battery-specific policies.

7.7.6 Performance Impact¶

Per-architecture overhead per scheduler tick (~4ms):

Architecture	Read mechanism	Cost per domain	6-domain system	Overhead
x86 (RAPL)	MSR read	~100ns	600ns	0.015%
AArch64 (SCMI)	SCP mailbox	~1-5 μs	6-30 μs	0.15-0.75%
ARMv7 (SCMI)	SCP mailbox	~1-5 μs	6-30 μs	0.15-0.75%
RISC-V (Estimated)	Calculation	~50ns	300ns	0.008%
PPC32 (Estimated)	Calculation	~50ns	300ns	0.008%
PPC64LE (OCC)	OPAL sensor read	~1-5 μs	6-30 μs	0.15-0.75%
Any (BMC/IPMI)	OOB polling	~10-50 μs	60-300 μs	0.006-0.03% (rate-limited to 1/s)

SCMI overhead is higher than RAPL but still well within budget. For BMC/IPMI sources, the kernel rate-limits reads to 1 per second (not per tick) to avoid I2C/IPMI bus saturation, using the last-read value for inter-read ticks. The overhead percentage for BMC/IPMI reflects amortization over the 1-second read interval (60-300 μs / 1s), not per-tick cost.

When power throttling is active: performance reduction is intentional and configured. It replaces uncontrolled thermal throttling (which is worse — it's sudden and undifferentiated).

When power throttling is NOT active: zero overhead beyond the power reads.

7.8 Timekeeping and Clock Management¶

Accurate, low-latency timekeeping is foundational: the scheduler needs monotonic timestamps for CBS deadlines (Section 7.6), real-time tasks need bounded timer latency (Section 8.4), and userspace applications call clock_gettime() millions of times per second. This section describes how UmkaOS reads hardware clocks, maintains system time, exposes fast timestamps to userspace, and manages timer events.

7.8.1 Clock Source Hierarchy¶

Each architecture provides one or more hardware cycle counters. UmkaOS selects the best available source at boot and can switch at runtime if a source proves unstable (Section 7.8).

Architecture	Primary Source	Secondary	Resolution	Access
x86-64	TSC (Time Stamp Counter)	HPET, ACPI PM Timer	sub-ns	`rdtsc` (user/kernel)
AArch64	Generic Timer (`CNTPCT_EL0`)	—	typically 1-10 ns	`mrs` (EL0 if enabled)
ARMv7	Generic Timer (`CNTPCT` via cp15)	—	typically 1-10 ns	`mrc` (PL0 if enabled)
RISC-V	`mtime` (MMIO)	`rdtime` CSR	implementation-defined	`rdtime` (U-mode)
PPC32	Timebase (TBL/TBU)	Decrementer (DEC)	typically 1-10 ns	`mftb` / `mfspr`
PPC64LE	Timebase (TB)	Decrementer (DEC)	sub-ns (POWER9: 512 MHz)	`mftb` (user/kernel)
s390x	TOD (Time-of-Day) clock	—	sub-ns (1.024 GHz native)	`STCK` / `STCKE` (all privilege levels)
LoongArch64	Stable Counter	—	implementation-defined (freq from `CPUCFG`)	`RDTIME` (user/kernel)

x86-64 notes: Modern processors (Intel Nehalem+, AMD Zen+) provide an invariant TSC that runs at a constant rate regardless of frequency scaling or C-state transitions. CPUID leaf 0x8000_0007 EDX bit 8 advertises this. When invariant TSC is available, it is the preferred source: zero-cost reads (rdtsc is unprivileged), sub-nanosecond resolution, and monotonicity guaranteed across cores. When invariant TSC is absent, UmkaOS falls back to HPET (~100 ns read latency, MMIO) or the ACPI PM Timer (~800 ns read latency, port I/O).

AArch64 / ARMv7 notes: The ARM Generic Timer is architecturally defined and always present. The kernel configures CNTKCTL_EL1 to allow EL0 (userspace) reads of CNTPCT_EL0, enabling a vDSO fast path identical in spirit to x86 rdtsc.

RISC-V notes: The rdtime pseudo-instruction reads the platform-provided real-time counter. Frequency is discoverable from the device tree (timebase-frequency property). Resolution varies by implementation.

s390x notes: The TOD (Time-of-Day) clock is a 104-bit architecturally defined clock running at 1.024 GHz (bit 51 = 1 microsecond). STCK reads the upper 64 bits (sufficient for sub-nanosecond resolution); STCKE reads the full 128-bit extended format. The TOD clock is continuous across all CPU states and synchronized across all CPUs in a configuration via the STP (Server Time Protocol) facility. No secondary source is needed — the TOD clock is the sole timekeeping mechanism on s390x.

LoongArch64 notes: The Stable Counter is a fixed-frequency counter accessible via the RDTIME instruction from any privilege level. The counter frequency is discoverable at boot from the CPUCFG instruction (register 0x4, CC_FREQ field) or from the device tree. Like ARM's Generic Timer, it provides a uniform timekeeping interface independent of CPU frequency scaling.

All clock sources implement a common abstraction:

// umka-core/src/time/clocksource.rs

/// Hardware clock source abstraction.
/// Implementations are per-architecture; the best source is selected at boot.
pub trait ClockSource: Send + Sync {
    /// Read the current cycle count from hardware.
    fn read_cycles(&self) -> u64;

    /// Nominal frequency of this clock source in Hz.
    fn frequency_hz(&self) -> u64;

    /// Quality rating: higher values are preferred when multiple sources exist.
    /// TSC invariant = 350, HPET = 250, ACPI PM Timer = 100.
    fn rating(&self) -> u32;

    /// Whether this source continues counting through CPU sleep states.
    fn is_continuous(&self) -> bool;

    /// Upper bound on single-read uncertainty in nanoseconds.
    /// Accounts for read latency and synchronization jitter.
    fn uncertainty_ns(&self) -> u32;
}

At boot, umka-core enumerates available sources, sorts by rating(), and activates the highest-rated continuous source. The secondary source (if any) is retained for watchdog cross-validation (Section 7.8).

7.8.2 Timekeeping Subsystem¶

UmkaOS maintains four clocks, matching POSIX semantics:

Clock	Semantics	Adjustable?
`CLOCK_MONOTONIC`	Time since boot, NTP-adjusted rate	No (monotonic)
`CLOCK_MONOTONIC_RAW`	Time since boot, raw hardware rate	No
`CLOCK_REALTIME`	Wall clock (UTC), NTP-adjusted	Yes (`clock_settime`, NTP)
`CLOCK_BOOTTIME`	Like `CLOCK_MONOTONIC` but includes suspend time	No

Timestamp representation: All internal timestamps use a (seconds: u64, nanoseconds: u64) tuple. Both fields are 64-bit to avoid overflow in intermediate arithmetic (nanoseconds may temporarily exceed 10^9 during computation and are normalized before storage).

Global timekeeper state is protected by a seqlock — the same pattern used in Linux timekeeping.c. Readers (including the vDSO) retry if they observe a torn update. Writers (the timer interrupt handler) are serialized by holding the seqlock write side.

// umka-core/src/time/timekeeper.rs

/// Global timekeeping state, updated on every tick or clocksource event.
pub struct Timekeeper {
    pub seq: SeqLock,                       // seqlock ([Section 3.6](03-concurrency.md#lock-free-data-structures--seqlockt-sequence-lock)) protecting all fields
    pub clock: &'static dyn ClockSource,    // active clock source
    pub cycle_last: u64,                    // last cycle count at update
    pub mask: u64,                          // counter wrap bitmask
    pub mult: u32,                          // ns = (cycles * mult) >> shift
    pub shift: u32,
    pub wall_sec: u64,                      // CLOCK_REALTIME
    pub wall_nsec: u64,
    pub mono_sec: u64,                      // CLOCK_MONOTONIC
    pub mono_nsec: u64,
    pub boot_offset_sec: u64,              // CLOCK_BOOTTIME delta
    pub boot_offset_nsec: u64,
    pub freq_adj: i64,                      // NTP/PTP frequency correction (scaled ppm)
    pub phase_adj: i64,                     // NTP/PTP phase correction (ns)
}

NTP/PTP discipline: An adjtimex()-compatible interface accepts frequency and phase corrections from userspace NTP or PTP daemons. Frequency adjustment modifies mult slightly so that cycles-to-nanoseconds conversion drifts at the requested rate. Phase adjustment is applied as a slew (at most 500 ppm rate adjustment) to avoid wall clock jumps. CLOCK_MONOTONIC_RAW is immune to both adjustments — it reflects raw hardware cycles converted at the nominal rate.

7.8.3 vDSO Fast Path¶

Linux problem: clock_gettime() is the most frequently invoked syscall in many workloads (databases, trading systems, telemetry). A kernel entry costs ~100-200 ns due to mode switch, KPTI page table reload, and speculative execution mitigations. At millions of calls per second, this adds up.

UmkaOS design: Like Linux, UmkaOS maps a vDSO (virtual Dynamic Shared Object) into every process's address space. The vDSO contains userspace implementations of clock_gettime(), gettimeofday(), and time() that read the hardware clocksource directly and apply precomputed conversion parameters — no syscall needed.

The kernel maintains a read-only shared page (VvarPage, Section 2.22) that it updates on every timer tick and on NTP adjustments. Userspace vDSO code reads this page under seqlock (Section 3.6) protection.

The canonical VvarPage struct definition is in Section 2.22. Key fields for timekeeping:

Field	Type	Description
`seq`	`u32`	Seqlock counter (odd = update in progress, even = stable). ABI-constrained u32; wraps in ~24.8 days at 1000 Hz but seqlock parity check makes wrap safe.
`clock_mode`	`u32`	Active clocksource (TSC, HPET, Generic Timer, ...)
`cycle_last`	`u64`	Cycle count at last kernel update
`mask` / `mult` / `shift`	`u64` / `u32` / `u32`	NTP-adjusted conversion parameters
`clock_realtime_sec` / `_nsec`	`u64`	CLOCK_REALTIME base
`clock_monotonic_ns`	`u64`	CLOCK_MONOTONIC base (nanoseconds)
`clock_tai_offset_sec`	`i64`	TAI - UTC offset (typically 37s as of 2024). CLOCK_TAI = CLOCK_REALTIME + this offset. Updated on leap second events.
`clock_boottime_sec` / `_nsec`	`u64`	CLOCK_BOOTTIME base (includes suspend time)
`monotonic_raw_sec` / `_nsec`	`u64`	CLOCK_MONOTONIC_RAW (immune to NTP adjustments)
`raw_mult` / `raw_shift`	`u32`	Nominal (non-NTP-adjusted) conversion parameters

vDSO read path (userspace, per-architecture):

Read seq. If odd, spin (kernel is mid-update). Increment retry_count.
Read cycle_last, mult, shift, mask, and the relevant base time.
Read the hardware counter (rdtsc / mrs CNTPCT_EL0 / rdtime).
Compute delta = (now - cycle_last) & mask.
Compute ns = base_nsec + (delta * mult) >> shift. Normalize into seconds.
Re-read seq. If it changed:
If retry_count < 100: go to step 1.
If retry_count >= 100: fall back to clock_gettime() syscall. This matches Linux vDSO behavior and prevents indefinite spinning when the kernel performs sustained timekeeper updates (e.g., NTP slew adjustment, clocksource switch, or a pathological interrupt storm holding the timekeeper seqlock for extended periods). The 100-iteration threshold is conservative: normal contention resolves within 1-3 retries; reaching 100 indicates a systemic issue rather than transient contention.

Cost: ~5-20 ns depending on architecture (dominated by the clocksource read instruction itself). This is 10-40x faster than a syscall path. The syscall fallback path costs ~100-200 ns but is taken only under extreme contention.

Fallback: If clock_mode indicates no userspace-readable source is available (e.g., HPET on x86, which requires MMIO the kernel has not mapped into user address space), the vDSO falls back to a real syscall instruction.

7.8.4 Timer Infrastructure¶

UmkaOS provides two timer mechanisms, matching the Linux split between coarse and high-resolution timers.

Timer wheel (coarse-grained, jiffies resolution):

Used for network retransmission timeouts, poll/epoll timeouts, and other events where millisecond precision is sufficient. Implemented as a hierarchical timer wheel with O(1) insertion and O(1) per-tick processing (cascading is amortized). The wheel uses 8 levels with 256 slots each, covering timeouts from 1 tick to ~50 days at HZ=250.

High-resolution timers (hrtimers, nanosecond precision):

Used for timer_create() (POSIX per-process timers), nanosleep() / clock_nanosleep(), timerfd_create(), and scheduler deadline enforcement. Implemented as a per-CPU red-black tree keyed by absolute expiry time. The nearest expiry programs the hardware timer (local APIC on x86, Generic Timer on ARM, mtimecmp on RISC-V) to fire at the exact time.

// umka-core/src/time/hrtimer.rs

/// A high-resolution timer.
pub struct HrTimer {
    /// Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
    pub expires_ns: u64,

    /// Callback invoked on expiry. Runs in hard-IRQ context.
    pub callback: fn(&mut HrTimer),

    /// Opaque context value passed to the callback. Typically a pointer to
    /// the enclosing structure (cast via `as usize`), allowing the callback
    /// to recover its context via `unsafe { &*(context as *const T) }`.
    /// This is the Rust equivalent of Linux's `container_of` pattern for
    /// timer callbacks.
    ///
    /// # Safety
    ///
    /// Using `context` as a pointer requires the following invariants:
    ///
    /// 1. **Pinning**: The `HrTimer` must be embedded in a `Pin`-ned
    ///    allocation. The enclosing structure must not move while the timer
    ///    is armed, since `context` stores a raw pointer to it.
    /// 2. **Drop ordering**: The enclosing structure's `Drop` implementation
    ///    must cancel the timer (`hrtimer_cancel()`) before deallocation,
    ///    ensuring the callback never fires with a dangling `context`.
    /// 3. **Type agreement**: The callback is responsible for casting
    ///    `context` back to the correct type via
    ///    `unsafe { &*(self.context as *const T) }`. The caller that sets
    ///    `context` and the callback must agree on the type `T`.
    /// 4. These invariants match Linux's `container_of` + `hrtimer` pattern,
    ///    adapted for Rust's ownership model. The timer subsystem enforces
    ///    invariant (2) by requiring `Pin<&mut HrTimer>` for
    ///    `hrtimer_start()`.
    pub context: usize,

    /// Timer state.
    pub state: HrTimerState,

    /// Owning CPU (timers are per-CPU to avoid cross-CPU synchronization).
    pub cpu: u32,
}

/// Convenience alias used by subsystem specs (watchdog, IPVS, timerfd).
/// `KernelTimer` is the same `HrTimer` defined above.
pub type KernelTimer = HrTimer;

7.8.4.1 Cross-Domain Timer Registration¶

Tier 1 modules (e.g., umka-net TCP, IPVS) run in isolated domains and cannot receive direct timer callbacks from the Tier 0 timer wheel. Instead, they register timers with a domain_id parameter. On expiry, the timer subsystem routes the event to the target domain's MPSC IRQ ring as a TimerExpiry notification (Section 12.8), rather than invoking a callback directly.

This separation is required by the Unified Domain Model: the timer wheel runs in Tier 0 softirq context, while the timer handler runs in the Tier 1 driver's domain. Direct cross-domain function calls violate the isolation boundary.

/// Cross-domain timer registration for Tier 1 modules.
///
/// When `domain_id != CORE_DOMAIN_ID`, the timer expiry path calls
/// `timer_fire_to_domain()` ([Section 12.8](12-kabi.md#kabi-domain-runtime)) instead of
/// invoking `callback` directly. The callback field is ignored for
/// cross-domain timers (the driver's IRQ consumer loop handles dispatch
/// via `DriverIrqHandler::handle_timer_expiry()`).
///
/// When `domain_id == CORE_DOMAIN_ID` (Tier 0), the timer fires normally
/// via the `callback` field in `HrTimer` -- no ring dispatch.
///
/// # Arguments
///
/// - `timer_id`: Opaque identifier. For TCP: packed
///   `(sock_handle: u48, timer_type: u16)`. For other modules: module-defined.
///   The timer subsystem does not interpret this value -- it is echoed verbatim
///   in the `TimerExpiryPayload.timer_id` field on expiry.
/// - `domain_id`: Target domain for expiry delivery. Validated at registration
///   time: the domain must exist and be in `Active` state. If the domain crashes
///   between registration and expiry, the event is silently dropped.
/// - `expiry_ns`: Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
/// - `timer_type`: `Wheel` for coarse-grained (jiffies resolution, network
///   retransmit timeouts) or `HrTimer` for high-resolution (nanosecond precision,
///   deadline enforcement).
///
/// # Returns
///
/// `CrossDomainTimerHandle` on success. The handle can be used to cancel or
/// rearm the timer. Cancellation is synchronous: after `cancel()` returns, no
/// expiry event for this timer will be delivered (in-flight events may still
/// be in the IRQ ring but the consumer detects staleness via `expiry_ns`
/// comparison).
///
/// # Errors
///
/// - `EINVAL`: `domain_id` does not exist or is not `Active`.
/// - `ENOMEM`: Timer wheel or hrtimer tree is at capacity (should not
///   happen in practice -- both are dynamically sized).
pub fn timer_register_cross_domain(
    timer_id: u64,
    domain_id: DomainId,
    expiry_ns: u64,
    timer_type: TimerType,
) -> Result<CrossDomainTimerHandle, Error> {
    // Validate domain.
    let domain = DOMAIN_REGISTRY.get(domain_id)
        .ok_or(Error::INVAL)?;
    if domain.domain_crashed() {
        return Err(Error::INVAL);
    }

    let handle = match timer_type {
        TimerType::Wheel => {
            // Insert into per-CPU timer wheel. The wheel entry stores
            // (domain_id, timer_id) instead of a callback pointer.
            let wheel_handle = timer_wheel_insert_cross_domain(
                expiry_ns, domain_id, timer_id,
            )?;
            CrossDomainTimerHandle::Wheel(wheel_handle)
        }
        TimerType::HrTimer => {
            // Insert into per-CPU hrtimer tree. The hrtimer entry stores
            // (domain_id, timer_id). On expiry, the hrtimer callback calls
            // timer_fire_to_domain() instead of the driver's callback.
            let hr_handle = hrtimer_insert_cross_domain(
                expiry_ns, domain_id, timer_id,
            )?;
            CrossDomainTimerHandle::HrTimer(hr_handle)
        }
    };

    Ok(handle)
}

/// Timer type selector for cross-domain registration.
pub enum TimerType {
    /// Coarse-grained timer (jiffies resolution). Used for network retransmit
    /// timeouts, keepalive, TIME_WAIT, and other events where millisecond
    /// precision is sufficient.
    Wheel,
    /// High-resolution timer (nanosecond precision). Used for deadline
    /// enforcement, CBS replenishment, and latency-sensitive timers.
    HrTimer,
}

/// Handle returned by `timer_register_cross_domain()`. Supports cancel
/// and rearm operations.
pub enum CrossDomainTimerHandle {
    Wheel(WheelTimerHandle),
    HrTimer(HrTimerHandle),
}

impl CrossDomainTimerHandle {
    /// Cancel the timer. After this returns, no new expiry events will be
    /// enqueued for this timer. Events already in the IRQ ring are detected
    /// as stale by the consumer (expiry_ns mismatch).
    pub fn cancel(&mut self) {
        match self {
            Self::Wheel(h) => h.cancel(),
            Self::HrTimer(h) => h.cancel(),
        }
    }

    /// Rearm the timer with a new expiry time. The old expiry is cancelled
    /// and a new one scheduled. In-flight events for the old expiry are
    /// detected as stale by the consumer.
    pub fn rearm(&mut self, new_expiry_ns: u64) {
        match self {
            Self::Wheel(h) => h.rearm(new_expiry_ns),
            Self::HrTimer(h) => h.rearm(new_expiry_ns),
        }
    }
}

Tier 0 expiry path change: When a timer with domain_id != CORE_DOMAIN_ID expires, the timer wheel (or hrtimer) expiry handler calls timer_fire_to_domain(domain_id, timer_id, expiry_ns) instead of invoking the normal callback. This function is defined in Section 12.8 and enqueues a TimerExpiry event on the target domain's MPSC IRQ ring. The cost is one CAS enqueue (~3-5 cycles) plus a conditional IPI (~0-3 cycles) -- comparable to a normal timer callback invocation.

Per-CPU timer queues: Each CPU maintains its own timer wheel and hrtimer tree. Timer insertion targets the local CPU by default. Expiry processing happens in the local timer interrupt — no cross-CPU IPI is needed. This eliminates contention and provides deterministic latency on isolated CPUs (Section 8.4).

Timer coalescing: When a timer is inserted with a slack tolerance (e.g., a 100 ms timeout with 10 ms acceptable slack), the kernel may delay it to coalesce with nearby timers. This reduces wakeups on idle CPUs, improving power efficiency. Coalescing is disabled for hrtimers with zero slack (RT workloads). The timer_slack_ns per-process tunable controls default slack, identical to the Linux interface.

7.8.5 Time Namespace Offsets¶

Containers and checkpoint/restore (CRIU) require the ability to present shifted monotonic and boottime clocks to isolated processes. Linux added time namespaces in kernel 5.6 (via unshare(CLONE_NEWTIME)). UmkaOS provides the same capability.

/// Per-time-namespace offsets applied to monotonic and boottime clocks.
/// Created when a process calls unshare(CLONE_NEWTIME).
pub struct TimeNamespace {
    /// Offset added to CLOCK_MONOTONIC readings within this namespace.
    pub monotonic_offset_ns: i64,
    /// Offset added to CLOCK_BOOTTIME readings within this namespace.
    pub boottime_offset_ns: i64,
    /// Frozen flag: when set, all time reads return the value at freeze time.
    /// Used for container checkpoint/restore (CRIU).
    pub frozen: bool,
}

clock_gettime() path with namespace offsets:

Clock	Formula	Namespace-affected?
`CLOCK_MONOTONIC`	`raw_monotonic_ns + current_task().nsproxy.time_ns.monotonic_offset_ns`	Yes
`CLOCK_BOOTTIME`	`raw_boottime_ns + current_task().nsproxy.time_ns.boottime_offset_ns`	Yes
`CLOCK_REALTIME`	`wall_time` (real wall clock, matches Linux)	No
`CLOCK_MONOTONIC_RAW`	Raw hardware counter, no namespace adjustment	No

CLOCK_REALTIME is intentionally unaffected — it represents actual wall clock time and shifting it per-namespace would break distributed protocols (TLS certificate validation, Kerberos ticket lifetimes, NFS lease timers). CLOCK_MONOTONIC_RAW is the raw hardware counter exposed for benchmarking; it bypasses both NTP discipline and namespace offsets.

Frozen time (CRIU): When frozen == true, all four clocks return the timestamp captured at freeze time. This is used during container checkpoint/restore: the container process tree is frozen, checkpointed, migrated to another host, and restored. The restored processes see time continuing from the checkpoint instant (with the offset set to target_monotonic - source_monotonic at restore time), avoiding spurious timer expirations and timeout-driven errors.

vDSO fast path: Each time namespace has a dedicated vDSO data page mapped into user processes belonging to that namespace. The page contains pre-applied offsets (the VvarPage.clock_monotonic_ns and clock_boottime_sec/nsec fields already include the namespace offset) so userspace clock_gettime() never crosses into the kernel for namespace-aware time reads. When a process calls unshare(CLONE_NEWTIME), the kernel allocates a new VvarPage and remaps it into the process's vDSO mapping via mremap() of the vDSO data region. The kernel's timer tick handler updates all active vDSO data pages (one per distinct time namespace with at least one live process).

Offset configuration: Offsets are set by writing to /proc/[pid]/timens_offsets before the process enters the new time namespace (i.e., after unshare(CLONE_NEWTIME) but before the first exec() or clone() that would use the new namespace). The format matches Linux:

monotonic <offset_sec> <offset_nsec>
boottime  <offset_sec> <offset_nsec>

Once a process has entered the time namespace (first exec() after unshare(CLONE_NEWTIME)), the offsets are immutable — they cannot be changed for the lifetime of the namespace.

7.8.6 Clocksource Watchdog¶

A clocksource that reports incorrect time is worse than a slow one — it causes silent data corruption in timestamps, incorrect scheduler decisions, and broken network protocols.

Cross-validation: Every 500 ms (configurable), the kernel reads both the primary and secondary clocksource and compares the elapsed interval. If the primary's elapsed time deviates from the secondary's by more than a threshold (default: 100 ppm sustained over 5 consecutive checks), the primary is marked unstable.

TSC instability detection: On x86, the TSC can be unreliable in several scenarios:

Non-invariant TSC (pre-Nehalem Intel, pre-Zen AMD): frequency changes with P-state transitions.
TSC halts during deep C-states on some older processors.
TSC desynchronization across sockets on early multi-socket systems.

The watchdog detects all three cases. When the TSC is marked unstable:

The kernel logs a warning: clocksource: TSC marked unstable (drift >100ppm vs HPET).
The active clocksource switches to HPET (or ACPI PM Timer if HPET is absent).
The vDSO clock_mode is updated so userspace falls back to the syscall path (HPET is not readable from userspace without kernel MMIO mapping).
The switch is atomic from the perspective of seqlock readers — one consistent snapshot uses TSC parameters, the next uses HPET parameters.

Capability-gated calibration: TSC frequency calibration (reading MSRs like MSR_PLATFORM_INFO or calibrating against PIT/HPET) requires privileged operations. Only umka-core holds the capability to read/write MSRs. Tier 1 drivers cannot influence clocksource selection — a compromised driver cannot subvert system timekeeping.

7.8.7 Interaction with RT and Power Management¶

RT timer latency: Real-time tasks (Section 8.4) depend on bounded timer expiry. On CPUs designated for RT workloads (isolcpus, nohz_full), hrtimer expiry is serviced directly in hard-IRQ context with a preemption-disabled path of bounded length. The worst-case path from hardware interrupt to hrtimer callback execution is: interrupt entry (~200 cycles) + hrtimer tree lookup (O(1) for the nearest timer) + callback invocation. On x86 with a local APIC timer and an isolated CPU (no frequency scaling, shallow C-states, nohz_full), the software path completes in under 1 μs. However, hardware-level non-determinism (DRAM refresh cycles ~350ns worst-case, cache miss penalties, memory controller contention) means the end-to-end observed latency on real hardware is typically 1-5 μs under favorable conditions and up to 10 μs under worst-case memory pressure. These figures match measured PREEMPT_RT Linux performance on isolated cores. Section 8.4 details the hardware resource partitioning (CAT, MBA, RDT) that UmkaOS uses to minimize hardware-level jitter.

When PreemptionModel::Realtime is active (Section 8.4), softirq-context timers are promoted to hard-IRQ context for RT-priority hrtimers, ensuring they cannot be delayed by threaded interrupt processing.

C-state interaction with clocksources: CPU power states affect timer behavior:

C-state	Invariant TSC	Non-Invariant TSC	Generic Timer (ARM)	mtime (RISC-V)	Timebase (PPC)
C1 (halt)	Continues	May stop	Continues	Continues	Continues
C3+ (deep sleep)	Continues	Stops	Continues	Continues	Continues

When a non-invariant TSC is detected and the system supports deep C-states, the kernel forces HPET as the clocksource and disables the vDSO fast path for timestamp reads. This is a correctness requirement, not a performance choice.

Tickless (nohz) mode: When a CPU has no pending timers and is running a single task (or is idle), the periodic tick is stopped entirely. The kernel reprograms the hardware timer to fire at the next actual event (nearest hrtimer expiry, or infinity if none). This eliminates unnecessary wakeups on isolated RT CPUs and idle CPUs.

Resuming the tick happens when: (a) a new timer is inserted on the CPU, (b) a second task becomes runnable (the scheduler needs periodic load balancing), or (c) an interrupt wakes the CPU from idle. The nohz implementation reuses Linux's nohz_full semantics: user code on an isolated CPU can run for arbitrarily long periods without a single kernel interrupt.

Power-aware timer placement: When timer coalescing (Section 7.8) groups timers, the kernel prefers placing them on CPUs that are already awake. Waking a CPU from C3+ costs ~100 us and defeats the purpose of coalescing. The timer subsystem queries the per-CPU idle state before choosing a coalescing target.

7.9 System Event Bus¶

The event bus is a core kernel facility that enables kernel subsystems and drivers to notify userspace of hardware and system state changes via a capability-gated, lock-free ring buffer mechanism. Netlink compatibility (for udev/systemd) is implemented in umka-sysapi (Section 19.5).

7.9.1 Event Subscription Model¶

// umka-core/src/event/mod.rs

/// System event types.
#[repr(u32)]
pub enum EventType {
    /// Battery level changed.
    BatteryLevelChanged = 0,
    /// AC adapter state changed (plugged/unplugged).
    AcStateChanged = 1,
    /// WiFi connection state changed.
    WifiStateChanged = 2,
    /// Bluetooth device paired/unpaired.
    BluetoothDeviceChanged = 3,
    /// USB device inserted/removed.
    UsbDeviceChanged = 4,
    /// Display hotplug (connected/disconnected).
    DisplayHotplug = 5,
    /// Thermal event (warning, critical).
    ThermalEvent = 6,
    /// Power profile changed.
    PowerProfileChanged = 7,
    /// Block device added/removed (for storage hotplug).
    BlockDeviceChanged = 8,
    /// Memory pressure event (for OOM-aware daemons).
    MemoryPressure = 9,
    /// Driver crash/recovery event (for monitoring daemons).
    DriverRecovery = 10,
}

/// Event payload (exactly 256 bytes, cache-line friendly).
///
/// Layout verification with `#[repr(C)]`:
///   Offset 0:  event_type (EventType = u32)  = 4 bytes
///   Offset 4:  _pad0 ([u8; 4])               = 4 bytes (explicit alignment padding)
///   Offset 8:  timestamp_ns (u64)            = 8 bytes
///   Offset 16: data (EventData, 240 bytes)   = 240 bytes
///   Total: 4 + 4 + 8 + 240 = 256 bytes.
///
/// **Compile-time assertion**: `const_assert!(size_of::<Event>() == 256);`
// kernel-internal, not KABI
#[repr(C)]
pub struct Event {
    /// Event type.
    pub event_type: EventType,
    /// Explicit padding for u64 alignment of timestamp_ns.
    pub _pad0: [u8; 4],
    /// Timestamp (monotonic ns).
    pub timestamp_ns: u64,
    /// Event-specific data.
    pub data: EventData,
}

/// Event-specific data (union of all possible payloads).
///
/// Size assertion: `const_assert!(core::mem::size_of::<EventData>() == 240)`.
/// This ensures that adding a new variant cannot silently grow the union
/// (and therefore the `SystemEvent` struct) beyond 256 bytes.
///
/// **Info leak prevention**: Event structs are delivered to userspace via `read()`
/// on `/dev/event_bus`. Each variant is smaller than 240 bytes, leaving tail bytes.
/// The `Event::new()` constructor MUST zero-initialize the entire 256-byte struct
/// before writing the variant payload: `let mut e = Event { ..core::mem::zeroed() };`
/// This ensures no kernel stack/heap data leaks through uninitialized tail bytes.
#[repr(C)]
pub union EventData {
    pub battery: BatteryEvent,
    pub ac: AcEvent,
    pub wifi: WifiEvent,
    pub bluetooth: BluetoothEvent,
    pub usb: UsbEvent,
    pub display: DisplayEvent,
    pub thermal: ThermalEvent,
    pub power_profile: PowerProfileEvent,
    pub block_device: BlockDeviceEvent,
    pub memory_pressure: MemoryPressureEvent,
    pub driver_recovery: DriverRecoveryEvent,
    _pad: [u8; 240], // EventData union size = 240 bytes; Event total = 4 + 4 + 8 + 240 = 256.
}

impl Event {
    /// Zero-initialize an Event, then populate the header and variant payload.
    /// The entire 256-byte struct is zeroed BEFORE writing any fields, ensuring
    /// no kernel data leaks through uninitialized tail bytes when delivered to
    /// userspace via `read()` on `/dev/event_bus`.
    ///
    /// # Safety
    /// `core::mem::zeroed()` is safe for `Event` because all fields are
    /// `#[repr(C)]` with no enum discriminants or non-zero invariants.
    pub fn new(event_type: EventType, timestamp_ns: u64) -> Self {
        // SAFETY: Event is repr(C), all-zeroes is valid for every field.
        let mut e: Self = unsafe { core::mem::zeroed() };
        e.event_type = event_type;
        e.timestamp_ns = timestamp_ns;
        e
    }
}

// Event variant structs below are kernel-internal payloads embedded in the
// EventData union (max 240 bytes per variant). The 256-byte Event struct
// has its own const_assert. Individual variants do not need separate asserts.

/// Battery event data.
#[repr(C)]
pub struct BatteryEvent {
    /// Battery percentage (0-100).
    pub percent: u8,
    /// Charging state (0=discharging, 1=charging, 2=full).
    pub charging: u8,
    /// Time remaining in minutes (0xFFFF = unknown).
    pub time_remaining_min: u16,
}

/// Display hotplug event data.
#[repr(C)]
pub struct DisplayEvent {
    /// Connector ID.
    pub connector_id: u32,
    /// Event subtype: 0 = disconnected, 1 = connected.
    pub connected: u8,
}
const_assert!(core::mem::size_of::<DisplayEvent>() == 8);

/// Block device event data.
#[repr(C)]
pub struct BlockDeviceEvent {
    /// Major:minor encoded as `(major << 20) | minor`.
    pub dev_id: u32,
    /// Event subtype (0=removed, 1=added, 2=changed).
    pub action: u8,
    pub _pad: [u8; 3],
    /// Device name (e.g., "sda", "nvme0n1"). Null-terminated.
    pub name: [u8; 32],
}
const_assert!(core::mem::size_of::<BlockDeviceEvent>() == 40);

/// Memory pressure event data.
///
/// Layout (16 bytes, no implicit padding):
///   offset 0: available_pages (u64, 8 bytes)
///   offset 8: numa_node       (i32, 4 bytes)
///   offset 12: level          (u8, 1 byte)
///   offset 13: _pad           ([u8; 3], 3 bytes — tail padding to 8-byte alignment)
///   Total: 16 bytes.
#[repr(C)]
pub struct MemoryPressureEvent {
    /// Available memory in pages at the time of the event.
    pub available_pages: u64,
    /// NUMA node that triggered the event (-1 = system-wide).
    pub numa_node: i32,
    /// Pressure level (0=low, 1=medium, 2=critical).
    pub level: u8,
    /// Explicit tail padding to 8-byte struct alignment.
    pub _pad: [u8; 3],
}
const_assert!(core::mem::size_of::<MemoryPressureEvent>() == 16);

/// Driver crash/recovery event data.
#[repr(C)]
pub struct DriverRecoveryEvent {
    /// Device handle of the affected device.
    pub device_id: u64,
    /// Event subtype (0=crashed, 1=recovering, 2=recovered, 3=quarantined).
    pub action: u8,
    pub _pad: [u8; 7],
    /// Driver name. Null-terminated, max 63 bytes.
    pub driver_name: [u8; 64],
}
const_assert!(core::mem::size_of::<DriverRecoveryEvent>() == 80);

7.9.2 Subscription via Capability¶

Processes subscribe to events via the capability system (Section 9.1):

/// Global event subscription manager. Singleton, initialized during boot.
///
/// **Collection policy**: `EventType` is `#[repr(u32)]` with 11 variants (0-10).
/// A fixed-size array indexed by `event_type as usize` provides O(1) lookup
/// with zero overhead (0-indexed: variant 0 = index 0, variant 10 = index 10).
/// Each entry holds a bounded list of subscribers.
/// The post_event path is warm (device state changes, not per-syscall).
pub struct EventManager {
    /// Per-event-type subscription lists. Indexed by `EventType as usize`.
    /// Fixed-size array (EVENT_TYPE_COUNT entries, one per EventType variant).
    /// Each entry is a bounded list of subscribers, enforced at subscribe time.
    subscriptions: [SpinLock<ArrayVec<SubscriptionInfo, MAX_SUBSCRIBERS_PER_EVENT>>; EVENT_TYPE_COUNT],
    capability_manager: &'static CapabilityManager,
}

/// Maximum subscribers per event type. Enforced at subscribe() time;
/// exceeding this returns `EventError::TooManySubscribers`.
const MAX_SUBSCRIBERS_PER_EVENT: usize = 64;
/// Number of EventType enum variants.
const EVENT_TYPE_COUNT: usize = 11;

/// Per-subscriber state. Stored in the per-event-type subscription array.
struct SubscriptionInfo {
    process_id: Pid,
    ring: Weak<EventRing>,
    dropped_events: AtomicU64,
}

impl EventManager {
    /// Subscribe to a class of events. Returns an EventSubscription capability.
    ///
    /// # Security
    /// Requires `CAP_SYS_ADMIN` for system-wide events (thermal, power profile).
    /// Requires `CAP_NET_ADMIN` for network events (WiFi, Bluetooth).
    /// Battery, AC, USB, display events are unrestricted (visible to all processes).
    pub fn subscribe(&self, event_type: EventType, process: &Process) -> Result<CapabilityToken, EventError> {
        // Check capability grants.
        match event_type {
            EventType::ThermalEvent | EventType::PowerProfileChanged => {
                if !process.has_capability(Capability::SysAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            EventType::WifiStateChanged | EventType::BluetoothDeviceChanged => {
                if !process.has_capability(Capability::NetAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            EventType::MemoryPressure | EventType::DriverRecovery => {
                if !process.has_capability(Capability::SysAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            // Unrestricted — any process can subscribe to these event types.
            // New variants must be explicitly listed here — deny by default.
            // This match is intentionally exhaustive with no wildcard: adding a
            // new EventType variant triggers a compile error, forcing the developer
            // to assign it to an explicit capability tier above or list it here.
            EventType::BatteryLevelChanged
            | EventType::AcStateChanged
            | EventType::UsbDeviceChanged
            | EventType::DisplayHotplug
            | EventType::BlockDeviceChanged => {}
        }

        // Allocate event ring buffer (per-process, 4 KB = ~16 events).
        // The process holds the Arc (strong ref); subscription holds Weak.
        // When the process exits, Arc is dropped, Weak::upgrade() returns None,
        // and post_event() silently skips the dead subscriber.
        let ring = Arc::new(EventRing::allocate(process)?);

        // Mint capability token.
        let cap_token = self.capability_manager.mint(
            CapabilityType::EventSubscription,
            CapabilityRights::READ,
            ring.ring_id(),
        )?;

        // Register subscription with weak reference (AI-065: Arc/Weak lifecycle).
        let weak = Arc::downgrade(&ring);
        let idx = event_type as usize;
        let mut guard = self.subscriptions[idx].lock();
        if guard.is_full() {
            return Err(EventError::TooManySubscribers);
        }
        guard.push(SubscriptionInfo {
            process_id: process.pid(),
            ring: weak,
            dropped_events: AtomicU64::new(0),
        });

        // Return the Arc to the caller — the process stores it in its
        // file descriptor table so userspace can read() events from it.
        // (cap_token is the capability granting access to the ring FD.)
        Ok(cap_token)
    }

    /// Post an event to all subscribers.
    ///
    /// Takes `&Event` to avoid copying the 256-byte struct per call.
    /// The ring buffer copies the event internally on each push.
    pub fn post_event(&self, event: &Event) {
        let idx = event.event_type as usize;
        let guard = self.subscriptions[idx].lock();
        for sub in guard.iter() {
            if let Some(ring) = sub.ring.upgrade() {
                // Write event to subscriber's ring buffer (lock-free push).
                if ring.push(event).is_err() {
                    // Ring full: drop event (subscriber is too slow).
                    sub.dropped_events.fetch_add(1, Ordering::Relaxed);
                }
            }
        }
    }
}

For Netlink compatibility (udev, systemd integration), see Section 19.5.

7.9.3 Integration Points¶

Subsystem	Events posted
Battery driver (Section 7.4)	BatteryLevelChanged, AcStateChanged
WiFi driver (Section 13.2, 10-drivers.md)	WifiStateChanged
Bluetooth (Section 13.14, 10-drivers.md)	BluetoothDeviceChanged
USB bus (Section 13.12)	UsbDeviceChanged
Display driver (Section 21.5)	DisplayHotplug
Thermal framework (Section 7.4)	ThermalEvent
Power profiles (Section 7.4)	PowerProfileChanged
Block layer (Section 15.2)	BlockDeviceChanged
OOM killer (Section 4.5)	MemoryPressure
Crash recovery (Section 11.9)	DriverRecovery

7.10 Intent-Based Resource Management¶

7.10.1 The Abstraction Gap¶

UmkaOS has all the mechanisms for smart resource management: - In-kernel inference engine (Section 22.6) for learned decisions - Per-device utilization tracking (Section 22.1) - Topology awareness (device registry, Section 11.4) - Power metering (Section 7.7 ) - Memory tier tracking (PageLocationTracker, Section 22.4) - Network fabric topology (Section 5.2)

What's missing is the abstraction that ties these together. Currently, resources are managed imperatively: "give me 4 cores and 16GB RAM." The alternative: declare goals, let the kernel optimize.

7.10.2 Design: Resource Intents¶

// umka-core/src/intent/mod.rs

/// A resource intent declares WHAT the workload needs,
/// not HOW to allocate resources.
#[repr(C)]
pub struct ResourceIntent {
    /// Target P99 SCHEDULING latency (nanoseconds).
    /// This is the time from task becoming runnable to task getting CPU.
    /// The kernel cannot measure application-level latency (it doesn't know
    /// what an "operation" is). This metric is scheduling + I/O completion
    /// latency — both kernel-observable.
    /// Kernel adjusts CPU priority, memory placement, I/O scheduling.
    /// 0 = no latency target (best-effort).
    pub target_latency_ns: u64,

    /// Target throughput (operations per second).
    /// Kernel adjusts CPU allocation, I/O queue depth, batch sizes.
    /// 0 = no throughput target (best-effort).
    pub target_ops_per_sec: u64,

    /// Availability requirement (basis points: 9999 = 99.99%).
    /// Kernel adjusts redundancy, crash recovery priority.
    /// 0 = no availability target.
    /// Used by: cgroup knob `intent.availability` (Section 7.7.3),
    /// crash recovery priority in Section 20.1 (higher availability_bp
    /// = faster restart, more aggressive health monitoring).
    pub availability_bp: u32,

    /// Power efficiency preference (0 = max performance, 100 = max efficiency).
    /// Kernel adjusts DVFS, core parking, accelerator clock.
    /// 50 = balanced (default).
    pub efficiency_preference: u32,

    /// Data locality hint: where does this workload's data live?
    /// Kernel uses this for NUMA placement and distributed scheduling.
    /// Used by: cgroup knob `intent.data_affinity` (Section 7.7.3),
    /// NUMA placement optimizer (Section 7.7.5 step 2b), and distributed
    /// scheduling in Section 5.1.
    pub data_affinity: DataAffinityHint,

    /// Struct layout version. Enables future extension without breaking binary
    /// compatibility: the kernel checks this field and interprets fields beyond
    /// the base layout only if version >= the version that introduced them.
    /// v1 = initial layout (this definition). Future versions extend into _reserved.
    pub version: u32,

    /// Reserved for future extension fields. New versions of ResourceIntent
    /// consume bytes from this region. Zero-initialized by callers; the kernel
    /// ignores non-zero bytes in positions it does not recognize for the given
    /// `version` field value. When `version` is incremented, newly-defined
    /// fields are parsed from specific offsets within `_reserved`. Sized to
    /// make the struct exactly 64 bytes (u64-aligned) with no implicit
    /// tail padding: 8+8+4+4+4+4+32 = 64.
    pub _reserved: [u8; 32],
}
// Layout: 8+8+4+4+4+4+32 = 64 bytes.
const_assert!(size_of::<ResourceIntent>() == 64);

#[repr(u32)]
pub enum DataAffinityHint {
    /// No preference. Kernel decides based on observation.
    Auto            = 0,
    /// Data is primarily local (disk-bound workload).
    Local           = 1,
    /// Data is distributed across nodes (distributed workload).
    Distributed     = 2,
    /// Data is on accelerators (GPU-bound workload).
    Accelerator     = 3,
}

7.10.3 Cgroup Integration¶

/sys/fs/cgroup/<group>/intent.latency_ns
# # Target P99 latency in nanoseconds.
# # "0" = no target (default, pure imperative mode).
# # "5000000" = target 5ms P99 latency.

/sys/fs/cgroup/<group>/intent.throughput
# # Target operations per second.
# # "0" = no target.

/sys/fs/cgroup/<group>/intent.efficiency
# # 0 = max performance, 100 = max efficiency, 50 = balanced.
# # Default: 50.

/sys/fs/cgroup/<group>/intent.availability
# # Availability target in basis points (0 = no target, 9999 = 99.99%).
# # Kernel adjusts crash recovery priority and health monitoring
# # frequency for drivers serving this cgroup. Higher values trigger
# # faster driver restart (Section 20.1) and redundant I/O path selection.
# # Default: 0 (no availability target).
# # Maps to: ResourceIntent.availability_bp

/sys/fs/cgroup/<group>/intent.data_affinity
# # Data locality hint for NUMA placement and distributed scheduling.
# # Values: "auto" (default), "local", "distributed", "accelerator"
# # "auto" = kernel observes memory access patterns and decides.
# # "local" = data is primarily on local storage (optimize for disk I/O).
# # "distributed" = data spans cluster nodes (optimize for network).
# # "accelerator" = data lives on accelerator memory (minimize transfers).
# # Maps to: ResourceIntent.data_affinity (DataAffinityHint enum)

/sys/fs/cgroup/<group>/intent.status
# # Read-only. Current intent satisfaction:
# #   latency_met: true|false
# #   latency_p99_actual_ns: <value>
# #   throughput_met: true|false
# #   throughput_actual: <value>
# #   power_actual_mw: <value>
# #   optimizer_action: <last action taken, e.g., "raised cpu.weight to 200">
# #   adjustments_last_hour: <count>
# #   contradiction: <none|description>

Multi-tenant access control: In K8s multi-tenant clusters, intent.status exposes internal workload metrics (actual latency, throughput, power draw) for each cgroup. A process in one container reading /sys/fs/cgroup/other-tenant/intent.status would expose the other tenant's workload profile, which is an information disclosure risk. Access control: intent.status is readable only by processes with CAP_SYS_ADMIN in the cgroup's user namespace, or by the cgroup owner (matching the cgroup's uid). This matches the access model for /proc/PID/status — visible to owner and root only. Non-owner reads return EACCES. The same access policy applies to intent.explain and intent.adjustment_history, which also contain tenant-specific operational data.

7.10.4 Objective Function and Conflict Resolution¶

Clarification: SCHED_INTENT is NOT a scheduling class.

SCHED_INTENT is an annotation layered on top of existing scheduling classes (EEVDF, RT, Deadline). A task's SchedClass is unchanged by intent assignment — an EEVDF task with intent::LATENCY_SENSITIVE remains EEVDF class.

The annotation modifies the task's effective eligibility calculation: a latency-sensitive EEVDF task receives a forward-eligible offset (effectively a negative lag boost via place_entity()) that prioritizes it WITHIN EEVDF without promoting it to RT class. EEVDF computes eligibility dynamically from avg_vruntime(), not from a stored field.

The "priority level 4" in the multi-class selection table below refers to the selection order among scheduler classes: a LATENCY_SENSITIVE EEVDF task is considered before standard EEVDF tasks but after all RT tasks. This is implemented by adjusting the task's vruntime on wakeup — there is no separate SCHED_INTENT class in the runqueue.

Intent Scheduling: Objective Function and Conflict Resolution

Objective function: minimize total latency for latency-sensitive tasks subject to meeting all SCHED_DEADLINE deadlines.

  minimize: Σ latency(task_i) for all SCHED_INTENT/SCHED_NORMAL tasks
  subject to: deadline_j met for all SCHED_DEADLINE tasks j

Scheduling class priority (highest to lowest):

Priority	Class	Condition
1 (highest)	SCHED_DEADLINE	CBS task with remaining budget and active deadline
2	SCHED_RT FIFO	Real-time, FIFO policy, priority 1-99
3	SCHED_RT RR	Real-time, round-robin, priority 1-99
4	SCHED_NORMAL (EEVDF)	Standard tasks (Section 7.1)
5	SCHED_BATCH	CPU-bound batch jobs (EEVDF with longer time slices)
6 (lowest)	SCHED_IDLE	Run only when nothing else is runnable

Note: Tasks with intent::LATENCY_SENSITIVE receive a forward-eligible offset within EEVDF (row 4 / SCHED_NORMAL). They do NOT form a separate scheduling class. The intent annotation adjusts the task's vruntime positioning via place_entity() on wakeup, giving latency-sensitive tasks priority within EEVDF without promoting them above standard EEVDF tasks in the class hierarchy.

Intent conflict resolution rules:

Two SCHED_INTENT tasks competing for the same CPU: Schedule by EEVDF virtual time (same as SCHED_NORMAL). Annotations affect eligibility but not within-class ordering.
LATENCY_SENSITIVE vs THROUGHPUT on same CPU: LATENCY_SENSITIVE runs first in the current scheduling quantum. THROUGHPUT tasks use the remaining time in the quantum.
POWER_EFFICIENT hint: Migrate to an efficiency core (EAS decision, Section 7.2) if: (a) latency budget allows the migration cost (~10-50μs), AND (b) the efficiency core is not already at capacity. POWER_EFFICIENT is never honored if it would cause a LATENCY_SENSITIVE task to miss its target latency.
EXCLUSIVE_CPU hint (two tasks competing): Round-robin at equal EEVDF priority. The hint is advisory — UmkaOS does not dedicate a CPU to a single SCHED_INTENT task unless it has an explicit CPU affinity set via sched_setaffinity().
SCHED_INTENT degradation: If the requested intent is unachievable (e.g., all CPUs committed to RT tasks), the task degrades to SCHED_NORMAL for that scheduling quantum. Degradation is logged to the observability layer (Section 20.1) and exposed via /proc/PID/sched_intent_stats.

7.10.5 The Optimization Loop¶

Every ~1 second (configurable):

1. IntentOptimizer collects metrics:
   - Per-cgroup: actual latency (P50, P99), throughput, power
   - Per-device: utilization, temperature, power
   - Cluster-wide: node loads, memory pressure, network utilization

2. For each cgroup with intents:
   a. Is the intent being met?
      - latency_p99_actual <= target_latency_ns?
      - throughput_actual >= target_ops_per_sec?
   b. If not met → need more resources:
      - Increase CPU allocation (raise cpu.weight or cpu.guarantee)
      - Improve NUMA placement (migrate pages closer to running CPUs)
      - Increase accelerator allocation (raise accel.compute.guarantee)
      - Increase I/O priority
   c. If met with headroom → can release resources:
      - Reduce CPU allocation (lower cpu.weight)
      - Lower frequency (save power)
      - Free accelerator time for other workloads

3. Apply adjustments via existing cgroup knobs.
   Intent layer is an OPTIMIZER that writes to existing imperative knobs.
   It does NOT replace the imperative interface — it sits above it.

4. **Optimization algorithm** (gradient-descent-inspired, bounded):
   - For each unmet intent, compute the **deficit**: e.g.,
     `latency_deficit = latency_p99_actual - target_latency_ns`.
   - Map deficit to resource adjustment via a **PD controller**:
     `delta_weight = K_p × (deficit / target) + K_d × d(deficit / target) / dt`,
     where `K_p` is a per-resource-type proportional gain constant (default:
     K_p=0.5 for CPU weight, K_p=0.3 for frequency) and `K_d` is the
     derivative gain (default: K_d=1.0 for CPU weight, K_d=0.6 for frequency).
     See [Section 7.10](#intent-based-resource-management--stability-analysis) for the complete control law
     and gain derivation.
   - **Clamp** adjustments to avoid oscillation: each iteration adjusts
     by at most ±20% of the current allocation (`MAX_INTENT_ADJUSTMENT = 0.20`).
     Multiple iterations converge geometrically.
   - **Convergence criterion**: intent is "met" when the metric is within
     10% of the target for 3 consecutive measurement windows. Once met,
     the optimizer enters a **hold** state for that cgroup (no further
     adjustments until the metric drifts outside the 10% band).
   - **`compute.weight` decomposition**: When the intent specifies
     `accel.compute.weight`, the optimizer distributes across CPU and
     accelerator proportionally to their current utilization ratio:
     `cpu_share = cpu_util / (cpu_util + accel_util)`.
   - **Safety bound**: The optimizer never reduces allocation below the
     cgroup's `*.min` guarantee or above the `*.max` ceiling.

   Stability controls (prevent oscillation/hunting):
     - Hysteresis: don't adjust unless delta exceeds 10% of current value.
     - Minimum hold time: no changes within 5 seconds of last adjustment.
     - Damping: exponential backoff if last 3 adjustments didn't converge.
     - Max adjustment rate: at most ±20% change per optimization cycle.

4. Conflicting intents: when multiple cgroups declare intents that cannot
   all be satisfied simultaneously (insufficient resources):
     - Intents are BEST-EFFORT, not guarantees.
     - Priority follows existing cpu.weight / accel.compute.weight hierarchy.
     - Higher-weight cgroups get intent satisfaction first.
     - Unsatisfied intents are reported in intent.status (latency_met: false).
     - The optimizer does NOT starve low-priority cgroups — it respects
       existing cgroup min guarantees (cpu.min, memory.min).

5. Policy priority ordering (prevents conflicts between subsystems):
     Priority (highest to lowest):
       a. Hardware limits (thermal throttle, voltage limits) — immutable
       b. Admin-configured cgroup limits (cpu.max, power.max) — hard ceiling
       c. Power budget enforcement (Section 7.4) — watt cap
       d. Intent optimization (this section) — soft optimization
       e. EAS energy optimization (Section 7.1.5) — per-task core selection
     Each layer can only adjust WITHIN the ceiling set by the layer above.
     Power budget is a hard constraint; intents work within it. No oscillation.
     When power budgeting and EAS conflict (e.g., power budgeting throttles a
     CPU domain that EAS prefers), power budgeting takes precedence — the EAS
     migration is deferred until the power budget is satisfied. This may
     temporarily route tasks to less energy-efficient cores, but prevents
     thermal throttling and power supply overload, which are correctness
     constraints rather than optimization goals.

6. Log adjustments to /sys/kernel/umka/intent/adjustment_log
   for observability and debugging.

The in-kernel inference engine (Section 22.6) powers the optimization. The "Intent I/O Scheduler" and "Intent Page Prefetch" models (Section 22.6) are use cases of intent-based management.

IntentOptimizer data structures.

The PD controller operates independently per resource dimension. Each dimension maps to one kernel control variable that the optimizer adjusts:

/// Resource dimension: identifies which resource knob the PD controller
/// tunes for a given cgroup. Each dimension has independent gain constants
/// (`k_p`, `k_d`) and error history, allowing different convergence rates
/// for CPU vs. I/O vs. accelerator workloads.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum ResourceDim {
    /// CPU weight (cgroup `cpu.weight`). The optimizer adjusts the
    /// EEVDF weight to converge on `target_latency_ns`.
    Cpu       = 0,
    /// Memory placement tier (DRAM vs. CXL vs. compressed). The optimizer
    /// adjusts page promotion/demotion thresholds to meet the latency target.
    Memory    = 1,
    /// Block I/O priority (`io.weight`, I/O scheduler class). Adjusts
    /// I/O scheduling parameters to converge on `target_ops_per_sec`.
    Io        = 2,
    /// Network bandwidth priority (TC qdisc class, BPF cgroup egress
    /// rate). Adjusts to converge on network throughput or latency targets.
    Net       = 3,
    /// Accelerator allocation (GPU/inference engine time-slice fraction).
    /// Adjusts `AccelWeight` to converge on accelerator utilization targets.
    Accel     = 4,
}

impl ResourceDim {
    /// Total number of resource dimensions. Used to size fixed-length arrays
    /// (`[PdControllerState; ResourceDim::COUNT]`, `[f64; ResourceDim::COUNT]`).
    pub const COUNT: usize = 5;
}

The IntentOptimizer is a singleton kernel subsystem that owns the optimization loop state. It runs as a dedicated kernel thread (kthread/intent_optimizer) and is never instantiated more than once.

/// Top-level state for the intent optimization subsystem.
///
/// One instance exists system-wide, owned by the `intent_optimizer` kernel
/// thread. All fields are accessed only from that thread except where noted.
pub struct IntentOptimizerState {
    /// PD controller state per resource dimension.
    /// Each dimension (CPU weight, frequency, accelerator allocation, I/O
    /// priority) has independent gain constants and error history.
    pub controller: [PdControllerState; ResourceDim::COUNT],

    /// Per-cgroup intent control state. Keyed by cgroup ID (u64).
    /// XArray: O(1) lookup by integer cgroup ID, RCU-compatible reads.
    /// Allocated on first intent assignment, freed when all intents are cleared.
    pub cgroup_states: XArray<IntentControlState>,

    /// Monotonic nanosecond timestamp of the last optimizer tick.
    /// Used to detect missed ticks and adjust derivative computation.
    pub last_tick_ns: u64,

    /// Whether the optimizer is in the "hold" state (all intents converged).
    /// When true, the optimizer still runs on its timer but skips computation
    /// unless a metric drifts outside the convergence band.
    pub all_converged: bool,

    /// Configuration: optimizer tick period in nanoseconds. Default: 1_000_000_000 (1s).
    /// Adjustable at runtime via `/sys/kernel/umka/intent/tick_period_ns`.
    pub tick_period_ns: u64,
}

/// PD controller state for one resource dimension (CPU weight, frequency, etc.).
///
/// # Floating-Point Usage
///
/// This struct contains `f64` fields. Floating-point arithmetic in kernel context
/// requires FPU state to be saved/restored across preemption points and context
/// switches. To avoid this overhead on the common path, this struct is only ever
/// accessed from the `IntentOptimizer` kthread — a dedicated kernel thread that
/// runs with FPU context enabled in task (non-interrupt) context.
///
/// **Forbidden contexts**: interrupt handlers, RCU callbacks, softirq handlers,
/// spinlock critical sections, or any preemption-disabled section. Violations cause
/// FPU state corruption on preemptible kernels.
///
/// Compile-time enforcement: `PdControllerState` is `!Send` (via `PhantomData<*mut ()>`)
/// — only the single `IntentOptimizer` kthread may hold a reference to it.
pub struct PdControllerState {
    /// Proportional gain constant. Default depends on dimension
    /// (0.5 for CPU weight, 0.3 for frequency). May be halved by the
    /// exponential backoff mechanism after 3 consecutive non-convergent ticks.
    pub k_p: f64,

    /// Derivative gain constant. Set to `k_p * tau_d` where `tau_d` is the
    /// dominant feedback delay for this dimension (see Section 7.7.5.1).
    pub k_d: f64,

    /// Setpoint: the target value that the controller drives toward.
    /// For latency dimensions, this is `target_latency_ns`. For throughput,
    /// `target_ops_per_sec`. Updated when the cgroup's intent changes.
    pub setpoint: f64,

    /// Normalized error from the previous optimizer tick.
    /// Used to compute the discrete derivative `d_error[k]`.
    /// Initialized to 0.0 on first tick after intent assignment.
    pub prev_error: f64,

    /// Accumulated integral term (reserved for future PID extension).
    /// Currently unused — the optimizer runs a PD controller only.
    /// Retained in the struct to avoid a layout change if PID is needed.
    pub integral: f64,

    /// Current effective gain multiplier, reduced by exponential backoff.
    /// Starts at 1.0, halved after 3 consecutive non-convergent adjustments,
    /// reset to 1.0 on convergence.
    pub gain_multiplier: f64,

    /// `!Send` marker: prevents this struct from being moved across threads.
    /// `PdControllerState` must only be accessed from the `IntentOptimizer`
    /// kthread, which holds FPU context. Transferring it to another thread
    /// would bypass this invariant and risk FPU state corruption.
    _not_send: core::marker::PhantomData<*mut ()>,
}

Observation flow: observe_scheduler_metric().

Scheduler observations flow into the optimizer through per-CPU observation rings. Each CPU writes metrics locally without contention; the optimizer thread reads them in bulk during its tick.

/// Write a scheduler observation to the per-CPU observation ring.
///
/// Called from the scheduler hot path (task tick, context switch, wakeup)
/// with preemption disabled. The write is O(1) — a single slot in a
/// pre-allocated per-CPU ring buffer. If the ring is full, the oldest
/// unread observation is silently overwritten (lossy under extreme load,
/// but the optimizer is statistical and tolerates dropped samples).
///
/// # Arguments
///
/// * `cpu` — The CPU producing the observation. Must be the current CPU
///   (enforced by the `PreemptGuard` the caller holds).
/// * `metric_type` — The type of metric being reported.
/// * `value` — The metric value (nanoseconds for latency, count for
///   throughput, 0-1024 for utilization).
///
/// # Performance
///
/// ~5-10 ns per call (single cache-line write to per-CPU ring). Zero
/// allocation. No locks. The per-CPU ring is sized to hold 256 entries
/// (one cache line per entry), sufficient for ~250ms of observations at
/// 1000 observations/second before wraparound.
pub fn observe_scheduler_metric(cpu: CpuId, metric_type: SchedMetricType, value: u64) {
    let ring = per_cpu!(sched_observation_ring, cpu);
    ring.push(SchedObservation {
        timestamp_ns: arch::current::cpu::read_timestamp(),
        metric_type,
        value,
    });
}

/// Scheduler metric types reported to the intent optimizer.
#[derive(Copy, Clone, Debug)]
pub enum SchedMetricType {
    /// Task wakeup-to-run latency in nanoseconds.
    WakeupLatency,
    /// Run queue depth (number of runnable tasks) at observation time.
    RunQueueDepth,
    /// CPU utilization (PELT-smoothed, 0-1024 scale).
    CpuUtilization,
    /// Context switch count since last observation.
    ContextSwitchCount,
    /// EAS migration decision (1 = migrated to efficiency core, 0 = stayed).
    EasMigration,
    /// Accelerator compute utilization (0-1000 scale, per-device).
    /// Reported by the accelerator scheduler ([Section 22.1](22-accelerators.md#unified-accelerator-framework)).
    AccelUtilization,
    /// Accelerator command queue depth (per-device).
    AccelQueueDepth,
    /// Accelerator power draw in milliwatts (per-device).
    AccelPowerDraw,
    /// Accelerator temperature in millidegrees Celsius (per-device).
    AccelTemperature,
}

/// A single scheduler observation written to the per-CPU ring.
///
/// Aligned to cache line (64 bytes) to prevent false sharing between per-CPU
/// observation slots. The padding beyond the ~24 bytes of data fields is the
/// cost of contention-free concurrent writes from independent CPUs.
#[repr(C, align(64))]
pub struct SchedObservation {
    /// Monotonic timestamp in nanoseconds.
    pub timestamp_ns: u64,
    /// The metric being reported.
    pub metric_type: SchedMetricType,
    /// The metric value.
    pub value: u64,
}
// kernel-internal ML observation. align(64) pads to 64 bytes (one cache line).
const_assert!(size_of::<SchedObservation>() == 64);

Pipeline to KernelObservation bus:

observe_scheduler_metric() is a typed convenience wrapper around the observe_kernel! macro (Section 23.1). It writes directly to the per-CPU ObservationRing as a KernelObservation with subsystem = SubsystemId::Scheduler and the metric type mapped to obs_type. There is no separate SchedObservation ring — the SchedObservation struct is the internal format that observe_scheduler_metric() converts to KernelObservation::features[] before pushing to the shared ring:

features[0] = metric_type as i32
features[1] = value low 32 bits as i32
features[2] = value high 32 bits as i32 (for values > 2^31)

The Tier 2 ML policy service reads KernelObservation entries from the shared ObservationRing (4096 entries, mmap'd read-only). This single-ring design avoids a second ring and an intermediate aggregation thread.

The observe_kernel! macro (zero-cost when no consumer is attached) is the generic entry point; observe_scheduler_metric() is the scheduler-specific typed wrapper.

Parameter update propagation: apply_policy_update().

When the optimizer (or an external Tier 2 ML policy service) computes a new parameter value, it flows through a validated, bounded update path:

/// Apply a policy-driven parameter update to a kernel tunable.
///
/// This is the sole entry point for runtime parameter changes from both
/// the in-kernel intent optimizer and external Tier 2 policy services.
/// All updates are validated and bounded before application.
///
/// # Validation
///
/// 1. The parameter must be registered in the Kernel Tunable Parameter
///    Store ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence)).
/// 2. The new value must be within the parameter's declared `[min, max]`
///    bounds. Out-of-bounds values are clamped (not rejected) and the
///    clamping is logged.
/// 3. The rate of change must not exceed `MAX_INTENT_ADJUSTMENT` (20%)
///    per optimizer tick. If the requested change exceeds this, it is
///    clamped to the maximum rate.
/// 4. The caller must hold `CAP_ML_TUNE` (for Tier 2 services) or be
///    the intent optimizer kernel thread (implicitly authorized).
///
/// # Propagation
///
/// After validation, the update is converted to fixed-point integer
/// representation and written to the parameter's `AtomicI64` backing
/// store in the tunable registry. The conversion uses a fixed-point
/// scale factor:
///
/// ```text
/// PARAM_SCALE = 1_000_000  (microsecond / permille precision)
/// stored = AtomicI64::store((value * PARAM_SCALE).round() as i64, Relaxed)
/// read   = AtomicI64::load(Relaxed) as f64 / PARAM_SCALE
/// ```
///
/// This avoids `AtomicF64` (which does not exist in the Rust atomic
/// model) while preserving sub-ppm precision for all tunable parameters.
/// The scale factor is a global constant shared by all `KernelTunableParam`
/// consumers — there is no per-parameter scale. `PARAM_SCALE = 1_000_000`
/// gives 6 decimal digits of precision, sufficient for all scheduler
/// and policy parameters (whose bounds are expressed in permille or
/// microseconds).
///
/// The affected subsystem reads the new value on its next access — there
/// is no explicit notification. This is safe because all tunable parameters
/// are designed to be read speculatively (the scheduler reads
/// `eevdf_weight_scale` on every `pick_next_task`, not cached).
///
/// # Auto-decay
///
/// Every parameter update carries an expiry timestamp. If no new update
/// arrives within the expiry window (default: 60 seconds), the parameter
/// reverts to its compiled-in default. This prevents a crashed ML service
/// from leaving the kernel in a mis-tuned state indefinitely.
/// Fixed-point scale factor for f64 ↔ AtomicI64 conversion.
///
/// All `KernelTunableParam` values are stored as `(value * PARAM_SCALE).round() as i64`
/// in the `AtomicI64` backing store. Consumers reconstruct f64 via
/// `param.current.load(Relaxed) as f64 / PARAM_SCALE as f64`.
///
/// 1_000_000 gives 6 decimal digits of precision — sufficient for all
/// scheduler parameters (permille gains, microsecond latencies).
pub const PARAM_SCALE: i64 = 1_000_000;

/// # FPU Safety
///
/// This function uses `f64` arithmetic. It MUST be called ONLY from the
/// `IntentOptimizer` kthread or a Tier 2 policy service's kernel entry
/// point (which runs in task context with FPU enabled). It MUST NOT be
/// called from interrupt context, softirq, RCU callbacks, or any
/// preemption-disabled section.
///
/// The `IntentOptimizer` kthread calls `kernel_fpu_begin()` before
/// entering its policy evaluation loop and `kernel_fpu_end()` on exit.
/// This saves/restores the user-mode FPU state once per evaluation
/// cycle (not per parameter update), amortizing the cost across all
/// parameter updates in the cycle.
///
/// Debug builds assert `current_thread_is_intent_optimizer()` at entry.
pub fn apply_policy_update(
    param: &KernelTunableParam,
    value: f64,
    expiry_ns: u64,
) -> Result<(), PolicyUpdateError> {
    // 1. Bounds check and clamp (in f64 domain, against scaled-back bounds).
    let min_f = param.min_value as f64 / PARAM_SCALE as f64;
    let max_f = param.max_value as f64 / PARAM_SCALE as f64;
    let clamped = value.clamp(min_f, max_f);
    if (clamped - value).abs() > f64::EPSILON {
        log_clamped_update(param, value, clamped);
    }

    // 2. Rate-of-change check and clamp.
    let current_i = param.current.load(core::sync::atomic::Ordering::Relaxed);
    let current = current_i as f64 / PARAM_SCALE as f64;
    let max_delta = current.abs() * MAX_INTENT_ADJUSTMENT;
    let delta = (clamped - current).clamp(-max_delta, max_delta);
    let final_value = current + delta;

    // 3. Convert to fixed-point and write to AtomicI64 backing store.
    let stored = (final_value * PARAM_SCALE as f64).round() as i64;
    param.set_with_expiry(stored, expiry_ns);

    Ok(())
}

Wake mechanism: the optimizer kernel thread.

The intent optimizer runs as a dedicated kernel thread (kthread/intent_optimizer) created during boot. It does not busy-poll — it sleeps and is woken by one of two mechanisms:

Periodic timer. A high-resolution timer fires every tick_period_ns (default: 1 second). This is the primary wake source during normal operation. The timer is set with HRTIMER_MODE_REL and re-armed at the end of each optimizer tick to avoid drift accumulation.
Threshold-crossing event. When a per-CPU observation ring detects that a metric has crossed a critical threshold (e.g., wakeup latency exceeds 2x the target for any cgroup with an active latency intent), it sets an atomic flag (intent_optimizer_wake_pending) and sends an IPI to the CPU running the optimizer thread. This wakes the optimizer ahead of the next timer tick, enabling faster response to sudden load changes. The threshold check is a single comparison in the observe_scheduler_metric() hot path and is skipped entirely when no cgroup has active intents (checked via a global atomic active_intent_count).

Threshold Crossing Detection and Wake Protocol:

The intent optimizer is a background kernel thread (umka_intent_optimizer, SCHED_IDLE priority) that re-evaluates scheduling policy hints based on observed workload behavior.

Threshold monitoring: Each schedulable entity tracks a set of behavioral metrics updated by the scheduler hot path: - runtime_last_window_us: CPU time consumed in the last 100ms window. - iowait_fraction: fraction of runnable time spent in I/O wait. - cache_miss_rate: L3 miss rate sampled via per-CPU PMU (updated every 10ms tick). - wakeup_rate_hz: wakeups per second (exponential moving average).

Threshold crossing detection (lock-free, hot path): On each scheduler tick, a fast check compares each metric against its registered threshold. The comparison uses a hysteresis band (5% above the "enter" threshold, 5% below the "exit" threshold) to prevent flapping. The check is implemented as:

if (abs(metric - threshold) > hysteresis && !already_pending):
    per_cpu(intent_sample_pending) = true
    // No IPI yet — piggyback on the next scheduler yield point.

IPI targeting (deferred, not on hot path): The intent optimizer thread sleeps on intent_wait_queue. It is woken by: 1. Scheduler yield points: schedule() checks per_cpu(intent_sample_pending) for the current CPU and calls intent_wakeup_if_pending() — a non-IPI local wakeup (enqueues the optimizer thread on the current CPU's runqueue if it's not already runnable). O(1), no cross-CPU traffic. 2. Cross-CPU threshold: If a threshold was crossed on CPU B but CPU B's scheduler is idle (no local yield point imminent), the timer tick on CPU B sends a SCHEDULER_KICK IPI to the optimizer thread's home CPU (the optimizer is pinned to CPU 0 by default, or the least-loaded CPU if configured). IPIs are rate-limited to 1 per 100ms per crossing to prevent IPI storms.

Batch processing: The optimizer wakes at most 100 times per second (configurable). Each wakeup processes up to 64 pending threshold crossings from all CPUs before returning to sleep.

/// The intent optimizer kernel thread entry point.
///
/// This function runs in an infinite loop, sleeping between optimizer
/// ticks. It is created once during boot and runs for the lifetime of
/// the kernel.
fn intent_optimizer_thread() -> ! {
    let state = IntentOptimizerState::new();
    let timer = HrTimer::new(state.tick_period_ns, HrTimerMode::Rel);

    loop {
        // Sleep until timer fires or threshold-crossing wakes us early.
        timer.wait_or_wake(&INTENT_OPTIMIZER_WAKE_PENDING);

        // Drain all per-CPU observation rings into aggregated metrics.
        for cpu in 0..num_online_cpus() {
            let ring = per_cpu!(sched_observation_ring, cpu);
            while let Some(obs) = ring.pop() {
                state.aggregate_observation(cpu, &obs);
            }
        }

        // Run the optimization loop (steps 1-6 from the pseudocode above).
        for (cgroup_id, cgroup_state) in &mut state.cgroup_states {
            let metrics = state.collect_cgroup_metrics(*cgroup_id);
            let adjustments = state.compute_adjustments(cgroup_state, &metrics);
            state.apply_cgroup_adjustments(*cgroup_id, &adjustments);
        }

        // Re-arm the periodic timer.
        timer.rearm(state.tick_period_ns);
    }
}

The optimizer thread runs at SCHED_IDLE priority (consistent with its declaration above). It is not a real-time thread — it must not interfere with RT, deadline, or normal workloads. If the optimizer tick takes longer than expected (e.g., due to a large number of intent cgroups), the next tick is simply delayed — there is no attempt to "catch up" missed ticks, as the PD controller is designed for variable tick rates.

7.10.5.1 Stability Analysis¶

Control law.

The proportional control law from step 4 above is augmented with a derivative term to provide active damping. The complete PD control law for each resource dimension r is:

error[k]      = deficit[k] / target[r]          // normalized error at tick k
d_error[k]    = (error[k] - error[k-1]) / T     // discrete derivative (T = tick period)

raw_adjustment = K_p[r] × error[k]  +  K_d[r] × d_error[k]

adjustment[r] = clamp(raw_adjustment, -MAX_INTENT_ADJUSTMENT, +MAX_INTENT_ADJUSTMENT)

where:

Parameter	Value	Meaning
`T`	1 s	Optimizer tick period (see Section 7.7.6)
`K_p` (CPU weight)	0.5	Proportional gain, CPU weight dimension
`K_p` (frequency)	0.3	Proportional gain, frequency dimension
`K_d` (CPU weight)	1.0	Derivative gain = `K_p × τ_d` where `τ_d = 2 s` (dominant delay, see below)
`K_d` (frequency)	0.6	Derivative gain for frequency dimension
`MAX_INTENT_ADJUSTMENT`	0.20	Maximum fractional change per tick (20% of current allocation)

MAX_INTENT_ADJUSTMENT = 0.20 means the optimizer will never increase or decrease a cgroup's allocation by more than 20% of its current value in a single tick, regardless of the magnitude of the computed adjustment. For example, a cgroup at cpu.weight = 100 can move at most to [80, 120] in one optimizer cycle.

Dominant delay estimate.

The feedback signals observed by the optimizer are:

Signal	Measurement latency	Settling time
CPU utilization (PELT)	~40 ms (PELT decay constant)	≤ 200 ms
P99 latency (sliding window)	Configurable; default 1 s window	≤ 1 s
Memory pressure (PSI)	10 s exponential window	≤ 10 s
Temperature (RAPL / ACPI thermal)	1 s read interval (Section 7.4.1)	≤ 2 s
I/O utilization (iostat-style)	250 ms sampling	≤ 500 ms

The dominant (largest) delay in the loop is the PSI memory pressure signal with a settling time of up to 10 seconds. However, the intent optimizer only acts on PSI as a secondary, advisory input — it does not directly adjust memory allocation in response to PSI (that is the responsibility of the memory reclaim path, Section 4). For the primary control dimensions (CPU weight and frequency), the dominant delay is the P99 latency window of at most 1 second, and the RAPL/thermal signal at ≤ 2 seconds. Setting τ_d = 2 s for the derivative gain is therefore conservative (larger than needed for CPU control, but correct for the temperature feedback path).

Nyquist stability condition.

For a proportional-only controller in a sampled-data loop with tick period T and signal delay τ_d, the Nyquist stability criterion requires:

T ≥ 2 × τ_d_primary

For the primary CPU-weight control path: T = 1 s, τ_d_primary ≤ 1 s (P99 window). The Nyquist condition 1 s ≥ 2 × 1 s is not satisfied by the P99 window alone, which is why the derivative term is required.

The PD controller transforms the open-loop transfer function. With K_d = K_p × τ_d:

G_PD(z) = K_p × (1 + τ_d / T) - K_p × (τ_d / T) × z⁻¹

For K_p = 0.5, τ_d = 2, T = 1:

G_PD(z) = 0.5 × 3 - 0.5 × 2 × z⁻¹  =  1.5 - 1.0 × z⁻¹

Stability analysis for discrete-time PD controller:

Closed-loop characteristic equation: 1 + G_PD(z) × z⁻¹ = 0
  where G_PD(z) = k_p + k_d × (1 - z⁻¹) = 1.5 - 1.0z⁻¹  [example gains: k_p=0.5, τ_d=2, T=1]

Expanding: 1 + (1.5 - z⁻¹) × z⁻¹ = 0
→ 1 + 1.5z⁻¹ - z⁻² = 0
→ multiply by z²: z² + 1.5z - 1.0 = 0  [characteristic polynomial, degree 2]

Jury stability criterion for z² + bz + c (b = 1.5, c = -1.0):
  Condition 1: |c| < 1  →  |-1.0| = 1.0  [marginally stable at these example gains]

Note: The gains above (k_p = 0.5, τ_d = 2) are illustrative only. Production gain selection is performed numerically at deployment time via the umka-intent-tuner tool, which sweeps the gain space and verifies Jury stability for the measured plant delay on each hardware configuration. The architecture document shows the stability analysis methodology, not fixed production gains.

Gain margin and stability boundary.

The gain margin is the factor by which K_p can increase before the closed loop becomes unstable. For the CPU-weight PD controller (K_d = 2 K_p), numerical analysis gives a gain margin of approximately 3× (i.e., K_p up to ~1.5 before instability). The chosen K_p = 0.5 is well within this margin.

If feedback pathology causes error[k] to grow without bound — for example, a bug in the P99 latency measurement that returns extreme values — the MAX_INTENT_ADJUSTMENT clamp prevents runaway:

// In every optimizer tick, regardless of computed adjustment magnitude:
let adjustment = raw_adjustment.clamp(-MAX_INTENT_ADJUSTMENT, MAX_INTENT_ADJUSTMENT);

Concretely: even if the proportional and derivative terms compute a raw adjustment of +5.0 (500% increase), the clamped adjustment is +0.20 (20% increase). The system cannot diverge faster than geometric growth at rate 1.20 per second, and the existing 5-second minimum hold time (stability controls in step 4) further limits the maximum achievable divergence rate to 1.20^(1/5) ≈ 1.038 per second — less than 4% per second even in a fully pathological feedback scenario.

State carried across ticks.

Each controlled cgroup retains:

struct IntentControlState {
    /// Normalized error from the previous optimizer tick (for derivative computation).
    prev_error: [f64; ResourceDim::COUNT],
    /// Number of consecutive ticks within the 10% convergence band.
    converged_ticks: u32,
    /// Monotonic nanosecond timestamp of the last applied adjustment.
    last_adjustment_ns: u64,
    /// Number of consecutive adjustments that failed to improve the metric (for backoff).
    backoff_count: u32,
}

prev_error is initialized to 0.0 on the first tick after an intent is set, so the derivative term contributes zero on the first control step. This avoids a derivative kick from the initial large error value.

Interaction with existing stability controls.

The stability controls in step 4 complement the PD controller:

Hysteresis (10% band): prevents the derivative term from amplifying measurement noise. When |error[k]| < 0.10, both K_p × error and K_d × d_error are zeroed (no adjustment applied). This is equivalent to a deadband in the control law.
5-second minimum hold time: provides a floor on effective tick rate, giving the plant time to respond before the next adjustment. This is equivalent to adding latency to the feedback path, making the system behave as if T_eff = max(T, 5 s) for purposes of the Nyquist criterion.
Exponential backoff (3-miss rule): after 3 consecutive adjustments without convergence, K_p is halved for subsequent ticks until convergence is achieved. This adaptive gain reduction handles unmodelled plant dynamics (e.g., a workload whose latency is bottlenecked on I/O, not CPU — increasing CPU weight does nothing).
±20% max rate (MAX_INTENT_ADJUSTMENT): as analyzed above, provides the runaway-prevention bound.

Together, these mechanisms give the intent optimizer the stability properties of an overdamped second-order system: it converges monotonically (no overshoot in the normal case) with a time constant of approximately 3–5 optimizer ticks (3–5 seconds). This is appropriate for a slow-path resource allocation controller — fast enough to respond to workload shifts within 10–15 seconds, slow enough to avoid the thrashing that a tighter controller would produce.

Intent Admission Control:

Intents are advisory, not guaranteed. When a cgroup sets intent.latency_ns = 5000000 (5ms), the kernel attempts to meet it but does not reject the intent if resources are insufficient. Instead: - If the intent cannot be met: intent.status reports latency_met: false with the actual observed P99 latency. - Clamping: intent values are clamped to physically achievable bounds. An intent.latency_ns = 1 (1 nanosecond) is silently clamped to the system's minimum achievable scheduling latency (~10μs on a typical x86 system). - Contradictions (e.g., intent.latency_ns = 1000 with intent.efficiency = 100) are logged as warnings in intent.status with contradiction: latency_vs_efficiency. The optimizer prioritizes the latency target.

Intent Feedback:

The intent.status file (defined above) provides real-time feedback on intent satisfaction. See the intent.status definition in the cgroup interface listing above for the full field set.

Multi-Tenant Isolation:

The cgroup hierarchy IS the authority for resource isolation. Intents operate within existing cgroup limits: - A child cgroup's intent cannot cause resource consumption exceeding the parent's cpu.max, memory.max, or power.max limits. - Parent limits are a hard ceiling. Intents are soft optimization within that ceiling. - Cross-tenant interference is prevented by existing cgroup isolation — the intent optimizer adjusts knobs for one cgroup without affecting other cgroups' guarantees (cpu.min, memory.min are respected).

7.10.6 Performance Impact¶

The optimization loop runs once per second as a background kernel thread. Each iteration reads per-cgroup metrics, runs the inference engine, and writes adjusted parameters. The total cost scales linearly with the number of cgroups that have active intents:

Active intent cgroups	Typical iteration cost	Notes
≤16	~100-500 μs	Sub-millisecond; inference engine inline
10-50	~1-5 ms	Still negligible as fraction of 1-second period
>100	~10-50 ms	Should use `intent_optimizer_batch_size` to split

For the default case (≤50 cgroups), the amortised overhead is well under 0.5% CPU.

The actual scheduling/allocation decisions use the same fast paths as before. Only the cgroup parameters change. Hot-path performance: identical to Linux.

7.10.7 Explainability Interface¶

Intent optimization (Section 7.10) reports whether intents are met and what adjustments were made, but administrators also need to understand why a specific performance target is not being met, what the system tried and rejected, and what action they could take to help. The explainability interface provides this deep diagnostic view.

sysfs interface (per-cgroup, read-only):

/sys/fs/cgroup/<group>/intent.explain
    bottleneck: cpu|memory|io|accelerator|power|network
    bottleneck_detail: "CPU saturated: 4/4 cores at 100%, cpu.max reached"
    adjustments_attempted: 5
    adjustments_rejected: 2
    rejected_reasons: ["cpu.max ceiling reached", "power budget Section 7.4 constraint"]
    recommendation: "Increase cpu.max from 400000 to 600000"
    conflicting_intents: ["cgroup:/prod/db has higher cpu.weight, consuming 3/4 cores"]

Each field is populated by the optimization loop (Section 7.10) at the end of each cycle. The bottleneck field identifies the single most constrained resource. The recommendation field suggests the smallest configuration change that would allow the intent to be met. The conflicting_intents field lists other cgroups whose intents are competing for the same resource.

Structured tracepoint: umka_tp_stable_intent_explain is emitted every optimization cycle for cgroups with unmet intents. Fields: cgroup path, intent type, target value, actual value, bottleneck type, attempted adjustments (count), rejection reasons (array). This enables perf / BPF-based monitoring of intent optimization across the system.

Adjustment history log: Per-cgroup ring buffer of the last 64 adjustments, exposed via:

/sys/fs/cgroup/<group>/intent.adjustment_history
# # Each entry:
# #   timestamp: 1708012345.123456
# #   parameter: cpu.max
# #   old_value: 400000
# #   new_value: 500000
# #   reason: "latency_p99 target 5ms, actual 8ms, cpu was bottleneck"
# #   effect: "latency_p99 dropped from 8ms to 4.2ms at next cycle"

The ring buffer is fixed-size (64 entries, ~8 KiB per cgroup) and wraps around. It provides a complete causal trail: what changed, why, and what happened as a result.

Integration with umkactl: umkactl intent explain <cgroup> provides a human-readable summary combining intent.status + intent.explain + intent.adjustment_history into a single diagnostic view. Example output:

$ umkactl intent explain /prod/web
Intent: latency_p99 ≤ 5ms
Status: NOT MET (actual: 8.1ms)
Bottleneck: CPU (4/4 cores at 100%, cpu.max = 400000)
Recommendation: Increase cpu.max to 600000
Conflicting: /prod/db (cpu.weight=200, consuming 3/4 cores)
Last 3 adjustments:
  [12:01:05] cpu.weight 100→150 — effect: p99 9.3ms→8.5ms
  [12:01:06] io.weight  100→200 — effect: p99 8.5ms→8.3ms (not bottleneck)
  [12:01:07] cpu.weight 150→200 — rejected: power budget Section 7.4 constraint

7.10.8 Integration with ML Policy Framework¶

The Intent Optimizer (Section 7.7.5) is a natural consumer of the ML Policy framework (Section 23.1). The PD controller's gain constants and stability parameters are tunable via PolicyUpdateMsg, giving the ML framework control over how aggressively the intent optimizer responds to workload changes — without replacing the optimizer itself.

Tunable parameters exposed to the ML Policy framework:

`param_id`	Field	Default	Bounds	Effect
`intent_kp_cpu_weight`	`PdControllerState.k_p` (CPU weight dim)	500	[50, 2000]	Proportional gain (permille) for CPU weight adjustments
`intent_kp_frequency`	`PdControllerState.k_p` (frequency dim)	300	[50, 2000]	Proportional gain (permille) for frequency adjustments
`intent_kd_cpu_weight`	`PdControllerState.k_d` (CPU weight dim)	1000	[0, 5000]	Derivative gain (permille) for CPU weight damping
`intent_kd_frequency`	`PdControllerState.k_d` (frequency dim)	600	[0, 5000]	Derivative gain (permille) for frequency damping
`intent_max_adjustment`	`MAX_INTENT_ADJUSTMENT`	200	[50, 500]	Maximum per-tick adjustment (permille of current allocation)
`intent_hold_time_ms`	Minimum hold time	5000	[1000, 30000]	Minimum milliseconds between adjustments
`intent_convergence_band`	Convergence criterion	100	[10, 300]	Permille of target within which intent is "met"

All values are integer (permille for fractional quantities). The PD controller's f64 fields are derived: k_p = param_value as f64 / 1000.0. This keeps PolicyUpdateMsg integer-only while preserving sub-1% controller precision.

Why this is better than a standalone PD controller:

Temporal decay: If the ML policy service crashes, the PD controller gains revert to defaults over decay_period_ms. A standalone controller either keeps running with stale gains or stops entirely.
Per-cgroup tuning: Different workloads can receive different gain profiles via the cgroup override mechanism (Section 23.1). A latency-sensitive database gets aggressive gains; a batch job gets conservative ones.
Observability: All gain changes flow through the PolicyUpdateMsg audit log (Section 23.1), making controller behavior fully traceable.
Stability guarantee: The ML framework's bounded parameter ranges (enforced by KernelParamStore clamping) prevent a misbehaving policy service from setting gains that cause oscillation. The bounds in the table above are chosen so that even the extreme corners (k_p=2.0, k_d=0.0, max_adjustment=50%) produce a stable system (damping ratio > 0.3 for all resource dimensions).

The PD controller remains the default. When no ML policy service is running, the intent optimizer uses the default gains from the table above. The ML framework is an enhancement layer — it makes the optimizer adaptive, not dependent.

7.11 Core Provisioning and Workload Partitioning¶

Inspired by: Akaros many-core OS (Berkeley, 2009–2018) — the MCP model with provisioning/allocation split. Adapted for production use on 8 architectures with Linux cgroup compatibility. IP status: Akaros was BSD-licensed academic research. The provisioning/allocation separation is a general scheduling concept; this section specifies UmkaOS's original design.

7.11.1 Problem¶

Modern workloads span a wide spectrum of CPU isolation needs. A latency-sensitive database engine needs dedicated cores with no OS noise — timer ticks, RCU callbacks, and workqueue items cause measurable tail-latency spikes. A parallel HPC job needs all N cores to start simultaneously (gang scheduling) or the entire wavefront stalls. A batch analytics pipeline needs cores only when available and should yield them instantly when the latency-sensitive workload reclaims.

Linux provides isolcpus (boot-time, static), cpuset.cpus (cgroup pinning but no noise elimination), and nohz_full (per-core tickless, boot-time). None of these supports dynamic provisioning, backfill, or gang allocation. Combining them requires error-prone manual coordination across boot parameters, cgroup configuration, and IRQ affinity masks.

UmkaOS unifies these capabilities into a single cgroup-aware provisioning model with three core classes, dynamic reconfiguration, and a backfill protocol that eliminates wasted cycles on dedicated cores.

Cross-references: EEVDF scheduler (Section 7.1), CBS bandwidth guarantees (Section 7.6), intent-based management (Section 7.10), ML policy integration (Section 23.1), cgroup namespacing (Section 17.2).

7.11.2 Core Classes¶

Every online CPU core is classified into exactly one of three classes at any given time. The classification is dynamic — a core's class can change at runtime via the cgroup provisioning interface.

// umka-core/src/sched/provision.rs

/// Classification of a CPU core's scheduling mode.
///
/// Each online core belongs to exactly one class. The class determines what
/// kernel services run on that core and how the scheduler treats it.
///
/// Transitions are: LL → CG (on provision), CG → Backfill (on provisionee
/// idle), Backfill → CG (on provisionee reclaim), CG → LL (on deprovision).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum CoreClass {
    /// Low-Latency: default class. Time-shared via EEVDF (Section 7.1.2).
    /// Full kernel services: timer tick, RCU callbacks, workqueue items,
    /// softirqs, hardware interrupts. Standard Linux scheduling behavior.
    LL = 0,

    /// Coarse-Grained: dedicated to a single cgroup. No timer tick
    /// (tickless, equivalent to `nohz_full`). RCU callbacks offloaded to
    /// LL cores (equivalent to `rcu_nocbs`). No workqueue items. No softirqs.
    /// Only IPIs (for cross-core coordination and backfill reclaim) and page
    /// faults (cannot be disabled — application needs memory) are handled.
    /// OS noise target: <1μs per second.
    CG = 1,

    /// Backfill: transient state for a CG core whose provisionee is idle.
    /// Available for batch work from the backfill queue, but reclaimable
    /// within 10μs via IPI preemption. Transitions back to CG when the
    /// provisionee reclaims.
    Backfill = 2,
}

OS noise elimination on CG cores — kernel services disabled vs. retained:

Component	CG Core Behavior	Rationale
Timer tick	Disabled (tickless, `nohz_full` equivalent)	Eliminates periodic interrupts
RCU callbacks	Offloaded to LL cores (`rcu_nocbs` equivalent)	No RCU processing on CG
Workqueue items	Not scheduled on CG cores	Dispatched to LL cores only
Softirqs	Deferred to LL core via IPI bounce	No network/timer softirqs on CG
Page faults	Handled locally	Cannot disable — application needs memory
IPIs	Received	Cross-core coordination, backfill reclaim
Hardware interrupts	Affinity-masked away from CG cores	Only IPIs remain

Per-architecture implementation of CG noise elimination:

Arch	Timer Disable	RCU Offload	IRQ Affinity
x86-64	APIC timer one-shot, no periodic tick	Per-CPU `rcu_nocbs` flag	IOAPIC destination mask
AArch64	`CNTP_CTL_EL0.IMASK` bit set	Same flag	GIC SPI affinity registers
ARMv7	Cortex-A timer IMASK	Same flag	GIC-400 ITARGETSR
RISC-V	SBI timer not re-armed	Same flag	PLIC enable bitmap
PPC32	Decrementer not re-armed	Same flag	OpenPIC destination mask
PPC64LE	Decrementer not re-armed	Same flag	XIVE EQ target
s390x	Clock comparator not re-armed	Same flag	SIGP cpu mask (directed interrupts)
LoongArch64	CSR.TCFG timer disabled	Same flag	EIOINTC routing vector

The mechanism differs per architecture, but the guarantee is identical: no involuntary kernel entry on CG cores except page faults and IPIs.

7.11.3 Provisioning and Allocation¶

Provisioning and allocation are distinct operations, inspired by Akaros's separation of promise (provisioning) from grant (allocation):

Provisioning: The kernel records that a cgroup is entitled to N cores of a given class. The cores are not yet assigned — this is a capacity reservation.
Allocation: The kernel selects specific cores from the provisioned pool and assigns them to the cgroup. The cgroup's threads begin running on these cores.

This separation allows the kernel to promise capacity before threads are ready to use it, enables gang allocation (atomically allocating all N cores), and supports backfill of idle provisioned cores.

// umka-core/src/sched/provision.rs

/// A provisioning record: the kernel's promise that a cgroup can obtain
/// up to `core_count` cores of the specified class.
///
/// Provisioning is requested via the `cpu.provision_count` cgroup knob.
/// The kernel validates that the system has enough free cores to honor the
/// provision before accepting it.
// Kernel-internal struct. Not KABI. Not wire.
#[repr(C)]
pub struct CoreProvision {
    /// Owning cgroup's unique identifier.
    pub cgroup_id: u64,

    /// Number of cores provisioned for this cgroup.
    pub core_count: u16,

    /// Core class: LL or CG. LL provisioning reserves cores in the EEVDF
    /// pool (guarantees capacity). CG provisioning dedicates cores exclusively.
    pub core_class: CoreClass,

    /// If true, cores are exclusively reserved — no backfill when idle.
    /// Only allowed when the system has at least 2× the provisioned count
    /// in free LL cores (prevents over-reservation that starves the system).
    pub provision_hard: bool,

    /// If true, all provisioned cores must be allocated atomically
    /// (all-or-none). Used for parallel workloads where partial allocation
    /// is worse than no allocation (e.g., MPI jobs, GPU compute pipelines).
    pub gang_mode: bool,

    /// Explicit padding: 3 bytes between gang_mode (offset 12) and
    /// gang_timeout_ms (offset 16, u32 requires 4-byte alignment).
    pub _pad1: [u8; 3],

    /// Timeout for gang allocation in milliseconds. If the scheduler cannot
    /// allocate all N cores within this duration, the behavior depends on
    /// gang_partial_ok (future extension). Default: 100ms.
    pub gang_timeout_ms: u32,

    /// Priority for backfill work on idle CG cores belonging to this
    /// provision. Lower values are evicted first when the provisionee
    /// reclaims its cores. Default: 0. Range: -32768 to 32767.
    pub backfill_priority: i16,

    /// Padding for alignment. Zero-initialized, must be zero.
    pub _pad: [u8; 2],
}
const_assert!(size_of::<CoreProvision>() == 24);

/// The runtime state of an allocation derived from a CoreProvision.
pub struct CoreAllocation {
    /// Back-reference to the provisioning record.
    ///
    /// # Safety
    ///
    /// The `CoreProvision` is owned by the cgroup's `CpuController`
    /// extension. It outlives all `CoreAllocation`s that reference it,
    /// because the cgroup teardown path releases all allocations
    /// (returning cores to the global pool) before freeing the
    /// provision. The pointer is valid for the entire `CoreAllocation`
    /// lifecycle (Provisioned -> Active -> Released).
    pub provision: *const CoreProvision,

    /// Currently granted core IDs. The array is populated from index 0.
    /// `allocated_count` indicates how many entries are valid.
    pub allocated_cores: [CpuId; MAX_GANG_STACK_SIZE],

    /// Number of valid entries in `allocated_cores`.
    pub allocated_count: u16,

    /// Current allocation state.
    pub state: CoreAllocState,
}

/// Maximum number of cores in a single gang allocation's stack-allocated array.
/// The effective limit is `min(online_cpus(), 256)` — discovered at boot. The
/// 256 upper bound sizes the per-request stack allocation (~2 KB for `[CpuId; 256]`),
/// fitting comfortably within the 16 KB kernel stack. Systems with >256 cores
/// use heap-allocated gang arrays via `Box<[CpuId]>`. At boot,
/// `init_gang_allocator()` reads the online CPU count and selects stack or heap
/// backing accordingly. The constant is compile-time-fixed (not runtime tunable)
/// because it determines the stack frame layout.
pub const MAX_GANG_STACK_SIZE: usize = 256;

/// State machine for a core allocation.
///
/// Transitions:
///   Idle → Provisioned (provision accepted)
///   Provisioned → Allocated (cores granted)
///   Allocated → Backfilling (provisionee idle, backfill work assigned)
///   Backfilling → Allocated (provisionee reclaims via reschedule IPI)
///   Allocated → Provisioned (cores released, provision retained)
///   Provisioned → Idle (provision revoked)
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum CoreAllocState {
    /// No provisioning active for this cgroup.
    Idle        = 0,
    /// Cores are provisioned (reserved) but not yet allocated.
    Provisioned = 1,
    /// Cores are allocated and running the provisionee's threads.
    Allocated   = 2,
    /// Provisionee is idle; cores are running backfill work.
    Backfilling = 3,
    /// Provisionee is reclaiming cores from backfill state.
    /// Transient: lasts only until reschedule IPI is acknowledged (~2-10μs).
    Reclaiming  = 4,
}

7.11.4 Cgroup Interface¶

New control files under the cpu controller, additive to the existing interface (Section 7.6):

/sys/fs/cgroup/<group>/cpu.provision_count
# # Number of cores to provision for this cgroup.
# # 0 = no provisioning (default). The cgroup uses standard EEVDF scheduling.
# # N > 0 = request N cores. The kernel validates that sufficient free cores
# # exist before accepting the write. Returns -ENOSPC if the system cannot
# # honor the request.

/sys/fs/cgroup/<group>/cpu.core_class
# # Core class for provisioned cores.
# # "ll" (default) = Low-Latency. Provisioned LL cores remain in the EEVDF
# #   pool but are capacity-reserved for this cgroup.
# # "cg" = Coarse-Grained. Provisioned cores are dedicated exclusively.
# #   OS noise is eliminated (see Section 7.8.2 table).

/sys/fs/cgroup/<group>/cpu.provision_hard
# # "0" (default) = soft provisioning. Idle CG cores accept backfill work.
# # "1" = hard provisioning. Idle CG cores remain idle (no backfill).
# #   Only accepted when the system has >= 2x the provisioned count in free
# #   LL cores. Returns -ENOSPC otherwise.

/sys/fs/cgroup/<group>/cpu.gang_mode
# # "0" (default) = cores allocated individually as they become available.
# # "1" = gang mode. All provisioned cores are allocated atomically.
# #   Allocation blocks up to cpu.gang_timeout_ms. If the gang cannot be
# #   formed within the timeout, the write to cpu.provision_count returns
# #   -EAGAIN.

/sys/fs/cgroup/<group>/cpu.gang_timeout_ms
# # Timeout for gang allocation in milliseconds. Default: 100.
# # Only meaningful when cpu.gang_mode = 1.

/sys/fs/cgroup/<group>/cpu.allocated_cores
# # (read-only) Space-separated list of currently allocated core IDs.
# # Empty if no cores are allocated.
# # Example: "4 5 12 13"

/sys/fs/cgroup/<group>/cpu.provision_status
# # (read-only) Current provisioning state.
# # Values: "none", "provisioned", "allocated", "backfilling"

Capability requirements: - Writing cpu.provision_count requires CAP_SYS_NICE (same capability used for scheduler policy changes). Without it, writes return -EPERM. - Writing cpu.core_class or cpu.provision_hard requires CAP_SYS_ADMIN (these affect system-wide core allocation policy). Without it, writes return -EPERM. - cpu.gang_mode and cpu.gang_timeout_ms require CAP_SYS_NICE. - Inside a user namespace, these capabilities apply to the owning user namespace (matching cgroup delegation rules in Section 17.2.6).

Validation rules: - cpu.provision_count must not exceed the number of online cores minus a minimum LL reserve (default: 2 cores, configurable via sysctl kernel.min_ll_cores). - The sum of all cpu.provision_count values across all cgroups must not exceed online_cores - min_ll_cores. Returns -ENOSPC on overcommit. - cpu.core_class = cg requires cpu.provision_count > 0. Setting core_class to cg without provisioning returns -EINVAL. - Nested cgroups: a child's provision comes out of its parent's provision budget, analogous to cpu.guarantee nesting (Section 7.6).

7.11.5 Backfill Protocol¶

When a provisionee's threads are all idle on a CG core, the core is wasted unless repurposed. The backfill protocol allows batch work to use idle CG cores while guaranteeing that the provisionee can reclaim them within 10μs.

Protocol steps:

Idle detection: The last runnable thread of the provisionee on a CG core enters sleep (blocks on I/O, futex, etc.). The per-CPU idle handler detects that no provisionee threads remain runnable. State transitions: Allocated → Backfilling.
Backfill dispatch: The scheduler checks the global backfill queue (a priority queue sorted by CoreProvision.backfill_priority, lower = evicted first). If batch work is available, the scheduler performs a full context switch to the highest-priority backfill thread. The CG core temporarily behaves like an LL core for the backfill thread (timer tick re-enabled, RCU callbacks local).
Reclaim trigger: The provisionee's thread becomes runnable (woken by I/O completion, futex wake, signal). The try_to_wake_up path detects the target core is backfilling (via CpuLocal::core_class == Backfill) and sends a standard reschedule IPI — the same IPI used by EEVDF load balancing (Section 7.1). State transitions: Backfilling → Reclaiming.
Reclaim execution: The backfill core receives the reschedule IPI. The handler sets need_resched on the backfill CPU's CpuLocalBlock (standard mechanism, Section 3.2). On interrupt return, schedule() runs pick_next_task() which re-applies the cgroup filter and selects the provisionee's thread. State transitions: Reclaiming → Allocated.
Backfill thread disposition: The preempted backfill thread is migrated to an LL core by the EEVDF load balancer (standard migration path). If it was in a syscall, the syscall either completes (if near completion) or is interrupted with EINTR — standard preemption semantics, no special handling.
CG restoration: Once the provisionee's thread is running, the core reverts to CG mode: timer tick disabled, RCU callbacks offloaded, workqueue items removed from the core's queue.

  Provisionee        CG Core State        Backfill Thread
  ───────────        ──────────────        ───────────────
  Running     ──→    Allocated             (not present)
  All idle    ──→    Backfilling    ←──    Scheduled
  Wakes up    ──→    Reclaiming     ──→    Preempted (resched IPI)
  Running     ──→    Allocated             Migrated to LL

Timing guarantees: - reschedule IPI delivery: <1μs on all architectures (IPI is highest-priority interrupt). - Context switch on IPI return: 1-4μs (architecture-dependent, see table below). - Total reclaim latency (provisionee wake → provisionee running): ≤10μs.

Arch	IPI Delivery	Context Switch	Total Reclaim
x86-64	~0.5μs	~1.5-3μs	~2-5μs
AArch64	~1μs	~3-6μs	~5-8μs
ARMv7	~1μs	~3-5μs	~4-7μs
RISC-V	~2μs	~3-6μs	~5-9μs
PPC32	~1μs	~3-5μs	~4-7μs
PPC64LE	~1μs	~2-4μs	~3-6μs

7.11.6 Gang Scheduling Protocol¶

Gang scheduling ensures that a parallel workload's threads all start simultaneously. Without gang scheduling, threads that start at different times waste cycles spinning on barriers or synchronization primitives, waiting for the last thread to be scheduled.

Protocol:

A cgroup with cpu.gang_mode=1 writes a value N to cpu.provision_count.
The provisioning subsystem reserves N cores (validating availability).
When the cgroup's threads become runnable, the scheduler attempts to allocate all N cores from the provisioned set simultaneously.
If N cores are available: all are allocated atomically. Each core receives an IPI directing it to schedule the cgroup's thread. All threads begin executing within one IPI round-trip (~1-5μs skew within a single NUMA domain; cross-NUMA adds 5-50μs depending on topology).
If fewer than N cores are available: the scheduler waits up to gang_timeout_ms (default 100ms, tunable via /sys/fs/cgroup/<group>/cpu.gang_timeout_ms). 100ms allows time for a tier switch or driver reload to complete (~50-150ms typical) while avoiding indefinite blocking of system suspend or workload rescheduling. During the wait, the cgroup's threads remain in TASK_UNINTERRUPTIBLE state (not consuming CPU).
On timeout: allocation fails. The cpu.provision_status reads provisioned (not allocated). The kernel logs KERN_INFO with the shortfall count. User space can retry, reduce cpu.provision_count, or fall back to non-gang scheduling.
Allocated gang cores are guaranteed until explicitly released (the cgroup writes cpu.provision_count = 0 or is destroyed). The scheduler does not preempt gang cores — they run until the provisionee voluntarily yields or exits.

NUMA-aware gang allocation: The scheduler prefers allocating all gang cores from the same NUMA node. If the preferred node has insufficient free cores, the scheduler spills to the nearest NUMA node (lowest NumaTopology::distance()). A cross-NUMA gang emits tracepoint umka_tp_stable_sched_gang_cross_numa to alert operators. For strict NUMA locality, set ProvisionNumaHint.allow_cross_numa = false — gang allocation will fail with -ENOSPC rather than cross NUMA boundaries.

Gang teardown: When a gang-allocated cgroup writes cpu.provision_count = 0 or is destroyed, all allocated cores are simultaneously released. Each core receives a standard reschedule IPI, transitions to LL class, and re-enables full kernel services (timer tick, RCU, workqueue). Teardown completes within 1ms (IPI fan-out + per-core re-initialization).

7.11.7 Integration with Existing Scheduler¶

The provisioning subsystem is an extension of EEVDF, not a parallel path. CG cores use the same scheduler infrastructure with modified parameters — they don't bypass the scheduler.

CG cores as an EEVDF scheduling class:

CG cores run EEVDF like LL cores, but with a restricted candidate set:

A CG core's runqueue accepts only tasks from the owning cgroup (and backfill tasks, at lowest priority). This is enforced by the runqueue's cgroup_filter: Option<CgroupId> field — when set, pick_next_task() skips tasks not matching the filter.
Within the owning cgroup, EEVDF operates normally: virtual deadline ordering, fair share, preemption on eligibility. If the cgroup has 8 threads on 4 CG cores, EEVDF schedules them fairly across those 4 cores.
When all cgroup tasks are blocked, the cgroup filter temporarily opens to admit backfill tasks (state → Backfilling). On provisionee wake, the filter re-closes and backfill tasks are preempted.

This means: no separate scheduling path for CG cores. The difference is configuration (restricted runqueue, disabled tick, offloaded RCU), not a different algorithm. The existing EEVDF load balancer, priority handling, and preemption logic all work unchanged on CG cores.

Backfill preemption via standard mechanism:

Backfill reclaim uses the existing CpuLocalBlock.need_resched flag (Section 3.2), not a custom IPI protocol. When a provisionee task wakes on a backfilling core:

The waking path (try_to_wake_up) detects the target core is backfilling (core class = Backfill in CpuLocal state).
It sends a standard reschedule IPI to the core (same IPI used by EEVDF load balancing for migration, Section 7.1).
The IPI handler stores true to need_resched (.store(true, Relaxed)).
On interrupt return, schedule() runs EEVDF pick_next_task() which selects the provisionee task (cgroup filter re-applied) over the backfill task.
Backfill task is migrated to an LL core by the load balancer.

No new preemption mechanism. The only new element is checking core_class == Backfill in try_to_wake_up to prioritize the IPI.

NUMA-aware provisioning:

Provisioned cores are NUMA-aware. The provisioning subsystem:

Prefers cores on the same NUMA node as the cgroup's memory allocations (determined by the cgroup's cpuset.mems or auto-detected from the first allocation's NUMA node).
Gang allocations are NUMA-local when possible: all N cores from the same NUMA node. If insufficient cores on one node, spills to the nearest NUMA node (lowest distance in NumaTopology::distance()).
Cross-NUMA gang allocation is logged as a tracepoint (umka_tp_stable_sched_gang_cross_numa) since it may cause NUMA penalties. The operator can increase cpu.provision_count or reduce the gang size.

/// NUMA preference for core provisioning.
pub struct ProvisionNumaHint {
    /// Preferred NUMA node. Auto-detected from cgroup memory or explicit.
    pub preferred_node: NumaNodeId,
    /// Allow cross-NUMA allocation if preferred node is full.
    /// Default: true. Set to false for strict NUMA locality.
    pub allow_cross_numa: bool,
}

CBS bandwidth (Section 7.6): Does not apply to CG cores. A CG core has 100% bandwidth by definition — it runs only one cgroup's threads. The CBS server for the cgroup, if configured, applies only to LL cores where the cgroup's threads may also run (if cpu.provision_count < the cgroup's thread count, excess threads run on LL cores under EEVDF+CBS).

Power management (Section 7.7): CG cores participate in power budgeting. If the system hits a RAPL or thermal limit, CG cores are frequency-throttled (not descheduled). The provisionee retains its cores but at reduced frequency. This is preferable to losing cores entirely, which would break gang scheduling invariants.

Intent-based management (Section 7.10): The intent.latency_ns target maps to core class recommendations: - latency_ns < 100_000 (< 100μs): recommend CG provisioning. - 100_000 ≤ latency_ns < 10_000_000 (100μs–10ms): recommend LL with CBS guarantee. - latency_ns ≥ 10_000_000 or latency_ns = 0: no provisioning recommendation.

The intent optimizer (Section 7.10) can automatically write cpu.provision_count and cpu.core_class based on observed latency, subject to the same validation rules as manual writes.

ML policy (Section 23.1): The ML policy framework receives provisioning state as an observation channel: current core class per CPU, allocation states, backfill utilization, reclaim latency histograms. The policy can recommend provisioning changes (e.g., "workload X's backfill utilization is 0% — switch from soft to hard provisioning to avoid backfill overhead") via PolicyUpdateMsg. These recommendations flow through the intent optimizer, not directly to the provisioning subsystem.

7.11.8 Relationship to Linux cpuset/isolcpus¶

UmkaOS provisioning is a strict superset of Linux's core isolation mechanisms. The following table maps Linux concepts to UmkaOS equivalents:

Linux Mechanism	UmkaOS Equivalent	Difference
`isolcpus=2,3` (boot param)	`cpu.core_class = cg` + `cpu.provision_count = 2`	Dynamic, not boot-time
`nohz_full=2,3` (boot param)	Automatic on CG cores	Per-core, dynamic
`rcu_nocbs=2,3` (boot param)	Automatic on CG cores	Per-core, dynamic
`cpuset.cpus = 2,3`	`cpu.allocated_cores` (read-only output)	Provisioning is separate from pinning
`irqaffinity=0,1` (boot param)	Automatic IRQ masking on CG cores	Per-core, dynamic
No equivalent	Backfill protocol	Idle CG cores are not wasted
No equivalent	Gang scheduling	Atomic multi-core allocation
No equivalent	Intent-driven provisioning	Automatic core class selection

Linux compatibility: cpuset.cpus continues to work as in Linux. Writing to cpuset.cpus on a cgroup that also has cpu.provision_count > 0 is rejected with -EBUSY — the two mechanisms are mutually exclusive. cpu.provision_count is an UmkaOS extension; Linux applications that do not use it experience standard EEVDF scheduling on LL cores, identical to Linux behavior.

7.11.9 Performance Bounds¶

All performance targets are measured on the specified reference hardware. Actual values may differ on other platforms but must remain within the stated bounds for certified hardware.

Operation	Bound	Typical (x86-64)	Typical (AArch64)	Reference Hardware
Backfill reclaim (provisionee wake → running)	≤ 10μs	2-5μs	5-8μs	AMD EPYC 9004 / Ampere Altra
Gang allocation (N ≤ 16 cores)	≤ 100ms timeout	< 1ms	< 2ms	Lightly loaded system
Provisioning change (add/remove cores)	≤ 1ms	~200μs	~400μs	Bitmap update + IPI fan-out
CG→LL transition (deprovision)	≤ 1ms	~300μs	~500μs	Re-enable tick + RCU + workqueue
OS noise on CG core	< 1μs/sec	~0.3μs/sec	~0.5μs/sec	FTQ benchmark (Akaros methodology)

OS noise measurement methodology: OS noise is measured using the Fixed Time Quantum (FTQ) benchmark, which measures the number of iterations of a tight loop per fixed time quantum. Deviations from the expected count indicate kernel interference. The target of <1μs per second means that over a 1-second measurement window, the total time lost to kernel interference (page faults excluded, as they are application-triggered) is less than 1 microsecond. This matches or exceeds the best published results from Akaros FTQ benchmarks on comparable hardware.

Chapter 7: Scheduling and Power Management¶

7.1 Scheduler¶

7.1.1 Multi-Policy Design¶

7.1.2 Architecture¶

7.1.2.1 EEVDF Algorithm Specification¶

7.1.3 Key Properties¶

7.1.4 Scheduler Classes¶

7.1.4.1 eevdf_task_tick — Per-Tick Accounting for EEVDF Tasks¶

7.1.4.2 sched_idle_enter / sched_idle_exit — Idle Accounting Hooks¶

7.1.4.3 Timer IRQ → scheduler_tick() Entry Path¶

7.1.4.4 Top-Level scheduler_tick() Dispatch¶

7.1.4.5 RT Bandwidth Period Accounting¶

7.1.4.6 Per-Class Tick Functions¶

7.1.4.7 Scheduler Utility Functions¶

7.1.4.8 reweight_entity — Lazy Weight Update¶

7.1.4.9 Task-to-Runqueue Lookup Protocol (Lockfree cpu_id + Retry)¶

7.1.4.10 wake_up_new_task (Forked Task Activation)¶

7.1.4.11 select_task_rq (CPU Selection for Task Placement)¶

7.1.4.12 WakeFlags¶

7.1.4.13 activate_task (Sleeping-to-Runnable Transition)¶

7.2 Heterogeneous CPU Support (big.LITTLE / Intel Hybrid / RISC-V)¶

7.2.1 CPU Capacity Model¶

7.2.2 Energy Model¶

7.2.3 Energy-Aware Scheduling Algorithm¶

7.2.4 Per-Entity Load Tracking (PELT)¶

7.2.5 Frequency Domain Awareness and Cpufreq Integration¶

7.2.6 Intel Thread Director (ITD) Integration¶

7.2.7 Asymmetric Packing¶

7.2.7.1 Hybrid Core Isolation Domain Asymmetry (x86-64, Intel Hybrid)¶

7.2.8 Hierarchical Group Scheduling (cpu.weight Backing Mechanism)¶

7.2.8.1 Virtual Runtime Propagation¶

7.2.9 Cgroup Integration¶

7.2.10 RISC-V Heterogeneous Hart Support¶

7.2.11 Topology Discovery¶

7.2.12 Linux Compatibility¶

7.2.13 Performance Impact¶

7.3 Context Switch and Register State¶

7.3.1 Context Switch Procedure¶

7.3.1.1 PKRU Management During Context Switch (x86-64)¶

7.3.1.2 Isolation Register Save/Restore (All Architectures)¶

7.3.1.3 Return Stack Buffer (RSB) Fill¶

7.3.1.4 LL/SC Reservation Clearing (ARM, RISC-V, PowerPC)¶

7.3.2 Extended Register State Management¶

7.3.2.1 Saved State Composition by Architecture¶

7.3.3 Post-Context-Switch Cleanup (finish_task_switch)¶

7.3.4 CPU Hotplug Integration¶

7.4 Platform Power Management¶

7.4.1 Problem and Scope¶

7.4.2 RAPL — Running Average Power Limit¶

7.4.2.1 Domain Taxonomy¶

7.4.2.2 MSR Interface (x86-64 / x86)¶

7.4.2.3 AMD Equivalent¶

7.4.2.4 ARM and RISC-V Equivalents¶

7.4.2.5 Kernel Abstraction¶

7.4.3 Per-Architecture Power Management Interfaces¶

7.4.3.1 x86-64 (Intel and AMD)¶

7.4.3.1.1 Idle State Management and C-State Restrictions (x86-64)¶

7.4.3.2 AArch64 (ARM Servers: Graviton, Neoverse, Ampere)¶

7.4.3.3 RISC-V¶

7.4.3.4 PPC32 and PPC64LE¶

7.4.4 Thermal Framework¶

7.4.4.1 Thermal Zones¶

7.4.4.2 Trip Points¶

7.4.4.3 Cooling Devices¶

7.4.4.4 Cooling Map Discovery¶

7.4.4.5 Polling and Interrupt-Driven Monitoring¶

7.4.4.6 Temperature Sensor Abstraction¶

7.4.4.7 Linux sysfs Compatibility¶

7.4.5 Powercap Interface (sysfs)¶

7.4.5.1 Directory Structure¶

7.4.5.2 Tool Compatibility¶

7.4.5.3 Write Semantics¶

7.4.6 Cgroup Power Accounting¶

7.4.6.1 Design¶

7.4.6.2 Attribution Model¶

7.4.6.3 Cgroup Interface¶

7.4.6.4 Per-Cgroup Power Limit Enforcement¶

7.4.7 VM Power Budget Enforcement¶

7.4.7.1 Motivation¶

7.4.7.2 Mechanism¶

7.1.4.3 Timer IRQ → `scheduler_tick()` Entry Path¶

7.1.4.4 Top-Level `scheduler_tick()` Dispatch¶

7.1.4.8 `reweight_entity` — Lazy Weight Update¶

7.4.8.4 `DcmiPowerCap` Interface¶