Chapter 7: Scheduling and Power Management¶
EEVDF, RT, deadline scheduling, per-CPU runqueues, EAS, power budgeting, CPU bandwidth, timekeeping
EEVDF (Earliest Eligible Virtual Deadline First) is the default scheduler class. Real-time (FIFO/RR) and deadline classes are fully supported. Per-CPU runqueues eliminate global contention. Energy-Aware Scheduling (EAS) drives power management on heterogeneous platforms. All scheduling policy is replaceable via live kernel evolution.
7.1 Scheduler¶
7.1.1 Multi-Policy Design¶
The scheduler supports three scheduling policies simultaneously:
| Policy | Algorithm | Use case | Priority range |
|---|---|---|---|
| Normal | EEVDF | General-purpose workloads (Section 7.1) | Nice -20 to 19 |
| Real-Time | FIFO / RR | Latency-sensitive applications | RT 1-99 |
| Deadline | EDF (CBS) | Guaranteed CPU time (audio, etc.) | Runtime/Period |
Live evolution: Scheduler policy is replaceable via the SchedPolicy trait as an
EvolvableComponent (Section 13.18). Runqueue data structures (EEVDF
red-black tree, DL intrusive RB tree, RT bitmap) are non-replaceable verified data — only the
pick/enqueue/dequeue policy logic is swappable. This means a live kernel evolution cycle
can replace the scheduling algorithm (e.g., swap EEVDF for a future policy) without
draining or rebuilding the runqueue. The replacement module imports the existing runqueue
state via EvolvableComponent::import_state() and resumes scheduling immediately. See
also the SchedPolicy trait definition in Section 19.9 and the
SchedClassOps documentation trait in Section 7.1 below.
7.1.2 Architecture¶
Global Load Balancer
(runs every ~4ms)
|
+-------------+-------------+
| | |
CPU 0 CPU 1 CPU N
+----------+ +----------+ +----------+
| RT Queue | | RT Queue | | RT Queue | <- Highest priority
+----------+ +----------+ +----------+
| DL Queue | | DL Queue | | DL Queue | <- Deadline tasks
+----------+ +----------+ +----------+
|EEVDF Tree| |EEVDF Tree| |EEVDF Tree| <- Normal tasks (red-black tree, EEVDF)
+----------+ +----------+ +----------+
7.1.2.1 EEVDF Algorithm Specification¶
Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules.
&selfmethods use interior mutability for mutation. Atomic fields use.store()/.load(). See CLAUDE.md Spec Pseudocode Quality Gates.
The EEVDF (Earliest Eligible Virtual Deadline First) scheduler is the primary
scheduling algorithm for normal (non-RT, non-deadline) tasks. This section
specifies the complete algorithm, matching Linux 6.12+ EEVDF mathematical
semantics exactly for the five core functions (avg_vruntime, entity_eligible,
pick_eevdf, update_entity_lag, place_entity), while using UmkaOS-native
integration (CpuLocal, Evolvable, ML policy).
Reference: Stoica & Abdel-Wahab, "Earliest Eligible Virtual Deadline First:
A Flexible and Accurate Mechanism for Proportional Share Resource Allocation",
TR-95-22, Old Dominion University, 1995. Linux implementation:
kernel/sched/fair.c (v6.12+, Peter Zijlstra).
Evolvable component classification. The EEVDF scheduler is split into non-replaceable (Nucleus) and replaceable (Evolvable) components per Section 13.18:
| Component | Classification | Rationale |
|---|---|---|
VruntimeTree data structure |
Nucleus (data) | Shared tree + accumulator layout; embedded by EevdfRunQueue, CbsCpuServer, GroupEntity |
EevdfRunQueue data structure |
Nucleus (data) | Wraps VruntimeTree + root-only fields (curr, next, bandwidth_timer); import_state operates on these fields |
EevdfTask scheduling fields |
Nucleus (data) | Task state layout must survive policy replacement |
| RB-tree node layout | Nucleus (data) | Data structure layout integrity; tree operations are Evolvable code |
VruntimeTree accumulators (sum_w_vruntime, sum_weight, zero_vruntime) |
Nucleus (data + invariant) | Accumulator integrity invariants (division-free eligibility, overflow bounds) are checked by EevdfInvariantChecker. The accumulators themselves are Nucleus data (part of VruntimeTree); the code that updates them is Evolvable. |
avg_vruntime() computation |
Evolvable | Formula code, not data. Correctness ensured by Nucleus invariant checker (EevdfInvariantChecker) before swap is committed. Making formulas Nucleus prevents fixing a math bug without reboot — worse for 50-year correctness. |
entity_eligible() check |
Evolvable | Formula code operating on Nucleus data. Invariant checker validates division-free semantics consistency with avg_vruntime. |
update_entity_lag() |
Evolvable | Vlag clamping formula. Invariant checker validates clamping bounds and monotonicity. |
update_curr() vruntime advance |
Evolvable | Accounting formula. Invariant checker validates weight-proportional vruntime advance and monotonic vruntime for running entity. |
eevdf_task_tick() |
Evolvable | Per-tick logic: delegates to update_curr(), check_preempt_tick, cgroup reweight. All policy code. |
set_protect_slice() |
Evolvable | Protection fraction is policy (ML-tunable via ParamId::SchedProtectFraction). |
PELT (update_load_avg()) |
Evolvable | Exponential decay formula. Invariant checker validates decay constant and signal bounds. |
CBS charge path (cbs_charge()) |
Evolvable | Budget accounting formula. Invariant checker validates deficit bounds. |
| RB-tree insert/delete/augment ops | Evolvable | Code operating on Nucleus RB-tree node layout. O(log N) invariant checked. |
pick_eevdf() tree walk |
Evolvable | Tie-breaking, PICK_BUDDY, protect_slice heuristics swappable |
place_entity() wake policy |
Evolvable | Lag inflation, initial placement, wake bonus swappable |
| Slice computation | Evolvable | ML-tunable via ParamId::SchedEevdfWeightScale |
| Preemption threshold | Evolvable | ML-tunable via ParamId::SchedPreemptionLatencyBudget |
| Load balancer heuristics | Evolvable | Migration benefit threshold ML-tunable |
Phase assignment: EEVDF Nucleus data structures and all Evolvable scheduler code (formulas + policy) are Phase 2 (required for basic scheduling). ML integration of tunable parameters is Phase 4 (Section 23.1).
Virtual runtime and weights. Each task accumulates virtual runtime proportional to its CPU consumption, inversely scaled by its weight:
where NICE_0_WEIGHT = 1024 (the weight of a nice-0 task) and delta_exec_ns
is the wall-clock nanoseconds the task ran since the last accounting update.
Higher-weight tasks accumulate vruntime more slowly (they are entitled to more
CPU), and lower-weight tasks accumulate vruntime more quickly.
The sched_prio_to_weight table maps nice values to weights (identical to
Linux CFS/EEVDF). The 40 entries (nice -20 to +19):
| Nice | Weight | Nice | Weight | Nice | Weight | Nice | Weight |
|---|---|---|---|---|---|---|---|
| -20 | 88761 | -10 | 9548 | 0 | 1024 | 10 | 110 |
| -19 | 71755 | -9 | 7620 | 1 | 820 | 11 | 87 |
| -18 | 56483 | -8 | 6100 | 2 | 655 | 12 | 70 |
| -17 | 46273 | -7 | 4904 | 3 | 526 | 13 | 56 |
| -16 | 36291 | -6 | 3906 | 4 | 423 | 14 | 45 |
| -15 | 29154 | -5 | 3121 | 5 | 335 | 15 | 36 |
| -14 | 23254 | -4 | 2501 | 6 | 272 | 16 | 29 |
| -13 | 18705 | -3 | 1991 | 7 | 215 | 17 | 23 |
| -12 | 14949 | -2 | 1586 | 8 | 172 | 18 | 18 |
| -11 | 11916 | -1 | 1277 | 9 | 137 | 19 | 15 |
Each step of +1 nice reduces CPU share by approximately 10% (weight ratio
~1.25 between adjacent nice levels). The inverse weight table
(sched_prio_to_wmult) is precomputed for fixed-point division:
wmult[i] = 2^32 / weight[i].
calc_delta_fair: Converts wall-clock nanoseconds to virtual-time
nanoseconds for a given entity. Used by update_curr(), update_entity_lag(),
and place_entity():
/// Convert wall-clock delta to virtual-time delta for an entity.
/// For nice-0 (weight 1024): returns delta unchanged.
/// For nice +19 (weight 15): returns delta * 1024 / 15 ≈ 68× delta.
/// For nice -20 (weight 88761): returns delta * 1024 / 88761 ≈ 0.012× delta.
///
/// Linux equivalent: `calc_delta_fair()` + `__calc_delta()`.
#[inline]
fn calc_delta_fair(delta_ns: u64, weight: u32) -> u64 {
if weight == NICE_0_WEIGHT {
delta_ns
} else {
// Fixed-point: delta * NICE_0_WEIGHT * 2^32 / weight / 2^32
// Using precomputed wmult = 2^32 / weight:
// result = (delta * NICE_0_WEIGHT * wmult) >> 32
//
// For weights within the sched_prio_to_weight table range (15..=88761),
// use the precomputed table for speed. For group entities with weights
// outside the table range (e.g., cpu.weight=10000 maps to group_weight
// = 102400), compute wmult dynamically. This matches Linux's
// `__update_inv_weight()` which computes inv_weight on-the-fly for
// group entities.
let wmult = if weight <= MAX_TABLE_WEIGHT {
SCHED_PRIO_TO_WMULT[weight_to_idx(weight)]
} else {
// Dynamic computation for out-of-range weights (group entities).
(u32::MAX as u64 / weight as u64) as u32
};
((delta_ns as u128 * NICE_0_WEIGHT as u128 * wmult as u128) >> 32) as u64
}
}
cgroup cpu.weight and task weight: Task weight is always derived from the task's
nice value via sched_prio_to_weight[nice + 20]. The cgroup's cpu.weight affects
the GroupEntity's weight in the hierarchical scheduler tree, not individual task
weights. The cpu.weight value [1, 10000] is converted to an EEVDF group weight via
group_weight = (cpu.weight * NICE_0_WEIGHT) / 100 and applied to the GroupEntity
that represents the cgroup in its parent's run queue. This hierarchical model means
that tasks in a cgroup with cpu.weight = 200 get twice the CPU share of tasks in a
sibling cgroup with cpu.weight = 100, but each task's individual vruntime
accumulation rate is still governed by its nice-derived weight within the group.
See Section 17.2 for the full hierarchical weight model.
Weight change propagation: Weight changes propagate lazily. When a cgroup's
cpu.weight is modified, the new value is stored atomically in
CpuController.weight (Relaxed store — no ordering fence needed). The scheduler
recomputes the GroupEntity weight on the next enqueue, dequeue, or scheduler
tick that touches a task in the affected cgroup. No reschedule IPI is sent for
weight changes on ticked cores — the weight update is picked up lazily at the
next tick. Tickless cores with running tasks from the affected cgroup receive a
reschedule IPI to ensure timely weight application (see
Section 17.2). Tasks that are sleeping when the weight
changes pick up the new weight on their next wakeup (enqueue path). Tasks that
are currently running on ticked cores pick it up on the next task_tick()
(1-4 ms granularity). This matches Linux CFS behavior, where
reweight_entity() is called lazily for ticked cores.
Eligibility.
A task is eligible to run when it has not received more than its fair share of
CPU time. The eligibility test compares the task's virtual runtime against
the run queue's weighted average virtual runtime (avg_vruntime):
A task with vruntime < avg_vruntime has received less CPU than its fair share
(positive virtual lag) and is eligible. A task with vruntime > avg_vruntime has
received more than its fair share (negative virtual lag) and must wait until
avg_vruntime advances past it. This ensures fairness: tasks that have been
underserved are prioritized, while tasks that have been overserved must wait.
Division-free eligibility test. To avoid division on the hot path, the
eligibility check is algebraically transformed (see entity_eligible() below):
sum_weight):
This uses only subtraction and multiplication — no division.
Tree pruning bound. The augmented RB-tree stores a per-subtree
min_vruntime field (the minimum se.vruntime of any entity in the subtree).
pick_eevdf() prunes subtrees using the same division-free eligibility check:
vruntime_eligible(rq, subtree.min_vruntime). If the subtree's minimum vruntime
is not eligible (i.e., greater than avg_vruntime), then no entity in that
subtree can be eligible either. This achieves O(log n) selection.
Virtual deadline. Each task's virtual deadline determines its scheduling priority among eligible tasks:
This is set by place_entity() on wakeup and by update_deadline() on slice
expiry. Linux equivalent: se->deadline = se->vruntime + calc_delta_fair(se->slice, se).
The default slice is 750 us (slice_ns = 750_000), configurable via
sched_base_slice_ns. Deliberate divergence from Linux: Linux mainline uses
700 us (sysctl_sched_base_slice = 700000 in kernel/sched/fair.c). UmkaOS
uses 750 us based on internal analysis of container workload scheduling latency
— the extra 50 us reduces context-switch rate by ~7% on database/web-server
mixes without measurably increasing tail latency. This is an intentional UmkaOS
design choice, not a stale value from an earlier Linux version.
ML-tunable slice: The ML policy framework can adjust the effective slice via
ParamId::SchedEevdfWeightScale (default 100, range [50, 200]):
let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
.map_or(100, |p| p.current.load(Relaxed));
let effective_slice = base_slice_ns * weight_scale as u64 / 100;
place_entity() (Evolvable) when computing the
virtual slice for deadline assignment. A weight_scale of 150 increases slices
by 50%, reducing context-switch overhead at the cost of tail latency.
Lower-weight tasks get longer virtual deadlines (lower scheduling urgency);
higher-weight tasks get shorter virtual deadlines (higher urgency). The task
with the minimum vdeadline among all eligible tasks is selected by
pick_next_task.
Red-black tree organization.
Each per-CPU run queue maintains a single intrusive red-black tree keyed by
vdeadline (virtual deadline), matching the Linux 6.6+ tasks_timeline design:
- Tree key:
deadline(virtual deadline). Ties broken byTaskId(lower ID sorts first). Linux equivalent:entity_before()comparesa->deadline < b->deadline. - Augmented field:
min_vruntime— per-subtree minimum ofse.vruntime. Linux equivalent:sched_entity.min_vruntimeaugmented field. - Additional augmented fields:
max_slice— per-subtree maximum ofse.slice. Used bycfs_rq_max_slice()for lag clamping.
This tree organization is identical to Linux 6.6+. Linux also keys by deadline and augments with min_vruntime. (The pre-6.6 CFS keyed by vruntime with min_deadline augmentation — that design was replaced by EEVDF.)
tasks_timeline: An intrusive augmented RB-tree containing ALL runnable tasks (both eligible and ineligible). Each node embeds anEevdfRbLinkwith the augmentedmin_vruntimefield.pick_eevdf()prunes ineligible subtrees using the division-freevruntime_eligible()check on the subtree'smin_vruntime.
Eligibility is computed dynamically during the pick_eevdf() walk, not stored
as a tree membership property. The deadline-keyed single-tree design means
pick_eevdf() performs a left-descent walk that naturally visits earlier-deadline
tasks first.
Intrusive link: Each EevdfTask embeds an EevdfRbLink directly. Enqueue and
dequeue are O(log n) intrusive insert/remove operations with zero heap allocation.
This is critical for the scheduler hot path (per-syscall, per-tick).
Eligibility filter in pick_eevdf().
The EEVDF pick_eevdf() must find the eligible task with the earliest virtual
deadline. A task is eligible when:
pick_eevdf() walks the deadline-keyed tree using vruntime_eligible() on
subtree min_vruntime fields to prune entire subtrees that cannot contain
an eligible task. The currently-running entity (curr) is also considered as a
candidate: if eligible and has an earlier deadline than the tree-walk result,
curr wins.
If no eligible task exists (a transient condition due to numerical precision),
pick_eevdf() returns curr if eligible, otherwise None (the caller falls
through to the idle task).
Lag tracking (vlag).
Virtual lag measures how far a task's vruntime deviates from the run queue's
weighted average. Linux stores this as se->vlag — an unweighted,
signed virtual-time quantity:
Positive vlag means se.vruntime < avg_vruntime (task is owed CPU — eligible).
Negative vlag means se.vruntime > avg_vruntime (task over-served — ineligible).
Vlag is updated on every dequeue (when a task blocks, yields, or is preempted)
by update_entity_lag(). It is clamped to prevent starvation or unbounded credit:
The clamp limit is per-entity (depends on the entity's weight) and per-rq
(depends on the maximum slice of any entity on the rq, plus one tick for timing
granularity). For a nice-0 task with default slice: limit = 750_000 + TICK_NSEC.
For a nice +19 task (weight 15): limit = calc_delta_fair(750_000 + 4_000_000, 15) =
(750_000 + 4_000_000) * 1024 / 15 = 324_266_666 — much larger, allowing low-weight
tasks to accumulate proportionally more virtual credit.
Linux equivalent: update_entity_lag() in fair.c.
On enqueue (when a task wakes up), the saved vlag from the previous dequeue is
used by place_entity() to position the task in virtual time, ensuring that a
task owed CPU time before sleeping is prioritized when it wakes.
Deferred dequeue (over-served sleep path).
When a task sleeps while carrying negative vlag — meaning it has consumed more
CPU than its fair share — a naive immediate removal from the run queue would let
it re-enter with a head-start on avg_vruntime computation. UmkaOS uses a
deferred dequeue mechanism, matching the Linux 6.12 sched_delayed design:
- On sleep, if
vlag < 0(equivalently,!entity_eligible(rq, se)— the two conditions are algebraically identical sincevlag = avg_vruntime - se.vruntime), the task is NOT immediately removed from thetasks_timelinetree. The fieldsched_delayedis set totrueandon_rqremainsQueued. This deferral is gated on theDELAY_DEQUEUEscheduling feature flag (default: enabled). Special dequeue paths bypass the delay:DEQUEUE_SPECIAL(signal-induced dequeue, task exit) andDEQUEUE_THROTTLE(CBS bandwidth exhaustion) always remove immediately regardless of vlag. - While deferred, a task remains in the tree and may be selected if eligible.
When picked,
requeue_delayed_entity()clears the delay and re-integrates it. The tree walk does NOT filtersched_delayed— this matches Linuxkernel/sched/fair.c pick_eevdf()which has nosched_delayedguard. - While deferred, the task still contributes its weight to the two
avg_vruntimeaccumulators (sum_w_vruntime,sum_weight). This ensures that over-served sleeping tasks continue to pushavg_vruntimeupward, naturally decaying their negative vlag without requiring any explicit timer. - The vlag clamp is the standard
update_entity_lag()clamp:limit = calc_delta_fair(cfs_rq_max_slice(rq) + TICK_NSEC, se). No separate widened clamp exists — the single clamping formula handles all cases. Linux does not have a separate widened clamp for deferred entities. - Deferred removal is triggered by two events:
- Vlag reaches zero at pick time:
pick_eevdf()scans the tree and, for each deferred candidate, checks whetherentity_eligible(rq, se)(i.e.,avg_vruntimehas advanced past the task'svruntime). When true, the task is removed from the tree (transitioned toOnRqState::Off) without any wake-up. - Task wakes up before vlag decays: On the wake-up path, if
sched_delayed == true, the task is first removed from its deferred tree position, then re-enqueued viaplace_entity()which uses the saved vlag to position it correctly. - On wake-up from a deferred state,
place_entity()sets:
where inflated_vlag = se.vlag * (W + w_i) / W compensates for the effect
of adding this entity on avg_vruntime (see place_entity() below).
If vlag < 0 (still over-served at wake time), se.vruntime > avg_vruntime,
so the task remains ineligible and waits until avg_vruntime catches up.
If vlag >= 0 (fully decayed by the time it wakes), se.vruntime <= avg_vruntime,
so the task is immediately eligible for selection.
requeue_delayed_entity() — called from enqueue_task_fair() when re-enqueueing
a task with ENQUEUE_DELAYED flag, or when walking up the hierarchy and finding a
delayed ancestor. NOT called from pick_eevdf() — the transition happens on the
enqueue side, not the pick side. Matches Linux kernel/sched/fair.c
requeue_delayed_entity() (line ~6909):
fn requeue_delayed_entity(rq: &mut RunQueue, se: &mut EevdfTask) {
debug_assert!(se.sched_delayed && se.on_rq == OnRqState::Deferred);
// If DELAY_ZERO feature is enabled, zero positive vlag on requeue.
// This prevents over-served tasks from accumulating unlimited vlag
// credit during deferred sleep.
if sched_feat(DELAY_ZERO) {
update_entity_lag(rq, se);
if se.vlag > 0 {
// Dequeue from current position, place with vlag = 0, re-enqueue.
__dequeue_entity(rq, se);
se.vlag = 0;
place_entity(rq, se, PlaceFlag::Requeue);
__enqueue_entity(rq, se);
}
}
update_load_avg(rq, se);
se.sched_delayed = false;
se.on_rq = OnRqState::Queued;
}
The sched_delayed field is added to EevdfTask and OnRqState gains a
Deferred variant to unambiguously represent the deferred state (distinct from
CbsThrottled which represents CBS bandwidth exhaustion — see below):
/// Whether and how a task appears on the run queue — extended for deferred dequeue
/// and CBS throttling.
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub enum OnRqState {
/// Task is sleeping and not physically present in any run queue tree.
Off,
/// Task is runnable and present in the `tasks_timeline` tree. Eligibility
/// (whether `se.vruntime <= avg_vruntime`) is computed dynamically by
/// `pick_eevdf()` via the division-free `entity_eligible()` check — it is
/// NOT a stored state. This single variant replaces
/// the former `Eligible`/`Ineligible` split, matching the Linux 6.6+ design
/// where all runnable tasks reside in one `tasks_timeline` tree.
Queued,
/// **EEVDF deferred dequeue**: task is sleeping but still physically present in
/// `tasks_timeline` because it had negative vlag at sleep time
/// (`sched_delayed == true`). Remains in `tasks_timeline` and may be
/// selected by `pick_eevdf()` if eligible; still contributes weight to
/// `sum_w_vruntime` and `sum_weight`. Transitions to `Off`
/// when vlag decays to zero or the task wakes and is re-enqueued.
///
/// **Not the same as CBS throttling.** A `Deferred` task is an EEVDF-internal
/// optimization: the task voluntarily slept while over-served (negative vlag),
/// so it remains on the tree to let its vlag decay passively. It is NOT waiting
/// for any timer or budget replenishment — it is waiting for a future wakeup
/// event (I/O, signal, timer) at which point `place_entity()` repositions it.
Deferred,
/// **CBS bandwidth throttled**: task's cgroup CPU bandwidth budget is exhausted.
/// The task is NOT in `tasks_timeline` — it has been fully dequeued with its
/// `vruntime` and `vlag` preserved in the task struct. The task waits for the
/// CBS period replenishment timer to fire, at which point it is re-enqueued
/// with `OnRqState::Queued`.
///
/// Unlike `Deferred`, a `CbsThrottled` task:
/// - Is NOT physically present in any run queue tree.
/// - Does NOT contribute weight to `avg_vruntime` accumulators.
/// - Is INTERRUPTIBLE by signals: SIGKILL immediately re-enqueues the task
/// (bypassing the throttle with bandwidth debt). Other signals set
/// `TIF_SIGPENDING` and the task processes them on replenishment.
/// - Transitions to `Queued` on budget replenishment, not on
/// vlag decay.
///
/// See `CbsCpuServer.throttled` and `CbsCpuServer.max_throttled` in
/// [Section 7.6](#cpu-bandwidth-guarantees) for the per-CPU server state that
/// controls when tasks enter/exit this state.
CbsThrottled,
}
PELT interaction with deferred dequeue.
When a task is deferred (sched_delayed = true, still in the tree but sleeping), its
PELT accounting pauses: last_update_time is NOT advanced
during the deferred period. The task is sleeping — it is not consuming CPU — so neither
util_sum nor runnable_sum should accumulate.
When the task is eventually picked (because its vlag decayed to zero) or explicitly
dequeued (wake-up from deferred state), PELT decay catches up with a single call to
update_load_avg() using the true elapsed time since the last update. The geometric
decay naturally handles the gap: if the deferred period was D nanoseconds, the call
applies decay_load(sum, D / PELT_PERIOD_NS) to all three accumulators, correctly
reducing the stale contribution.
If the deferred period exceeds 32 ms (one PELT half-life), the stale contribution decays by more than 50%, which is correct behavior: the task was not actually consuming CPU during that time. A task deferred for 100 ms (~97 periods) has its PELT signal decayed to ~5% of its pre-sleep value, accurately reflecting its recent CPU demand.
This design avoids two failure modes:
1. No load spike on resume: without the catch-up decay, a task resuming from
deferred state would appear to have full utilization (stale util_avg), causing
EAS to over-provision and the load balancer to make unnecessary migrations.
2. No double-counting: period_contrib is preserved during the deferred period
(not reset to zero), so the Phase 1 head completion in update_load_avg() correctly
finishes the partial period that was in progress when the task entered deferred state.
Bandwidth throttling interaction.
When a cgroup's CPU bandwidth is exhausted (CBS budget depleted per
Section 7.6), all tasks in the cgroup are
dequeued from their respective per-CPU run queues (removed from
tasks_timeline) and their on_rq is set to
OnRqState::CbsThrottled. Each task's vruntime and lag are
preserved in the task struct.
On budget replenishment (when the CBS period timer fires and the cgroup receives
a new quota), all CbsThrottled tasks are re-enqueued with their saved
vruntime and vlag intact, transitioning to Queued.
This ensures that bandwidth throttling does not cause fairness distortion —
a task that was owed CPU time before throttling remains owed after throttling
ends.
Signal delivery during CBS throttling:
When a cgroup's CPU bandwidth is exhausted and tasks are dequeued:
- SIGKILL / SIGSTOP (uncatchable): Delivered immediately. The task is
re-enqueued with its
vruntimeintact, bypassing bandwidth throttling. This ensureskill -9always works regardless of bandwidth state. The bandwidth consumed is recorded as debt, repaid on next quota replenishment. - Other signals (SIGTERM, SIGHUP, user signals): Queued in the task's
pending signal set (
TIF_SIGPENDINGis set). The task is NOT re-enqueued. Signal delivery completes when the task is re-enqueued on quota replenishment and returns to userspace through the signal check path. - KILLABLE tasks (UNINTERRUPTIBLE | WAKEKILL) that are CBS-throttled: SIGKILL wakes immediately with bandwidth debt. Non-fatal signals remain pending.
This design preserves bandwidth isolation (a cgroup cannot escape its quota by receiving signals) while ensuring liveness (fatal signals always terminate within one scheduling tick + IPI latency).
Latency-nice.
Tasks with latency_nice < 0 (latency-sensitive, e.g., interactive or audio)
have their virtual deadline shortened, making them scheduled sooner among
eligible tasks. The effective slice used for deadline calculation:
where latency_weight = LATENCY_NICE_TO_WEIGHT[(latency_nice + 20)]. A task
with latency_nice = -20 (weight 88818) gets effective_slice ≈ base_slice_ns *
88818 / 1024 ≈ 86.7 * base_slice_ns — wait, this is inverted. The correct
formula: latency-nice affects the virtual deadline offset, not the slice
duration. A more latency-sensitive task gets a shorter virtual deadline
offset:
vslice = calc_delta_fair(effective_slice, weight)
effective_slice = base_slice_ns * LATENCY_NICE_0_WEIGHT / latency_weight
A task with latency_nice = -20 (latency_weight = 88818) gets
effective_slice = 750_000 * 1024 / 88818 ≈ 8_650 ns, giving it a very short
virtual deadline offset — it is picked first among equal-vruntime peers. A task
with latency_nice = +19 (latency_weight = 15) gets
effective_slice = 750_000 * 1024 / 15 = 51_200_000 ns, pushing its deadline
far out — it is picked last.
NOT a Linux feature. latency_nice is an UmkaOS-original design inspired by
the proposed (but never merged) Linux latency_nice patchset (Vincent Guittot / Parth
Shah, LKML 2022-2024). The patchset was discussed on LKML multiple times but never
accepted into torvalds/linux mainline. As of Linux 6.17+, there is no
latency_nice field in struct task_struct, no sched_latency_nice field in
struct sched_attr, and no SCHED_FLAG_LATENCY_NICE bit.
The LATENCY_NICE_TO_WEIGHT table, the effective_slice formula, and the
sched_latency_nice extension to sched_attr are all UmkaOS-specific.
UmkaOS defines a new flag SCHED_FLAG_LATENCY_NICE = 0x80 for sched_setattr(2)
and extends struct sched_attr with a sched_latency_nice: i32 field (see
Section 19.1 for the extended sched_attr layout). Applications that
use latency_nice are UmkaOS-only and will not work on upstream Linux kernels.
/// Latency-nice to latency_weight mapping. Higher weight = shorter virtual
/// deadline offset = more latency-sensitive. Geometric ratio ~1.25x per step.
/// latency_nice 0 maps to LATENCY_NICE_0_WEIGHT (1024).
///
/// **UmkaOS-original.** This table is part of the UmkaOS latency_nice
/// extension (NOT a Linux feature — see the latency-nice note above).
/// UmkaOS maintains a separate latency weight table (distinct from
/// `sched_prio_to_weight`) to enable per-cgroup ML policy overrides via
/// `SubsystemId::Scheduler` parameters. See
/// [Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence).
///
/// **Intentionally different from `sched_prio_to_weight`**: The nice-to-weight
/// table (`sched_prio_to_weight`) controls CPU bandwidth allocation (time slices).
/// This table controls scheduling latency (virtual deadline offset). They use
/// the same 1024 base and the same ~1.25x geometric ratio per step, but the two
/// dimensions are independent: a thread can have nice=0 (normal bandwidth) but
/// latency_nice=-20 (maximum latency sensitivity).
///
/// **Formula**: `entry[i] = round(1024 * 1.25^(20 - i))` for i = 0..39.
/// The base value 1024 at index 20 (latency_nice 0) matches `NICE_0_WEIGHT`.
/// Each step toward latency_nice -20 multiplies by 1.25 (more latency-sensitive);
/// each step toward +19 divides by 1.25 (less latency-sensitive).
///
/// Index 0 = latency_nice -20 (most latency-sensitive),
/// index 20 = latency_nice 0 (baseline),
/// index 39 = latency_nice +19 (least latency-sensitive).
///
/// Lookup: latency_weight = LATENCY_NICE_TO_WEIGHT[(latency_nice + 20) as usize]
///
/// Note: LATENCY_NICE_TO_WEIGHT[-20] = 88818 differs from
/// sched_prio_to_weight[-20] = 88761. Both use ~1.25x geometric ratios but
/// different rounding. The latency table uses exact `round(1024 * 1.25^20)` =
/// 88818; the weight table uses the Linux-compatible rounded approximation
/// 88761. Intentionally different — see line 583.
pub const LATENCY_NICE_TO_WEIGHT: [u32; 40] = [
// latency_nice -20 -19 -18 -17 -16
88818, 71054, 56843, 45475, 36380,
// latency_nice -15 -14 -13 -12 -11
29104, 23283, 18626, 14901, 11921,
// latency_nice -10 -9 -8 -7 -6
9537, 7629, 6104, 4883, 3906,
// latency_nice -5 -4 -3 -2 -1
3125, 2500, 2000, 1600, 1280,
// latency_nice +0 +1 +2 +3 +4
1024, 819, 655, 524, 419,
// latency_nice +5 +6 +7 +8 +9
336, 268, 215, 172, 137,
// latency_nice +10 +11 +12 +13 +14
110, 88, 70, 56, 45,
// latency_nice +15 +16 +17 +18 +19
36, 29, 23, 18, 15,
];
/// Base latency weight corresponding to latency_nice 0.
pub const LATENCY_NICE_0_WEIGHT: u32 = 1024;
Latency-nice does NOT affect CPU bandwidth — only scheduling latency. A latency-nice -20 task at nice 0 receives the same total CPU share as a latency-nice 0 task at nice 0; it simply gets scheduled sooner when it wakes up.
Data structures.
/// Lock level constant for per-CPU run queue locks.
///
/// Placed above the task lock level (20) and below the priority-inheritance
/// lock level (60) in the system-wide lock hierarchy defined in
/// [Section 3.4](03-concurrency.md#cumulative-performance-budget). The lock hierarchy uses x10 spacing
/// (0, 10, 20, ..., 260). Level 50 is shared by all run queue locks regardless
/// of which CPU they protect.
/// (Level 30 is `SIGHAND_LOCK`, level 40 is `SIGLOCK`/`FDTABLE_LOCK`.)
///
/// Runqueue lock is acquired via `rq.lock()` which returns
/// `SpinGuard<RunQueueData, RQ_LOCK_LEVEL>`. The lock level prevents holding a
/// higher-level lock (e.g., CAP_TABLE_LOCK at level 70) while holding the runqueue lock
/// — compile-time enforced. PI_LOCK (level 45) is BELOW RQ_LOCK (level 50).
///
/// # Same-level ordering
/// Because all `RunQueue` locks share level 50, the type system alone cannot prevent
/// an ABBA deadlock between two run queues. The `lock_two_runqueues()` function
/// closes this gap: it is the **only** function permitted to hold two run queue locks
/// simultaneously, and it always acquires them in CPU-ID order.
pub const RQ_LOCK_LEVEL: u32 = 50;
/// Logical CPU identifier. Monotonic index assigned during topology enumeration
/// (BSP = 0, APs numbered in ACPI MADT / DT order). Used as a key for per-CPU
/// data structures and as a lock-ordering tiebreaker.
pub type CpuId = u32;
/// Per-CPU run queue — the top-level schedulable entity for one logical CPU.
///
/// `RunQueue` is the owner of the `SpinLock` that protects `EevdfRunQueue`
/// and the associated RT/deadline queues for one CPU. Callers that need to lock
/// two run queues at once **must** use `lock_two_runqueues()` — direct chained
/// calls to `lock()` on two different `RunQueue`s is a compile-time error when
/// the lock-level type system detects two concurrent level-50 acquisitions.
pub struct RunQueue {
/// Identity of the CPU that owns this run queue.
/// Used by `lock_two_runqueues()` to establish a canonical acquisition order.
pub cpu_id: CpuId,
/// Typed spinlock carrying the EEVDF and RT/deadline state.
/// The level-50 type parameter participates in the compile-time lock hierarchy
/// ([Section 3.4](03-concurrency.md#cumulative-performance-budget)); it prevents acquiring this lock while
/// already holding a level-50 lock except through `lock_two_runqueues()`.
lock: SpinLock<RunQueueData, RQ_LOCK_LEVEL>,
}
/// RT priority valid range for SCHED_FIFO and SCHED_RR: 1–99.
/// Priority 0 is EINVAL for real-time policies (reserved for non-RT policies
/// per POSIX and Linux ABI). `sched_get_priority_min(SCHED_FIFO) == 1`.
/// Slot 0 in the priority bitmap is allocated but always empty.
pub const RT_PRIORITY_MIN: u8 = 1;
pub const RT_PRIORITY_MAX: u8 = 99;
pub const RT_PRIORITY_LEVELS: usize = 100; // Indexed 0–99; slot 0 is unused (priority 0 = EINVAL).
// Parameter validation for sched_setscheduler / sched_setattr:
//
// if policy == SCHED_FIFO || policy == SCHED_RR {
// if param.sched_priority == 0 { return EINVAL; }
// if param.sched_priority > RT_PRIORITY_MAX { return EINVAL; }
// }
//
// Equivalently: valid range is RT_PRIORITY_MIN..=RT_PRIORITY_MAX (1–99).
// sched_get_priority_min(SCHED_FIFO) and sched_get_priority_min(SCHED_RR)
// both return 1. This matches the Linux ABI and POSIX SCHED_FIFO/SCHED_RR
// semantics: priority 0 is defined only for non-RT policies (SCHED_OTHER,
// SCHED_BATCH, SCHED_IDLE) where it is the only legal value.
/// Absolute deadline timestamp in nanoseconds (monotonic clock).
///
/// Used as the ordering key for the `DlRunQueue` intrusive red-black tree.
/// Two tasks with the same absolute deadline share equal priority under EDF;
/// tie-breaking uses `TaskId` (lower ID wins). In practice, two tasks with
/// identical absolute deadlines are exceedingly rare.
pub type AbsDeadlineNs = u64;
/// Fixed-point scale for deadline bandwidth accounting.
///
/// `DL_BW_SCALE = 1 << 20` (approximately 1,048,576). Deadline bandwidth
/// fractions are stored as `runtime_ns * DL_BW_SCALE / period_ns`, giving
/// 20 bits of sub-unit precision. A task consuming 100% of the CPU has
/// `bw = DL_BW_SCALE`. The sum of all per-task bandwidths must not exceed
/// `capacity_ns * DL_BW_SCALE / period_ns`.
///
/// This matches Linux's `BW_SHIFT = 20` / `BW_UNIT = 1 << BW_SHIFT`
/// convention so that bandwidth values computed from `SCHED_DEADLINE`
/// parameters are directly comparable. CBS admission
/// ([Section 7.6](#cpu-bandwidth-guarantees)) uses the same `BW_SCALE = 1 << 20`
/// constant, ensuring DL and CBS bandwidths can be summed directly in
/// the system-wide overcommit check.
pub const DL_BW_SCALE: u64 = 1 << 20;
/// Per-CPU real-time (SCHED_FIFO / SCHED_RR) run queue.
///
/// UmkaOS improves on Linux's `struct rt_rq` in two ways:
///
/// 1. **Per-queue CBS bandwidth accounting** instead of a single global
/// `sched_rt_runtime_us` knob. Each `RtRunQueue` tracks its own consumed
/// runtime and replenishment period, so individual CPUs can be throttled
/// independently without a global lock. This is particularly valuable on
/// heterogeneous platforms where P-core and E-core CPUs have different
/// RT capacity budgets.
///
/// 2. **Typed priority bitmap** — the 100-bit occupancy map uses two `u64`
/// words (128 bits allocated, top 28 unused), with the highest set bit
/// indicating the next priority level to schedule. A leading-zeros scan
/// (`CLZ` / `BSR`) on the bitmap locates the highest occupied priority in
/// a single instruction on all supported architectures. Scan word 1 first
/// (priorities 64-99); if zero, scan word 0 (priorities 0-63).
///
/// Linux reference: `struct rt_rq` in `kernel/sched/sched.h` (Linux 6.x).
/// Key differences: no `rt_nr_boosted` (PI boost is tracked per-task in
/// UmkaOS, not per-queue), no `pushable_tasks` plist (UmkaOS uses a separate
/// per-CPU migration candidate set managed by the load balancer), and no
/// embedded group-scheduling pointers (`tg`, `rq`).
pub struct RtRunQueue {
/// Two-word occupancy bitmap for priority levels 0–99.
///
/// Bit `p` is set when priority queue `p` is non-empty.
/// Bit 0 is never set (priority 0 is invalid for RT policies;
/// `RT_PRIORITY_MIN == 1`). Word 0 covers priorities 0–63; word 1
/// covers priorities 64–99 (bits 100–127 of word 1 are always zero).
/// The highest-priority non-empty queue is found by scanning for the
/// most-significant set bit across both words (word 1 first).
pub bitmap: [u64; 2],
/// Per-priority intrusive task lists.
///
/// `queues[p]` holds all runnable tasks at RT priority `p` in
/// FIFO order. SCHED_RR tasks are rotated to the tail of their
/// queue on time-slice expiry. SCHED_FIFO tasks are never rotated.
/// Indexed by user-visible `sched_priority` (0-99), where 99 is the
/// highest RT priority. This differs from Linux's internal kernel
/// priority numbering (where 0 is the highest internal priority);
/// UmkaOS uses the user-facing numbering directly to avoid the
/// `99 - sched_priority` translation.
/// Note: `queues[0]` is always empty (tombstone): `RT_PRIORITY_MIN == 1`.
/// `bitmap` bit 0 is never set. The slot exists for O(1) direct indexing.
pub queues: [IntrusiveList<Task>; RT_PRIORITY_LEVELS],
/// Number of runnable RT tasks on this CPU.
///
/// Incremented on enqueue, decremented on dequeue. Does not include
/// tasks that are throttled (removed from all queues). The value
/// equals the sum of task counts across all priority level queues:
/// `nr_running = sum(queues[p].len() for all p)`.
pub nr_running: u32,
/// Accumulated runtime consumed during the current throttle period, in
/// nanoseconds.
///
/// Increased by the scheduler tick handler on every tick that an RT task
/// is running. When `rt_time_ns >= rt_runtime_ns` the queue is throttled:
/// all RT tasks are dequeued and `throttled` is set. This is a per-CPU
/// improvement over Linux's `sched_rt_runtime_us` global knob.
pub rt_time_ns: u64,
/// Maximum RT runtime allowed per period, in nanoseconds.
///
/// Defaults to `950_000_000` (950 ms per second — reserving 5% of the CPU
/// for non-RT work, matching Linux's default `sched_rt_runtime_us`).
/// Configurable at runtime via `sysctl umka.sched.rt_runtime_ns`.
/// Set to `u64::MAX` to disable throttling (equivalent to
/// `sched_rt_runtime_us = -1` in Linux).
pub rt_runtime_ns: u64,
/// `true` when this queue is currently throttled due to bandwidth
/// exhaustion (`rt_time_ns >= rt_runtime_ns`).
///
/// While throttled, no RT task from this queue may be selected by
/// `pick_next_task`. The period replenishment timer resets this flag
/// and re-enqueues all previously throttled tasks.
pub throttled: bool,
/// Monotonic timestamp (nanoseconds) of the start of the current
/// throttle accounting period.
///
/// The period length is 1 second (`1_000_000_000 ns`), matching Linux's
/// `sched_rt_period_us` default of 1 s. At the end of each period,
/// `rt_time_ns` is reset to zero, `throttled` is cleared, and
/// `period_start_ns` is advanced by the period length.
pub period_start_ns: u64,
}
/// Per-CPU deadline (SCHED_DEADLINE) run queue.
///
/// UmkaOS improves on Linux's `struct dl_rq` in two ways:
///
/// 1. **Intrusive red-black tree with cached leftmost** — each `Task` embeds
/// a `DlRbLink` node (see `EevdfTask.dl_rb_link`), so enqueue and dequeue
/// manipulate only the embedded pointers — **zero heap allocation**. The
/// tree is ordered by `(AbsDeadlineNs, TaskId)` giving O(log n) insert /
/// remove and O(1) earliest-deadline pickup via the cached `leftmost`
/// pointer (updated on every structural change). This matches Linux's
/// `rb_root_cached` + embedded `rb_node` pattern used in `struct dl_rq`.
///
/// Unlike `BTreeMap`, which allocates B-tree nodes on the heap during
/// `insert()`, the intrusive tree never allocates — making it safe to call
/// under the per-CPU runqueue spinlock on the scheduler hot path.
///
/// 2. **Explicit bandwidth tracking** — `total_bw` and `capacity_ns` are
/// first-class fields rather than derived from per-task `dl_bw` entries.
/// This makes admission control O(1): check `total_bw + new_task_bw <=
/// DL_BW_SCALE` before accepting a new SCHED_DEADLINE task.
///
/// Linux reference: `struct dl_rq` in `kernel/sched/sched.h` (Linux 6.x).
/// Key differences: explicit `capacity_ns` replaces per-rq `dl_bw` struct,
/// and `earliest_deadline_ns` is a cached O(1) copy of the earliest deadline
/// (equivalent to `rb_first_cached(&dl_rq->root)` in Linux).
pub struct DlRunQueue {
/// Intrusive red-black tree of runnable deadline tasks, ordered by
/// `(AbsDeadlineNs, TaskId)`.
///
/// Each task embeds a `DlRbLink` in `EevdfTask.dl_rb_link`. The tree
/// owns no heap-allocated nodes — all storage lives inside the `Task`
/// structs themselves. Insert and remove are O(log n) with no allocator
/// calls, safe to execute under the per-CPU runqueue spinlock.
///
/// Invariant: every task linked into this tree is runnable (not sleeping)
/// and has been admitted through the bandwidth test. The ordering key
/// combines the task's absolute deadline with its `TaskId` to prevent
/// key collisions when two tasks share the same absolute deadline.
/// If a task's deadline is updated (e.g., on new job arrival), it is
/// unlinked and re-inserted with the new key.
pub root: IntrusiveRbRoot<DlRbLink>,
/// Cached pointer to the leftmost (earliest-deadline) node, or `None`
/// when the tree is empty. Maintained on every insert / remove — provides
/// O(1) earliest-deadline pickup without a tree traversal.
pub leftmost: Option<NonNull<DlRbLink>>,
/// Sum of fixed-point bandwidth fractions for all admitted tasks.
///
/// Each task contributes `runtime_ns * DL_BW_SCALE / period_ns` to
/// `total_bw`. Admission control accepts a new task only if
/// `total_bw + new_bw <= capacity_ns * DL_BW_SCALE / period_ns`.
/// Maintained incrementally: increased on task admission, decreased
/// on task departure (sleep, termination, or policy change away from
/// SCHED_DEADLINE).
pub total_bw: u64,
/// CPU capacity in nanoseconds per period (default: 1_000_000_000 for a
/// fully available CPU over a 1-second window).
///
/// On heterogeneous CPUs, `capacity_ns` is scaled by the CPU's capacity
/// factor (from `CpuCapacity.capacity / 1024`) so that a 512-capacity
/// efficiency core exposes only 512 ms of deadline capacity per second.
/// This prevents over-admission on low-capacity cores.
pub capacity_ns: u64,
/// Number of runnable deadline tasks on this CPU.
///
/// Maintained as a separate `u32` on every enqueue/dequeue. Keeping
/// `nr_running` in the hot struct avoids pointer chasing to count tree
/// nodes.
pub nr_running: u32,
/// Cached absolute deadline of the earliest-deadline task, or
/// `u64::MAX` when the queue is empty.
///
/// Mirrors the deadline component of `leftmost`'s key. Updated on every
/// enqueue and dequeue. Used by `pick_next_task` and the preemption
/// check (`resched_curr(rq, ReschedUrgency::Eager)`) to compare against the currently running
/// task's deadline without a tree traversal.
pub earliest_deadline_ns: u64,
}
/// Intrusive red-black tree link embedded in each deadline task.
///
/// Contains the left/right/parent pointers and color bit needed for the
/// red-black tree, plus the ordering key `(AbsDeadlineNs, TaskId)`.
/// The `container_of!` macro recovers the owning `Task` from a `DlRbLink`
/// pointer. No heap allocation is performed — the link lives inside the
/// `Task` struct (via `EevdfTask.dl_rb_link`).
///
/// A `DlRbLink` is in exactly one of two states:
/// - **Linked**: the task is in a `DlRunQueue` tree. `parent`/`left`/`right`
/// may be non-null.
/// - **Unlinked**: the task is not in any tree. All pointer fields are null
/// and `is_linked()` returns `false`.
pub struct DlRbLink {
/// Ordering key: `(absolute_deadline_ns, task_id)`.
pub key: (AbsDeadlineNs, TaskId),
parent: Option<NonNull<DlRbLink>>,
left: Option<NonNull<DlRbLink>>,
right: Option<NonNull<DlRbLink>>,
color: RBColor,
}
/// Root sentinel for an intrusive red-black tree.
///
/// Contains only the root pointer. The tree does not own any nodes — node
/// lifetime is tied to the containing struct (e.g., `Task`). All operations
/// (insert, remove, find) take `&mut IntrusiveRbRoot` and `&mut DlRbLink`
/// references, never allocate, and are safe to call under spinlock.
pub struct IntrusiveRbRoot<L> {
root: Option<NonNull<L>>,
}
/// Data protected by `RunQueue.lock`.
pub struct RunQueueData {
/// Single EEVDF tree (tasks filtered dynamically by vruntime_eligible() pruning).
pub eevdf: EevdfRunQueue,
/// RT FIFO/RR run queue for this CPU.
pub rt: RtRunQueue,
/// SCHED_DEADLINE (EDF) run queue for this CPU.
pub dl: DlRunQueue,
/// Per-cgroup CBS (Constant Bandwidth Server) instances for this CPU.
///
/// Keyed by `CgroupId` (integer key → XArray, per
/// [Section 3.13](03-concurrency.md#collection-usage-policy)). Each entry tracks the per-CPU budget
/// remaining, deadline, and throttle state for one cgroup's cpu.max
/// enforcement on this CPU.
///
/// Each CBS server embeds its own local EEVDF tree for scheduling
/// tasks assigned to that server. Tasks in a CBS-guaranteed cgroup
/// are enqueued in the server's EEVDF tree, not the main per-CPU
/// `eevdf` tree. `pick_next_task()` checks CBS servers (in EDF order
/// by deadline) between the DL and plain-EEVDF steps. When a CBS
/// server's budget is exhausted and steal fails, the server is
/// throttled and skipped (OnRqState → CbsThrottled). Replenishment
/// timer un-throttles and re-enables the server.
///
/// See [Section 7.6](#cpu-bandwidth-guarantees) for `CbsCpuServer` struct,
/// replenishment/steal protocol, and cpu.max/cpu.guarantee semantics.
///
/// `XArray<CgroupId, Arc<CbsCpuServer>>` — per-CPU mapping from cgroup IDs to
/// their CBS servers on this CPU. One entry per cgroup with a `cpu.guarantee`
/// on this CPU. Allocated lazily on first task enqueue.
pub cbs_servers: XArray<CbsCpuServer>,
/// Per-cgroup group scheduling entities for this CPU.
///
/// Keyed by `CgroupId` (integer key → XArray). Each entry is the per-CPU
/// `GroupEntity` ([Section 7.2](#heterogeneous-cpu-scheduling--hierarchical-eevdf-via-group-scheduling-entities))
/// for one cgroup on this CPU. Created lazily on first task enqueue into
/// a cgroup on this CPU; removed when the last task from that cgroup
/// leaves this CPU.
///
/// `dequeue_task` and `enqueue_task` in `SchedClassOps` update the
/// appropriate `GroupEntity` (decrement/increment `nr_running`, update
/// `load_avg`). Cgroup migration (step 15) uses this to move a task
/// between GroupEntities on the same CPU.
pub group_entities: XArray<GroupEntity>,
/// Cached per-CPU task clock (nanoseconds, monotonic). Updated once
/// per scheduler entry by `update_rq_clock()`. Excludes IRQ time.
/// Used by `update_curr()` as the canonical scheduler time source
/// (`rq_clock_task(rq)` returns this value). Avoids per-entity
/// sched_clock() calls — the clock is read once and shared.
/// Linux equivalent: `rq->clock_task` in `kernel/sched/sched.h`.
pub clock_task: u64,
/// Accumulated IRQ time on this CPU (nanoseconds, monotonic).
/// Updated by `irq_time_start()`/`irq_time_end()` at IRQ entry/exit.
/// Subtracted from the raw clock in `update_rq_clock()` to produce
/// the task-only clock (`clock_task`). Without this, IRQ processing
/// time is charged to the running task's vruntime.
/// Linux equivalent: `rq->prev_irq_time` + irq accounting in
/// `irqtime_account_irq()` / `account_irq_enter_time()`.
pub irq_time_ns: u64,
/// Timestamp of the most recent IRQ entry (nanoseconds, monotonic).
/// Set by `irq_time_start()`, consumed by `irq_time_end()` to
/// compute the delta for `irq_time_ns` accumulation.
pub irq_entry_timestamp: u64,
/// Countdown timer for load-balance trigger. Decremented every tick
/// by `scheduler_tick()`. When it reaches zero, `trigger_load_balance()`
/// raises `SCHED_SOFTIRQ`. Reset to `load_balance_interval(rq)` after
/// triggering. Initialized to `load_balance_interval(rq)` at boot.
/// Typical values: 4-32 ticks depending on sched_domain depth.
pub next_balance_tick: u32,
/// The per-CPU idle task. Always runnable; never enqueued in any
/// scheduling class. Returned by `pick_next_task()` when all three
/// class queues are empty. Statically allocated at boot — one per CPU.
pub idle_task: Arc<Task>,
}
/// Lock two run queues in a deadlock-free order (lower CPU ID first).
///
/// This is the **only** function that may hold two run queue locks simultaneously.
/// The load balancer (work stealing) must call this function instead of acquiring
/// `RunQueue.lock` directly while already holding another run queue's lock.
///
/// # Deadlock prevention — compile-time enforced
///
/// `RunQueue.lock` is a `SpinLock<RunQueueData, LEVEL=50>`. The `SpinLock` type
/// parameter prevents a caller that already holds a level-50 lock from acquiring
/// a second one without going through this function. Calling `lock_two_runqueues`
/// is therefore the only legal path to holding two run queue locks simultaneously,
/// and it always acquires them in CPU-ID order — eliminating ABBA deadlock at
/// compile time rather than relying on code review to uphold a naming convention.
///
/// The old Linux approach (documented rule "always lock min CPU first") has
/// produced real deadlocks in distribution kernels. UmkaOS's type-level enforcement
/// eliminates this class of bug entirely.
pub fn lock_two_runqueues<'a>(
rq_a: &'a RunQueue,
rq_b: &'a RunQueue,
) -> (SpinGuard<'a, RunQueueData, RQ_LOCK_LEVEL>, SpinGuard<'a, RunQueueData, RQ_LOCK_LEVEL>) {
// Guard: same-CPU case would deadlock (locking the same spinlock twice).
// Callers must check before calling. This is a debug_assert rather than
// a runtime branch because all call sites are in the load balancer which
// never steals from itself.
debug_assert!(rq_a.cpu_id != rq_b.cpu_id,
"lock_two_runqueues called with same CPU {} — would deadlock", rq_a.cpu_id);
// Always acquire the run queue whose CPU has the lower numeric ID first.
// Every pair of CPUs has a unique total order under this relation, so no
// two threads can form an acquisition cycle.
if rq_a.cpu_id < rq_b.cpu_id {
let g_a = rq_a.lock.lock();
let g_b = rq_b.lock.lock();
(g_a, g_b)
} else {
let g_b = rq_b.lock.lock();
let g_a = rq_a.lock.lock();
(g_a, g_b)
}
}
/// A reference to a runnable task held by the EEVDF scheduler.
/// Ownership: the scheduler holds one `Arc<Task>` per runnable task.
/// Using this alias makes ownership semantics explicit in data structure definitions.
///
/// **Type safety invariant**: Because `Arc<Task>` implies shared ownership,
/// scheduler methods receive `&Task` (shared reference), not `&mut Task`.
/// All scheduler-mutable fields in `Task` — `vruntime`, `vdeadline`,
/// `on_rq_state`, `cpu`, `sched_entity.*` — use interior mutability
/// (atomic types or fields protected by the per-CPU run queue lock).
/// The run queue lock provides mutual exclusion for non-atomic mutable
/// fields; the `&Task` reference type makes this explicit in the type system.
pub type TaskHandle = Arc<Task>;
/// A generic red-black tree with optional per-subtree augmented fields.
///
/// The EEVDF hot path uses intrusive `EevdfRbLink` (below) with `min_vruntime`
/// and `max_slice` augmentation; this generic `RBNode`/`RBTree` type is used
/// for the deadline scheduler's task tree and other cold-path ordered containers.
///
/// # Path Classification: COLD ONLY
///
/// This type heap-allocates nodes via `Box`. It MUST NOT be used on hot
/// paths (scheduler tick, context switch, enqueue/dequeue). Hot-path
/// ordered containers use intrusive `EevdfRbLink` / `IntrusiveRbRoot`.
pub struct RBNode<K, V> {
pub key: K,
pub value: V,
color: RBColor,
left: Option<Box<RBNode<K, V>>>,
right: Option<Box<RBNode<K, V>>>,
/// Augmented field: minimum key in this subtree.
pub min_subtree_key: K,
}
/// Non-augmented red-black tree (standard ordered map).
///
/// # Path Classification: COLD ONLY
///
/// See `RBNode` for allocation constraints. Hot-path trees use intrusive links.
pub struct RBTree<K: Ord, V> {
root: Option<Box<RBNode<K, V>>>,
pub len: usize,
}
#[derive(Clone, Copy, PartialEq)]
enum RBColor { Red, Black }
/// Intrusive red-black tree link embedded in `EevdfTask`. Zero-allocation
/// insert/remove — the link is part of the task struct, not heap-allocated.
/// Keyed by `vdeadline` (virtual deadline) so that a left-descent walk in
/// `pick_eevdf()` naturally visits earlier-deadline tasks first. Augmented
/// with `min_vruntime` for eligibility pruning during `pick_eevdf()`.
pub struct EevdfRbLink {
/// RB-tree key: task's vdeadline (virtual deadline).
///
/// The tree is ordered by vdeadline so that `pick_eevdf()` finds the
/// eligible task with the earliest deadline via a left-descent walk.
/// Tie-breaking: when two tasks have identical vdeadline values, the
/// task with the smaller `TaskId` sorts first (ensures deterministic
/// ordering — `entity_before(a, b) = a.vdeadline < b.vdeadline ||
/// (a.vdeadline == b.vdeadline && a.task_id < b.task_id)`).
pub key: u64,
/// Augmented field: minimum `vruntime` in this subtree. Used by
/// `pick_eevdf()` to prune ineligible subtrees via the division-free
/// `vruntime_eligible(rq, node.min_vruntime)` check. If the subtree's
/// minimum vruntime is not eligible (greater than avg_vruntime), then
/// no entity in the subtree can be eligible.
///
/// Note: this is `vruntime` (not `vdeadline`). The tree is keyed by
/// `vdeadline` for deadline-first selection, but the augmented field
/// tracks `vruntime` for eligibility pruning. Linux equivalent:
/// `sched_entity.min_vruntime` augmented field.
pub min_vruntime: u64,
/// Augmented field: maximum `slice` in this subtree. Used by
/// `cfs_rq_max_slice()` for lag clamp computation. Linux equivalent:
/// `sched_entity.max_slice` augmented field.
pub max_slice: u64,
/// Intrusive RB-tree link pointers (parent, left, right, color).
/// Managed by the intrusive RB-tree implementation; not directly
/// accessed by scheduler logic.
///
/// Uses `Option<NonNull<EevdfRbLink>>` (matching `DlRbLink` pattern)
/// for niche optimization and null-safety. `None` = no child/parent.
///
/// # Safety
///
/// - **Ownership**: The `EevdfTask` struct owns the `EevdfRbLink` via
/// embedding. The link's lifetime is tied to the task.
/// - **Aliasing**: Multiple nodes' `parent`/`left`/`right` point to the
/// same node. No mutable aliasing — tree mutations happen exclusively
/// under `rq.lock` (lock level 50).
/// - **Thread safety**: All pointer dereferences require `rq.lock` held.
parent: Option<NonNull<EevdfRbLink>>,
left: Option<NonNull<EevdfRbLink>>,
right: Option<NonNull<EevdfRbLink>>,
color: RBColor,
}
/// Root of the intrusive augmented RB-tree containing all runnable tasks.
pub struct IntrusiveAugmentedRbRoot<L> {
/// Root of the augmented red-black tree, or `None` if empty.
/// Consistent with `IntrusiveRbRoot<L>` which also uses
/// `Option<NonNull<L>>` for the nullable tree root.
root: Option<NonNull<L>>,
/// Total node count. O(1).
pub len: usize,
}
/// Number of tasks queued on this runqueue, **NOT** including `curr`.
///
/// This is a UmkaOS design choice: Linux's `cfs_rq->nr_queued` INCLUDES `curr`,
/// so Linux threshold `nr_queued == 1` means "only curr, no waiters." UmkaOS
/// `WaiterCount` counts only tasks in the tree (excluding `curr`), so the
/// equivalent check is `!has_waiters()` (count == 0).
///
/// The newtype prevents agents from accidentally using Linux threshold values.
/// Use the semantic methods (`has_waiters()`, `single_waiter()`) instead of
/// raw numeric comparisons.
///
/// **Threshold translation guide** (Linux → UmkaOS):
/// - Linux `nr_queued == 1` (only curr) → `!waiter_count.has_waiters()`
/// - Linux `nr_queued > 1` (peers exist) → `waiter_count.has_waiters()`
/// - Linux `nr_queued <= 1` (no peers) → `!waiter_count.has_waiters()`
pub struct WaiterCount(u32);
impl WaiterCount {
/// True if any task is waiting (curr has peers in the tree).
pub fn has_waiters(&self) -> bool { self.0 > 0 }
/// True if exactly one task is waiting.
pub fn single_waiter(&self) -> bool { self.0 == 1 }
/// Raw count (use sparingly — prefer semantic methods).
pub fn count(&self) -> u32 { self.0 }
/// Increment when a task is enqueued.
pub fn inc(&mut self) { self.0 += 1; }
/// Decrement when a task is dequeued. Debug-asserts non-zero.
pub fn dec(&mut self) { debug_assert!(self.0 > 0); self.0 -= 1; }
}
/// Shared vruntime-ordered augmented RB tree with EEVDF accumulators.
///
/// `VruntimeTree` is the base type shared by the root per-CPU EEVDF run queue
/// (`EevdfRunQueue`), CBS bandwidth servers (`CbsCpuServer`), and hierarchical
/// group scheduling entities (`GroupEntity`). It contains the augmented RB tree
/// and the two-accumulator state needed for division-free eligibility checks.
///
/// All EEVDF helper functions that operate only on the tree and accumulators
/// (`avg_vruntime_update`, `entity_key`, `update_zero_vruntime`,
/// `__enqueue_entity`, `__dequeue_entity`) take `&VruntimeTree` or
/// `&mut VruntimeTree` directly. Functions that additionally need the
/// currently-running entity (`avg_vruntime`, `vruntime_eligible`,
/// `pick_eevdf`, `update_curr`, `place_entity`) take `&EevdfRunQueue`,
/// which embeds `VruntimeTree` as its `base` field.
///
/// **Classification**: Nucleus (data). Struct layout is non-replaceable; all
/// Evolvable code (formulas, policy) operates on these fields.
pub struct VruntimeTree {
/// Single intrusive augmented red-black tree of ALL runnable tasks,
/// keyed by `vdeadline` (virtual deadline). Matches the Linux 6.6+
/// `tasks_timeline` design. Each task embeds an `EevdfRbLink` with
/// augmented `min_vruntime` and `max_slice` fields so that
/// `pick_eevdf()` can prune ineligible subtrees in O(log n).
/// Eligibility is computed dynamically during the walk, not stored as
/// tree membership. Zero heap allocation on enqueue/dequeue.
///
/// EEVDF-deferred tasks (`OnRqState::Deferred`) reside in this tree;
/// they are NOT filtered by `pick_eevdf()` during the tree walk.
/// If a deferred entity is picked, the caller (`pick_next_task()`)
/// handles it: dequeue without wake-up, then retry the pick.
/// See `pick_next_task()` deferred-entity handling below.
/// CBS-throttled tasks (`OnRqState::CbsThrottled`) are NOT in this tree.
pub tasks_timeline: IntrusiveAugmentedRbRoot<EevdfRbLink>,
/// Tracking reference point for the `avg_vruntime` two-accumulator
/// computation. Updated on every call to `avg_vruntime()` to track
/// close to the true weighted average. This is NOT the minimum vruntime
/// — it is approximately equal to avg_vruntime after each update.
///
/// Linux equivalent: `cfs_rq->zero_vruntime`. Note: Linux's old CFS
/// `cfs_rq.min_vruntime` was REMOVED when EEVDF replaced CFS. The field
/// `min_vruntime` exists only as a per-node augmented field (subtree
/// minimum), not on the runqueue struct.
///
/// `zero_vruntime` keeps accumulator deltas `(v_i - zero_vruntime)` small,
/// preventing overflow in the `key * weight` products. After
/// `avg_vruntime()` returns, `sum_w_vruntime` is approximately zero
/// (the residual from integer division).
pub zero_vruntime: u64,
/// First avg_vruntime accumulator: Σ(w_i × (v_i − zero_vruntime)) for all
/// enqueued entities (NOT including `curr` — it is added transiently).
/// Updated on every enqueue and dequeue (including deferred). Never
/// divided on the hot path — the division-free eligibility test uses this
/// value directly.
///
/// Linux equivalent: `cfs_rq->sum_w_vruntime` (type `s64`).
///
/// **Overflow analysis**: `zero_vruntime` tracks close to avg_vruntime,
/// so each `(v_i - zero_vruntime)` offset is bounded by approximately
/// `±2 * lag_limit`. The lag limit for a nice -20 task is
/// `calc_delta_fair(max_slice + TICK_NSEC, 88761) ≈ 750_000 + 4_000_000
/// ≈ 4_750_000 ns` (virtual time). So each term is bounded by
/// `88761 * 4_750_000 ≈ 4.2 × 10^11`. With 1000 tasks at maximum weight:
/// `4.2 × 10^14`, well within i64 range `±9.22 × 10^18`.
pub sum_w_vruntime: i64,
/// Second avg_vruntime accumulator: Σ(w_i) for all enqueued entities
/// (NOT including `curr`). Updated in lockstep with `sum_w_vruntime`.
/// Deferred tasks contribute their weight until removed from the tree.
///
/// Linux equivalent: `cfs_rq->sum_weight` (type `unsigned long`).
/// UmkaOS uses `i64` to keep the same signedness as `sum_w_vruntime`
/// for the division-free eligibility multiplication.
///
/// **Overflow analysis**: Maximum per-task weight is 88761. With 1000
/// tasks: `W = 88_761_000`, negligible relative to i64 range.
pub sum_weight: i64,
/// Number of entities on this tree (including deferred, excluding
/// `curr` when used in EevdfRunQueue context). See `WaiterCount` docs
/// for threshold semantics.
/// Linux equivalent: `cfs_rq->nr_queued` (but Linux includes `curr`).
pub nr_queued: WaiterCount,
/// Number of runnable (non-deferred) tasks on this tree.
/// `AtomicU32` so that work-stealing CPUs may read this field lock-free
/// using `Relaxed` ordering — an approximate count is sufficient for
/// steal candidate selection. Updated with `Relaxed` on enqueue and dequeue.
///
/// **Design tradeoff**: AtomicU32 is required for the root per-CPU
/// `EevdfRunQueue` (lock-free work-stealing reads). CBS server
/// (`CbsCpuServer.tree`) and `GroupEntity` sub-trees pay the atomic cost
/// unnecessarily (~5-20 cycles per enqueue/dequeue on x86-64 for
/// `lock xadd` vs plain `add`). This is accepted as the cost of a unified
/// `VruntimeTree` base type — duplicating into atomic and non-atomic
/// variants would increase maintenance burden for marginal benefit. If
/// profiling shows this is a bottleneck, split `VruntimeTree` into
/// `VruntimeTreeAtomic` (root per-CPU) and `VruntimeTreeLocal` (CBS,
/// GroupEntity) variants.
pub task_count: AtomicU32,
}
/// Per-CPU EEVDF run queue state.
///
/// Embeds a `VruntimeTree` (the shared augmented RB tree and accumulators)
/// plus EEVDF-specific fields that are only meaningful for the root per-CPU
/// run queue: the currently-running entity, the wakeup buddy, and the CBS
/// bandwidth timer.
///
/// **CpuLocal access**: Accessed via `CpuLocal::get::<EevdfRunQueue>()` on the
/// hot path. The per-CPU runqueue is protected by `RunQueue.lock` at level 50
/// in the lock hierarchy ([Section 3.4](03-concurrency.md#cumulative-performance-budget)).
pub struct EevdfRunQueue {
/// Shared vruntime-ordered tree and accumulators. CBS servers and group
/// entities embed `VruntimeTree` directly; the root per-CPU run queue
/// wraps it here with additional EEVDF-specific state.
pub base: VruntimeTree,
/// Currently-running entity on this CPU, or None if the CPU is idle.
/// `curr` is NOT included in `base.sum_w_vruntime`/`base.sum_weight` —
/// it is added transiently by `avg_vruntime()` and `vruntime_eligible()`.
/// This matches Linux's design where `curr` is dequeued from the tree
/// while running.
///
/// **Root-only field**: In CBS sub-trees (`CbsCpuServer.tree`) and group
/// entity sub-trees (`GroupEntity.child_rq`), there is no `curr` — the
/// currently running task is tracked solely by the root per-CPU
/// `EevdfRunQueue.curr`. Those sub-trees embed `VruntimeTree` directly,
/// avoiding this unused field.
///
/// Updated by `pick_next_task()` (set to picked entity) and
/// `put_prev_task()` (set to None before the next pick).
pub curr: Option<NonNull<EevdfTask>>,
/// Cache-affinity wakeup buddy. When a task is woken by `try_to_wake_up()`
/// with `WakeFlags::SYNC`, `next` is set to the wakee. `pick_eevdf()` checks
/// `next` before the tree walk — if `next` is eligible, it is picked
/// immediately (PICK_BUDDY optimization). This improves cache locality for
/// producer-consumer patterns (pipe, futex, unix socket).
///
/// **Root-only field**: CBS and group entity sub-trees do not use wakeup
/// buddies — PICK_BUDDY is applied only at the root per-CPU level.
///
/// Linux equivalent: `cfs_rq->next`.
pub next: Option<NonNull<EevdfTask>>,
/// Timer for CBS bandwidth replenishment checks.
pub bandwidth_timer: HrTimer,
// NOTE: Section 7.8 (Core Provisioning) adds `cgroup_filter: Option<CgroupId>`
// to this struct for CG-core task filtering in `pick_next_task()`.
}
/// Augmented RB-tree invariants for `EevdfRbLink`:
///
/// Note: the tree is keyed by `vdeadline`, but the augmented fields track
/// `vruntime` (subtree minimum) and `slice` (subtree maximum).
///
/// **min_vruntime invariant** (the entity's vruntime stored alongside the
/// key, NOT `node.key` which is vdeadline):
///
/// ```text
/// node.min_vruntime = min(node.vruntime,
/// left.min_vruntime if left exists,
/// right.min_vruntime if right exists)
/// ```
///
/// **max_slice invariant**:
///
/// ```text
/// node.max_slice = max(node.slice,
/// left.max_slice if left exists,
/// right.max_slice if right exists)
/// ```
///
/// Both fields are maintained by the RB-tree rebalancing hooks:
/// every rotation or color change that restructures the tree must call
/// `recompute_augmented()` on each affected node, bottom-up. This is
/// the standard augmented-RB-tree update protocol.
///
/// `pick_eevdf()` exploits `min_vruntime` to prune ineligible subtrees
/// via `vruntime_eligible(rq, subtree.min_vruntime)`. If the subtree's
/// minimum vruntime is not eligible (greater than avg_vruntime), then
/// no entity in the subtree can be eligible — achieving O(log n)
/// selection even when many tasks are ineligible.
///
/// `cfs_rq_max_slice()` uses `max_slice` from the tree root to compute
/// the lag clamp limit without a full tree scan.
/// EEVDF scheduling state embedded in each Task struct.
///
/// This struct is embedded directly in `Task` (via `pub type SchedEntity = EevdfTask`)
/// so that the scheduler hot path can access all scheduling-relevant fields without
/// pointer indirection through `Task`. Fields like `weight`, `sched_class`, and `nice`
/// are here (not in `Task`) because the scheduler dispatch and accounting code reads
/// them on every tick, pick, and enqueue/dequeue — keeping them cache-local with
/// `vruntime` and `vdeadline` avoids a pointer chase on every scheduling decision.
pub struct EevdfTask {
/// Virtual runtime: accumulated CPU consumption in virtual time units.
/// Scales inversely with task weight (higher weight = slower accumulation).
/// Updated by `update_curr()` on every tick and scheduling event.
/// Linux equivalent: `sched_entity.vruntime`.
///
/// **Overflow invariant**: The difference between any entity's `vruntime`
/// and `VruntimeTree.zero_vruntime` must fit in an `i64` (i.e.,
/// `|se.vruntime - tree.zero_vruntime| < i64::MAX`). This is maintained by
/// the `zero_vruntime` normalization which periodically rebases all
/// vruntimes when `zero_vruntime` drifts. The `as i64` casts in
/// `entity_eligible()` and `avg_vruntime()` rely on this invariant.
/// Since `zero_vruntime` tracks the run queue's minimum vruntime and
/// all active entities have bounded lag (clamped by `update_entity_lag()`),
/// the invariant is maintained for any realistic workload.
vruntime: u64,
/// Virtual deadline: `vruntime + calc_delta_fair(slice, weight)`. The task
/// with the minimum vdeadline among eligible tasks is scheduled next.
/// Set by `place_entity()` on wakeup and `update_deadline()` on slice expiry.
/// Linux equivalent: `sched_entity.deadline`.
vdeadline: u64,
/// Virtual lag: `avg_vruntime(cfs_rq) - se.vruntime` at last dequeue.
/// **Unweighted, signed, in virtual-time units** — NOT multiplied by weight.
/// Positive = task has been underserved (owed CPU, eligible).
/// Negative = task over-served (ineligible).
/// Clamped by `update_entity_lag()` to
/// `±calc_delta_fair(cfs_rq_max_slice + TICK_NSEC, se)`.
/// Used by `place_entity()` on wakeup to position the task in virtual time.
/// Preserved across sleep/wake cycles.
/// Linux equivalent: `sched_entity.vlag` (type `s64`).
vlag: i64,
/// Time slice in nanoseconds. Default `sysctl_sched_base_slice` = 750_000
/// (750 us; UmkaOS diverges from Linux's 700 us — see design note above).
/// Configurable per task via `sched_setattr()` by setting `sched_runtime`
/// to a non-zero value in `struct sched_attr`. When `sched_runtime != 0`,
/// `custom_slice = true` and `slice_ns = sched_runtime`. When
/// `custom_slice` is false, `place_entity()` resets this to
/// `sysctl_sched_base_slice` on each wakeup.
/// Linux equivalent: `sched_entity.slice`.
slice_ns: u64,
/// Whether this entity has a custom slice set via `sched_setattr()`.
/// If false, `place_entity()` resets `slice_ns` to the sysctl default.
/// Linux equivalent: `sched_entity.custom_slice`.
custom_slice: bool,
/// Relative deadline flag. Set by `reweight_entity()` when weight changes
/// require deadline recalculation. `place_entity()` converts the relative
/// deadline to absolute: `vdeadline += vruntime`. Cleared after conversion.
/// Linux equivalent: `sched_entity.rel_deadline`.
rel_deadline: bool,
/// Protected slice threshold. `pick_eevdf()` retains `curr` if
/// `curr.vruntime < curr.vprot`, preventing excessive preemption when the
/// current task has not yet consumed a minimum quantum.
/// Set by `set_protect_slice()` on each scheduling event.
/// Linux equivalent: `sched_entity.vprot`.
vprot: u64,
/// Run queue membership state. Uses the four-variant enum to correctly
/// capture both the EEVDF deferred-dequeue state (sleeping but still in
/// the tree) and CBS throttled state (dequeued, awaiting replenishment).
on_rq: OnRqState,
/// True when the task has gone to sleep with negative vlag and is still
/// physically resident in `tasks_timeline` pending vlag decay. While
/// `sched_delayed` is set, the task remains eligible for selection by
/// `pick_eevdf()` (matching Linux, which has no `sched_delayed` filter
/// in the tree walk). The task still contributes its weight to the
/// `avg_vruntime` accumulators.
/// Cleared on wake-up (before re-enqueue) or when vlag decays to zero
/// at pick time and the task is physically removed.
sched_delayed: bool,
/// Wall-clock execution time accumulated by this entity (nanoseconds).
/// Incremented by `delta_exec` in `update_curr()` on every tick and
/// scheduling event. Used by `check_preempt_tick` in `eevdf_task_tick()`
/// to compare against `prev_sum_exec_runtime` for ideal-runtime checks,
/// and by `getrusage(2)` / `/proc/[pid]/sched` for user-visible accounting.
/// Linux equivalent: `sched_entity.sum_exec_runtime`.
sum_exec_runtime: u64,
/// Value of `sum_exec_runtime` at the time the entity was last enqueued.
/// The difference `sum_exec_runtime - prev_sum_exec_runtime` gives the
/// wall-clock time consumed in the current scheduling quantum.
/// **Initialization**: Set to `sum_exec_runtime` on each enqueue
/// (`enqueue_task()` / `set_next_task()` assign
/// `prev_sum_exec_runtime = sum_exec_runtime`). Not updated by
/// `eevdf_task_tick()` or `update_curr()` -- once set at enqueue, it
/// is fixed for the duration of the task's on-CPU residence. The
/// `check_preempt_tick` heuristic in `eevdf_task_tick()` step 3 reads
/// this field to compute `delta_exec`.
/// Linux equivalent: `sched_entity.prev_sum_exec_runtime`.
prev_sum_exec_runtime: u64,
/// Intrusive red-black tree link for the EEVDF `tasks_timeline` tree.
/// Keyed by `vdeadline`, augmented with `min_vruntime` for eligibility
/// pruning in `pick_eevdf()`. Embedded directly in the task struct for
/// zero-allocation enqueue/dequeue on the scheduler hot path.
eevdf_rb_link: EevdfRbLink,
/// Intrusive red-black tree link for the `DlRunQueue` tree. When this
/// task has `sched_class == SchedClass::Deadline`, `dl_rb_link` is
/// inserted into the per-CPU `DlRunQueue.root` tree on enqueue and
/// removed on dequeue. For non-deadline tasks this link remains in the
/// unlinked state (`dl_rb_link.is_linked() == false`) and is never
/// touched by the scheduler. Embedding the link here avoids any heap
/// allocation when inserting into / removing from the DL run queue.
dl_rb_link: DlRbLink,
/// EEVDF weight derived from the nice value via `sched_prio_to_weight[]`.
/// Determines the rate of vruntime accumulation: higher weight → slower
/// accumulation → more CPU share. Updated when nice is changed via
/// `setpriority(2)` or `sched_setattr(2)`. Must be kept in sync with
/// `nice` — the canonical mapping is `weight = sched_prio_to_weight[nice + 20]`.
/// Stored here (not derived on-the-fly) because the scheduler reads it
/// on every `vruntime += delta * NICE_0_WEIGHT / weight` computation.
weight: u32,
/// Per-Entity Load Tracking state. Maintains exponentially-decaying averages
/// of CPU utilisation (`util_avg`), runnability (`runnable_avg`), and weighted
/// load (`load_avg`). Updated on every scheduling state transition (run→sleep,
/// sleep→run, tick). Consumed by EAS for task placement, by cpufreq for
/// frequency selection, and by the load balancer for migration decisions.
/// See the `PeltState` definition above for field-level details.
///
/// **Long-term precision**: PELT uses fixed-point (u32, 20-bit fractional)
/// geometric decay. Inactive entities decay to zero within ~188 ms (6
/// half-lives). Accumulated rounding errors do not drift over time because
/// the decay is self-correcting — any rounding error in an active entity's
/// load average is dominated by the next period's fresh contribution.
pelt: PeltState,
/// Latency-nice value (-20 to 19). Controls the EEVDF virtual deadline slack:
/// lower latency_nice → shorter virtual deadline → task is picked sooner among
/// eligible entities at the cost of reduced throughput for co-scheduled tasks.
///
/// **UmkaOS-original extension** — NOT present in Linux mainline. Set via
/// `sched_setattr(2)` with `sched_flags |= SCHED_FLAG_LATENCY_NICE` (0x80)
/// and `sched_latency_nice` field in the extended `sched_attr`. Default 0
/// (no adjustment).
///
/// The effective slice used for deadline calculation:
/// `effective_slice = base_slice_ns * LATENCY_NICE_0_WEIGHT / latency_weight`
/// where `latency_weight = LATENCY_NICE_TO_WEIGHT[(latency_nice + 20)]`.
/// A task with `latency_nice = -20` gets a ~88× shorter effective slice,
/// giving it the earliest deadline among peers.
latency_nice: i32,
/// Scheduling class. Determines which per-CPU queue this task is managed by
/// and which class-specific operations (enqueue/dequeue/pick/tick) apply.
/// The `pick_next_task()` dispatch uses the class priority ordering:
/// Deadline > RtFifo/RtRr > Eevdf > Idle. Changed via `sched_setscheduler(2)`
/// or `sched_setattr(2)`. Stored in `EevdfTask` (not `Task`) because the
/// scheduler dispatch reads it on every pick decision — keeping it cache-local
/// with `vruntime` avoids an extra cache line fetch.
sched_class: SchedClass,
/// Scheduling policy. Encodes the user-visible POSIX policy that the task
/// was configured with. Maps to `SchedClass` but preserves the distinction
/// between policies within the same class (e.g., `UserSchedPolicy::Batch` and
/// `UserSchedPolicy::Normal` both map to `SchedClass::Eevdf` but differ in
/// preemption behavior — Batch tasks are never preempted by newly woken
/// Eevdf peers, only by RT/DL tasks).
sched_policy: UserSchedPolicy,
/// Nice value (-20 to 19). The POSIX nice value set by `setpriority(2)` or
/// `nice(2)`. Determines `weight` via `sched_prio_to_weight[nice + 20]`.
/// Stored alongside `weight` because `getpriority(2)` must return it and
/// `/proc/[pid]/stat` field 19 exposes it. Changing nice updates both `nice`
/// and `weight` atomically under the runqueue lock.
nice: i8,
/// Accumulated RT CPU time in microseconds for RLIMIT_RTTIME enforcement.
/// Incremented on every scheduler tick while the task is running under an
/// RT scheduling class (SCHED_FIFO or SCHED_RR). When this value reaches
/// the task's `rlimit(RLIMIT_RTTIME)`, the kernel sends SIGXCPU. If the
/// task continues running for one additional second, SIGKILL is sent.
/// Reset to zero each time the task voluntarily relinquishes the CPU
/// (blocks, yields, or sleeps). AtomicU64 because it is read by the
/// signal delivery path without holding the runqueue lock.
rt_runtime_us: AtomicU64,
/// Wall-clock timestamp (nanoseconds, monotonic) of the last scheduling
/// accounting update. Set to `rq_clock_task(rq)` by `update_curr()` on
/// every tick/scheduling event, and on enqueue. The delta
/// `rq_clock_task(rq) - exec_start` gives task-only time since the last
/// accounting update (excludes IRQ time). Used by `update_curr()`,
/// `rt_bandwidth_tick()`, and `check_preempt_tick()` for time accounting.
/// Linux equivalent: `sched_entity.exec_start` (uses `rq_clock_task()`).
exec_start: u64,
/// Dirty flag set by cgroup cpu.weight write path
/// (`sched_group_set_weight()`) to signal that this task's GroupEntity
/// weight needs recalculation on the next tick. AtomicBool because the
/// cgroup migration path (cross-CPU) may set this flag while the task
/// is running on a different CPU — a cross-context access that is a
/// data race under the Rust abstract machine with `Cell`.
/// Checked and cleared by `eevdf_task_tick()` step 4.
cgroup_weight_dirty: AtomicBool,
/// Saved `entity_key()` value at `sched_idle_enter()` time. Used by
/// `sched_idle_exit()` to restore the exact accumulator contribution,
/// preventing drift from `zero_vruntime` shifts during the idle interval.
/// Only valid when `sched_idle_marked` is true. AtomicI64 for cross-context
/// visibility (same rationale as `sched_idle_marked`).
saved_idle_key: AtomicI64,
}
// ---------------------------------------------------------------------------
// avg_vruntime — Weighted Average Virtual Runtime
// ---------------------------------------------------------------------------
/// Compute and return the weighted average virtual runtime of all entities
/// on this run queue. This function has a SIDE EFFECT: it updates
/// `zero_vruntime` and adjusts `sum_w_vruntime` so that `zero_vruntime`
/// tracks close to the true weighted average. After this call,
/// `sum_w_vruntime` is approximately zero (the residual from integer
/// division).
///
/// **Classification**: Evolvable (replaceable). Correctness ensured by
/// `EevdfInvariantChecker` before swap is committed. See classification
/// table at top of this section.
///
/// Linux equivalent: `avg_vruntime()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn avg_vruntime(rq: &mut EevdfRunQueue) -> u64 {
/// let curr = rq.curr;
/// let mut weight: i64 = rq.base.sum_weight;
/// let mut delta: i64 = 0;
///
/// // Only include curr if it is on the runqueue.
/// let curr = match curr {
/// Some(c) if unsafe { c.as_ref() }.on_rq != OnRqState::Off => Some(c),
/// _ => None,
/// };
///
/// if weight > 0 {
/// let mut runtime: i64 = rq.base.sum_w_vruntime;
///
/// // Transiently add curr's contribution (curr is not in the
/// // accumulators while running).
/// if let Some(c) = curr {
/// let se = unsafe { c.as_ref() };
/// let w = se.weight as i64;
/// runtime += entity_key(&rq.base, se) * w;
/// weight += w;
/// }
///
/// // Floor-division bias: round toward negative infinity.
/// // This ensures that avg_vruntime() + 0 always yields
/// // entity_eligible() == true (a task placed exactly at the
/// // average must be eligible). Without this bias, truncation
/// // toward zero could make the average appear slightly too high,
/// // causing a correctly-placed task to be deemed ineligible.
/// if runtime < 0 {
/// runtime -= weight - 1;
/// }
///
/// delta = runtime / weight; // integer division
/// } else if let Some(c) = curr {
/// // When only curr exists (no tree entities), it IS the average.
/// let se = unsafe { c.as_ref() };
/// delta = se.vruntime as i64 - rq.base.zero_vruntime as i64;
/// }
///
/// update_zero_vruntime(&mut rq.base, delta);
/// rq.base.zero_vruntime
/// }
/// ```
///
/// **`entity_key`**: Returns the signed offset of an entity's vruntime from
/// `zero_vruntime`. Takes `&VruntimeTree` — operates only on accumulator state.
/// Linux equivalent: `entity_key()`.
/// ```text
/// fn entity_key(tree: &VruntimeTree, se: &EevdfTask) -> i64 {
/// se.vruntime as i64 - tree.zero_vruntime as i64
/// }
/// ```
///
/// **`update_zero_vruntime`**: Shifts the tracking reference point by `delta`.
/// Takes `&mut VruntimeTree` — operates only on accumulator state.
/// Linux equivalent: `update_zero_vruntime()`.
/// ```text
/// fn update_zero_vruntime(tree: &mut VruntimeTree, delta: i64) {
/// // v' = v + d ==> sum_w_vruntime' = sum_w_vruntime - d * sum_weight
/// tree.sum_w_vruntime -= tree.sum_weight * delta;
/// tree.zero_vruntime = (tree.zero_vruntime as i64 + delta) as u64;
/// }
/// ```
pub fn avg_vruntime(rq: &mut EevdfRunQueue) -> u64 { /* in sched/eevdf.rs */ }
// ---------------------------------------------------------------------------
// Accumulator maintenance
// ---------------------------------------------------------------------------
/// Update the avg_vruntime accumulators when a task is enqueued or dequeued.
///
/// `avg_vruntime` is maintained without division using two running sums:
///
/// ```text
/// zero_vruntime = tracking reference (approximately = avg_vruntime)
/// sum_w_vruntime = Σ(w_i × (v_i − zero_vruntime))
/// sum_weight = Σ(w_i)
///
/// avg_vruntime = zero_vruntime + sum_w_vruntime / sum_weight
/// ```
///
/// **On enqueue** (task entering the tree, including deferred re-enqueue):
/// ```text
/// tree.sum_w_vruntime += entity_key(tree, se) * se.weight as i64;
/// tree.sum_weight += se.weight as i64;
/// ```
///
/// **On dequeue** (task leaving the tree, including deferred removal):
/// ```text
/// tree.sum_w_vruntime -= entity_key(tree, se) * se.weight as i64;
/// tree.sum_weight -= se.weight as i64;
/// ```
///
/// Linux equivalent: called from `enqueue_entity()` and `dequeue_entity()`.
///
/// Takes `&mut VruntimeTree` — operates only on accumulator state, no access
/// to `curr` or other root-only fields. This allows CBS servers and group
/// entities to call this function directly on their embedded `VruntimeTree`.
///
/// **ML observation point**: After dequeue, emit runtime metrics:
/// ```text
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::RunqueueStats,
/// tree.nr_queued.count(), /* ... */);
/// ```
pub fn avg_vruntime_update(tree: &mut VruntimeTree, se: &EevdfTask, enqueue: bool) {
let key = entity_key(tree, se);
let w = se.weight as i64;
if enqueue {
tree.sum_w_vruntime += key * w;
tree.sum_weight += w;
} else {
tree.sum_w_vruntime -= key * w;
tree.sum_weight -= w;
}
}
// ---------------------------------------------------------------------------
// Eligibility — division-free O(1) check
// ---------------------------------------------------------------------------
/// Division-free O(1) eligibility check for a specific vruntime value.
///
/// Returns true if `vruntime <= avg_vruntime(rq)`, computed without division:
///
/// ```text
/// vruntime <= zero_vruntime + sum_w_vruntime / sum_weight
/// ⟺ (vruntime - zero_vruntime) * sum_weight <= sum_w_vruntime
/// ```
///
/// **Curr transient inclusion**: The currently-running entity (`curr`) is NOT
/// in `sum_w_vruntime`/`sum_weight`. This function transiently adds `curr`'s
/// contribution before comparing. This matches Linux's `vruntime_eligible()`.
///
/// **Classification**: Evolvable (replaceable). Correctness ensured by
/// `EevdfInvariantChecker` before swap is committed. See classification
/// table at top of this section.
///
/// Linux equivalent: `vruntime_eligible()` in `fair.c`.
///
/// ```text
/// fn vruntime_eligible(rq: &EevdfRunQueue, vruntime: u64) -> bool {
/// let mut avg: i64 = rq.base.sum_w_vruntime;
/// let mut load: i64 = rq.base.sum_weight;
///
/// // Transiently include curr if on the runqueue.
/// if let Some(c) = rq.curr {
/// let se = unsafe { c.as_ref() };
/// if se.on_rq != OnRqState::Off {
/// let w = se.weight as i64;
/// avg += entity_key(&rq.base, se) * w;
/// load += w;
/// }
/// }
///
/// avg >= (vruntime as i64 - rq.base.zero_vruntime as i64) * load
/// }
/// ```
pub fn vruntime_eligible(rq: &EevdfRunQueue, vruntime: u64) -> bool {
/* in sched/eevdf.rs */
}
/// Check whether a specific entity is eligible.
///
/// Delegates to `vruntime_eligible(rq, se.vruntime)`.
///
/// Linux equivalent: `entity_eligible()` in `fair.c`.
///
/// ```text
/// fn entity_eligible(rq: &EevdfRunQueue, se: &EevdfTask) -> bool {
/// vruntime_eligible(rq, se.vruntime)
/// }
/// ```
pub fn entity_eligible(rq: &EevdfRunQueue, se: &EevdfTask) -> bool {
vruntime_eligible(rq, se.vruntime)
}
// ---------------------------------------------------------------------------
// update_entity_lag — Lag Tracking
// ---------------------------------------------------------------------------
/// Update and clamp the entity's virtual lag on dequeue.
///
/// Called on every dequeue (sleep, yield, preemption). Stores the virtual lag
/// (`vlag = V - v_i`) for use by `place_entity()` on the next wakeup.
///
/// **Classification**: Evolvable (replaceable). Correctness ensured by
/// `EevdfInvariantChecker` before swap is committed. See classification
/// table at top of this section.
///
/// Linux equivalent: `update_entity_lag()` in `fair.c`.
///
/// ```text
/// fn update_entity_lag(rq: &mut EevdfRunQueue, se: &mut EevdfTask) {
/// debug_assert!(se.on_rq != OnRqState::Off);
///
/// let vlag: i64 = avg_vruntime(rq) as i64 - se.vruntime as i64;
/// let limit: i64 = calc_delta_fair(
/// cfs_rq_max_slice(rq) + TICK_NSEC, se.weight
/// ) as i64;
///
/// se.vlag = vlag.clamp(-limit, limit);
/// }
/// ```
///
/// **`cfs_rq_max_slice`**: Returns the maximum `slice_ns` across the tree root's
/// `max_slice` augmented field and `curr` (if on_rq). Linux equivalent:
/// `cfs_rq_max_slice()`.
///
/// ```text
/// fn cfs_rq_max_slice(rq: &EevdfRunQueue) -> u64 {
/// let mut max_slice: u64 = 0;
/// if let Some(c) = rq.curr {
/// let se = unsafe { c.as_ref() };
/// if se.on_rq != OnRqState::Off {
/// max_slice = se.slice_ns;
/// }
/// }
/// if let Some(root) = rq.base.tasks_timeline.root() {
/// max_slice = max_slice.max(root.max_slice);
/// }
/// max_slice
/// }
/// ```
pub fn update_entity_lag(rq: &mut EevdfRunQueue, se: &mut EevdfTask) {
/* in sched/eevdf.rs */
}
// ---------------------------------------------------------------------------
// place_entity — Wakeup Placement (CRITICAL — was completely missing)
// ---------------------------------------------------------------------------
/// Position a waking (or newly forked) entity in virtual time.
///
/// This is the most critical function in EEVDF — it determines where a task
/// appears in the virtual timeline on every wakeup, fork, and reweight. It
/// MUST match Linux's mathematical semantics for fairness correctness.
///
/// **Classification**: Evolvable (replaceable). The lag inflation formula and
/// initial placement policy are swappable via `EvolvableComponent`. The ML
/// framework can tune the effective slice via `ParamId::SchedEevdfWeightScale`.
///
/// Linux equivalent: `place_entity()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn place_entity(rq: &mut EevdfRunQueue, se: &mut EevdfTask, flags: EnqueueFlags) {
/// // Step 1: Get the current weighted average (also updates zero_vruntime).
/// let vruntime: u64 = avg_vruntime(rq);
/// let mut lag: i64 = 0;
///
/// // Step 2: Reset slice to sysctl default unless custom.
/// if !se.custom_slice {
/// // ML-tunable slice via ParamId::SchedEevdfWeightScale.
/// let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
/// .map_or(100, |p| p.current.load(Relaxed));
/// se.slice_ns = SYSCTL_SCHED_BASE_SLICE * weight_scale as u64 / 100;
/// }
/// let vslice: u64 = calc_delta_fair(se.slice_ns, se.weight);
///
/// // Step 3: PLACE_LAG — restore saved virtual lag with inflation.
/// // Only applies if the queue has entities and the entity has saved lag.
/// if rq.base.nr_queued.has_waiters() && se.vlag != 0 {
/// lag = se.vlag;
///
/// // Lag inflation: compensate for the effect of adding this entity on V.
/// // Adding a task with positive vlag moves V backwards, reducing the
/// // effective lag. The inflation formula:
/// // inflated_lag = vlag * (W + w_i) / W
/// // prevents lag "evaporation" on sleep/wake cycles. Without this,
/// // frequently-sleeping interactive tasks gradually lose their
/// // accumulated service credits — a systematic fairness bias.
/// //
/// // Linux equivalent: the PLACE_LAG block in place_entity().
/// let mut load: i64 = rq.base.sum_weight;
/// if let Some(c) = rq.curr {
/// let curr_se = unsafe { c.as_ref() };
/// if curr_se.on_rq != OnRqState::Off {
/// load += curr_se.weight as i64;
/// }
/// }
///
/// // lag *= (load + w_i) / load
/// lag = lag * (load + se.weight as i64);
/// debug_assert!(load > 0, "sum_weight == 0 with nr_queued.has_waiters(): corrupted runqueue");
/// if load == 0 { load = 1; } // defensive fallback prevents div-by-zero in release
/// lag = lag / load;
/// }
///
/// // Step 4: Set vruntime. A task with positive lag (owed CPU) gets
/// // vruntime < avg_vruntime (eligible). A task with negative lag
/// // (over-served) gets vruntime > avg_vruntime (ineligible).
/// se.vruntime = (vruntime as i64 - lag) as u64;
///
/// // Step 5: Handle relative deadline (from reweight).
/// if se.rel_deadline {
/// se.vdeadline += se.vruntime;
/// se.rel_deadline = false;
/// return;
/// }
///
/// // Step 6: Initial placement (fork) — start with half a slice to ease
/// // into the competition. Existing tasks are, on average, halfway
/// // through their slice.
/// let effective_vslice = if flags.contains(EnqueueFlags::ENQUEUE_INITIAL) {
/// vslice / 2
/// } else {
/// vslice
/// };
///
/// // Step 7: Set virtual deadline.
/// // EEVDF: vd_i = ve_i + r_i/w_i (in virtual time units).
/// se.vdeadline = se.vruntime + effective_vslice;
///
/// // ML observation: emit placement metrics.
/// // observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskWoke,
/// // se.cgroup_id, latency_ns as i32, rq.base.nr_queued.count(), prev_cpu);
/// }
/// ```
///
/// **Error paths**: `place_entity()` cannot fail — all inputs are bounded by
/// the lag clamp and the `sysctl_sched_base_slice` range. Division by zero
/// is prevented by the `load == 0` guard (defensive; `sum_weight == 0`
/// implies an empty runqueue, and `place_entity` is called when enqueueing
/// into a non-empty queue).
pub fn place_entity(
rq: &mut EevdfRunQueue,
se: &mut EevdfTask,
flags: EnqueueFlags,
) { /* in sched/eevdf.rs */ }
// ---------------------------------------------------------------------------
// update_curr — Core Accounting Loop
// ---------------------------------------------------------------------------
/// Update the currently running entity's virtual runtime, deadline, PELT,
/// and CBS budget. This is the most frequently called EEVDF function — it
/// runs on EVERY scheduler tick and EVERY voluntary dequeue.
///
/// **Classification**: Evolvable. The vruntime advance formula, PELT update,
/// CBS charge path, and preemption check are all policy code, hot-swappable
/// via `EvolvableComponent`. The `EevdfRunQueue` and `EevdfTask` struct
/// layouts are Nucleus (data). The invariant checker validates that any
/// replacement `update_curr()` preserves weight-proportional vruntime
/// advance and that `vruntime` is monotonically non-decreasing for the
/// currently running entity.
///
/// Linux equivalent: `update_curr()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn update_curr(rq: &mut EevdfRunQueue) {
/// let curr = match rq.curr {
/// Some(c) => unsafe { c.as_mut() },
/// None => return, // No current entity (idle CPU).
/// };
///
/// // Step 1: Compute execution delta using the runqueue task clock.
/// // rq_clock_task(rq) is the runqueue-cached clock that excludes IRQ
/// // time, updated once per scheduler entry by update_rq_clock().
/// // This matches Linux's update_curr() which uses rq_clock_task(rq),
/// // NOT sched_clock() or ktime_get_ns(). Using the task clock avoids
/// // charging IRQ time to the running task's vruntime.
/// let now = rq_clock_task(rq);
/// let delta_exec = now as i64 - curr.exec_start as i64;
/// if delta_exec <= 0 { return; } // Clock went backwards / no time elapsed.
/// curr.exec_start = now;
///
/// // Step 1a: Accumulate wall-clock execution time (used by
/// // check_preempt_tick, getrusage, /proc/[pid]/sched).
/// curr.sum_exec_runtime += delta_exec as u64;
///
/// // Step 2: Skip vruntime accumulation for idle-marked tasks.
/// // See sched_idle_enter() / sched_idle_exit().
/// // NOTE: sum_exec_runtime continues accumulating during idle-spin
/// // (step 1a runs before this check) because the task IS consuming
/// // CPU time — getrusage() and /proc/[pid]/sched should reflect
/// // wall-clock CPU usage. Only vruntime and PELT are frozen.
/// if curr.sched_idle_marked.load(Relaxed) { return; }
///
/// // Step 3: Advance vruntime by wall-clock delta scaled by weight.
/// // NOTE: The avg_vruntime accumulators (`sum_w_vruntime`,
/// // `sum_weight`) are NOT updated here. They use curr's stale
/// // vruntime from the last enqueue/dequeue, not the updated
/// // value. This is correct (matches Linux): accumulator updates
/// // are performed on enqueue/dequeue via avg_vruntime_add/sub.
/// // Updating them on every tick would break the O(1) tick path
/// // (accumulator updates involve subtraction and re-addition of
/// // key * weight). The stale contribution is corrected when the
/// // entity is dequeued, and `avg_vruntime()` compensates for
/// // curr by subtracting the stale and adding the current value.
/// curr.vruntime += calc_delta_fair(delta_exec as u64, curr.weight);
///
/// // Step 3b: Propagate vruntime to ancestor GroupEntities (hierarchical
/// // EEVDF). Walk from the task's innermost cgroup up to the root cgroup,
/// // advancing each GroupEntity's vruntime in its parent's EEVDF tree by
/// // delta_exec_ns * NICE_0_WEIGHT / group_weight. All entities are on
/// // this CPU's runqueue — no cross-CPU locking needed.
/// // See [Section 7.2](#heterogeneous-cpu-scheduling--virtual-runtime-propagation).
/// // `group_entity_for(curr)` is an XArray lookup:
/// // `rq.group_entities.get(curr.cgroup_id)` -> Option<&GroupEntity>
/// // `propagate_group_vruntime()` is fully specified in
/// // [Section 7.2](#heterogeneous-cpu-scheduling--virtual-runtime-propagation):
/// // walks from innermost cgroup to root, advancing each GroupEntity's
/// // vruntime by calc_delta_fair(delta_ns, group_weight).
/// if let Some(ge) = group_entity_for(curr) {
/// propagate_group_vruntime(ge, delta_exec as u64);
/// }
///
/// // Step 4: Update deadline if slice expired. Returns true when the
/// // entity's deadline was renewed (meaning a reschedule check is needed).
/// let deadline_renewed = update_deadline(rq, curr);
///
/// // Step 5: Update PELT for the current entity (running=true, runnable=true).
/// curr.pelt.update(/* running */ true, /* runnable */ true, now, curr.weight as u64);
///
/// // Step 6: CBS budget charge (if task is in a CBS-guaranteed cgroup).
/// // See [Section 7.6](#cpu-bandwidth-guarantees--cbs-charge).
/// if let Some(server) = cbs_server_for(curr) {
/// cbs_charge(server, &cbs_config_for(curr), delta_exec as u64);
/// }
///
/// // Step 6b: cpu.max ceiling charge (if task's cgroup has cpu.max set).
/// // See [Section 7.6](#cpu-bandwidth-guarantees--cpumax-ceiling-enforcement-bandwidth-throttling).
/// // `charge_cpu_max()` accepts nanoseconds (matching CBS budget_remaining_ns
/// // to avoid the persistent truncation bias of `delta_exec / 1000`).
/// if let Some(bw) = cpu_bandwidth_for(curr) {
/// if bw.quota_ns != u64::MAX {
/// charge_cpu_max(curr, bw, delta_exec as u64);
/// }
/// }
///
/// // Step 7: Fast path — only curr on the rq, no preemption needed.
/// // UmkaOS nr_queued excludes curr, so "no waiters" == "only curr running".
/// if !rq.base.nr_queued.has_waiters() { return; }
///
/// // Step 8: Preemption check. Request reschedule if the current entity's
/// // deadline expired or it has exhausted its protected slice.
/// if deadline_renewed || !protect_slice(curr) {
/// resched_curr(rq, ReschedUrgency::Lazy);
/// }
///
/// // ML observation: emit per-tick scheduling metrics (gated by static key).
/// // observe_kernel!(SubsystemId::Scheduler, SchedObs::UpdateCurr,
/// // delta_exec, curr.vruntime, rq.base.nr_queued.count());
/// }
/// ```
///
/// **Error paths**: `update_curr()` cannot fail. All arithmetic is bounded:
/// `delta_exec` is clamped to non-negative by the guard in Step 1,
/// `calc_delta_fair` uses u128 intermediate to prevent overflow, and PELT
/// sums are bounded by `LOAD_AVG_MAX`. CBS charge uses saturating arithmetic.
///
/// **Locking**: Called with the local CPU's `rq.lock` held. All fields
/// accessed are either local to the current CPU's runqueue (no contention)
/// or use interior mutability (AtomicXX fields in `CbsCpuServer`).
pub fn update_curr(rq: &mut EevdfRunQueue) { /* in sched/eevdf.rs */ }
// ---------------------------------------------------------------------------
// update_deadline — Slice Expiry
// ---------------------------------------------------------------------------
/// Update the entity's virtual deadline when its slice expires.
///
/// **Precondition**: Called only from `update_curr()` after `se.vruntime`
/// has been advanced by `calc_delta_fair(delta_exec, se)`. Must not be
/// called independently — the `vruntime >= vdeadline` check assumes
/// vruntime reflects the entity's current accumulated execution.
///
/// Called by `update_curr()` when `se.vruntime >= se.vdeadline`. Assigns a
/// new slice and recomputes the deadline from the current vruntime.
///
/// Linux equivalent: `update_deadline()` in `fair.c`.
///
/// Returns `true` if a new deadline was assigned (slice expired), `false` if
/// the entity's current deadline has not yet been reached.
///
/// ```text
/// fn update_deadline(rq: &EevdfRunQueue, se: &mut EevdfTask) -> bool {
/// if se.vruntime < se.vdeadline { return false; }
///
/// if !se.custom_slice {
/// let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
/// .map_or(100, |p| p.current.load(Relaxed));
/// se.slice_ns = SYSCTL_SCHED_BASE_SLICE * weight_scale as u64 / 100;
/// }
/// se.vdeadline = se.vruntime + calc_delta_fair(se.slice_ns, se.weight);
///
/// // Update weighted average virtual runtime after deadline renewal.
/// // Linux `update_deadline()` calls `avg_vruntime(cfs_rq)` here to
/// // rebase zero_vruntime after the entity's vruntime has advanced past
/// // the old deadline. Without this call, entity_key() deltas grow,
/// // reducing precision of the division-free eligibility check.
/// avg_vruntime(rq);
///
/// // Update the protect slice threshold.
/// // NOTE: UmkaOS deliberately refreshes vprot on every slice expiry
/// // (via update_deadline()), NOT only at context-switch time as Linux
/// // does (Linux calls set_protect_slice() from set_next_entity()).
/// // Rationale: per-slice-expiry refresh provides tighter protection
/// // tracking for long-running tasks that span many slices without
/// // being context-switched out. The formula also diverges from Linux
/// // — see set_protect_slice() documentation below.
/// set_protect_slice(se);
/// true
/// }
/// ```
pub fn update_deadline(rq: &EevdfRunQueue, se: &mut EevdfTask) -> bool {
/* in sched/eevdf.rs */
}
/// Set the protected slice threshold for an entity. `pick_eevdf()` retains
/// `curr` if `curr.vruntime < curr.vprot`, preventing excessive preemption.
///
/// **UmkaOS-original improvement over Linux.**
///
/// **Linux** (`kernel/sched/fair.c set_protect_slice()`, called from
/// `set_next_entity()`): with `RUN_TO_PARITY` enabled (default-true per
/// `features.h`: `SCHED_FEAT(RUN_TO_PARITY, true)`):
/// 1. `slice = cfs_rq_min_slice(cfs_rq)` -- the minimum slice of all
/// enqueued entities.
/// 2. `slice = min(slice, se->slice)`.
/// 3. If `slice != se->slice` (i.e., a shorter-slice entity exists on the
/// runqueue), compute `vprot = min(se->deadline, se->vruntime +
/// calc_delta_fair(slice, se))`.
/// 4. If `slice == se->slice` (the common case when all tasks use the
/// default base slice), `vprot = se->deadline` -- the FULL virtual
/// slice is protected.
/// Without `RUN_TO_PARITY`: `slice = min(base_slice_ns, se->slice)`, which
/// similarly protects the full slice for default-slice tasks.
///
/// **UmkaOS**: configurable fraction (default 50%) of each task's virtual
/// slice. This makes UmkaOS **more preemptive** than Linux under default
/// settings for same-slice tasks. The tradeoff is deliberate: server
/// throughput workloads benefit from shorter protection windows, and the
/// fraction is ML-tunable per-cgroup.
///
/// Tradeoff: Linux's full-deadline protection for same-slice tasks provides
/// complete run-to-completion guarantees within a slice. UmkaOS's fractional
/// protection enables more frequent preemption opportunities, reducing
/// tail latency at the cost of throughput. The fraction is tunable:
/// `pct=20` for latency-sensitive workloads, `pct=70` for throughput.
///
/// **Tunability (two levels)**:
/// 1. **Boot parameter**: `umka.sched.protect_pct=50` (system-wide default,
/// range 10-90). Sysadmin sets once: 70 for database clusters, 20 for
/// latency-sensitive trading systems.
/// 2. **ML-tunable per-cgroup**: `ParamId::SchedProtectFraction` (default
/// from boot param, range [10, 90]). ML framework adapts per-cgroup at
/// runtime: batch cgroups get 70% (more throughput), latency-sensitive
/// cgroups get 20% (more responsive).
///
/// **Classification**: Evolvable. The fraction is policy, hot-swappable.
///
/// ```text
/// fn set_protect_slice(se: &mut EevdfTask) {
/// let pct = PARAM_STORE.get(ParamId::SchedProtectFraction)
/// .map_or(BOOT_PROTECT_PCT.load(Relaxed), |p| p.current.load(Relaxed));
/// // Clamp to [10, 90] to prevent degenerate behavior.
/// let pct = pct.clamp(10, 90);
/// let vslice = calc_delta_fair(se.slice_ns, se.weight);
/// let protect_vtime = vslice * pct as u64 / 100;
/// se.vprot = se.vdeadline - protect_vtime;
/// }
/// ```
///
/// Default: `BOOT_PROTECT_PCT = 50` (protect half the virtual slice).
/// At `pct=50`, a nice-0 task with 750μs slice has ~375μs of protection
/// (virtual time). A nice -20 task's proportionally smaller vslice gives
/// proportionally smaller but still meaningful protection. Setting `pct=20`
/// approaches Linux-like behavior (short protection, fast preemption).
fn set_protect_slice(se: &mut EevdfTask) { /* in sched/eevdf.rs */ }
// ---------------------------------------------------------------------------
// pick_eevdf — Eligible Earliest Virtual Deadline First Selection
// ---------------------------------------------------------------------------
/// Select the eligible entity with the earliest virtual deadline in O(log n).
///
/// **Classification**: Evolvable (replaceable). The tie-breaking logic,
/// PICK_BUDDY optimization, and protect_slice check are policy decisions
/// that can be hot-swapped. The tree walk structure and eligibility check
/// are Nucleus-correct.
///
/// `tasks_timeline` is a single intrusive augmented RB-tree keyed by
/// `vdeadline`. Each `EevdfRbLink` caches `min_vruntime` — the minimum
/// `vruntime` in that subtree. The walk uses `vruntime_eligible()` on
/// subtree `min_vruntime` fields to prune ineligible subtrees.
///
/// Linux equivalent: `__pick_eevdf()` in `fair.c`.
///
/// # Algorithm
///
/// ```text
/// fn pick_eevdf(rq: &mut EevdfRunQueue) -> Option<&EevdfTask> {
/// let mut node = rq.base.tasks_timeline.root();
/// let se = pick_first_entity(rq); // leftmost (earliest deadline)
/// let mut curr = rq.curr;
/// let mut best: Option<&EevdfTask> = None;
///
/// // Fast path: single entity (only curr, no waiters in tree).
/// if !rq.base.nr_queued.has_waiters() {
/// return match curr {
/// Some(c) if unsafe { c.as_ref() }.on_rq != OnRqState::Off =>
/// Some(unsafe { c.as_ref() }),
/// _ => se,
/// };
/// }
///
/// // PICK_BUDDY: prefer cache-affinity wakeup buddy if eligible.
/// // This is an Evolvable heuristic — can be disabled or replaced.
/// // Linux WARN_ON_ONCE(cfs_rq->next->sched_delayed): the buddy is set
/// // during wakeup, so it must not be delayed. UmkaOS uses debug_assert.
/// if let Some(next) = rq.next {
/// let next_se = unsafe { next.as_ref() };
/// debug_assert!(!next_se.sched_delayed, "buddy must not be sched_delayed");
/// if entity_eligible(rq, next_se) {
/// return Some(next_se);
/// }
/// }
///
/// // Filter curr: only consider if on_rq and eligible.
/// let curr_ref = curr.and_then(|c| {
/// let se = unsafe { c.as_ref() };
/// if se.on_rq != OnRqState::Off && entity_eligible(rq, se) {
/// Some(se)
/// } else {
/// None
/// }
/// });
///
/// // protect_slice: if curr has not exhausted its minimum quantum, retain it.
/// // This prevents excessive preemption with very short slices.
/// // Evolvable — the threshold can be tuned by ML policy.
/// if let Some(c) = curr_ref {
/// if c.vruntime < c.vprot {
/// return Some(c);
/// }
/// }
///
/// // Leftmost shortcut: if the earliest-deadline entity is eligible, pick it.
/// if let Some(leftmost) = se {
/// if entity_eligible(rq, leftmost) {
/// best = Some(leftmost);
/// }
/// }
///
/// // Tree walk: find the earliest-deadline eligible entity.
/// // The walk is ITERATIVE (not recursive) — UmkaOS improvement over
/// // potential stack depth issues.
/// if best.is_none() {
/// while let Some(n) = node {
/// let left = n.left;
///
/// // If left subtree contains eligible entities, descend left
/// // (earlier deadlines are always better).
/// if let Some(l) = left {
/// if vruntime_eligible(rq, l.min_vruntime) {
/// node = Some(l);
/// continue;
/// }
/// }
///
/// // Left subtree empty or no eligible entities. Check current node.
/// let node_se = task_from_link(n);
/// if entity_eligible(rq, node_se) {
/// best = Some(node_se);
/// break;
/// }
///
/// // Current node not eligible. Try right subtree.
/// node = n.right;
/// }
/// }
///
/// // Compare tree result against curr. If curr has an earlier deadline
/// // than best, prefer curr (it is already running — no context switch).
/// match (best, curr_ref) {
/// (None, c) => c,
/// (Some(b), Some(c)) if entity_before(c, b) => Some(c),
/// (b, _) => b,
/// }
/// }
/// ```
///
/// **`entity_before`**: Compares two entities by deadline (lower = earlier).
/// Tie-breaking by TaskId for determinism.
/// ```text
/// fn entity_before(a: &EevdfTask, b: &EevdfTask) -> bool {
/// a.vdeadline < b.vdeadline
/// || (a.vdeadline == b.vdeadline && a.task_id < b.task_id)
/// }
/// ```
///
/// # Invariant maintenance
///
/// Every tree mutation (insert, delete, rotation) must call
/// `recompute_augmented()` on affected nodes bottom-up. Violating this
/// causes `pick_eevdf()` to return a suboptimal or incorrect result.
///
/// **ML observation point**: After pick, emit task selection metrics:
/// ```text
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::PreemptionEvent,
/// picked_task_prio, preempted_task_prio, cgroup_id);
/// ```
pub fn pick_eevdf(rq: &mut EevdfRunQueue) -> Option<&EevdfTask> {
/* in sched/eevdf.rs */
}
7.1.3 Key Properties¶
- Preemptible locks by default: Mutexes and rwlocks are always sleeping locks with
priority inheritance. Under
PreemptionModel::Realtime(Section 8.4), spinlocks also become sleeping locks (RT-safe). UnderVoluntaryandFullpreemption modes,SpinLockis a true spinlock that disables preemption for its critical section — but all spinlock-protected critical sections are bounded and O(1) in duration. Per-CPU data is protected by short IRQ-disabling guards (PerCpuMutGuard) that hold for bounded durations only — never across blocking operations. There are no unbounded preemption-disabled regions. - NUMA-aware load balancing: The load balancer models migration cost (cache invalidation, memory latency) and only migrates tasks when the imbalance exceeds the migration cost. The topology hierarchy is architecture-dependent:
- x86-64: Core → LLC (L3) → Package → NUMA node (2-3 levels typical)
- AArch64: Core → Cluster (DynamIQ/big.LITTLE) → Package → NUMA node
- PPC64LE: Thread → Core → Chip → Drawer → Node (POWER9/10 can have 4 levels)
- s390x: Thread → Core → Book → Drawer → Node (5 scheduling domain levels — z15 up to 190 PUs, z16 up to 200 PUs; migration cost increases sharply at each boundary due to private L2/L3/L4 caches and interconnect hop penalties)
- RISC-V: Core → Cluster → NUMA node (DT-described, varies per SoC)
- LoongArch64: Core → Package → NUMA node (3A5000/6000 topology)
The scheduler builds a SchedDomain hierarchy at boot from firmware topology data
(ACPI SRAT/SLIT on x86/ARM server, DT on embedded/RISC-V, STSI instruction on s390x).
Each level has a migration cost threshold — work stealing only crosses a boundary when
the imbalance exceeds that level's threshold. On s390x, the 5-level topology means the
scheduler must distinguish intra-book steals (fast, shared L4) from inter-drawer steals
(slow, remote memory), using the topology distance from STSI to set per-level migration
cost thresholds.
- Per-CPU run queues: No global run queue lock. Each CPU manages its own queues
independently. Each per-CPU run queue is protected by a per-CPU spinlock (rq->lock).
- Work stealing: Idle CPUs steal tasks from busy CPUs at low frequency (~4ms interval)
to avoid thundering-herd effects. The work stealing algorithm is specified below.
Target CPU selection. When a CPU goes idle and its local run queue is empty, it initiates a steal attempt. The target CPU is selected as follows:
- Same-NUMA-node preference: Scan CPUs within the same NUMA node first. Cross-node steals incur higher migration cost (remote memory latency, cache invalidation of NUMA-local pages) and are only attempted if no same-node candidate has stealable work.
- Highest load first: Among candidate CPUs, prefer the one with the highest run queue
load (measured as
rq.eevdf.base.task_count). This maximizes the probability that the target can spare a task without becoming underloaded itself. - Cache topology tiebreak: When multiple candidates have equal load, prefer the CPU that shares the closest cache level (L2 > L3 > cross-package). Tasks migrated within a shared cache domain retain warm cache lines, reducing post-migration stall cycles.
- Cross-node fallback: If no same-node CPU has more than one runnable task, scan
remote NUMA nodes in distance order (nearest first, using SLIT/SRAT distances from
firmware). The migration cost threshold is higher for cross-node steals — the load
imbalance must exceed
NUMA_MIGRATION_THRESHOLD(default: 2 tasks) to justify the cross-node penalty.
Lock-free load observation: Each VruntimeTree (embedded in EevdfRunQueue) exposes:
Relaxed — an exact count is not required; approximate load is sufficient
for steal decisions. The stealing CPU reads task_count atomically without acquiring
the target runqueue's lock. This gives a snapshot that may be 1-2 operations stale,
which is acceptable: the goal is finding a CPU with available work, not exact balance.
False positive handling: If the stealing CPU reads non-zero task_count but finds
no stealable task after acquiring the target lock (due to intervening dequeue), it
counts as one steal attempt.
Steal attempt limit: After STEAL_ATTEMPT_LIMIT = 4 failed candidates (same-NUMA
probes first, cross-NUMA second), the idle CPU enters halted state via
cpu::halt() and awaits the next IPI or timer tick.
Task selection. From the target CPU's EEVDF tasks_timeline tree, steal the eligible task with the
largest vdeadline (rightmost node in the tree). Rationale: the task with the largest
vdeadline is the one furthest from being scheduled next on the source CPU — it has the
most remaining virtual runtime before its next turn. Stealing it causes the least
disruption to the source CPU's fairness invariants and avoids stealing a task that was
about to run (which would waste the source CPU's scheduling decision). The stolen task's
vlag is preserved; its vruntime is adjusted relative to the
destination run queue's zero_vruntime to maintain fairness on the new CPU.
RT and deadline tasks are not stolen by the normal work-stealing path. RT task migration uses a separate push/pull mechanism triggered by RT priority changes (see Section 8.4).
Lock ordering. The work stealer must hold two run queue locks simultaneously
(source and destination). Deadlock prevention is enforced at compile time via
lock_two_runqueues() (see below). However, the idle CPU also uses a trylock with
exponential backoff strategy to avoid priority inversion:
- The idle CPU acquires its own run queue lock first (guaranteed success — it is local).
- It calls
trylock()on the target CPU's run queue lock. If the lock is contended (the target CPU is in a scheduling critical section), the steal attempt is abandoned for this cycle rather than spinning. - On
trylockfailure, the idle CPU backs off: it doubles the steal retry interval (from the base 4ms up to a cap of 32ms) and re-enters the idle loop. The backoff resets to 4ms on a successful steal or when a local wake-up occurs.
This trylock approach ensures the work stealer never blocks a busy CPU's scheduler
path. In practice, trylock succeeds on the first attempt >95% of the time because
run queue critical sections are bounded and short (O(1), typically < 1 microsecond).
Maximum steal count. Each steal attempt moves exactly 1 task. Stealing multiple tasks per attempt risks over-correcting the load imbalance and causing ping-pong migration between CPUs. The periodic 4ms steal interval provides natural convergence: a 4-task imbalance resolves in ~16ms (4 steal cycles), which is well within acceptable load-balancing latency for non-RT workloads.
- Run queue lock ordering — compile-time enforced: The load balancer (work stealing)
acquires remote run queue locks. ABBA deadlock prevention is type-enforced, not a
runtime convention. All run queue locks share
RQ_LOCK_LEVEL = 50in the compile-time lock hierarchy (Section 3.4). TheSpinLock<_, LEVEL=50>type prevents a caller holding one level-50 lock from acquiring a second one directly. The only legal path to holding two run queue locks simultaneously islock_two_runqueues(rq_a, rq_b), which always acquires the lock for the lower CPU ID first — making CPU-ID-ordered acquisition the sole valid code path rather than a convention that code review must uphold. The load balancer never holds more than two run queue locks simultaneously. Cross-subsystem lock ordering (innermost last):TASK_LOCK (level 20) < PI_LOCK (level 45) < RQ_LOCK (level 50). This means a caller holding RQ_LOCK must NOT acquire TASK_LOCK or PI_LOCK. - Real-time guarantees: Dedicated RT cores can be reserved (isolcpus equivalent). Threaded interrupts ensure deterministic scheduling latency.
- CPU frequency/power: Integration with cpufreq governors for power management.
- Steal time accounting (paravirtualized guests): Each VCPU has a
steal_timefield in a shared memory page mapped by the hypervisor. On x86, KVM writes cumulative stolen nanoseconds to the page registered viaMSR_KVM_STEAL_TIME(0x4b564d03). On AArch64, the PV steal time structure is registered via theSMCCC_HV_PV_SCHED_FEATUREShypercall (SMCCC v1.1+). The guest scheduler reads this on each timer tick:
/// Per-CPU steal time tracking. Updated on each scheduler tick.
pub struct StealTimeAccounting {
/// Last observed cumulative steal value (nanoseconds) from the
/// hypervisor-mapped shared page.
pub last_steal_ns: u64,
/// Running total of stolen time for /proc/stat reporting.
pub total_steal_ns: u64,
}
/// Called from scheduler_tick() on paravirtualized guests.
/// Returns the stolen nanoseconds since the last call.
fn update_steal_time(cpu: &mut CpuLocalBlock) -> u64 {
let current_steal = read_pv_steal_time(); // arch-specific MMIO read
let delta = current_steal - cpu.steal_acct.last_steal_ns;
cpu.steal_acct.last_steal_ns = current_steal;
cpu.steal_acct.total_steal_ns += delta;
delta
}
Stolen time is accumulated in total_steal_ns for /proc/stat reporting
but is NOT subtracted from delta_exec for vruntime advancement. The
rq_clock_task(rq) time base used by update_curr() already excludes IRQ
time, and steal time is accounted separately through account_steal_time()
called from timer_tick_handler() (upstream of scheduler_tick()). The
vruntime advance formula (curr.vruntime += calc_delta_fair(delta_exec, weight))
uses the task clock delta directly — no steal adjustment is needed or applied.
This matches Linux's behavior where steal time affects /proc/stat but not
the CFS vruntime calculation.
Reported via /proc/stat as the steal column (field 9, in USER_HZ ticks).
Per-task steal attribution is not tracked (steal is a per-CPU phenomenon, not
per-task); the /proc/stat value is the sum across all CPUs.
On bare-metal (no hypervisor), read_pv_steal_time() always returns 0 (the MSR
is not registered, so no shared page exists). The check is a single branch on a
per-CPU boolean (pv_steal_enabled), which is false on bare-metal — zero overhead.
7.1.4 Scheduler Classes¶
The scheduler is modular. Each scheduling class implements a standard interface:
/// Documentation-only trait defining the per-class method signatures.
///
/// UmkaOS uses enum-based class dispatch (`match` on `SchedPolicy`) rather than
/// trait objects. The `SchedClassOps` trait exists only as a documentation aid
/// for the per-class method signatures; it is not used for dynamic dispatch.
/// See the `SchedClass` enum and dispatch mechanism below.
pub trait SchedClassOps: Send + Sync {
/// Enqueue a task onto this class's run queue. Task-mutable fields
/// (vruntime, deadline, on_rq_state) use interior mutability; the
/// run queue lock provides mutual exclusion.
fn enqueue_task(&mut self, task: &Task, flags: EnqueueFlags);
fn dequeue_task(&mut self, task: &Task, flags: DequeueFlags);
fn pick_next_task(&mut self, cpu: CpuId) -> Option<&Task>;
fn check_preempt(&self, current: &Task, incoming: &Task) -> bool;
fn task_tick(&mut self, task: &Task, cpu: CpuId, queued: bool);
fn balance(&mut self, cpu: CpuId, flags: BalanceFlags) -> BalanceResult;
}
Classes are checked in priority order: Deadline > RT > EEVDF. The highest-priority class with a runnable task wins.
pick_next_task() dispatch algorithm. The per-CPU scheduler entry point traverses
scheduling classes in strict priority order. Each class's pick_next_task is called at
most once; the first class that returns a runnable task wins. This is O(1) in the number
of classes (three fixed classes, not a dynamic list).
/// Select the highest-priority runnable task on this CPU's run queue.
///
/// Called from the scheduler core on every context switch, timer tick
/// preemption, and explicit `schedule()` invocation. The caller holds
/// `rq.lock` for the local CPU.
///
/// # Priority order
///
/// 1. **Deadline (CBS)** — tasks with active bandwidth reservations and
/// unexpired deadlines. Scheduled earliest-deadline-first within the
/// CBS server.
/// 2. **RT (FIFO / RR)** — real-time tasks. FIFO tasks run until they
/// yield or block; RR tasks rotate within their priority level on
/// each time slice expiry.
/// 3. **EEVDF (normal)** — the eligible task with the smallest virtual
/// deadline (leftmost eligible node in `tasks_timeline`).
///
/// If all three classes are empty, the CPU enters the idle task — a
/// per-CPU kernel thread that executes the architecture's halt/wait
/// instruction (`hlt` on x86, `wfi` on ARM, `wfi` on RISC-V) until
/// the next interrupt.
fn pick_next_task(rq: &mut RunQueueData) -> &Task {
// 1. Deadline class: highest priority. CBS tasks with active
// reservations whose deadline has not yet expired are checked
// first. `dl.pick_next_task()` returns the task with the
// earliest absolute deadline (EDF within the CBS server).
if let Some(task) = rq.dl.pick_next_task() {
return task;
}
// 2. RT class: FIFO and RR tasks. The highest-priority RT task
// is returned. Within a priority level, FIFO tasks are ordered
// by arrival time; RR tasks rotate on slice expiry.
if let Some(task) = rq.rt.pick_next_task() {
return task;
}
// 3. CBS group servers: cgroups with cpu.guarantee
// ([Section 7.6](#cpu-bandwidth-guarantees)). Iterate CBS servers on this
// CPU in EDF order (earliest deadline first). For each server
// with remaining budget and runnable tasks, pick its next EEVDF
// task. This ensures guaranteed groups receive their reserved
// bandwidth ahead of non-guaranteed EEVDF tasks.
//
// cpu.max interaction (GAP-22): if the cgroup is also
// max-throttled (cpu.max quota exhausted), skip this CBS server
// even if guarantee budget remains. cpu.max is always the hard
// ceiling; cpu.guarantee cannot override it.
//
// RT/DL tasks in CBS-guaranteed cgroups bypass the CBS server
// entirely — they are picked in steps 1-2 above and do not
// consume CBS budget.
// Linear scan of CBS servers on this CPU to find the one with the
// earliest deadline. XArray provides O(1) lookup by cgroup ID but
// does NOT support deadline-ordered iteration natively. The linear
// scan is O(N) where N = number of CBS-guaranteed cgroups on this
// CPU (typically 1-5, max ~20). See [Section 7.6](#cpu-bandwidth-guarantees)
// for the authoritative specification and optimization threshold.
for server in rq.cbs_servers.values() {
if server.throttled.load(Acquire) { continue; }
if server.max_throttled.load(Acquire) { continue; }
if let Some(task) = server.pick_next_eevdf_task() {
return task;
}
}
// 4. EEVDF class: normal (SCHED_NORMAL / SCHED_BATCH) tasks
// WITHOUT cpu.guarantee. Returns the eligible task with the
// smallest vdeadline from `tasks_timeline` via the augmented
// `pick_eevdf()` walk that prunes ineligible subtrees.
//
// Deferred-entity handling: `pick_eevdf()` does NOT filter
// `sched_delayed` entities (matching Linux `fair.c pick_eevdf()`).
// If a deferred entity is picked, dequeue it without wake-up and
// retry. This matches Linux `pick_task_fair()` behavior.
loop {
match rq.eevdf.pick_next_task() {
Some(task) if task.eevdf.sched_delayed => {
// Deferred entity picked: vlag decayed to zero.
// Remove from tree without wake-up, then retry.
dequeue_entity(rq, task, DEQUEUE_SLEEP);
continue;
}
Some(task) => return task,
None => break,
}
}
// 5. All classes empty: return the per-CPU idle task.
// The idle task is always runnable and never enqueued in any
// scheduling class. It is a sentinel — the run queue is never
// truly "empty" because the idle task is always available.
&*rq.idle_task
}
Per-CPU run queue interaction. Each CPU calls pick_next_task() independently on
its own RunQueueData while holding the local run queue lock. There is no cross-CPU
coordination in the pick path — load balancing and work stealing (Section 7.1.3) are
separate, asynchronous operations that move tasks between run queues. This ensures
the scheduling hot path is lock-local and O(1) in the number of CPUs.
Idle task behavior. The idle task is a statically allocated per-CPU kernel thread
that does not participate in any scheduling class. When selected, it:
1. Checks for pending softirqs (Section 3.8)
and processes them before halting.
2. Invokes the cpuidle governor to select the deepest safe C-state
(Section 7.4) based on expected idle duration and latency constraints.
3. Executes the architecture halt instruction. The CPU remains halted until the next
interrupt (timer tick, IPI from work stealing, device interrupt).
4. On wake, the idle task immediately calls pick_next_task() again — it never
"runs" application logic.
7.1.4.1 eevdf_task_tick — Per-Tick Accounting for EEVDF Tasks¶
The EEVDF implementation of SchedClassOps::task_tick(). Called on every timer
interrupt for the currently running CFS/EEVDF task. This is the single most
important function in the scheduler — it runs on every tick on every CPU that
has a running EEVDF task.
Classification: Evolvable. All runtime code (vruntime advance, PELT update,
CBS charge, preemption check) is policy, hot-swappable via EvolvableComponent.
The VruntimeTree, EevdfRunQueue, and EevdfTask struct layouts are Nucleus (data).
/// EEVDF implementation of `task_tick`. Called from the timer interrupt
/// handler on every scheduling tick (1-4 ms depending on `CONFIG_HZ`).
///
/// # Arguments
/// - `rq`: the local CPU's run queue data (caller holds `rq.lock`)
/// - `curr`: the currently running EEVDF task
/// - `queued`: derived from `rq.base.nr_queued.count()` inside the function body
///
/// # Locking
/// Called with the local CPU's `rq.lock` held (IRQs disabled). All
/// field accesses are to the local CPU's runqueue — no cross-CPU locking.
///
/// # Algorithm
///
/// ```text
/// fn eevdf_task_tick(rq: &mut EevdfRunQueue, curr: &EevdfTask) {
/// // Step 1: Core accounting — advance vruntime, check deadline,
/// // update PELT, charge CBS budget, check preemption.
/// update_curr(rq);
///
/// // Step 2: Update exec_start for the next delta_exec computation.
/// // (Already done inside update_curr — listed here for completeness.)
///
/// // Step 3: Check preemption against tree candidates.
/// // If the current task has exhausted its protected slice AND a peer
/// // with an earlier virtual deadline is eligible, request reschedule.
/// // This is the `check_preempt_tick` logic.
/// // UmkaOS WaiterCount excludes curr, so has_waiters() == true
/// // means "at least one peer exists" (equivalent to Linux's
/// // nr_queued > 1). See WaiterCount threshold translation guide.
/// if rq.base.nr_queued.has_waiters() {
/// if curr.vruntime >= curr.vprot {
/// // Protected slice consumed — check if a peer should preempt.
/// // Use the raw slice as the ideal runtime (wall-clock nanoseconds).
/// // The primary EEVDF preemption mechanism is `update_deadline()`
/// // (virtual-time comparison: vruntime >= vdeadline), which handles
/// // weight-proportional fairness. This secondary check just ensures
/// // no task runs beyond its raw slice in wall-clock time.
/// // NOTE: Do NOT use `calc_delta_fair(slice, weight)` here — that
/// // converts to virtual time, which would invert the weight
/// // relationship (high-weight tasks preempted sooner, not later).
/// let ideal_runtime = curr.slice_ns;
/// let delta_exec = curr.sum_exec_runtime - curr.prev_sum_exec_runtime;
/// // If we have run for at least one ideal runtime quantum, request
/// // reschedule. Lazy urgency: the tick fires every 1 ms; using
/// // Eager would cause immediate preemption on the next interrupt
/// // return, potentially interrupting kernel-mode work. Lazy lets
/// // the task finish its current kernel-mode operation and yields at
/// // the next voluntary preemption point (return-to-user,
/// // cond_resched()). This matches Linux 6.12+ scheduler_tick()
/// // behavior which uses resched_curr_lazy().
/// //
/// // Note: update_curr() (step 2) also uses Lazy for its own
/// // preemption check (deadline/protected-slice expiry). Both
/// // paths use Lazy consistently — the tick path does not
/// // override update_curr()'s urgency decision.
/// // Note: Once delta_exec >= ideal_runtime on one tick,
/// // this condition remains true on EVERY subsequent tick
/// // (prev_sum_exec_runtime is set on enqueue, not updated
/// // by the tick). This is intentional: Lazy reschedule is
/// // idempotent (setting TIF_NEED_RESCHED_LAZY again is a
/// // no-op). This matches Linux behavior.
/// if delta_exec >= ideal_runtime {
/// resched_curr(rq, ReschedUrgency::Lazy);
/// }
/// }
/// }
///
/// // Step 4: Propagate cgroup weight changes (lazy reweight).
/// // If the task's cgroup cpu.weight was modified since the last tick,
/// // recompute the GroupEntity weight. No reschedule IPI needed for
/// // ticked cores — the weight update is picked up here.
/// // See [Section 17.2](17-containers.md#control-groups--cpu-controller-state).
/// // AtomicBool because cgroup migration (cross-CPU) may set this
/// // flag while the task is running on a different CPU.
/// if curr.cgroup_weight_dirty.load(Acquire) {
/// reweight_entity(rq, curr);
/// curr.cgroup_weight_dirty.store(false, Release);
/// }
///
/// // ML observation: emit per-tick scheduling metrics (gated by static key).
/// // observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskTick,
/// // curr.vruntime, rq.base.nr_queued.count(), delta_exec);
/// }
/// ```
pub fn eevdf_task_tick(
rq: &mut EevdfRunQueue,
curr: &EevdfTask,
) { /* in sched/eevdf.rs — uses rq.base.nr_queued directly.
* curr is &EevdfTask (shared reference) because it comes from
* Arc<Task> via TaskHandle. All scheduler-mutable fields in EevdfTask
* (vruntime, vdeadline, on_rq, sum_exec_runtime, etc.) are either
* atomics or protected by the rq lock, which this function holds.
* See the interior mutability convention at line ~1037. */ }
Relationship to update_curr(): eevdf_task_tick() delegates the core
accounting to update_curr() which handles vruntime advance, deadline renewal,
PELT update, CBS charge, cpu.max charge, and basic preemption check. The tick
function adds the check_preempt_tick heuristic (step 3) which checks whether
the current task has exceeded its ideal runtime — a coarser preemption trigger
than the deadline-based check inside update_curr().
Why both preemption checks exist: update_curr() step 8 triggers on vprot
expiry alone (curr.vruntime >= curr.vprot). The tick's step 3 adds a
wall-clock runtime check (delta_exec >= ideal_runtime) which is relevant
when vprot has not yet expired but the task has run for more than one full
ideal runtime quantum — possible when a large delta_exec spans multiple
ticks (e.g., tick interrupt delayed by IRQ storm). If update_curr() already
set TIF_NEED_RESCHED_LAZY, the tick check is redundant (harmless — setting
the flag again is a no-op). Both use Lazy urgency consistently.
Error paths: eevdf_task_tick() cannot fail. All arithmetic is bounded.
Idle CPU note: rq.curr is NEVER None -- idle CPUs have rq.curr
pointing to the statically-allocated rq.idle_task (boot-allocated per-CPU,
see idle_task: Arc<Task> in RunQueueData). The idle task has
sched_class = SchedClass::Idle, which dispatches to this function. The
update_curr() early return guard (step 2, idle-marked check) ensures the
idle task's vruntime does not advance. rq.curr is NonNull<EevdfTask> in
practice; the Option wrapper exists for the boot-time initialization window
before the idle task is assigned.
7.1.4.2 sched_idle_enter / sched_idle_exit — Idle Accounting Hooks¶
Non-idle threads that spin-wait (e.g., KVM halt-poll, busy-wait synchronisation primitives) can bracket their idle-wait windows so the scheduler does not penalise them with excess vruntime. These hooks are distinct from the per-CPU idle task — they apply to any runnable thread that is temporarily doing non-productive work.
/// Mark the current task as idle-spinning. While in this state:
///
/// - **vruntime does not advance**: `update_curr()` skips the
/// `vruntime += delta * (NICE_0_WEIGHT / weight)` accumulation for the
/// idle-marked task. The task's `vruntime` is frozen at its value when
/// `sched_idle_enter()` was called.
/// - **Not counted in runqueue load**: The task's weight is subtracted from
/// the two `avg_vruntime` accumulators (`sum_w_vruntime`, `sum_weight`).
/// This prevents an idle-spinning
/// task from inflating the runqueue's apparent load, which would distort
/// load balancing decisions and EAS frequency selection.
/// - **PELT accounting pauses**: `last_update_time` is not advanced for the
/// idle-marked task. Neither `util_sum` nor `runnable_sum` accumulate.
/// When `sched_idle_exit()` is called, a single `update_load_avg()` call
/// applies geometric decay over the idle interval, correctly reducing the
/// stale PELT signal.
/// - **Preemption remains enabled**: The task can still be preempted by
/// higher-priority tasks (RT, DL, or EEVDF peers with earlier deadlines).
/// If preempted, the idle marking persists across the context switch —
/// it is cleared only by an explicit `sched_idle_exit()` call.
///
/// # Panics
/// Calling `sched_idle_enter()` when the task is already idle-marked is a
/// logic error and triggers a `WARN_ON_ONCE` diagnostic (debug builds panic).
///
/// # Usage
/// - **KVM halt-poll** ([Section 18.3](18-virtualization.md#kvm-operational--vcpu-scheduling-integration)):
/// vCPU threads call `sched_idle_enter()` before the halt-poll spin loop
/// and `sched_idle_exit()` when an interrupt arrives or the poll window
/// expires. This prevents latency-sensitive guests from accumulating
/// unfair vruntime during idle polling.
/// - **Busy-wait synchronisation**: Any kernel thread that spin-waits for a
/// bounded duration (≤1 ms) on a condition that does not justify sleeping
/// may use these hooks to avoid scheduler penalties.
/// **Locking precondition**: Must be called with the local rq lock held
/// (i.e., within a `rq_lock_irqsave()` / `rq_unlock_irqrestore()` section).
/// Preemption disabled alone is NOT sufficient — the timer tick IRQ handler
/// calls `update_curr()` which reads `sum_w_vruntime`/`sum_weight`, and
/// these are non-atomic `i64` fields. The rq lock disables IRQs, preventing
/// concurrent modification from the timer tick handler.
pub fn sched_idle_enter(rq: &mut RunQueueGuard) {
let task = current();
debug_assert!(!task.sched_idle_marked.load(Relaxed), "double sched_idle_enter");
task.sched_idle_marked.store(true, Relaxed);
// Save the entity_key at enter time. The saved value is used at exit
// to restore the exact accumulator contribution, preventing drift from
// zero_vruntime shifts during the idle interval. Without this, a
// zero_vruntime update (from update_curr() called via timer tick on
// another task) would cause the exit restoration to add back a
// different key*weight product than was subtracted.
let key = entity_key(&rq.eevdf.base, task);
task.saved_idle_key.store(key, Relaxed);
// Subtract weight from rq accumulators.
// Access the runqueue data through the guard (no separate this_rq() call
// — that would alias the guard's &mut reference, causing UB).
rq.eevdf.base.sum_w_vruntime -= key * task.weight as i64;
rq.eevdf.base.sum_weight -= task.weight as i64;
}
/// Exit idle-spinning state. Reverses `sched_idle_enter()`:
///
/// 1. Re-adds the task's weight to the two `avg_vruntime` accumulators
/// (`sum_w_vruntime`, `sum_weight`).
/// 2. Performs a single `update_load_avg()` call to apply PELT decay for
/// the elapsed idle interval.
/// 3. Clears `task.sched_idle_marked`.
///
/// After this call, the task resumes normal vruntime accumulation and PELT
/// accounting. The task's vruntime is unchanged — it resumes from the frozen
/// value, which means it has not consumed any of its fair share during the
/// idle window.
///
/// # Panics
/// Calling `sched_idle_exit()` without a prior `sched_idle_enter()` triggers
/// `WARN_ON_ONCE` (debug builds panic).
pub fn sched_idle_exit(rq: &mut RunQueueGuard) {
let task = current();
debug_assert!(task.sched_idle_marked.load(Relaxed), "sched_idle_exit without enter");
task.sched_idle_marked.store(false, Relaxed);
// Catch up PELT decay for the idle interval.
// Access through the RunQueueGuard (no separate this_rq() — see sched_idle_enter).
update_load_avg(task, &*rq);
// Restore weight contribution to rq accumulators using the saved key
// from sched_idle_enter(). This prevents accumulator drift from
// zero_vruntime shifts during the idle interval.
let saved_key = task.saved_idle_key.load(Relaxed);
rq.eevdf.base.sum_w_vruntime += saved_key * task.weight as i64;
rq.eevdf.base.sum_weight += task.weight as i64;
}
Invariant: The interval between sched_idle_enter() and sched_idle_exit() must
be bounded. KVM enforces this via halt_poll_ns (default 200 us, max 10 ms). Unbounded
idle marking would starve the runqueue's avg_vruntime advancement — the frozen
task's vruntime contribution to the accumulators prevents avg_vruntime from
advancing normally, which could delay ineligible tasks from becoming eligible. The halt_poll_ns sysctl provides the administrative bound;
WARN_ON_ONCE fires if the idle interval exceeds 10 ms (configurable via
sched_idle_max_ns, default 10_000_000).
Task struct field: sched_idle_marked: AtomicBool is added to the per-task scheduling
state (adjacent to sched_delayed in EevdfTask). AtomicBool with Relaxed ordering
is used instead of Cell<bool> because update_curr() reads this flag from IRQ context
(timer tick handler) while sched_idle_enter() writes it from process context — a
cross-context access that is a data race under the Rust abstract machine with Cell.
On all 8 architectures, Relaxed load/store of a bool compiles to the same instruction
as a non-atomic load/store (naturally aligned single-byte access), so there is zero
performance cost. update_curr() checks this flag before accumulating vruntime — the
check is a single branch that is almost always not-taken (the likely() hint ensures
the branch predictor handles the common case with zero overhead).
7.1.4.3 Timer IRQ → scheduler_tick() Entry Path¶
The hardware timer interrupt (LAPIC timer on x86-64, ARM Generic Timer on
AArch64/ARMv7, SBI timer on RISC-V, decrementer on PPC, CPU timer on s390x,
stable counter on LoongArch64) fires at the configured tick rate (HZ=1000 → 1 ms,
HZ=250 → 4 ms). The per-architecture timer IRQ handler calls
timer_tick_handler() which is the single entry point connecting hardware
timer interrupts to scheduler and RCU tick processing.
/// Timer tick handler. Called from the architecture-specific timer IRQ
/// handler with IRQs disabled on the local CPU.
///
/// This function bridges the hardware timer interrupt to the scheduler
/// and RCU subsystems. It runs on EVERY tick on EVERY online CPU.
fn timer_tick_handler() {
// 1. Advance jiffies and update timekeeping.
update_wall_time();
// 2. Process expired hrtimers and timer wheel entries.
run_local_timers();
// 3. RCU quiescent state reporting.
// Single-task CPUs that never context-switch must report quiescent
// states here to avoid stalling RCU grace periods. Without this,
// a CPU running a single long-running task would never call
// rcu_note_context_switch() and the grace period would stall.
rcu_sched_clock_irq();
// 4. Scheduler tick.
scheduler_tick();
// 5. Subsystem tick hooks.
perf_event_task_tick(); // PMU event multiplexing (rotate groups)
calc_global_load_tick(); // /proc/loadavg update
psi_task_tick(); // Pressure Stall Information accounting
}
7.1.4.4 Top-Level scheduler_tick() Dispatch¶
The timer interrupt handler calls scheduler_tick() on every tick (1 ms HZ=1000,
or 4 ms HZ=250 depending on configuration). This is the top-level dispatch
function that coordinates all per-tick scheduling work across all scheduling
classes.
/// Top-level scheduler tick handler. Called from the timer interrupt
/// handler with IRQs disabled. Dispatches to per-class tick functions
/// and performs period maintenance (RT bandwidth, load balance interval).
///
/// Linux equivalent: `scheduler_tick()` in `kernel/sched/core.c`.
///
/// # Algorithm
///
/// ```text
/// fn scheduler_tick() {
/// let cpu = smp_processor_id();
/// let rq = &mut per_cpu_rq(cpu);
/// let _guard = rq.lock.lock(); // level 50
///
/// // Step 1: Update steal time (paravirtualized guests).
/// // See [Section 18.3](18-virtualization.md#kvm-operational--paravirtual-steal-time).
/// update_steal_time(rq);
///
/// // Step 1a: Update the runqueue task clock (cached, excludes IRQ time).
/// // All subsequent clock reads in this tick use rq.clock_task.
/// update_rq_clock(rq);
///
/// // Step 2: Dispatch to per-class task_tick().
/// let curr = rq.curr;
/// match curr.sched_class {
/// SchedClass::RtFifo | SchedClass::RtRr => {
/// rt_task_tick(rq, curr);
/// }
/// SchedClass::Deadline => {
/// dl_task_tick(rq, curr);
/// }
/// SchedClass::Eevdf | SchedClass::Idle => {
/// eevdf_task_tick(&mut rq.eevdf, curr);
/// }
/// }
///
/// // Step 3: RT bandwidth period accounting.
/// // If an RT task is running, charge delta_exec against the RT
/// // bandwidth budget. Check for period rollover and throttle.
/// if rq.rt_rq.rt_runtime_ns != u64::MAX {
/// rt_bandwidth_tick(rq);
/// }
///
/// // Step 4: Load balance interval check.
/// // Decrement the per-CPU load balance interval counter. When it
/// // reaches zero, trigger_load_balance() schedules the softirq
/// // that runs the load balancer. This amortizes the balancer cost
/// // across many ticks (interval = 4-32 ticks depending on topology).
/// rq.next_balance_tick -= 1;
/// if rq.next_balance_tick == 0 {
/// trigger_load_balance(rq);
/// rq.next_balance_tick = load_balance_interval(rq);
/// }
///
/// // Step 5: Update thermal/frequency pressure.
/// // See [Section 7.7](#power-budgeting--thermal-pressure-propagation).
/// update_thermal_pressure(rq);
///
/// // Step 6: nohz_full re-entry check.
/// // If this CPU was in nohz_full mode (tickless) and a second task
/// // became runnable, the tick was re-enabled. Check if we can return
/// // to tickless mode (only one runnable task again).
/// if is_nohz_full(cpu) && rq.nr_running == 1 {
/// nohz_full_kick_stop(cpu);
/// }
/// }
/// ```
pub fn scheduler_tick() { /* in sched/core.rs */ }
7.1.4.5 RT Bandwidth Period Accounting¶
/// Per-tick RT bandwidth accounting. Called from `scheduler_tick()` when
/// the RT bandwidth limiter is active (`rt_runtime_ns != u64::MAX`).
///
/// The RT bandwidth limiter prevents RT tasks from starving non-RT tasks
/// by limiting total RT execution time per period (default: 950 ms per
/// 1000 ms period, matching Linux's `sched_rt_runtime_us = 950000`).
///
/// # Algorithm
///
/// ```text
/// fn rt_bandwidth_tick(rq: &mut RunQueue) {
/// let rt_rq = &mut rq.rt_rq;
/// let now = rq_clock_task(rq);
///
/// // Step 1: Period rollover check.
/// // Advance period_start_ns by the period length (not set to `now`)
/// // to prevent drift when ticks are delayed. Use a while loop to
/// // handle missed periods (e.g., IRQ storm causing 2+ second gap).
/// // Bound: cap iterations to RT_MAX_ROLLOVER_ITERS (10) to prevent
/// // unbounded latency under the rq lock if the clock jumps (e.g.,
/// // VM pause/resume). If more than 10 periods are missed, the
/// // remaining catch-up happens over subsequent ticks. This matches
/// // Linux's RT_MAX_PERIODS bound in do_sched_rt_period_timer().
/// let mut rollover_iters = 0u32;
/// const RT_MAX_ROLLOVER_ITERS: u32 = 10;
/// while now - rt_rq.period_start_ns >= 1_000_000_000
/// && rollover_iters < RT_MAX_ROLLOVER_ITERS
/// {
/// rollover_iters += 1;
/// rt_rq.period_start_ns += 1_000_000_000;
/// rt_rq.rt_time_ns = 0;
/// // Un-throttle if previously throttled.
/// if rt_rq.throttled {
/// rt_rq.throttled = false;
/// // Re-enqueue RT tasks that were dequeued during throttle.
/// rt_unthrottle(rq);
/// resched_curr(rq, ReschedUrgency::Eager);
/// }
/// }
///
/// // Step 2: Charge runtime against the RT bandwidth budget.
/// // rt_bandwidth_tick() owns the exec_start update for RT tasks.
/// // rt_task_tick() does NOT update exec_start — it handles per-task
/// // slice management (RR timeslice, RTTIME limit) using the delta
/// // computed here. This single-owner design prevents double-charging.
/// if rq.curr.sched_class == SchedClass::RtFifo
/// || rq.curr.sched_class == SchedClass::RtRr
/// {
/// let delta = now as i64 - rq.curr.exec_start as i64;
/// if delta > 0 {
/// rq.curr.exec_start = now;
/// rt_rq.rt_time_ns += delta as u64;
/// }
/// }
///
/// // Step 3: Throttle check.
/// if !rt_rq.throttled
/// && rt_rq.rt_time_ns >= rt_rq.rt_runtime_ns
/// {
/// // RT budget exhausted for this period. Throttle all RT tasks
/// // on this CPU: dequeue them from the RT runqueue and set the
/// // throttled flag. Non-RT tasks will run until the period resets.
/// rt_rq.throttled = true;
/// rt_throttle_all(rq);
/// resched_curr(rq, ReschedUrgency::Eager);
/// }
/// }
/// ```
///
/// **Locking**: Called with `rq.lock` held (level 50), IRQs disabled.
///
/// **Default values**: `rt_runtime_ns = 950_000_000` (950 ms),
/// period = 1_000_000_000 ns (1 second). Configurable via
/// `/proc/sys/kernel/sched_rt_runtime_us` and `sched_rt_period_us`
/// (Linux compatibility interface in [Section 20.9](20-observability.md#kernel-parameter-store)).
pub fn rt_bandwidth_tick(rq: &mut RunQueue) { /* in sched/rt.rs */ }
7.1.4.6 Per-Class Tick Functions¶
/// RT scheduling class tick handler. Called from `scheduler_tick()` when
/// the current task is SCHED_FIFO or SCHED_RR.
///
/// **exec_start ownership**: This function does NOT update `exec_start`.
/// `rt_bandwidth_tick()` (called after this) owns the `exec_start` update
/// and delta computation for all RT tasks. This prevents double-charging.
///
/// # Algorithm
/// ```text
/// fn rt_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) {
/// // SCHED_FIFO: no timeslice — run until preempted or blocked.
/// // No action needed on tick (FIFO tasks are preempted only by
/// // higher-priority RT tasks, DL tasks, or explicit yield).
/// if curr.sched_policy == UserSchedPolicy::Fifo {
/// return;
/// }
///
/// // SCHED_RR: decrement timeslice and rotate if expired.
/// // The RR timeslice is stored in curr.slice_ns (default:
/// // DEF_TIMESLICE = 100ms, same as Linux).
/// curr.rr_time_remaining -= TICK_NS;
/// if curr.rr_time_remaining > 0 {
/// return;
/// }
/// // Reset timeslice for next quantum.
/// curr.rr_time_remaining = curr.slice_ns;
/// // If other RR/FIFO tasks at the same priority exist, rotate:
/// // move curr to the tail of its priority queue.
/// if rq.rt.has_peers_at_priority(curr.rt_priority) {
/// dequeue_rt_task(rq, curr);
/// enqueue_rt_task_tail(rq, curr);
/// resched_curr(rq, ReschedUrgency::Eager);
/// }
///
/// // RLIMIT_RTTIME check: accumulated total RT CPU time.
/// // curr.rt_runtime_us reflects the value accumulated as of the
/// // PREVIOUS tick's rt_bandwidth_tick() (which runs AFTER
/// // rt_task_tick in scheduler_tick). The check is therefore one tick
/// // behind, which is within RLIMIT_RTTIME tolerance (same as Linux).
/// // If it exceeds the task's rlimit, send SIGXCPU.
/// let limit_us = task_rlimit(curr, RLIMIT_RTTIME);
/// if limit_us != u64::MAX {
/// let runtime_us = curr.rt_runtime_us.load(Relaxed);
/// if runtime_us >= limit_us {
/// send_sig(SIGXCPU, curr);
/// }
/// if runtime_us >= limit_us + 1_000_000 {
/// send_sig(SIGKILL, curr);
/// }
/// }
/// }
/// ```
pub fn rt_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) { /* in sched/rt.rs —
* curr is &EevdfTask (shared reference): same rationale as eevdf_task_tick.
* Scheduler-mutable fields use interior mutability (atomics / rq lock). */ }
/// Deadline (EDF/CBS) scheduling class tick handler.
///
/// # Algorithm
/// ```text
/// fn dl_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) {
/// // Step 1: CBS runtime accounting.
/// // The delta_exec for DL tasks is computed from rq.clock_task.
/// let now = rq.clock_task;
/// let delta = now.saturating_sub(curr.exec_start);
/// curr.exec_start = now;
/// curr.dl_runtime_remaining = curr.dl_runtime_remaining.saturating_sub(delta);
///
/// // Step 2: Deadline expiry check.
/// // If the absolute deadline has passed, the task missed its deadline.
/// // Reclaim: reset runtime and advance deadline by one period.
/// // Unit convention: dl_runtime_us and dl_period_us are stored in
/// // microseconds (matching the sched_attr ABI), converted to nanoseconds
/// // by multiplying by 1000 when computing absolute runtime/deadline values.
/// // This diverges from the CBS subsystem which stores budget_remaining_ns
/// // natively in nanoseconds. The divergence is intentional: the DL fields
/// // match the Linux sched_attr struct's dl_runtime/dl_period (microseconds),
/// // while CBS budget_remaining_ns is an internal accounting field not
/// // exposed to userspace.
/// if now >= curr.dl_deadline {
/// curr.dl_runtime_remaining = curr.dl_runtime_us * 1000;
/// curr.dl_deadline += curr.dl_period_us * 1000;
/// }
///
/// // Step 3: Budget exhaustion.
/// // If runtime is exhausted before the deadline, throttle the task
/// // until the next period starts (CBS replenishment).
/// if curr.dl_runtime_remaining == 0 {
/// curr.on_rq = OnRqState::CbsThrottled;
/// dequeue_dl_task(rq, curr);
/// start_dl_replenishment_timer(rq, curr);
/// resched_curr(rq, ReschedUrgency::Eager);
/// }
/// }
/// ```
pub fn dl_task_tick(rq: &mut RunQueueData, curr: &EevdfTask) { /* in sched/dl.rs —
* curr is &EevdfTask (shared reference): same rationale as eevdf_task_tick.
* Scheduler-mutable fields use interior mutability (atomics / rq lock). */ }
7.1.4.7 Scheduler Utility Functions¶
/// Raise SCHED_SOFTIRQ to trigger the load balancer on the next softirq
/// processing point. The load balancer runs in softirq context to avoid
/// blocking the scheduler tick path.
fn trigger_load_balance(rq: &RunQueueData) {
raise_softirq(SoftirqVec::Sched);
}
/// Return the load balance interval in ticks for this CPU based on
/// the sched_domain topology depth. Deeper topologies (more NUMA hops)
/// use longer intervals to reduce cross-node balancing overhead.
/// Range: 4 ticks (single-socket) to 32 ticks (4+ socket NUMA).
fn load_balance_interval(rq: &RunQueueData) -> u32 {
// Base interval scaled by sched_domain depth.
let depth = rq.sched_domain_depth;
core::cmp::min(4 * (1 << depth), 32)
}
/// Read the current thermal pressure from the architecture-specific
/// thermal monitoring interface and update the runqueue's capacity
/// reduction factor. Used by EAS for frequency/capacity decisions.
fn update_thermal_pressure(rq: &mut RunQueueData) {
let pressure = arch::current::thermal::read_pressure(rq.cpu_id);
rq.thermal_pressure = pressure;
}
/// Stop the tick timer for tickless (nohz_full) operation. Called when
/// a CPU returns to single-runnable-task state and can re-enter tickless
/// mode. The next scheduling event (wakeup, migration) will restart the tick.
fn nohz_full_kick_stop(cpu: u32) {
arch::current::timer::stop_tick(cpu);
}
/// Update the per-runqueue task clock. Called once at the start of each
/// scheduler entry (timer_tick_handler, schedule(), try_to_wake_up).
/// Reads the raw monotonic clock and subtracts accumulated IRQ time
/// to produce the task-only clock value stored in rq.clock_task.
fn update_rq_clock(rq: &mut RunQueueData) {
let raw_now = sched_clock_nanos();
rq.clock_task = raw_now - rq.irq_time_ns;
}
/// Called at IRQ entry. Records the timestamp for IRQ time accounting.
/// Must be called from the architecture-specific IRQ entry trampoline,
/// BEFORE the IRQ handler dispatches to device-specific code.
/// Paired with `irq_time_end()` at IRQ exit. Together these functions
/// maintain `rq.irq_time_ns`, which is subtracted from the raw clock
/// in `update_rq_clock()` to produce the task-only clock. Without this
/// accounting, IRQ processing time would be charged to the running
/// task's vruntime, inflating vruntime for tasks that happen to be
/// running during IRQ storms.
/// Linux equivalent: `irqtime_account_irq()` called from IRQ entry/exit.
fn irq_time_start(rq: &mut RunQueueData) {
rq.irq_entry_timestamp = sched_clock_nanos();
}
/// Called at IRQ exit. Accumulates the IRQ duration into `rq.irq_time_ns`.
/// Must be called from the architecture-specific IRQ exit trampoline,
/// AFTER all IRQ handlers have completed and BEFORE returning to the
/// interrupted context (task or idle).
fn irq_time_end(rq: &mut RunQueueData) {
let delta = sched_clock_nanos() - rq.irq_entry_timestamp;
rq.irq_time_ns += delta;
}
/// Return the cached task clock for this runqueue. This is the canonical
/// clock source for all scheduler accounting (update_curr, CBS charge,
/// PELT update). Excludes IRQ time.
/// Linux equivalent: rq_clock_task(rq) in kernel/sched/sched.h.
fn rq_clock_task(rq: &RunQueueData) -> u64 {
rq.clock_task
}
7.1.4.8 reweight_entity — Lazy Weight Update¶
/// Recompute an entity's scheduling weight after a nice/cpu.weight change.
/// Called from `eevdf_task_tick()` step 4 when `cgroup_weight_dirty` is set,
/// and from `sched_setattr()`/`setpriority()` when the nice value changes.
///
/// The entity must be dequeued from the EEVDF accumulators, have its weight
/// updated, and re-enqueued with the new weight to maintain invariants.
///
/// Linux equivalent: `reweight_entity()` in `kernel/sched/fair.c`.
///
/// # Algorithm
/// ```text
/// fn reweight_entity(rq: &mut EevdfRunQueue, se: &mut EevdfTask) {
/// let old_weight = se.weight;
/// let new_weight = sched_prio_to_weight[(se.nice + 20) as usize];
/// if old_weight == new_weight { return; }
///
/// // Step 1: Remove entity's contribution from accumulators.
/// // Uses avg_vruntime_update() to ensure the same entity_key() formula
/// // (se.vruntime - tree.zero_vruntime) is used for both removal and
/// // re-addition. Directly using raw se.vruntime would corrupt the
/// // accumulator by omitting the zero_vruntime offset.
/// let was_on_tree = se.on_rq == OnRqState::Queued;
/// if was_on_tree {
/// __dequeue_entity(&mut rq.base, se);
/// }
/// avg_vruntime_update(&mut rq.base, se, false); // dequeue: subtracts key*weight
///
/// // Step 2: Scale vlag by weight ratio to preserve relative position.
/// // vlag_new = vlag_old * old_weight / new_weight
/// se.vlag = se.vlag * old_weight as i64 / new_weight as i64;
///
/// // Step 3: Scale deadline by weight ratio.
/// // The remaining virtual time until deadline should be preserved
/// // proportionally: vd_remaining_new = vd_remaining_old * old/new.
/// let vd_remaining = se.vdeadline as i64 - se.vruntime as i64;
/// let scaled_remaining = vd_remaining * old_weight as i64 / new_weight as i64;
/// se.vdeadline = (se.vruntime as i64 + scaled_remaining) as u64;
/// se.rel_deadline = true;
///
/// // Step 4: Update weight.
/// se.weight = new_weight;
///
/// // Step 5: Re-add entity's contribution with new weight.
/// // Uses avg_vruntime_update() — same entity_key() path as enqueue/dequeue,
/// // ensuring the accumulator invariant is maintained:
/// // sum_w_vruntime = SUM_i( (v_i - zero_vruntime) * w_i )
/// avg_vruntime_update(&mut rq.base, se, true); // enqueue: adds key*new_weight
/// if was_on_tree {
/// __enqueue_entity(&mut rq.base, se);
/// }
///
/// // Step 6: Update PELT load average for the new weight.
/// se.pelt.update_weight(new_weight as u64);
/// }
/// ```
pub fn reweight_entity(rq: &mut EevdfRunQueue, se: &mut EevdfTask) { /* in sched/eevdf.rs */ }
SchedClass Dispatch Mechanism:
UmkaOS uses static enum dispatch (not a vtable or dyn SchedClassOps). Each task stores a SchedClass enum field. The scheduler's hot path uses a match statement on this enum, which the compiler can optimize to a direct jump table. The SchedClassOps trait (defined above) exists only as a documentation aid for the per-class method signatures; no dyn SchedClassOps or vtable pointer appears at runtime.
Rationale for static enum dispatch over vtable:
- Zero indirection: enum match compiles to a jump table (O(1) branch predictor-friendly); vtable dispatch requires a pointer dereference before the call. On x86-64, this eliminates 1 cache miss per scheduling decision.
- LTO-friendly: The compiler can inline small per-class operations (e.g., SCHED_IDLE.pick_next_task() always returns None if the idle task is the only runnable task). Vtable calls prevent inlining across compilation units.
- No runtime registration: SchedClass is fixed at compile time. New scheduling classes require a kernel rebuild, not a runtime module. This avoids the race conditions and validation overhead of dynamically registered scheduling classes.
/// Scheduling class. Stored in Task; determines all scheduling decisions.
#[repr(u8)]
pub enum SchedClass {
/// EEVDF (Eligible Earliest Virtual Deadline First).
/// For all normal (CFS) tasks. Provides fair-share CPU time with
/// latency-nice configurable slice sizes.
Eevdf = 0,
/// POSIX SCHED_FIFO. Run until preempted by higher-priority RT task,
/// blocked, or explicitly yields. Static priority 1-99.
RtFifo = 1,
/// POSIX SCHED_RR. Like SCHED_FIFO but with timeslices.
RtRr = 2,
/// POSIX SCHED_DEADLINE. CBS (Constant Bandwidth Server). Specified
/// by (runtime_us, deadline_us, period_us) at sched_setattr() time.
Deadline = 3,
/// SCHED_IDLE. Lower priority than any Eevdf task. Used for background
/// maintenance tasks (garbage collection, defragmentation, telemetry).
Idle = 4,
}
/// User-visible scheduling policy. Stored in `EevdfTask.sched_policy` and
/// returned by `sched_getscheduler(2)` / `sched_getattr(2)`. Multiple
/// policies may map to the same `SchedClass` but differ in behavioral details.
///
/// Matches Linux `SCHED_*` constants for ABI compatibility.
#[repr(u32)]
pub enum UserSchedPolicy {
/// SCHED_NORMAL (0). Standard time-sharing (EEVDF). Preemptible by newly
/// woken peers if they have earlier virtual deadlines.
Normal = 0,
/// SCHED_FIFO (1). Real-time FIFO — runs until blocked or preempted by
/// higher-priority RT task.
Fifo = 1,
/// SCHED_RR (2). Real-time round-robin — FIFO with per-priority timeslices.
Rr = 2,
/// SCHED_BATCH (3). CPU-intensive batch processing. Uses EEVDF but with
/// a behavioral difference: batch tasks are never preempted by newly woken
/// EEVDF peers (only by RT/DL tasks). This avoids unnecessary context
/// switches for throughput-oriented workloads.
Batch = 3,
/// SCHED_IDLE (5). Extremely low priority. Only runs when no other
/// non-idle task is runnable. Linux ABI value 5 (value 4 is unused).
Idle = 5,
/// SCHED_DEADLINE (6). Earliest Deadline First with CBS bandwidth
/// reservation. Linux ABI value 6.
Deadline = 6,
}
// In the scheduler hot path, the canonical `pick_next_task()` defined above
// (§ EEVDF pick_next_task) is called. The enum dispatch `match` statement in
// the scheduler tick and context switch paths calls into that function, which
// already implements the full priority order:
// Deadline > RT > CBS-guaranteed > EEVDF > Idle
// and returns `&Task` (never None -- the idle task is always available).
// See `pick_next_task(rq: &mut RunQueueData) -> &Task` for the full logic.
Per-class operations (called via match in the scheduler):
- enqueue(task): Add to the class-specific queue.
- dequeue(task): Remove from the class-specific queue.
- pick_next(): Select the next task to run.
- put_prev(task): Task is being descheduled; update per-class bookkeeping (e.g., EEVDF virtual time advance).
- check_preempt(task, new_task): Can new_task preempt task? Called when a new task becomes runnable.
- task_tick(task): Called on each scheduler tick for the running task.
Relationship between SchedClass (enum dispatch) and SchedPolicy (replaceable vtable):
The enum-dispatched pick_next_task() above is the fixed scheduling skeleton
that determines class priority ordering (DL > RT > CBS > EEVDF > Idle). This
skeleton is compiled into the kernel and is not replaceable at runtime.
The SchedPolicy trait (Section 19.9)
is a replaceable policy module that affects only the EEVDF scheduling class.
It is called from step 4 of pick_next_task(): the EEVDF branch delegates to the
active SchedPolicy module to select among eligible EEVDF tasks.
| Component | Dispatch | Replaceable? | Scope |
|---|---|---|---|
| DL class | Fixed enum match | No | Step 1 of pick_next_task |
| RT class | Fixed enum match | No | Step 2 |
| CBS servers | Fixed enum match | No | Step 3 |
| EEVDF class | SchedPolicy vtable |
Yes (live evolution) | Step 4 |
| Idle | Fixed | No | Step 5 |
CBS servers use their own local EEVDF trees, which also delegate to the active
SchedPolicy for task selection within each server. The SchedPolicy module
receives a read-only SchedPolicyContext snapshot — it never accesses the
runqueue directly. See Section 19.9 for the full policy
module interface and live replacement protocol.
7.1.4.9 Task-to-Runqueue Lookup Protocol (Lockfree cpu_id + Retry)¶
Operations that need to lock a task's runqueue (try_to_wake_up, cgroup migration,
sched_setaffinity) use a lockfree protocol based on task.cpu_id: AtomicU32.
No pi_lock is needed for task pinning — pi_lock stays at level 45 exclusively
for its actual purpose (protecting HeldMutexes and priority inheritance chain
walking).
/// Lock the runqueue that `task` is currently enqueued on. Uses lockfree
/// optimistic lookup with bounded retry. The retry is bounded because
/// migration requires the same rq.lock we are acquiring — once we hold
/// the correct rq.lock, the task cannot escape.
///
/// # Algorithm
///
/// 1. Read `task.cpu_id.load(Acquire)` → `cpu`
/// 2. Acquire `rq_array[cpu].lock`
/// 3. Verify `task.cpu_id.load(Acquire) == cpu`
/// 4. If mismatch: release lock, goto 1
/// 5. If match: proceed with operation under rq.lock
///
/// The Acquire ordering on cpu_id ensures that if we observe cpu=X, all
/// memory writes that were visible to the CPU that last stored X are
/// visible to us. This prevents reading stale scheduling state after the
/// task was migrated.
///
/// # Why pi_lock is not needed
///
/// In Linux, `task_rq_lock()` acquires `pi_lock` then `rq->lock` to
/// prevent the task from being migrated between the CPU lookup and the
/// rq.lock acquisition. UmkaOS's `cpu_id: AtomicU32` + retry achieves
/// the same guarantee without pi_lock:
///
/// - Migration writes `task.cpu_id` under the source `rq.lock` (the
/// migration path acquires source rq.lock, dequeues, updates cpu_id,
/// releases source rq.lock, acquires dest rq.lock, enqueues).
/// - Our retry loop detects the stale cpu by re-reading cpu_id after
/// acquiring the (possibly wrong) rq.lock.
/// - The worst case is one retry per concurrent migration (rare).
fn lock_task_rq(task: &Task) -> RqLockGuard {
loop {
let cpu = task.cpu_id.load(Acquire);
let guard = rq_array[cpu as usize].lock();
// Re-check: did the task migrate while we were acquiring?
if task.cpu_id.load(Acquire) == cpu {
return guard;
}
// Task migrated — release wrong lock, retry.
drop(guard);
}
}
7.1.4.10 wake_up_new_task (Forked Task Activation)¶
After do_fork()/do_clone() completes task creation and cgroup attachment
(Section 8.1),
wake_up_new_task(child) introduces the child to the scheduler. This is the
single entry point that transitions a newly-forked task from "exists but not
runnable" to "eligible for scheduling."
/// Activate a newly-forked task. Called exactly once per fork, after
/// cgroup_post_fork() and before returning to the parent.
///
/// # Steps
///
/// 1. **Set initial vruntime** (EEVDF tasks only):
/// The child is placed via `place_entity(rq, child, ENQUEUE_INITIAL)`.
/// This sets `child.vruntime = avg_vruntime(rq)` (not parent's
/// vruntime — the child starts at the current weighted average).
/// The deadline is set to `vruntime + vslice / 2` (half-slice for
/// initial placement, matching Linux's `PLACE_DEADLINE_INITIAL`).
/// The half-slice gives the child a slight advantage — existing tasks
/// are on average halfway through their slice, so starting the child
/// at half-slice eases it into the competition without starvation.
/// Child inherits parent's nice value. Vlag is initialized to 0
/// (no accumulated credit or debt).
///
/// 2. **Select target CPU**: Place the child on the parent's current
/// CPU runqueue. This maximizes cache locality: the child's COW
/// page tables and TLB entries overlap with the parent's, and
/// running on the same CPU avoids cold-cache penalties. If the
/// parent's CPU is overloaded (runqueue length > 1.5× average),
/// the load balancer may select a nearby idle CPU in the same LLC
/// domain instead.
///
/// 3. **Insert into tree**: `place_entity()` has already computed the
/// child's vruntime and vdeadline. Insert into the `tasks_timeline`
/// tree (keyed by vdeadline). Eligibility is determined dynamically
/// by `pick_eevdf()` using `entity_eligible()`.
///
/// 4. **Enqueue**: Call the scheduling class's enqueue(child) to
/// insert the task into the appropriate per-CPU runqueue data
/// structure. Update rq.nr_running and load averages.
///
/// 5. **Check preemption**: If the child's vdeadline < current
/// task's vdeadline (i.e., the child is more urgent), set
/// `need_resched` on the current CPU's `CpuLocalBlock`
/// ([Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path)). The rescheduling
/// occurs at the next preemption point (syscall return or
/// interrupt exit).
///
/// 6. **Resched IPI** (cross-CPU case only): If step 2 placed the
/// child on a different CPU than the current CPU, send a resched
/// IPI to that CPU so it re-evaluates pick_next_task promptly.
///
/// # RT and Deadline tasks
///
/// RT tasks (SCHED_FIFO/SCHED_RR) inherit the parent's static
/// priority. The child is placed at the tail of the priority's
/// runqueue (FIFO) or gets a fresh timeslice (RR).
///
/// Deadline tasks (SCHED_DEADLINE) are not inherited via fork —
/// the child is demoted to SCHED_NORMAL (EEVDF) unless the parent
/// explicitly sets deadline parameters via sched_setattr() after
/// fork. This prevents accidental bandwidth over-commitment.
pub fn wake_up_new_task(child: &Task) {
let rq = select_task_rq(child);
init_new_task_vruntime(child, rq);
activate_task(rq, child, EnqueueFlags::ENQUEUE_INITIAL);
check_preempt_curr(rq, child);
}
/// Initialize a newly forked task's virtual runtime for EEVDF scheduling.
///
/// Delegates to `place_entity()` with `ENQUEUE_INITIAL`. This sets:
/// - `child.vruntime = avg_vruntime(rq)` — the child starts at the current
/// weighted average of the runqueue, not the parent's vruntime. Starting at
/// the average prevents a fork bomb from obtaining unfairly low vruntimes.
/// - `child.vdeadline = vruntime + vslice / 2` — half-slice initial placement
/// (matching Linux's `PLACE_DEADLINE_INITIAL`). Existing tasks are on
/// average halfway through their slice; half-slice eases the child into
/// competition without starving existing tasks.
/// - `child.vlag = 0` — no accumulated credit or debt.
///
/// For non-EEVDF classes (RT, DL), this function is a no-op — RT tasks inherit
/// the parent's static priority directly, and DL tasks are demoted to
/// SCHED_NORMAL at fork (DL parameters are not inheritable).
///
/// **Classification**: Evolvable (the placement formula is policy).
///
/// Linux equivalent: `task_fork_fair()` in `kernel/sched/fair.c`.
fn init_new_task_vruntime(child: &Task, rq: &mut RunQueue) {
match child.sched_policy {
SchedPolicy::Normal | SchedPolicy::Batch | SchedPolicy::Idle => {
let se = &mut child.sched_entity;
place_entity(&mut rq.cfs, se, EnqueueFlags::ENQUEUE_INITIAL);
}
// RT and DL tasks: no vruntime initialization needed.
// RT inherits parent's static priority (set in do_fork step 7a).
// DL was demoted to SCHED_NORMAL at fork (do_fork step 7a).
_ => {}
}
}
// ---------------------------------------------------------------------------
// resched_curr — Rescheduling Request with Urgency
// ---------------------------------------------------------------------------
/// Urgency level for rescheduling requests.
///
/// **UmkaOS-original abstraction** over Linux's two-function model
/// (`resched_curr()` + `resched_curr_lazy()`). Single function with enum
/// parameter provides the same zero-cost dispatch (branch on immediate)
/// while enabling ML policy and Evolvable scheduler modules to return
/// `ReschedUrgency` values — decoupling the decision (what urgency?)
/// from the mechanism (which flag to set). Extensible to future urgency
/// levels without adding new functions. Exhaustive `match` on the enum
/// ensures all callers handle new variants.
///
/// **Classification**: Evolvable. ML policy can tune per-cgroup urgency
/// via `ParamId::SchedReschedUrgency`.
#[repr(u8)]
pub enum ReschedUrgency {
/// Must reschedule soon. Sets `TIF_NEED_RESCHED` on the target CPU's
/// thread_info, which is checked at ALL preemption points (interrupt
/// return, syscall return, cond_resched(), preempt_enable()).
///
/// Use for: wakeup of higher-priority task, RT task ready, slice expiry,
/// cross-class preemption (RT preempts EEVDF).
Eager = 0,
/// Should reschedule when convenient. Sets `TIF_NEED_RESCHED_LAZY` on
/// the target CPU's thread_info, which is checked ONLY at voluntary
/// preemption points (return-to-user, cond_resched()) but NOT at
/// involuntary preemption points (interrupt return with preemption
/// enabled).
///
/// Use for: EEVDF eligibility change, load balancing suggestion, nice
/// change, cgroup weight update. These events benefit from rescheduling
/// but do not require immediate preemption — allowing the current task
/// to complete its kernel-mode work reduces unnecessary context switches
/// on throughput-oriented workloads.
Lazy = 1,
}
/// Request rescheduling on the CPU that owns `rq`.
///
/// Sets the appropriate thread-info flag on `rq.curr` based on `urgency`:
/// - `Eager`: `TIF_NEED_RESCHED` — checked at all preemption points.
/// Sends IPI to remote CPUs and to nohz_full local CPU.
/// - `Lazy`: `TIF_NEED_RESCHED_LAZY` — checked at voluntary preemption
/// + return-to-user only. Does NOT send IPI to remote CPUs (the target
/// CPU will notice the flag at its next voluntary preemption point).
/// Sends IPI only if the target CPU is nohz_full (no tick to notice the flag).
///
/// **Locking**: Caller must hold `rq.lock`.
///
/// **Classification**: Evolvable. The urgency assignment at each call site
/// is policy that can be tuned by ML. The mechanism (flag set + IPI) is
/// the same for both urgency levels and is unlikely to change.
fn resched_curr(rq: &RunQueue, urgency: ReschedUrgency) {
match urgency {
ReschedUrgency::Eager => {
set_tif_need_resched(rq.curr);
// Eager: IPI needed for remote CPUs (to preempt immediately) and
// nohz_full local CPU (no tick to notice the flag).
if rq.cpu != current_cpu() || rq.is_nohz_full() {
send_resched_ipi(rq.cpu);
}
}
ReschedUrgency::Lazy => {
set_tif_need_resched_lazy(rq.curr);
// Lazy: no IPI for remote CPUs. The target CPU will notice
// TIF_NEED_RESCHED_LAZY at its next voluntary preemption point
// (return-to-user, cond_resched()). Sending an IPI would defeat
// the purpose of lazy rescheduling — the whole point is "reschedule
// when convenient," not "interrupt what you're doing right now."
//
// Exception: nohz_full CPUs have no periodic tick, so they would
// never notice the lazy flag without an IPI. This matches Linux
// 6.12+ resched_curr_lazy() behavior (kernel/sched/core.c).
if rq.is_nohz_full() {
send_resched_ipi(rq.cpu);
}
}
}
}
// ---------------------------------------------------------------------------
// check_preempt_curr — Wakeup Preemption Check
// ---------------------------------------------------------------------------
/// Check whether a newly activated task should preempt the current task.
///
/// Dispatches to the appropriate scheduling class comparison:
/// - If `task`'s class has higher priority than `rq.curr`'s class
/// (RT > DL > Normal), set `need_resched` unconditionally.
/// - If both are in the same class, delegate to the class-specific
/// check (EEVDF: compare virtual deadline; RT: compare static priority;
/// DL: compare absolute deadline).
/// - If the current task's class is higher priority, no preemption.
///
/// Called from `wake_up_new_task()`, `try_to_wake_up()`, and
/// `sched_setscheduler()` (after changing a task's scheduling class).
///
/// Uses `resched_curr(rq, ReschedUrgency::Eager)` for cross-class preemption
/// (higher-priority class waking) and delegates to the class-specific check
/// for intra-class preemption (which may use either `Eager` or `Lazy`
/// depending on the scheduling class's policy).
fn check_preempt_curr(rq: &RunQueue, task: &Task) {
let task_prio = sched_class_priority(task.eevdf.sched_class);
let curr_prio = sched_class_priority(rq.curr.eevdf.sched_class);
if task_prio > curr_prio {
resched_curr(rq, ReschedUrgency::Eager);
return;
}
if task_prio == curr_prio {
// Delegate to class-specific preemption check via match dispatch
// (not a vtable call — consistent with enum-based class dispatch).
match task.eevdf.sched_class {
SchedClass::Eevdf => check_preempt_eevdf(rq, task),
SchedClass::RtFifo | SchedClass::RtRr => check_preempt_rt(rq, task),
SchedClass::Deadline => check_preempt_dl(rq, task),
SchedClass::Idle => { /* idle never preempts idle */ }
}
}
// Lower-priority class: no preemption.
}
resched_curr call site urgency classification (exhaustive list across all
scheduling classes):
| Call site | Class | Urgency | Rationale |
|---|---|---|---|
update_curr() deadline/slice expiry |
EEVDF | Lazy |
Task exceeded its protected slice; reschedule at next voluntary preemption point. Not urgent — the task is still running fairly. |
scheduler_tick() ideal_runtime check |
EEVDF | Lazy |
Periodic fairness check. Lazy avoids interrupting kernel-mode work; the 1 ms tick already bounds latency. Matches Linux 6.12+ resched_curr_lazy(). |
check_preempt_curr() cross-class |
All | Eager |
Higher-priority class task woke up (e.g., RT preempts EEVDF). Must preempt immediately. |
check_preempt_curr() intra-class EEVDF |
EEVDF | Lazy |
Newly woken task has earlier virtual deadline than current. Preempt at convenience. |
DL earliest_deadline preemption |
DL | Eager |
A deadline task with an earlier absolute deadline than the running DL task is ready. Deadline ordering is hard — must preempt now to meet the earlier deadline. |
| RT wakeup preemption | RT | Eager |
A higher-static-priority RT task woke up. RT scheduling requires immediate preemption to maintain priority guarantees. |
| RT RR time-slice expiry | RT | Eager |
Round-robin RT task exhausted its time quantum. Must yield immediately to the next RR task at the same priority (Linux task_tick_rt() calls resched_curr()). |
| CBS budget exhaustion (throttle) | CBS | Eager |
CBS server budget depleted. Running task must yield to prevent bandwidth overrun beyond one tick period. Without Eager reschedule, the currently running task continues for up to one tick (1-4 ms) past budget exhaustion. |
| CBS server replenishment un-throttle | CBS | Eager |
Throttled CBS server received budget replenishment. Tasks waiting for bandwidth should run promptly. |
sched_setscheduler() class change |
All | Eager |
Task moved to a higher-priority class (e.g., SCHED_OTHER → SCHED_FIFO). Check preemption immediately. |
| Load balancer migration | EEVDF | Lazy |
Load balance suggests migrating a task. Not urgent — the source CPU's work is not affected. |
| Nice/weight change | EEVDF | Lazy |
Task's weight changed (renice, cgroup cpu.weight update). Reschedule when convenient. |
Fork placement via place_entity(): The child's vruntime is set to
avg_vruntime(rq) (the current weighted average, not the parent's vruntime).
The virtual deadline uses vslice / 2 for initial placement
(PLACE_DEADLINE_INITIAL). The child's vlag is initialized to 0 (no saved
credit/debt). This matches Linux's place_entity() with ENQUEUE_INITIAL.
The ML scheduler policy (Section 23.1)
can tune the effective slice via ParamId::SchedEevdfWeightScale.
7.1.4.11 select_task_rq (CPU Selection for Task Placement)¶
select_task_rq() selects the optimal CPU runqueue for placing a task. It is
called on three paths: new task activation (wake_up_new_task), task wakeup
(try_to_wake_up), and explicit migration (sched_setaffinity). The function
delegates to the task's scheduling class for class-specific CPU selection logic,
then applies cross-class constraints (cpuset, affinity mask).
/// Select the target CPU runqueue for a task.
///
/// # Algorithm (EEVDF / Normal class)
///
/// 1. **Affinity mask filter**: Restrict candidates to CPUs in the task's
/// `cpus_allowed` mask (set by `sched_setaffinity` or inherited from
/// the cpuset cgroup). If the mask has exactly one CPU, return it
/// immediately (pinned task).
///
/// 2. **Idle CPU preference**: Scan for idle CPUs in the task's last-run
/// LLC (Last-Level Cache) domain first. An idle CPU avoids runqueue
/// contention and provides immediate execution. Idle scan uses the
/// per-LLC `idle_cpumask` bitmap (updated atomically by the idle loop
/// and `pick_next_task`).
///
/// 3. **Cache warmth**: If no idle CPU is found in the LLC domain, prefer
/// the CPU where the task last ran (`task.last_cpu`). Warm cache lines
/// (TLB entries, L1/L2 data) avoid cold-start penalties of ~10-50 us
/// on modern hardware. The benefit is estimated as:
/// cache_benefit_ns = sched_cache_hot_ns * (1 - time_since_last_run / decay_ns)
/// where `sched_cache_hot_ns` defaults to 2,500,000 ns (2.5 ms).
///
/// 4. **NUMA distance**: For NUMA systems, penalize CPUs on remote NUMA
/// nodes proportional to the NUMA distance (from the SLIT table).
/// A remote node adds `remote_penalty_ns` (default: proportional to
/// NUMA distance) to the placement cost. This ensures tasks stay
/// NUMA-local unless a remote CPU is idle and the local node is
/// overloaded.
///
/// 5. **Energy-Aware Scheduling (EAS)**: On heterogeneous platforms
/// (big.LITTLE, hybrid P/E cores), EAS evaluates the energy cost of
/// placing the task on each candidate CPU using the Energy Model
/// ([Section 7.1](#scheduler--energy-aware-scheduling-eas)). The CPU with the
/// lowest incremental energy cost is preferred, subject to a latency
/// constraint: if the energy-optimal CPU would delay the task's
/// wakeup by more than `eas_latency_budget_ns`, the faster CPU wins.
///
/// 6. **Load balance tiebreaker**: Among equally-scored candidates,
/// prefer the CPU with the shortest runqueue (fewest `nr_running`).
///
/// # RT class
/// Scans for the lowest-numbered CPU in the affinity mask that is not
/// currently running a higher-priority RT task (`rt_rq.highest_prio`).
///
/// # Deadline class
/// Uses `dl_bw` (deadline bandwidth) to find a CPU with sufficient
/// remaining bandwidth in the task's root domain. Falls back to the
/// task's affinity mask if all CPUs are over-committed.
pub fn select_task_rq(task: &Task) -> &mut RunQueue;
7.1.4.12 WakeFlags¶
/// Bitflags controlling wakeup behavior in `try_to_wake_up()`.
/// Passed by callers (signal delivery, futex wake, I/O completion) to
/// influence CPU selection and throttle bypass decisions.
bitflags! {
pub struct WakeFlags: u32 {
/// Synchronous wakeup hint: the waker will soon block, so placing
/// the wakee on the waker's CPU improves cache reuse. Used by
/// pipe write, futex wake, and unix socket send. The scheduler
/// treats this as a placement hint, not a guarantee — it may be
/// overridden by load balancing or NUMA distance penalties.
const SYNC = 1 << 0;
/// Bypass CBS (Capacity-Based Scheduling) throttle. Used by
/// SIGKILL delivery to ensure the target task is scheduled
/// immediately regardless of bandwidth limits. Without this
/// flag, a bandwidth-throttled cgroup could delay SIGKILL
/// processing indefinitely.
const BYPASS_CBS = 1 << 1;
}
}
7.1.4.13 activate_task (Sleeping-to-Runnable Transition)¶
activate_task() transitions a task from sleeping (or newly created) to runnable
by inserting it into the target runqueue. It is the counterpart of deactivate_task()
which removes a task from its runqueue when it blocks.
/// Move a task from sleeping/new state to runnable on the specified runqueue.
///
/// # Steps
///
/// 1. **Set task state**: Transition `task.state` from `TASK_SLEEPING` (or
/// `TASK_NEW` for freshly forked tasks) to `TASK_RUNNING`. The state
/// transition uses `Ordering::Release` to ensure all task initialization
/// writes are visible to the runqueue's CPU before the task becomes
/// schedulable.
///
/// 2. **Enqueue into scheduling class**: Call the task's scheduling class
/// enqueue operation:
/// - **EEVDF**: Call `place_entity(rq, task, flags)` to compute
/// `vruntime` and `vdeadline` from saved vlag. Insert into
/// `tasks_timeline` (keyed by vdeadline). Eligibility is determined
/// dynamically by `pick_eevdf()`. Update `rq.nr_running` and `rq.load_weight`.
/// - **RT (FIFO)**: Insert at the tail of the priority's linked list
/// in the `rt_rq` bitmap-indexed array.
/// - **RT (RR)**: Same as FIFO, plus initialize the RR timeslice to
/// `sched_rr_timeslice_ms` (default: 100 ms).
/// - **Deadline**: Insert into the `dl_rq` red-black tree ordered by
/// absolute deadline. Update `dl_bw` accounting.
///
/// 3. **Update runqueue statistics**: Increment `rq.nr_running`.
/// Update `rq.load_weight` (sum of task weights for EEVDF load
/// balancing). Update `rq.avg_load` PELT (Per-Entity Load Tracking)
/// contribution.
///
/// 4. **NUMA statistics**: If the task's preferred NUMA node differs from
/// the runqueue's CPU NUMA node, increment `rq.nr_numa_foreign` (used
/// by the NUMA balancer to trigger migration).
///
/// # Flags
///
/// `flags: EnqueueFlags` controls placement behavior:
/// - `ENQUEUE_WAKEUP`: Task is waking from sleep. `place_entity()` uses
/// the saved `vlag` with PLACE_LAG inflation to position the task in
/// virtual time. No CFS-era "sleep bonus" — EEVDF uses lag-based
/// placement exclusively.
/// - `ENQUEUE_INITIAL`: New task (fork). `place_entity()` sets vruntime
/// to avg_vruntime and halves the virtual slice for the deadline.
/// - `ENQUEUE_MIGRATED`: Task was migrated from another CPU (skip cache
/// warmth assumptions).
/// - `ENQUEUE_RESTORE`: Re-enqueue after priority change (preserve
/// existing vruntime, do not recompute from vlag).
pub fn activate_task(rq: &mut RunQueue, task: &Task, flags: EnqueueFlags);
/// The inverse of activate_task: remove a task from its runqueue when
/// it blocks (sleep, wait, I/O). Updates rq.nr_running, load_weight,
/// and PELT statistics. The task's state is set to TASK_SLEEPING by
/// the caller before calling deactivate_task.
pub fn deactivate_task(rq: &mut RunQueue, task: &Task);
See also: Section 8.4 (Real-Time Guarantees) extends deadline scheduling with bounded-latency paths, threaded interrupts, and PREEMPT_RT-style priority inheritance for hard real-time workloads.
ML tuning: Key EEVDF parameters (
eevdf_weight_scale,migration_benefit_threshold,eas_energy_bias,preemption_latency_budget) are registered in the Kernel Tunable Parameter Store and may be adjusted at runtime by Tier 2 AI/ML policy services via the closed-loop framework defined in Section 23.1. The scheduler emitsSchedObsobservations (task wakeup latency, EAS decisions, runqueue stats) that feed theumka-ml-schedTier 2 service. All parameters revert to defaults within 60 seconds if the ML service stops sending updates.
7.2 Heterogeneous CPU Support (big.LITTLE / Intel Hybrid / RISC-V)¶
Modern SoCs are no longer symmetric. ARM big.LITTLE (2011+), Intel Alder Lake P-core/E-core (2021+), and RISC-V platforms with mixed hart types all present the scheduler with CPUs of different performance, power, and ISA capabilities. A scheduler that treats all CPUs as identical will either waste power (running background tasks on performance cores) or starve throughput (placing compute-heavy tasks on efficiency cores).
This section extends the scheduler with Energy-Aware Scheduling (EAS), per-CPU capacity tracking, and heterogeneous topology awareness.
7.2.1 CPU Capacity Model¶
Every CPU has a capacity value normalized to a 0–1024 scale, where the fastest core at its highest frequency = 1024. This is the fundamental abstraction that makes the scheduler heterogeneity-aware.
// umka-core/src/sched/capacity.rs
/// Per-CPU capacity descriptor.
/// Populated at boot from firmware tables (ACPI PPTT, devicetree, CPPC).
/// Updated at runtime when frequency changes.
pub struct CpuCapacity {
/// Maximum capacity of this CPU at its highest OPP (Operating Performance Point).
/// Normalized: fastest core in the system = 1024.
/// An efficiency core might be 512 (half the throughput of a performance core).
pub capacity: u32,
/// Original (boot-time) maximum capacity. Does not change.
pub capacity_max: u32,
/// Current capacity, adjusted for current frequency.
/// If a 1024-capacity core is running at 50% frequency, capacity_curr = 512.
/// Updated by cpufreq governor on frequency change.
///
/// **Memory ordering**: Cpufreq governor writes with `Release` after
/// updating the frequency hardware registers. Scheduler reads with
/// `Relaxed` in the misfit check (stale value for one tick is
/// acceptable; re-evaluated every load balance interval ~4ms).
/// EAS energy computation reads with `Acquire` to pair with the
/// governor's `Release`, ensuring placement decisions reflect the
/// actual frequency state.
pub capacity_curr: AtomicU32,
/// Core type classification.
pub core_type: CoreType,
/// Frequency domain this CPU belongs to.
/// All CPUs in a frequency domain share the same clock.
pub freq_domain: FreqDomainId,
/// ISA capabilities of this CPU.
/// On heterogeneous ISA systems (RISC-V), different cores may support
/// different extensions.
pub isa_caps: IsaCapabilities,
/// Microarchitecture ID (for Intel Thread Director).
/// Different core types have different uarch IDs.
pub uarch_id: u32,
}
/// Core type classification.
#[repr(u32)]
pub enum CoreType {
/// ARM Cortex-X/A7x, Intel P-core.
/// High single-thread performance, high power.
Performance = 0,
/// ARM Cortex-A5x, Intel E-core.
/// Lower performance, significantly lower power.
Efficiency = 1,
/// ARM Cortex-A7x mid-tier (e.g., Cortex-A78 in a system with X3 and A510).
Mid = 2,
/// Traditional SMP — all cores identical.
/// When all cores are Symmetric, EAS is disabled (unnecessary).
Symmetric = 3,
}
/// ISA capability flags.
/// On heterogeneous ISA systems, the scheduler must ensure a task only runs
/// on a CPU that supports the ISA features the task uses.
bitflags! {
pub struct IsaCapabilities: u64 {
// ARM
const ARM_SVE = 1 << 0; // Scalable Vector Extension
const ARM_SVE2 = 1 << 1; // SVE2
const ARM_SME = 1 << 2; // Scalable Matrix Extension
const ARM_MTE = 1 << 3; // Memory Tagging Extension
// x86
const X86_AVX512 = 1 << 16; // AVX-512 (P-cores only on some Intel)
const X86_AMX = 1 << 17; // Advanced Matrix Extensions (P-cores only)
const X86_AVX10 = 1 << 18; // AVX10 (unified AVX across core types)
// RISC-V
const RV_V = 1 << 32; // Vector extension
const RV_B = 1 << 33; // Bit manipulation
const RV_H = 1 << 34; // Hypervisor extension
const RV_CRYPTO = 1 << 35; // Cryptography extensions
}
}
/// Vector length metadata — companion to IsaCapabilities for variable-length
/// vector ISAs (ARM SVE/SVE2, RISC-V V). The bitflags above indicate *presence*
/// of the extension; this struct encodes the *vector register width* that the
/// thread actually uses, which determines migration constraints and XSAVE area size.
#[repr(C)]
pub struct VectorLengthInfo {
/// ARM SVE/SVE2 vector length in bits (128-2048, must be power of 2).
/// 0 = thread does not use SVE. Discovered per-core via `rdvl` at boot.
pub sve_vl_bits: u16,
/// RISC-V VLEN in bits (32-65536). 0 = thread does not use RVV.
/// Discovered per-hart via `vlenb` CSR at boot.
/// Uses u32 because the RISC-V V spec allows VLEN up to 65536 bits,
/// which equals u16::MAX + 1 and would overflow a u16.
pub rvv_vlen_bits: u32,
}
// kernel-internal, not KABI. Layout: 2 + 2(pad) + 4 = 8 bytes.
const_assert!(size_of::<VectorLengthInfo>() == 8);
Key design property: On a fully symmetric system (all cores CoreType::Symmetric),
the capacity model is a no-op. All CPUs have capacity 1024, all have the same ISA
capabilities. The scheduler fast path sees capacity_curr == 1024 on every CPU and
skips all heterogeneous logic. Zero overhead on symmetric systems.
7.2.2 Energy Model¶
The energy model describes the power cost of running a workload at each performance level on each core type. It is the foundation of Energy-Aware Scheduling.
// umka-core/src/sched/energy.rs
/// Maximum number of Operating Performance Points per frequency domain.
/// 32 entries exceeds all known hardware OPP tables (typical: 8-20).
/// No downsampling is needed for current hardware.
pub const MAX_OPP_ENTRIES: usize = 32;
/// Energy model for one frequency domain.
/// A frequency domain is a group of CPUs that share the same clock.
/// All CPUs in a domain have the same core type and OPP table.
pub struct EnergyModel {
/// Which frequency domain this model covers.
pub freq_domain: FreqDomainId,
/// Core type of CPUs in this domain.
pub core_type: CoreType,
/// Number of CPUs in this domain.
pub cpu_count: u32,
/// Operating Performance Points, sorted by frequency (ascending).
/// Each OPP maps a frequency to a capacity and power cost.
/// Fixed-capacity inline array avoids heap allocation and keeps OPP
/// data cache-local. 16 entries accommodates all known hardware OPP
/// tables.
pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>, // MAX_OPP_ENTRIES = 32
}
/// One Operating Performance Point.
pub struct OppEntry {
/// Frequency in kHz.
pub freq_khz: u32,
/// Capacity at this frequency (0–1024 scale).
/// Capacity scales linearly with frequency within a core type.
pub capacity: u32,
/// Power consumption at this frequency (milliwatts).
/// This is the DYNAMIC power for one CPU running at 100% utilization.
/// Power scales roughly as V²×f (voltage² × frequency).
pub power_mw: u32,
}
OPP table population at boot: OPP tables are populated during boot Phase 2.3
(post-ACPI/DT parse). Sources: ACPI PPTT (x86), device tree opp-table node
(ARM/RISC-V), CPPC (ACPI 6.0+). For each frequency domain, the boot code
enumerates available OPPs and fills the opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>
in ascending frequency order. Fallback: if no firmware power data is available
for a frequency domain, power_mw is set to 0 for all OPPs in that domain
and EAS (Energy-Aware Scheduling) is disabled — the scheduler falls back to pure
EEVDF without energy-aware placement. An FMA warning is logged:
"EAS disabled: no power data for CPU cluster {cluster_id}". This ensures EAS
never makes placement decisions based on fabricated power data.
Note: RAPL (Running Average Power Limit) on x86 provides real-time power monitoring for capping/budgeting but does NOT feed into EAS OPP
power_mwvalues. EAS uses firmware-provided static power estimates (ACPI PPTT / CPPC / device treeopp-table).
Example: ARM big.LITTLE system (Cortex-X3 + Cortex-A510)
Performance cores (Cortex-X3), freq_domain 0:
OPP 0: 600 MHz, capacity 256, power 80 mW
OPP 1: 1200 MHz, capacity 512, power 280 mW
OPP 2: 1800 MHz, capacity 768, power 650 mW
OPP 3: 2400 MHz, capacity 1024, power 1200 mW
Efficiency cores (Cortex-A510), freq_domain 1:
OPP 0: 400 MHz, capacity 100, power 15 mW
OPP 1: 800 MHz, capacity 200, power 50 mW
OPP 2: 1200 MHz, capacity 300, power 110 mW
OPP 3: 1600 MHz, capacity 400, power 200 mW
Observation: running a task with utilization 200 (out of 1024):
On performance core at OPP 0 (capacity 256): power = 80 mW
On efficiency core at OPP 3 (capacity 400): power = 200 mW
→ Wait. On efficiency core at OPP 1 (capacity 200): power = 50 mW
→ Efficiency core at OPP 1 uses 50 mW. Performance core at OPP 0 uses 80 mW.
→ Efficiency core wins. EAS places the task on the efficiency core.
But a task with utilization 500:
→ Doesn't fit on any efficiency core OPP (max capacity 400).
→ Must go to performance core. EAS picks lowest OPP that fits: OPP 1 (512), 280 mW.
7.2.3 Energy-Aware Scheduling Algorithm¶
EAS runs at task wakeup time (the most impactful scheduling decision). It answers: "Which CPU should this task run on to minimize total system energy while meeting performance requirements?"
// umka-core/src/sched/eas.rs
pub struct EnergyAwareScheduler {
/// Energy models for all frequency domains.
/// Boot-allocated contiguous array (one per frequency domain), not heap Vec.
/// Indexed by frequency domain ID. Length = number of frequency domains
/// discovered at boot. Stored as a `BootVec<EnergyModel>` (boot-time-allocated,
/// fixed-size-after-init, no heap pointer indirection on the wakeup fast path).
energy_models: BootVec<EnergyModel>,
/// Per-CPU utilization (PELT, see Section 7.1.5.4).
/// Boot-allocated contiguous array (one per CPU), not heap Vec.
/// Indexed by CPU ID. Stored as `BootVec<CacheAligned<AtomicU32>>` —
/// NOT `PerCpu<T>` because `find_energy_efficient_cpu()` reads remote
/// CPUs' utilization (`PerCpu<T>::get()` returns only the current CPU's
/// copy). `CacheAligned` ensures each entry sits on its own cache line
/// to prevent false sharing between CPUs updating their own utilization.
cpu_util: BootVec<CacheAligned<AtomicU32>>,
/// Threshold: a task is "misfit" if its utilization exceeds
/// the capacity of the CPU it's running on.
/// Misfit tasks are migrated to higher-capacity CPUs.
misfit_threshold: u32,
/// EAS is disabled on fully symmetric systems (no benefit).
enabled: bool,
}
impl EnergyAwareScheduler {
/// Find the most energy-efficient CPU for a waking task.
/// Called from EEVDF enqueue path when EAS is enabled.
///
/// Algorithm:
/// 1. For each frequency domain:
/// a. Compute the new utilization if this task were placed here.
/// b. Find the lowest OPP that can handle the new utilization.
/// c. Compute energy cost = OPP power × (new_util / capacity).
/// 2. Pick the frequency domain with the lowest energy cost.
/// 3. Within that domain, pick the CPU with the most spare capacity
/// (to avoid unnecessary frequency increases).
///
/// Complexity: O(domains × OPPs). Typically 2-3 domains × 4-6 OPPs = 8-18 iterations.
/// Time: ~200-500ns. Acceptable for task wakeup path (~2000ns total).
pub fn find_energy_efficient_cpu(&self, task_util: u32) -> CpuId {
let mut best_energy = u64::MAX;
let mut best_cpu = CpuId(0);
for model in &self.energy_models {
// Can this domain handle the task at all?
let max_capacity = model.opps.last().map(|o| o.capacity).unwrap_or(0);
if task_util > max_capacity {
continue; // Task doesn't fit on this core type
}
// Compute energy cost for placing task in this domain.
let energy = self.compute_energy(model, task_util);
if energy < best_energy {
best_energy = energy;
best_cpu = self.find_idlest_cpu_in_domain(model.freq_domain);
}
}
best_cpu
}
/// Estimate energy cost of adding `task_util` to a frequency domain.
///
/// OPP selection uses the maximum per-CPU utilization in the domain (not
/// the aggregate), because frequency is shared across all CPUs in a DVFS
/// domain — the OPP must be high enough for the most loaded CPU.
fn compute_energy(&self, model: &EnergyModel, task_util: u32) -> u64 {
// Find max per-CPU utilization in this domain. Assumes the task
// will be placed on the idlest CPU (same heuristic as
// find_idlest_cpu_in_domain), so task_util is added to that
// CPU's utilization when computing the domain's max.
let max_cpu_util = self.max_cpu_utilization(model.freq_domain, task_util);
// Find lowest OPP whose capacity can handle the busiest CPU.
// OPPs are sorted by ascending capacity; use binary search (O(log N)).
let idx = model.opps.partition_point(|o| o.capacity < max_cpu_util);
let opp = model.opps.get(idx).unwrap_or(model.opps.last().unwrap());
// Energy = power × (sum of all CPU utilizations) / capacity.
// Power is determined by the OPP (selected by max CPU), but energy
// is proportional to total work done across all CPUs in the domain.
// `domain_utilization()` returns the sum of PELT utilization_avg
// across all CPUs in the frequency domain (a u32 clamped to 1024
// per CPU × number of CPUs in the domain).
let domain_util = self.domain_utilization(model.freq_domain) + task_util;
(opp.power_mw as u64) * (domain_util as u64) / (opp.capacity as u64)
}
}
When EAS is NOT used (symmetric systems, or when all cores are of the same type):
the standard EEVDF load balancer runs instead. EAS adds zero overhead because
enabled == false and the check is a single branch at the top of the wakeup path.
7.2.4 Per-Entity Load Tracking (PELT)¶
EAS needs accurate, up-to-date utilization data for each task and each CPU. PELT provides this with an exponentially-decaying average that balances responsiveness with stability.
// umka-core/src/sched/pelt.rs
// ---------------------------------------------------------------------------
// PELT constants and decay lookup table (Gap 2.13)
// ---------------------------------------------------------------------------
/// One PELT period in nanoseconds.
///
/// Chosen as 1024 × 1000 = 1,024,000 ns ≈ 1.024 ms. The power-of-two
/// factor (1024) means the division `delta_ns / PERIOD_NS` and the modulo
/// `delta_ns % PERIOD_NS` can be computed with a right-shift and a bitmask
/// on architectures where the compiler elides the integer division.
pub const PELT_PERIOD_NS: u64 = 1_024_000;
/// Converged maximum load average.
///
/// A task that has been 100% runnable for effectively infinite time converges
/// to `LOAD_AVG_MAX`. This is the closed-form sum of the geometric series:
///
/// ```text
/// LOAD_AVG_MAX = 1024 × Σ_{n=0}^{∞} y^n = 1024 / (1 − y) ≈ 47742
/// ```
///
/// where `y = 0.5^(1/32) ≈ 0.97857` is the per-period decay factor.
/// Used to normalise the internal `*_sum` accumulators to the `*_avg` fields
/// (0–1024 scale): `util_avg = util_sum × 1024 / LOAD_AVG_MAX`.
pub const LOAD_AVG_MAX: u64 = 47742;
/// Number of periods for the geometric series to converge.
///
/// After `LOAD_AVG_MAX_N` periods at 100% utilisation the internal sum
/// reaches `LOAD_AVG_MAX` to within 1 ULP. Any periods beyond this index
/// need not be tracked — `decay_load()` returns 0 for `n ≥ LOAD_AVG_MAX_N`.
pub const LOAD_AVG_MAX_N: u64 = 345;
/// Sub-period fractional decay coefficients for PELT.
///
/// `RUNNABLE_AVG_YN_INV[i]` is the fixed-point (Q32) representation of `y^i`
/// where `y = 0.5^(1/32) ≈ 0.97857` and `i ∈ [0, 31]`:
///
/// ```text
/// RUNNABLE_AVG_YN_INV[i] = round(y^i × 2^32)
/// ```
///
/// Entry 0 = `2^32 - 1` (full weight, zero elapsed sub-periods).
/// Entry 31 = `round(y^31 × 2^32)` (nearly one full period of decay).
///
/// Used by `decay_load()` for the fractional-period component of decay:
///
/// ```text
/// val = (val * RUNNABLE_AVG_YN_INV[n % 32]) >> 32
/// ```
///
/// This avoids floating-point arithmetic at runtime; the table is computed
/// once at compile time from the analytic formula.
pub const RUNNABLE_AVG_YN_INV: [u32; 32] = [
0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a,
0xeac0c6e6, 0xe5b906e6, 0xe0ccdeeb, 0xdbfbb796,
0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46,
0xb504f333, 0xb123f581, 0xad583ee9, 0xa9a15ab4,
0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a,
0x8b95c1e3, 0x88980e80, 0x85aac367, 0x82cd8698,
];
/// Decay a PELT accumulator value by `n` elapsed periods.
///
/// Applies the compound decay factor `y^n` using integer arithmetic:
///
/// 1. If `n > LOAD_AVG_MAX_N` (345), return 0 — the value is fully decayed.
/// 2. Halve `val` for each complete group of 32 periods: `val >>= n / 32`.
/// Each group of 32 periods reduces the value by exactly 50% (`y^32 = 0.5`).
/// 3. Apply the remaining sub-period fractional decay using the lookup table:
/// `val = (val * RUNNABLE_AVG_YN_INV[n % 32]) >> 32`.
/// 4. Return the decayed value.
///
/// # Precision
///
/// The fixed-point multiply in step 3 is a Q32 multiply: the result is the
/// upper 32 bits of the 64-bit product. On 64-bit platforms this is a single
/// `mulhi` or equivalent instruction. On 32-bit platforms it requires a
/// 32×32→64 widening multiply.
///
/// # Usage
///
/// Called for each PELT accumulator (`load_sum`, `runnable_sum`, `util_sum`)
/// when a state transition spans `n ≥ 1` complete periods.
pub fn decay_load(val: u64, n: u64) -> u64 {
if n > LOAD_AVG_MAX_N {
return 0;
}
// Halve for each complete group of 32 periods.
let val = val >> (n / 32);
// Fractional sub-period decay via Q32 multiply with lookup table.
(val * RUNNABLE_AVG_YN_INV[(n % 32) as usize] as u64) >> 32
}
// ---------------------------------------------------------------------------
// PeltState — per-entity load tracking state
// ---------------------------------------------------------------------------
/// Per-Entity Load Tracking state.
///
/// Attached to every schedulable entity (task) and every CPU run queue.
/// Maintains exponentially-decaying averages of CPU utilisation, runnability,
/// and weighted load over a ~32 ms half-life window.
///
/// ## Internal representation
///
/// Three raw accumulators (`load_sum`, `runnable_sum`, `util_sum`) hold the
/// un-normalised geometric sums. Three derived averages (`load_avg`,
/// `runnable_avg`, `util_avg`) are the normalised 0–1024 values consumed by
/// EAS, load balancing, and cpufreq. The averages are recomputed from the
/// sums whenever a state transition occurs:
///
/// ```text
/// util_avg = util_sum * NICE_0_WEIGHT / LOAD_AVG_MAX (clamped to 1024)
/// runnable_avg = runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX (clamped to 1024)
/// load_avg = load_sum * task.weight / LOAD_AVG_MAX
/// ```
///
/// `NICE_0_WEIGHT = 1024` and `LOAD_AVG_MAX = 47742`.
///
/// ## Half-life
///
/// The decay factor `y = 0.5^(1/32) ≈ 0.97857` per 1.024 ms period gives a
/// half-life of 32 periods ≈ 32.768 ms. A task that stops running drops to
/// 50% utilisation after ~32 ms and is effectively zero after ~345 periods
/// (~353 ms).
pub struct PeltState {
/// Raw load accumulator: `Σ(task.weight × runnable_time_in_period × y^n)`.
/// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `load_avg`.
pub load_sum: u64,
/// Raw runnable accumulator: `Σ(runnable_time_in_period × y^n)`.
/// Counts time the entity was either running or waiting in the run queue.
/// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `runnable_avg`.
pub runnable_sum: u64,
/// Raw utilisation accumulator: `Σ(running_time_in_period × y^n)`.
/// Counts only time the entity was executing on a CPU (not queued).
/// Not normalised. Divide by `LOAD_AVG_MAX` to obtain `util_avg`.
pub util_sum: u64,
/// Sub-period carry-forward in nanoseconds.
///
/// Nanoseconds elapsed since the start of the current (incomplete) period.
/// Preserved across state transitions so that sub-period time accumulates
/// correctly. Range: `[0, PELT_PERIOD_NS)`.
pub period_contrib: u32,
/// Normalised weighted load average (0 = idle, `task.weight` = fully loaded).
/// `load_avg = load_sum * task.weight / LOAD_AVG_MAX`.
/// Used by the load balancer and NUMA placement.
pub load_avg: u64,
/// Normalised runnable average (0–1024).
/// Includes both running and queued (waiting) time.
/// `runnable_avg = runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX`.
pub runnable_avg: u64,
/// Normalised utilisation average (0–1024).
/// Pure execution time only, excluding queued time.
/// `util_avg = util_sum * NICE_0_WEIGHT / LOAD_AVG_MAX`.
/// This is the primary signal consumed by EAS and cpufreq.
pub util_avg: u64,
/// Monotonic timestamp of the last `update()` call (nanoseconds since boot).
/// Used to compute `delta_ns` on the next call.
pub last_update_time: u64,
}
impl PeltState {
/// Update PELT state with a new time sample using the canonical 3-phase
/// algorithm. This is the ONLY PELT update implementation — there is no
/// simplified version. The 3-phase decomposition correctly handles the
/// head partial period contribution at the old decay level, which is
/// critical for Linux accounting compatibility (`perf`, cgroup `cpu.stat`,
/// container runtimes reading `/proc/schedstat`).
///
/// Must be called at every scheduling event that changes entity state:
/// task switch (running→queued, queued→running), sleep (queued→off),
/// wake-up (off→queued), and scheduler tick. The caller must ensure
/// `running`, `runnable`, and `now` accurately reflect the entity's
/// state for the entire interval since the last call.
///
/// **State-transition contract**: Between consecutive calls, the entity's
/// state must be constant — exactly one of {running, runnable-but-not-running,
/// sleeping}. Calling `update()` mid-interval and then again with a different
/// state for the remainder is the correct pattern; calling `update()` with a
/// blended state is incorrect and will misattribute the `period_contrib`
/// carry-forward.
///
/// `running`: was this entity executing on a CPU for the entire interval?
/// `runnable`: was this entity on the run queue (running or waiting)?
/// Invariant: `running → runnable` (every running task is runnable).
/// `now`: current timestamp in nanoseconds (from `ktime_get_ns()`).
/// `task_weight`: the task's `sched_prio_to_weight` value (for `load_avg`).
///
/// **Classification**: Evolvable. The decay formula and accumulation logic
/// are policy code, hot-swappable via `EvolvableComponent`. The `PeltState`
/// struct layout is Nucleus (data). The invariant checker validates that any
/// replacement `update()` preserves the 3-phase decomposition and produces
/// PELT values within `[0, LOAD_AVG_MAX]`.
pub fn update(
&mut self,
running: bool,
runnable: bool,
now: u64,
task_weight: u64,
) {
debug_assert!(!running || runnable, "running implies runnable");
let delta = now - self.last_update_time;
if delta == 0 { return; }
self.last_update_time = now;
// Compute period boundaries accounting for the carry from the
// previous call. `period_contrib` is the number of nanoseconds
// already accumulated in the current (incomplete) period.
// The total elapsed time including carry determines how many
// complete period boundaries are crossed.
let total = self.period_contrib as u64 + delta;
let periods_crossed = total / PELT_PERIOD_NS;
let tail = total % PELT_PERIOD_NS;
// Phase 1: Complete the current partial period (head).
// `head` is the time needed to finish the current period, or all
// of `delta` if no boundary is crossed.
let head = (PELT_PERIOD_NS - self.period_contrib as u64).min(delta);
if periods_crossed == 0 {
// No period boundary crossed — just accumulate the tail.
self.period_contrib += delta as u32;
if runnable {
// load_sum accumulates UNWEIGHTED runnable time.
// Weight is applied ONCE during normalization (load_avg).
// This matches Linux `___update_load_sum()` where `load` is
// 0 or 1 (boolean), not the task weight.
self.load_sum += delta;
self.runnable_sum += delta;
}
if running {
self.util_sum += delta;
}
} else {
// At least one period boundary crossed.
// Phase 1: The head completes the old partial period.
// The entity was active for `head` ns of the current state
// plus the previous `period_contrib` ns (same state per the
// state-transition contract). Together = one full period.
// Decay existing sums by one period, add full period contribution.
self.load_sum = decay_load(self.load_sum, 1);
self.runnable_sum = decay_load(self.runnable_sum, 1);
self.util_sum = decay_load(self.util_sum, 1);
if runnable {
self.load_sum += PELT_PERIOD_NS;
self.runnable_sum += PELT_PERIOD_NS;
}
if running {
self.util_sum += PELT_PERIOD_NS;
}
// Phase 2: Complete periods in the body (if more than 1 boundary crossed).
// Each complete period contributes PELT_PERIOD_NS * y^i (unweighted).
// The sum of the geometric series is computed via accumulate_sum().
if periods_crossed > 1 {
let body_periods = periods_crossed - 1;
self.load_sum = decay_load(self.load_sum, body_periods);
self.runnable_sum = decay_load(self.runnable_sum, body_periods);
self.util_sum = decay_load(self.util_sum, body_periods);
let contrib = accumulate_sum(body_periods);
if runnable {
self.load_sum += contrib;
self.runnable_sum += contrib;
}
if running {
self.util_sum += contrib;
}
}
// Phase 3: Remaining partial period (tail).
// The tail contributes to sums at y^0 == 1 (undecayed, current
// period in progress). This matches Linux's `__accumulate_pelt_segments()`
// where `c3 = d3` (tail contribution, undecayed).
self.period_contrib = tail as u32;
if tail > 0 {
if runnable {
self.load_sum += tail;
self.runnable_sum += tail;
}
if running {
self.util_sum += tail;
}
}
}
// Recompute the normalised averages from the updated sums.
self.load_avg = self.load_sum * task_weight / LOAD_AVG_MAX;
self.runnable_avg = (self.runnable_sum * NICE_0_WEIGHT / LOAD_AVG_MAX).min(1024);
self.util_avg = (self.util_sum * NICE_0_WEIGHT / LOAD_AVG_MAX).min(1024);
}
}
/// Accumulate the decayed geometric series for `n` complete periods.
///
/// Returns the sum `Σ_{k=0}^{n-1} (1024 × y^k)`, which is the total
/// contribution of `n` complete periods of 100% activity to a PELT sum.
/// Implemented using the same two-step lookup as `decay_load()`:
///
/// ```text
/// // Full 32-period groups each contribute LOAD_AVG_MAX × (1 - y^32) = LOAD_AVG_MAX × 0.5
/// // but the incremental sum is easier to compute as LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n).
/// accumulate_sum(n) = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
/// ```
///
/// This identity holds because the converged sum minus the decayed tail is
/// exactly the contribution of `n` periods from a starting value of 0.
///
/// **Bit-identical note**: This identity may differ from Linux's
/// `runnable_avg_yN_sum` lookup table by +/- 1 due to fixed-point
/// rounding in the intermediate `decay_load()` computation. This is
/// acceptable: PELT values are exponentially-weighted averages consumed
/// by `perf`, `/proc/schedstat`, and `cpu.stat` — all of which display
/// them as percentages or averages where +/- 1 in the underlying
/// fixed-point sum is invisible. If exact Linux bit-compatibility is
/// required for a specific regression test, the `runnable_avg_yN_sum`
/// table can be added as a Phase 2 optimization (drop-in replacement,
/// same function signature).
pub fn accumulate_sum(n: u64) -> u64 {
LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
}
The decay factor y ≈ 0.978 gives a half-life of 32 periods (~32.768 ms). decay_load()
uses a precomputed table of y^n values for n = 0..31 (RUNNABLE_AVG_YN_INV) and
halves for each group of 32 periods. accumulate_sum(n) = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, n)
computes the geometric partial sum without floating-point. This matches Linux's PELT
exactly for tool compatibility.
Relationship to EAS: When a task wakes up, the scheduler reads task.pelt.util_avg
to know the task's CPU demand. EAS uses this to find the core type where the task fits
most efficiently. Without PELT, EAS would have no utilization data to work with.
7.2.5 Frequency Domain Awareness and Cpufreq Integration¶
CPUs within a frequency domain share a clock — changing one CPU's frequency changes all of them. The scheduler must be aware of this grouping.
// umka-core/src/sched/cpufreq.rs
/// Frequency domain: a group of CPUs sharing a clock source.
pub struct FreqDomain {
/// Domain identifier.
pub id: FreqDomainId,
/// CPUs in this domain.
pub cpus: CpuMask,
/// Core type of all CPUs in this domain (always uniform within a domain).
pub core_type: CoreType,
/// Available OPPs for this domain. Fixed-capacity inline array avoids
/// heap allocation and keeps OPP data cache-local. 32 entries
/// exceeds all known hardware OPP tables (typical: 8-20).
/// No downsampling is needed for current hardware.
pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>, // MAX_OPP_ENTRIES = 32
/// Current OPP index (into `opps`).
pub current_opp: AtomicU32,
/// Aggregate utilization of all CPUs in this domain (sum of PELT util_avg).
/// Updated at scheduler tick.
pub domain_util: AtomicU32,
/// Cpufreq governor for this domain.
pub governor: CpufreqGovernor,
}
/// Cpufreq governor — decides when to change frequency.
pub enum CpufreqGovernor {
/// Schedutil: frequency tracks utilization (default for EAS).
/// New frequency = (util / capacity) × max_freq.
/// Tight integration with scheduler — runs from scheduler context.
Schedutil,
/// Performance: always run at max frequency.
Performance,
/// Powersave: always run at min frequency.
Powersave,
/// Ondemand: legacy userspace sampling (Linux compat).
Ondemand,
/// Conservative: like ondemand but ramps gradually.
Conservative,
}
Schedutil integration: On every scheduler tick, the schedutil governor reads the domain's aggregate utilization and adjusts frequency:
new_freq = (domain_util / domain_capacity) × max_freq
If domain has 4 CPUs at capacity 1024 each:
domain_capacity = 4096
If domain_util = 2048 (50% utilized):
new_freq = (2048 / 4096) × max_freq = 50% of max_freq
Frequency change latency: ~10-50 μs (hardware-dependent).
The governor rate-limits changes to avoid oscillation (~4ms minimum interval).
7.2.6 Intel Thread Director (ITD) Integration¶
Intel Thread Director (Hardware Feedback Interface):
The HFI table is memory-mapped:
1. UmkaOS Core allocates a 4KB-aligned physical buffer at boot.
2. Writes the physical address to IA32_HW_FEEDBACK_PTR MSR (0x17D0H).
3. Hardware fills the table with per-class performance/efficiency scores
using normal memory stores (not MSR writes).
4. When HFI data is updated, hardware fires a Thermal Interrupt
(bit 26 set in IA32_PACKAGE_THERM_STATUS).
5. The interrupt handler reads the updated table via normal memory loads.
Per-thread class ID: read via RDMSR from IA32_THREAD_FEEDBACK_CHAR
(per-logical-processor MSR, address 0x17D2H). Each thread has a
hardware-assigned classification (e.g., integer-heavy, floating-point-heavy,
memory-bound) that informs EAS core assignment decisions.
Intel Thread Director is a hardware feature on Alder Lake+ that classifies running workloads and provides hints about which core type is optimal. The hardware monitors instruction mix in real-time and populates the HFI table in memory.
// umka-core/src/sched/itd.rs
/// Intel Thread Director hint (decoded from the memory-mapped HFI table).
/// Hardware populates the table via normal memory stores; UmkaOS reads it
/// on Thermal Interrupt (bit 26 of IA32_PACKAGE_THERM_STATUS).
/// Per-thread class ID is obtained via RDMSR IA32_THREAD_FEEDBACK_CHAR (0x17D2H).
pub struct ItdHint {
/// Hardware's assessment: how much this task benefits from a P-core.
/// 0 = no benefit (pure memory-bound), 255 = maximum benefit (compute-bound).
pub perf_capability: u8,
/// Hardware's assessment: energy efficiency on an E-core.
/// 0 = poor efficiency on E-core, 255 = excellent efficiency on E-core.
pub energy_efficiency: u8,
/// Workload classification.
pub workload_class: ItdWorkloadClass,
}
#[repr(u8)]
pub enum ItdWorkloadClass {
/// Scalar integer code — runs well on E-cores.
ScalarInt = 0,
/// Scalar floating-point — runs well on E-cores.
ScalarFp = 1,
/// Vectorized (SSE/AVX) — may benefit from P-cores (wider execution units).
Vector = 2,
/// AVX-512 / AMX — P-core only (E-cores may lack these).
HeavyVector = 3,
/// Branch-heavy — benefits from P-core branch predictor.
BranchHeavy = 4,
/// Memory-bound — core type doesn't matter, memory is the bottleneck.
MemoryBound = 5,
}
Integration with EAS: ITD hints are a refinement. EAS uses PELT utilization to pick the energy-optimal core. ITD overrides this when hardware detects a mismatch:
EAS decision: task has low utilization (200/1024) → place on E-core.
ITD override: task is HeavyVector (AVX-512) → E-core lacks AVX-512 → force P-core.
EAS decision: task has high utilization (800/1024) → place on P-core.
ITD override: task is MemoryBound → P-core is wasted → suggest E-core.
ITD hints are read from the memory-mapped HFI table on Thermal Interrupt (not per
scheduler tick). On interrupt: one memory load per table row (~4ns), no RDMSR needed
for the table itself. Per-thread class ID is fetched via RDMSR IA32_THREAD_FEEDBACK_CHAR
(0x17D2H) on context switch: cost ~30ns per switch, only on Intel Alder Lake+.
On non-Intel or pre-Alder Lake: ITD is disabled, zero overhead.
Other architectures: ARM and RISC-V do not have an ITD equivalent. On ARM
big.LITTLE/DynamIQ, the scheduler relies on the capacity-dmips-mhz device tree
property and runtime PELT utilization to make core placement decisions (Section 7.2). On RISC-V heterogeneous harts, the riscv,isa device tree property and
per-hart ISA capability flags drive placement (Section 7.2). Neither architecture
provides hardware-level workload classification hints — the scheduler's software
heuristics (PELT + EAS) perform this role. This is architecturally acceptable: ITD
is an optimization (~15-25% better placement for mixed workloads on Intel hybrid),
not a correctness requirement.
7.2.7 Asymmetric Packing¶
On heterogeneous systems, idle CPU selection must be topology-aware:
Symmetric (traditional):
Spread tasks across all CPUs evenly for maximum parallelism.
Asymmetric (big.LITTLE):
Pack tasks onto efficiency cores first.
Only spill onto performance cores when efficiency cores are full
or when a task is too large (misfit).
Why: an idle performance core at its lowest OPP still draws more power
than a busy efficiency core. Packing onto efficiency cores first
minimizes total system power.
Misfit migration: A task is "misfit" if its utilization (pelt.util_avg) exceeds
the capacity of the CPU it's currently running on. Misfit tasks are migrated to a
higher-capacity CPU at the next load balance opportunity.
/// Check if a task is misfit on its current CPU.
pub fn is_misfit(task: &Task, cpu: &CpuCapacity) -> bool {
task.pelt.util_avg > cpu.capacity_curr.load(Ordering::Relaxed)
}
/// Misfit migration is checked at every load balance interval (~4ms).
/// If a task is misfit:
/// 1. Find the closest (topology-wise) CPU with enough capacity.
/// 2. Migrate the task.
/// 3. Mark the source CPU's rq->misfit_task flag for faster detection.
7.2.7.1 Hybrid Core Isolation Domain Asymmetry (x86-64, Intel Hybrid)¶
On Intel hybrid CPUs (Alder Lake, Raptor Lake, Meteor Lake), WRPKRU execution time differs significantly between P-cores and E-cores:
| Core Type | WRPKRU Cost (per switch) | Consumer Batch (N=12, amortized) | Impact on Tier 1 I/O |
|---|---|---|---|
| P-core (Golden Cove / Raptor Cove) | ~35 cycles | ~6 cycles/op | ~0.06% overhead on NVMe 4KB read (amortized) |
| E-core (Gracemont) | ~89 cycles | ~15 cycles/op | ~0.15% overhead on NVMe 4KB read (amortized) |
Note: The consumer loop performs one domain switch per batch (N operations dequeued from the ring). The amortized per-operation cost is WRPKRU_cost / N. At N≥12 (typical under NVMe load), the overhead is negligible on both core types.
The scheduler accounts for this asymmetry when placing isolation-heavy workloads
(tasks that frequently cross Tier 1 domain boundaries). Tasks with high domain-switch
rates are preferentially placed on P-cores where WRPKRU is ~2.5x faster. The
X86Errata::HYBRID_ASYMMETRIC flag (set at boot for all hybrid CPUs) enables this
placement heuristic. The ITD (Intel Thread Director) hardware hint
(Section 7.2) classifies
such workloads with a high HFI_CLASS that naturally maps to P-core preference.
7.2.8 Hierarchical Group Scheduling (cpu.weight Backing Mechanism)¶
The EEVDF scheduler as described above is flat: all tasks compete directly in a single
per-CPU EEVDF tree. Cgroups v2 cpu.weight requires hierarchical fair sharing —
tasks within a cgroup share CPU proportional to the cgroup's weight relative to sibling
cgroups, and within a cgroup, tasks share proportional to their individual nice weights.
UmkaOS implements hierarchical EEVDF using group scheduling entities (GroupEntity),
matching Linux's struct task_group / struct sched_entity hierarchy. Each cgroup
with cpu.weight configured has one GroupEntity per CPU in the parent's EEVDF tree.
/// Per-CPU scheduling entity for a cgroup in the parent's EEVDF tree.
///
/// Each cgroup with a cpu.weight has one `GroupEntity` per CPU. The entity
/// participates in the parent cgroup's (or root's) EEVDF tree as if it were
/// a single task, representing all runnable tasks in the child cgroup on that CPU.
///
/// The entity's scheduling parameters (vruntime, vdeadline, lag) are maintained
/// using the same EEVDF algorithm as individual tasks — only the weight comes
/// from `cpu.weight` instead of the nice-to-weight table.
pub struct GroupEntity {
/// Cgroup ID that this entity represents.
pub cgroup_id: u64,
/// Weight from `cpu.weight` (1–10000, default 100). Scales the entity's
/// virtual runtime accumulation in the parent's EEVDF tree, exactly as
/// a task's nice weight scales its vruntime in a flat tree.
pub weight: u32,
/// Virtual runtime in the parent's EEVDF tree. Accumulates inversely
/// proportional to `weight`: `vruntime += delta_exec_ns * (NICE_0_WEIGHT / weight)`.
pub vruntime: u64,
/// Virtual deadline in the parent's EEVDF tree.
/// `vdeadline = vruntime + calc_delta_fair(slice_ns, weight)`.
/// EEVDF computes eligibility dynamically from `avg_vruntime()`,
/// not from a stored field.
pub vdeadline: u64,
/// Accumulated lag for EEVDF eligibility in the parent's tree.
pub lag: i64,
/// Number of runnable tasks in this cgroup on this CPU.
/// AtomicU32 for lock-free reads by work stealing and load estimation.
/// When `nr_running` drops to zero, the entity is dequeued from the
/// parent's EEVDF tree (no empty groups occupy tree space).
pub nr_running: AtomicU32,
/// Back-pointer to the parent cgroup's `GroupEntity` (or `None` for root).
/// Forms the hierarchy chain for weight propagation.
pub parent: Option<&'static GroupEntity>,
/// The child EEVDF tree: tasks (and nested child GroupEntities) that
/// belong to this cgroup on this CPU. Uses `VruntimeTree` directly
/// ([Section 7.1](#scheduler)) — the shared base type containing the augmented
/// RB tree and two-accumulator state.
///
/// All accumulator-only EEVDF helpers (`avg_vruntime_update`,
/// `entity_key`, `__enqueue_entity`, `__dequeue_entity`) operate
/// on `&VruntimeTree` and work identically on this group sub-tree.
/// Root-only fields (`curr`, `next`, `bandwidth_timer`) are not
/// present — the currently running task is tracked solely by the
/// root per-CPU `EevdfRunQueue.curr`. Hierarchical pick descends
/// into `child_rq` and picks from the tree without a local `curr`.
pub child_rq: VruntimeTree,
/// PELT state for this group entity. Aggregates the utilization of all
/// tasks in the cgroup on this CPU. Used by EAS and load balancing.
pub pelt: PeltState,
}
Two-level pick algorithm. Task selection in a hierarchical EEVDF tree is recursive:
-
At the root per-CPU EEVDF tree,
pick_eevdf()selects the eligible entity with the earliest virtual deadline. This entity may be either a bareEevdfTask(a task not in any cgroup with cpu.weight) or aGroupEntity. -
If the selected entity is a
GroupEntity, descend into itschild_rqand repeat: pick the eligible entity with the earliest virtual deadline in the child tree. This entity may again be aGroupEntity(nested cgroups) or a leafEevdfTask. -
Repeat until a leaf
EevdfTaskis reached. This is the task to schedule.
The recursion depth equals the cgroup nesting depth (typically 2–4 levels: root → system.slice → service → container). Each level is an O(log n) tree walk, making the total pick cost O(D × log n) where D is the cgroup depth and n is the maximum tasks per level.
// `rq` is a VruntimeTree (or EevdfRunQueue.base at the root level).
pick_next_eevdf(rq):
entity = rq.tasks_timeline.pick_eevdf() // O(log n) augmented walk
if entity.is_group():
return pick_next_eevdf(entity.child_rq) // recurse into child VruntimeTree
else:
return entity.task // leaf: schedule this task
7.2.8.1 Virtual Runtime Propagation¶
When a task runs for delta_exec_ns:
-
The task's own
EevdfTask.vruntimeadvances bydelta_exec_ns * NICE_0_WEIGHT / task_weight(within the innermost group's EEVDF tree). -
Each ancestor
GroupEntityup to the root also advances itsvruntimein its parent's tree bydelta_exec_ns * NICE_0_WEIGHT / group_weight. This ensures the group's virtual time reflects actual CPU consumption, maintaining fairness among sibling groups. -
The propagation is bottom-up and happens in the
task_tick()andput_prev_task()paths, both of which run under the per-CPU runqueue lock. No additional locking is needed because all entities in the hierarchy chain reside on the same CPU.
/// Propagate vruntime from a task to all ancestor GroupEntities.
/// Called from update_curr() step 3b after advancing the task's vruntime.
///
/// Walks from the task's innermost cgroup up to the root, advancing each
/// GroupEntity's vruntime in its parent's EEVDF tree. All entities are on
/// this CPU's runqueue — no cross-CPU locking needed.
fn propagate_group_vruntime(ge: &mut GroupEntity, delta_ns: u64) {
let mut current = ge;
loop {
// Advance this GroupEntity's vruntime in its parent's tree.
let vdelta = calc_delta_fair(delta_ns, current.weight);
current.vruntime += vdelta;
// Update avg_vruntime accumulators in the parent tree.
if let Some(parent_rq) = current.parent_eevdf_rq() {
parent_rq.sum_w_vruntime +=
(vdelta as i64) * (current.weight as i64);
}
// Walk up to the parent GroupEntity, stop at root.
match current.parent.as_mut() {
Some(p) => current = p,
None => break,
}
}
}
Weight mapping. The cpu.weight range (1–10000, default 100) is converted to EEVDF
weights using the same formula as Section 7.1:
group_weight = (cpu_weight * 1024) / 100. At the default cpu.weight = 100, a group
entity has EEVDF weight 1024 (= NICE_0_WEIGHT, equivalent to nice 0). Two sibling
cgroups with cpu.weight 100 and 200 get CPU in 1:2 ratio, regardless of how many
tasks each contains.
Enqueue / dequeue lifecycle:
- When the first task in a cgroup becomes runnable on a CPU: create (or reuse) the
GroupEntityfor that cgroup on that CPU, setnr_running = 1, enqueue the entity into the parent's EEVDF tree. - When a task in the cgroup wakes or forks on that CPU: increment
nr_running, enqueue the task into theGroupEntity.child_rq. The group entity stays in the parent tree. - When a task sleeps or exits: decrement
nr_running, dequeue fromchild_rq. Ifnr_runningreaches zero: dequeue theGroupEntityfrom the parent tree (preservinglagfor re-enqueue fairness, same as the deferred dequeue mechanism for individual tasks).
Per-CPU storage. GroupEntity instances are stored in per-CPU XArrays keyed by
cgroup ID: RunQueueData.group_entities: XArray<GroupEntity>. This is O(1) lookup by
cgroup ID when a task is enqueued and the scheduler needs to find or create the group
entity for the task's cgroup on the local CPU.
// (Pseudo-code: this field is part of the RunQueueData struct definition, shown here for clarity)
/// Extension to RunQueueData for group scheduling.
impl RunQueueData {
/// Per-cgroup group scheduling entities on this CPU.
/// Keyed by `CgroupId` (integer key → XArray per collection policy).
/// Lazily populated: an entry exists only when at least one task in
/// the cgroup is runnable on this CPU.
pub group_entities: XArray<GroupEntity>,
}
Runtime cpu.weight propagation. When userspace writes a new value to a cgroup's
cpu.weight file, the kernel must propagate the weight change to every per-CPU
GroupEntity instance for that cgroup. Without explicit propagation, the per-CPU
entities retain the stale weight and the scheduler produces incorrect proportional
sharing. On tickless (nohz_full) cores, a stale weight persists indefinitely because
there is no periodic tick to trigger parameter re-evaluation.
/// Propagate a cpu.weight change to all per-CPU GroupEntity instances.
///
/// Called from the cgroupfs cpu.weight write handler (Section 17.2.3)
/// after the CpuController.weight AtomicU32 is updated.
///
/// The write handler holds the cgroup's `css_set_lock` (read side),
/// preventing concurrent task migration from racing with the weight update.
/// Per-CPU runqueue locks are acquired individually in ascending CPU order
/// to avoid ABBA deadlock with the scheduler's per-CPU lock ordering.
pub fn sched_group_set_weight(cgroup_id: u64, new_weight: u32) {
for cpu_id in 0..nr_cpus_online() {
let rq = per_cpu_runqueue(cpu_id);
let _guard = rq.lock();
if let Some(ge) = rq.group_entities.get_mut(cgroup_id) {
let old_weight = ge.weight;
ge.weight = new_weight;
// Recompute vdeadline from the new weight. The entity's current
// vruntime is preserved — only the rate of future vruntime
// accumulation changes. This matches Linux's reweight_entity()
// semantics: the group retains its accumulated lag (fairness debt)
// and only future scheduling quanta are scaled by the new weight.
//
// vdeadline = vruntime + calc_delta_fair(EEVDF_SLICE_NS, new_weight)
// EEVDF computes eligibility dynamically from avg_vruntime(), not
// from a stored field.
ge.vdeadline = ge.vruntime + calc_delta_fair(EEVDF_SLICE_NS, new_weight);
// If the entity is currently enqueued in the parent's EEVDF tree,
// update the tree's augmented min_vdeadline metadata (the weight
// change may have altered this entity's position in the tree).
if ge.nr_running.load(Ordering::Relaxed) > 0 {
rq.eevdf_tree_update_key(ge);
}
// If this CPU is in nohz_full (tickless) mode and the entity is
// currently running, send a reschedule IPI so the scheduler
// re-evaluates with the new weight. Without this, the running
// task continues at the old weight indefinitely.
if rq.is_nohz_full() && rq.curr_group_id() == Some(cgroup_id) {
rq.resched_ipi();
}
}
// If no GroupEntity exists for this cgroup on this CPU, no action
// is needed — the entity will be created with the new weight when
// the first task in this cgroup becomes runnable on this CPU.
}
}
Interaction with CBS bandwidth servers (Section 7.6): CBS
(cpu.guarantee) and group scheduling (cpu.weight) are orthogonal. A cgroup can
have both: cpu.weight determines its share of available CPU relative to siblings,
while cpu.guarantee sets a minimum floor. The CBS server operates at the cgroup
level — when CBS throttles a cgroup, the GroupEntity is removed from the parent
EEVDF tree (same as OnRqState::CbsThrottled). When CBS un-throttles, the entity is
re-enqueued with its preserved lag.
7.2.9 Cgroup Integration¶
Cgroups can constrain which core types a group of tasks may use:
/sys/fs/cgroup/<group>/cpu.core_type
# # Allowed core types for this cgroup.
# # "all" — any core type (default)
# # "performance" — only P-cores (latency-critical workloads)
# # "efficiency" — only E-cores (background/batch workloads)
# # Multiple: "performance mid" — P-cores and mid-tier cores
/sys/fs/cgroup/<group>/cpu.capacity_min
# # Minimum per-CPU capacity for tasks in this cgroup.
# # Tasks will not be placed on CPUs with capacity below this value.
# # Default: 0 (no minimum).
# # Example: "512" — only run on CPUs with at least half maximum capacity.
/sys/fs/cgroup/<group>/cpu.capacity_max
# # Maximum per-CPU capacity for tasks in this cgroup.
# # Tasks will not be placed on CPUs with capacity above this value.
# # Default: 1024 (no maximum).
# # Example: "400" — only run on efficiency cores.
Use case examples:
# # Kubernetes: latency-critical pod on P-cores only
echo "performance" > /sys/fs/cgroup/k8s-pod-frontend/cpu.core_type
# # Background log processing: E-cores only (save P-cores for real work)
echo "efficiency" > /sys/fs/cgroup/k8s-pod-logshipper/cpu.core_type
# # ML training: needs AVX-512 (P-cores on Intel, ISA-gated)
echo "512" > /sys/fs/cgroup/k8s-pod-training/cpu.capacity_min
7.2.10 RISC-V Heterogeneous Hart Support¶
RISC-V takes heterogeneity further: different harts (hardware threads) may have different ISA extensions. One hart may have the Vector extension (RVV), another may not. One hart may support the Hypervisor extension (H), another may not.
// umka-core/src/sched/riscv.rs
/// RISC-V ISA extension discovery per hart.
/// Read from the devicetree `riscv,isa` property for each hart.
///
/// Example devicetree:
/// cpu@0 { riscv,isa = "rv64imafdc_zba_zbb_v"; }; // Vector-capable
/// cpu@1 { riscv,isa = "rv64imafd"; }; // No vector
pub fn discover_hart_capabilities(dt: &DeviceTree) -> BootVec<IsaCapabilities> {
// Parse each hart's ISA string.
// Set IsaCapabilities flags accordingly.
// The scheduler uses these to ensure tasks that use Vector
// instructions only run on Vector-capable harts.
}
ISA gating in the scheduler:
Task affinity includes ISA requirements:
struct TaskAffinityHint {
/// ISA extensions this task requires (detected from ELF header
/// or set by userspace via prctl).
pub isa_required: IsaCapabilities,
/// Core type preference (from cgroup or auto-detected).
pub core_preference: CorePreference,
/// PELT utilization (for EAS).
pub util_avg: u32,
}
Scheduler check:
if !cpu.isa_caps.contains(task.affinity.isa_required) {
// This CPU lacks ISA extensions the task needs.
// Skip this CPU. Do NOT schedule here.
continue;
}
This prevents illegal-instruction faults from scheduling a Vector task on a non-Vector hart, which would be a silent correctness bug on Linux today (Linux began adding per-hart ISA detection in 6.2+ and capability tracking in 6.4+, but does not yet integrate per-hart ISA awareness into the scheduler's task placement decisions — a Vector-tagged task can still be scheduled on a non-Vector hart).
7.2.11 Topology Discovery¶
The scheduler builds its heterogeneous topology model from firmware:
Sources (checked in order):
1. ACPI PPTT (Processor Properties Topology Table):
— Provides core type, cache hierarchy, frequency domain.
— Available on ARM SBSA servers and Intel Alder Lake+.
2. ACPI CPPC (Collaborative Processor Performance Control):
— Provides per-CPU performance range (highest/lowest/nominal).
— The ratio highest_perf / nominal_perf indicates core type:
P-cores have higher highest_perf than E-cores.
— Used on Intel hybrid platforms.
3. Devicetree:
— ARM and RISC-V embedded systems.
— `capacity-dmips-mhz` property gives relative core performance.
— RISC-V `riscv,isa` property gives per-hart ISA extensions.
4. Intel CPUID leaf 0x1A (Hybrid Information):
— Reports core type (Atom = E-core, Core = P-core) for the running CPU.
— Each CPU reads its own CPUID at boot.
5. Fallback: runtime measurement
— If no firmware data: run a calibration loop on each CPU at boot.
— Measure instructions-per-second to derive relative capacity.
— Last resort. ~100ms at boot.
7.2.12 Linux Compatibility¶
All Linux interfaces for heterogeneous CPU systems are supported:
/sys/devices/system/cpu/cpuN/cpu_capacity
# # Read-only. Capacity of CPU N (0–1024).
# # Written by the kernel at boot. Used by userspace tools.
# # e.g., "1024" for P-core, "512" for E-core.
/sys/devices/system/cpu/cpuN/topology/core_type
# # "performance", "efficiency", or "unknown".
# # New in Linux 6.x, used by systemd and schedulers.
/sys/devices/system/cpu/cpufreq/policyN/
# # Standard cpufreq interface (per frequency domain):
# # scaling_governor, scaling_cur_freq, scaling_max_freq, etc.
sched_setattr(pid, &attr):
# # SCHED_FLAG_UTIL_CLAMP: set min/max utilization clamp.
# # util_min/util_max affect EAS placement decisions.
# # Fully supported with same semantics as Linux.
prctl(PR_SCHED_CORE, ...):
# # Core scheduling (co-scheduling related tasks on the same core).
# # Supported.
Kernel command line:
# # isolcpus=2-3 (reserve CPUs, same as Linux)
# # nohz_full=2-3 (tickless for RT, same as Linux)
# # nosmt (disable SMT, same as Linux)
sched_ext compatibility: Linux 6.12+ allows BPF scheduling policies via sched_ext. UmkaOS provides the foundation for sched_ext through the eBPF subsystem (Section 19.2) and the sched_setattr interface (Section 19.1). Full sched_ext support requires the BPF struct_ops framework and sched_ext-specific kfuncs (scx_bpf_dispatch, scx_bpf_consume, etc.), which are part of the eBPF subsystem implementation. BPF schedulers that use sysfs topology files and sched_setattr for configuration will work without modification once the struct_ops infrastructure is in place.
7.2.13 Performance Impact¶
Symmetric systems (all cores identical):
EAS: disabled (enabled == false). One branch check at task wakeup: ~1 cycle.
Capacity model: all CPUs = 1024. No capacity checks affect scheduling decisions.
PELT: runs regardless (already exists in the standard EEVDF scheduler). Zero additional cost.
Total overhead vs Linux on symmetric: ZERO.
Heterogeneous systems (big.LITTLE, Intel hybrid):
EAS: ~200-500ns per task wakeup (iterate 2-3 domains × 4-6 OPPs).
Task wakeup total (without EAS): ~1500-2000ns.
Task wakeup total (with EAS): ~1700-2500ns.
Overhead: ~15-25% of wakeup path. Same as Linux EAS.
Large topology scaling: On 128-core NUMA systems with 4 NUMA nodes and 3
frequency domains per node, wakeup scanning covers up to 12 frequency
domains × 128 CPUs = up to 1536 capacity lookups in the worst case. With
cache misses on remote NUMA nodes, this can reach 4-40 μs — exceeding the
200-500 ns estimate for small systems. UmkaOS mitigates this with:
(1) per-domain capacity caches refreshed on topology change, not per-wakeup;
(2) early termination when a suitable idle CPU is found;
(3) the eas_max_domains sysctl (default 8) caps the search depth.
On systems where EAS latency exceeds 1 μs P99, set eas_max_domains=4 or
disable EAS entirely (kernel.sched_energy_aware=0). The 200-500 ns estimate
applies to systems with ≤64 cores and ≤4 frequency domains.
ITD class ID fetch (RDMSR IA32_THREAD_FEEDBACK_CHAR): ~30ns per context switch.
HFI table read: on Thermal Interrupt only (infrequent, hardware-triggered).
Combined overhead: negligible (context-switch-bound, not tick-bound).
Misfit check: one comparison per load balance (~4ms). Negligible.
Benefit: 20-40% power reduction for mixed workloads (measured on
Linux EAS vs non-EAS on ARM big.LITTLE). Same benefit expected.
This is not overhead — it's a power optimization. The CPU time spent
on EAS decisions is recovered many times over in power savings.
Summary: UmkaOS's heterogeneous scheduling has identical performance
to Linux EAS on the same hardware. The algorithms are the same (PELT,
EAS energy computation, schedutil). The implementation is clean-sheet
Rust but the scheduling mathematics are equivalent.
See also: Section 7.7 (Power Budgeting) extends EAS with system-level power caps and per-domain throttling. Section 7.7 specifies
thermal_update_capacity()— the callback that updatesCpuCapacity.capacitywhen thermal throttling reduces a CPU's maximum frequency, ensuring EAS placement decisions account for reduced throughput. Section 22.8 (Unified Compute Model) generalizes the CpuCapacity scalar into a multi-dimensional capacity vector spanning CPUs, GPUs, and accelerators.
7.3 Context Switch and Register State¶
7.3.1 Context Switch Procedure¶
The full context switch sequence when the scheduler selects a new task (next) to
replace the currently running task (prev). This is the hot path executed on every
involuntary preemption, voluntary yield, and explicit schedule() call. The caller
holds the local CPU's run queue lock (rq.lock).
context_switch(prev, next):
1. Update prev's scheduling class state (put_prev_task):
- EEVDF: advance prev.vruntime, update lag, re-insert into eligible/ineligible tree
- RT/DL: update runtime accounting, check timeslice expiry
2. perf_schedule_out(prev_ctx: &PerfEventContext)
— Stop and read all active PMU counters for prev's task context.
`prev_ctx` is the task's `PerfEventContext` (see [Section 20.8](20-observability.md#performance-monitoring-unit)).
Lock-free: iterates active[0..active_count] via Acquire load.
See [Section 20.8](20-observability.md#performance-monitoring-unit--context-switch-fast-path).
3. Save prev's general-purpose registers and stack pointer
— Architecture-specific: pushes callee-saved registers onto prev's kernel stack,
stores prev's stack pointer into prev.thread_struct.
4. Switch address space (switch_mm):
- x86-64: write next's PGD to CR3 (with PCID if available to avoid full TLB flush)
- AArch64: write TTBR0_EL1, issue TLBI if ASID differs
- Other arches: architecture-specific page table base register update
5. Save/restore extended register state (lazy XSAVE/XRSTOR):
— See Extended Register State Management below. Only dirty components are saved.
6. Restore next's general-purpose registers and stack pointer
— Pop callee-saved registers from next's kernel stack.
7. perf_schedule_in(next_ctx: &PerfEventContext)
— Program next's active PMU counters into hardware.
`next_ctx` is the incoming task's `PerfEventContext` (see [Section 20.8](20-observability.md#performance-monitoring-unit)).
Lock-free: same active[] iteration pattern as step 2.
See [Section 20.8](20-observability.md#performance-monitoring-unit--context-switch-fast-path).
8. Update per-CPU current task pointer
— Write next's Task pointer to the per-CPU CpuLocal block.
Steps 2 and 7 invoke the PMU context switch fast path defined in
Section 20.8. These calls are unconditional — they execute on
every context switch regardless of whether perf events are active. When no perf events
are open on the CPU (active_count == 0), both functions reduce to a single atomic
load (the Acquire load of active_count) and an immediate return — no PMU register
access, no loop iteration. The cost of this check is ~1ns per context switch, well
within the performance budget (Section 1.3).
Software event accounting. The context switch path also increments the
PERF_COUNT_SW_CONTEXT_SWITCHES software counter via a per-CPU atomic increment
between steps 1 and 2. This counter is always active (not gated by perf event
attachment) and feeds perf stat -e context-switches. Cost: one AtomicU64::fetch_add
with Relaxed ordering (~1ns).
7.3.1.1 PKRU Management During Context Switch (x86-64)¶
PKRU is excluded from XSAVE/XRSTOR and managed manually. This is a correctness requirement, not an optimization. On all x86 CPUs with PKU — and particularly on AMD Zen processors — the CPU can aggressively clear the XSTATE_BV bit for PKRU when the register value matches the init state (all zeros). A subsequent XRSTOR would then reset PKRU to 0 (init state), silently disabling all protection key enforcement and destroying Tier 1 isolation. Linux fixed this in v5.14 (Thomas Gleixner's 66-patch "Spring Cleaning" series) by completely decoupling PKRU from XSAVE.
UmkaOS stores PKRU in prev.saved_pkru (a u32 field) and switches it
manually during step 5 of the context switch procedure:
PKRU context switch (x86-64, between steps 4 and 5):
a. rdpkru → read current PKRU value
b. If PKRU != prev.saved_pkru: store current value to prev (prev may have
modified PKRU via pkey_mprotect or via Tier 1 domain entry/exit)
c. If next.saved_pkru != current PKRU: wrpkru(next.saved_pkru)
d. Update per-CPU CpuLocalBlock.pkru_shadow = next.saved_pkru
(keeps the isolation shadow in sync with the hardware register — required
for switch_domain() elision correctness; see
[Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes--pkru-write-elision-mandatory))
**Memory ordering**: The shadow write uses Release ordering so that any
subsequent switch_domain() on this CPU (which reads the shadow with
Acquire) observes the updated value. The hardware register write
(WRPKRU) acts as a permission barrier — subsequent memory accesses
affected by PKRU will not execute (even speculatively) until WRPKRU
completes (Intel SDM). It is NOT a full serializing instruction
(unlike CPUID or WRMSR); it provides ordering only for PKU-protected
memory accesses. The Release ordering on the shadow store matches
the WRPKRU permission-barrier semantics for the elision comparison
to be correct.
e. The conditional write in step (c) avoids the ~23-89 cycle WRPKRU cost when
adjacent tasks share the same PKRU value (common for non-isolated workloads).
Critical: when reading the target task's PKRU value, always use the explicit next
pointer passed to the switch function — never a cached current pointer. Between Linux
5.2 and 5.13, using this_cpu_read_stable() to access current->flags produced stale
values when switching from a kernel thread to a user thread, leaving PKRU unrestored.
XSAVE exclusion: during XSAVE (step 5), the PKRU component (bit 9 in XCR0/XSTATE_BV) is masked from both save and restore operations. The kernel XSAVE mask explicitly clears bit 9 so that XSAVE/XRSTOR never touch PKRU. This mask is set once during boot and never modified.
7.3.1.2 Isolation Register Save/Restore (All Architectures)¶
The PKRU management above is x86-64-specific. Every architecture with Tier 1 hardware isolation has an equivalent isolation register that must be saved/restored during context switch (between steps 4 and 5, same position as PKRU). On architectures without Tier 1 isolation (RISC-V, s390x, LoongArch64), this step is a no-op.
AArch64 — POR_EL0 (Permission Overlay Register, ARMv8.9-A / ARMv9.4-A+ with FEAT_S1POE):
POR_EL0 context switch (AArch64 + POE, between steps 4 and 5):
a. MRS x0, POR_EL0 → read current permission overlay value
b. If x0 != prev.saved_por: store current value to prev (prev may have
changed POR_EL0 via Tier 1 domain entry/exit)
c. If next.saved_por != x0: MSR POR_EL0, next.saved_por; ISB
(ISB is required after POR_EL0 writes to ensure the permission overlay
takes effect before any subsequent memory access)
d. Update per-CPU CpuLocalBlock.por_shadow = next.saved_por
e. On AArch64 without POE hardware: this step is skipped entirely.
Tier 1 drivers on non-POE AArch64 use page-table + ASID isolation
which is handled in step 4 (switch_mm).
ARMv7 — DACR (Domain Access Control Register):
DACR context switch (ARMv7, between steps 4 and 5):
a. MRC p15, 0, r0, c3, c0, 0 → read current DACR value
b. If r0 != prev.saved_dacr: store current value to prev
c. If next.saved_dacr != r0: MCR p15, 0, next.saved_dacr, c3, c0, 0
d. Update per-CPU CpuLocalBlock.dacr_shadow = next.saved_dacr
e. ISB barrier after DACR write — ARM Architecture Reference Manual
(B3.7.2) requires ISB after DACR modification to ensure the new
domain permissions are visible to subsequent instructions. Without
ISB, speculative access could use the old DACR value. This matches
Linux kernel behavior (`isb()` after every `set_domain()` call).
PPC32 — Segment Registers:
Segment register context switch (PPC32, between steps 4 and 5):
a. For each active segment register (sr0–sr15):
mfsr rN, srX → read current segment register value
b. If any sr differs from prev.saved_sr[X]: store current values to prev
c. For each segment register that differs between prev and next:
mtsr srX, next.saved_sr[X]
d. Update per-CPU CpuLocalBlock.sr_shadow[0..16] = next.saved_sr[0..16]
e. isync after the last mtsr to ensure new segment translations are visible.
PPC64LE — Radix PID (POWER9+ Radix mode):
Radix PID context switch (PPC64LE, between steps 4 and 5):
a. mfspr r0, SPRN_PID → read current Radix PID value
b. If r0 != prev.saved_rpid: store current value to prev
c. If next.saved_rpid != r0: mtspr SPRN_PID, next.saved_rpid
d. Update per-CPU CpuLocalBlock.rpid_shadow = next.saved_rpid
e. isync after mtspr PID to synchronize the translation context.
RISC-V, s390x, LoongArch64: These architectures have no Tier 1 hardware isolation
register. Tier 1 is unavailable on these platforms; drivers use Tier 0 or Tier 2.
No isolation register save/restore is performed during context switch. The context
switch code gates the entire isolation register save/restore block on
arch::current::isolation::supports_fast_isolation() — a compile-time constant that
evaluates to false on RISC-V, s390x, and LoongArch64, ensuring the dead code is
eliminated by the compiler with zero runtime cost.
7.3.1.3 Return Stack Buffer (RSB) Fill¶
On every context switch, the kernel fills 32 RSB entries with safe return addresses
(pointing to a speculative capture gadget that executes LFENCE; JMP back to itself).
32 entries matches the RSB depth on all current Intel and AMD microarchitectures
(Skylake through Sapphire Rapids, Zen 1 through Zen 5). Future CPUs with deeper
RSBs would require updating this constant; however, on CPUs with eIBRS (see below),
RSB fill is skipped entirely, making the depth moot.
This prevents speculative execution from following stale RSB entries left by the previous
task, which could leak data via cache side channels.
RSB fill (x86-64, after step 6 — register restore):
— Execute 32 CALL instructions (each pushes a return address onto the RSB).
— Adjust RSP to discard the 32 pushed return addresses.
— Total cost: ~20-40 cycles (32 predicted-taken near calls + stack adjustment).
RSB fill is also required on every VM exit (KVM VMEXIT handler) — the guest may have
polluted the RSB. On CPUs with IBRS_ALL (eIBRS), RSB fill after context switch can
be skipped (eIBRS provides RSB protection), but RSB fill after VM exit remains mandatory
because eIBRS does not cover guest→host RSB pollution.
AArch64, ARMv7, and RISC-V do not have an RSB equivalent that requires filling. PPC64 uses a link stack that is flushed via the count cache flush sequence (Section 2.18).
7.3.1.4 LL/SC Reservation Clearing (ARM, RISC-V, PowerPC)¶
On architectures that implement CAS via load-linked/store-conditional (LL/SC) pairs, a thread preempted between the load-linked and the store-conditional retains a hardware reservation. When a different thread resumes on that CPU, its first store-conditional may spuriously succeed on a completely unrelated address if it happens to fall within the stale reservation granule. This is a correctness requirement, not an optimization.
The context switch path must clear any dangling reservation between steps 6 and 7:
| Architecture | Clear instruction | Mechanism |
|---|---|---|
| AArch64 | CLREX |
Explicit reservation clear instruction. Zero cost, no memory access. |
| ARMv7 | CLREX |
Same as AArch64. Present from ARMv6K onward. |
| RISC-V | SC to a dummy word-aligned location |
RISC-V has no CLREX equivalent. Execute SC.W zero, zero, (dummy_addr) where dummy_addr is a per-CPU scratch word. The SC unconditionally clears the reservation regardless of success or failure. |
| PPC32 | stwcx. to a per-CPU dummy word |
Same principle as RISC-V: stwcx. to a word-aligned scratch location clears the reservation. Must be in cacheable memory with M=1 (coherence required). |
| PPC64LE | stdcx. to a per-CPU dummy doubleword |
64-bit equivalent. Same requirement: cacheable, coherent memory. |
| x86-64 | Not needed | x86 uses CMPXCHG (single instruction, no reservation). |
| s390x | Not needed | s390x uses CS/CSG (compare-and-swap, single instruction). |
| LoongArch64 | LL.W/SC.W to per-CPU scratch |
LoongArch LL/SC semantics require a trailing SC to clear the LLbit. |
Dummy word placement: On RISC-V, PPC32, PPC64LE, and LoongArch64, the per-CPU dummy
word used for reservation clearing must be placed on its own cache line (64-byte
aligned) to avoid false sharing. If the dummy word shares a cache line with a hot
variable, the SC/stwcx./stdcx. will write to that cache line on every context
switch (even though the store value is discarded), causing unnecessary cache invalidation
traffic on other CPUs that share the cache line. The dummy word is declared as:
/// Per-CPU scratch word for LL/SC reservation clearing on context switch.
/// Cache-line aligned to prevent false sharing with adjacent per-CPU data.
#[repr(C, align(64))]
struct LlscDummy {
word: u64,
_pad: [u8; 56],
}
// kernel-internal, not KABI. Size = 64 bytes (one cache line).
const_assert!(size_of::<LlscDummy>() == 64);
Placed in CpuLocalBlock (each block is already cache-line aligned).
PowerPC-specific concern: The reservation granule size is implementation-dependent
(minimum 16 bytes, typically 128 bytes = one cache line on POWER8/9). A stale reservation
on one lock can cause a spurious stwcx. success on a different lock in the same granule.
The dummy stwcx. must target a scratch word that is guaranteed to be in a different
cache line from any real lock or atomic variable.
7.3.2 Extended Register State Management¶
Modern x86 CPUs carry large amounts of extended register state beyond the basic GPRs and x87 FPU. Blindly saving and restoring all of this on every context switch is wasteful — most threads never touch AVX-512 or AMX.
The cost problem:
| State component | Size | Save/restore cost |
|---|---|---|
| x87 + SSE (XMM) | 576 bytes | ~20 ns |
| AVX (YMM) | 256 bytes | ~10 ns |
| AVX-512 (ZMM) | 2048 bytes | ~80 ns |
| Intel AMX (tiles) | 8192 bytes | ~300 ns |
| ARM SVE (Z regs) | 256–8192 bytes (VL-dependent) | ~100–500 ns |
On a server running thousands of threads with microsecond-scale scheduling, 300ns of AMX save/restore overhead per switch is significant.
Lazy XSAVE policy:
UmkaOS tracks per-thread which extended state components have actually been used via
an xstate_used bitmap that mirrors the hardware XSTATE_BV field:
/// Per-thread extended state tracking.
struct ThreadXState {
/// Bitmap of XSAVE components this thread has used since creation.
/// Mirrors hardware XSTATE_BV layout (bit 0 = x87, bit 1 = SSE, bit 2 = AVX, etc.)
///
/// **Synchronization**: Plain `u64`, not atomic. This is safe because
/// `xstate_used` is per-thread and only accessed on the thread's current
/// CPU: the #NM handler sets bits (same CPU, IRQ context), and the
/// context switch reads them (same CPU, preemption disabled). No
/// cross-CPU visibility is needed.
xstate_used: u64,
/// Dynamically-allocated XSAVE area. Starts as None; allocated on first use.
/// Size depends on which components are enabled (CPUID leaf 0xD).
xsave_area: Option<XSaveArea>,
}
Context switch optimization:
On context_switch(prev, next):
1. Determine prev's dirty components: xstate_dirty = prev.xstate_used & XSTATE_MODIFIED_BITS
2. XSAVE only the dirty components (XSAVES with prev.xstate_used as the mask)
— If prev never used AVX-512, the ZMM state is NOT saved (zero cost)
3. XRSTOR next's components (XRSTORS with next.xstate_used as the mask)
— Components not in next's mask are initialized to their reset state by hardware
Modified Optimization:
- If a thread hasn't executed any AVX-512/AMX instruction since the last context switch,
the corresponding XSTATE_BV bits are clear — XSAVES skips those components automatically.
- The kernel does NOT need to track this manually; it falls out of the hardware XSAVE
optimized mode (XSAVES/XRSTORS with INIT optimization).
Init optimization (demand allocation):
Threads that never use extended SIMD pay zero XSAVE cost:
1. New thread starts with xstate_used = 0, xsave_area = None
2. CR0.TS bit is set (or XCR0 is restricted) — any use of SIMD triggers #NM
(Device Not Available exception)
3. #NM handler:
a. Allocate xsave_area (sized per CPUID leaf 0xD for the used components)
b. Set appropriate bits in xstate_used
c. Clear CR0.TS (or extend XCR0)
d. Return — the faulting SIMD instruction re-executes successfully
4. Subsequent SIMD use proceeds without trapping
This means a thread that only does integer arithmetic and memory copies never allocates an XSAVE area and never incurs XSAVE/XRSTOR cost during context switch.
AMX special case:
Intel AMX tile registers (8KB) are especially expensive. Additional optimization:
- AMX has a TILERELEASE instruction that explicitly marks tile state as unused.
- UmkaOS's kernel scheduler can hint userspace (via prctl or arch_prctl) to call
TILERELEASE when exiting a compute-intensive section, so the next context switch
doesn't save 8KB of dead tile state.
- If a thread hasn't used AMX tiles in the last N context switches (configurable,
default N=8), the kernel deallocates the AMX portion of the XSAVE area to reclaim
memory.
ARM SVE/SME (AArch64):
ARM's Scalable Vector Extension has a variable vector length (128–2048 bits). The same lazy-allocation strategy applies, with ARM-specific mechanisms:
SVE state components and sizes (VL-dependent):
| Component | Size at VL=128 | Size at VL=512 | Size at VL=2048 |
|------------------|----------------|----------------|-----------------|
| Z registers (Z0-Z31) | 512 bytes | 2048 bytes | 8192 bytes |
| P predicates (P0-P15) | 32 bytes | 128 bytes | 512 bytes |
| FFR (first-fault) | 2 bytes | 8 bytes | 32 bytes |
| Total | 546 bytes | 2184 bytes | 8736 bytes |
Lazy SVE allocation policy:
1. New thread starts with SVE disabled (CPACR_EL1.ZEN = 0b00).
2. First SVE instruction triggers #UND (EL1 undefined instruction trap).
3. #UND handler:
a. Read current VL from ZCR_EL1 (or inherit from parent thread).
b. Allocate SVE save area sized for current VL.
c. Set CPACR_EL1.ZEN = 0b01 (enable SVE for EL0).
d. Return — the faulting SVE instruction re-executes.
4. Context switch saves/restores only if thread has SVE enabled:
a. Check CPACR_EL1.ZEN — if SVE was never used, skip (zero cost).
b. If used: SVE_ST (store Z/P/FFR) and SVE_LD (load) to save area.
c. Cost: proportional to VL, not fixed. VL=128: ~50ns. VL=512: ~200ns.
VL management:
- Per-thread VL via prctl(PR_SVE_SET_VL, new_vl).
- VL change takes effect on next context restore (no immediate effect).
- If new_vl > old_vl, the save area is reallocated (grown, not shrunk).
- System default VL set via sysctl: kernel.sve_default_vl = 256.
SME (Scalable Matrix Extension, ARMv9.2) provides matrix tiles analogous to AMX:
SME state:
- ZA tile register: (SVL/8) x (SVL/8) bytes, where SVL is the Streaming
Vector Length in BITS. At SVL=512 bits: 64x64 = 4KB. At the maximum
SVL=2048 bits: 256x256 = 64KB.
- Streaming SVE mode (SSVE): uses SVE registers at streaming VL.
Lazy SME allocation:
1. SMSTART (enter streaming mode) traps if SMCR_EL1.ENA = 0.
2. Handler allocates ZA storage and enables SME.
3. SMSTOP (exit streaming mode) marks ZA as inactive.
If ZA is inactive for N switches (default 4), deallocate ZA storage.
This matters for memory pressure: ZA at SVL=2048 is 64KB per thread.
Context switch for SME:
- Check PSTATE.SM (streaming mode active) and PSTATE.ZA (ZA active).
- If neither: zero cost.
- If ZA active: save ZA tile (up to 64KB at max SVL=2048). This is expensive.
- The scheduler deprioritizes SME-heavy threads from migration to minimize
ZA save/restore on context switch (locality preference).
ARMv7 VFP/NEON:
ARMv7 extended register state is simpler than x86 or AArch64 — only VFP (Vector Floating Point) and NEON (SIMD) use extended registers:
ARMv7 extended state:
| Component | Size | Save/restore cost |
|------------------|------------|-------------------|
| VFP/NEON (D0-D31)| 256 bytes | ~15-30 ns |
| FPSCR | 4 bytes | ~5 ns |
Total: 260 bytes per thread. Always the same size (no variable-length
extensions like SVE or AVX-512).
Lazy allocation policy for ARMv7:
1. New thread starts with VFP/NEON disabled (FPEXC.EN = 0).
2. First VFP/NEON instruction triggers #UND trap.
3. Handler:
a. Allocate 260-byte VFP save area.
b. Set FPEXC.EN = 1 (enable VFP/NEON).
c. Return — faulting instruction re-executes.
4. Context switch: VSTM/VLDM to save/restore D0-D31 + FPSCR.
Cost is fixed (~30ns) regardless of which registers were used.
ARMv7 has no equivalent of XSAVE's selective save — all 32 double-word
registers are saved/restored as a unit. The fixed 260-byte size means no
dynamic allocation complexity. Threads that never use floating-point or
NEON pay zero VFP save/restore cost (same lazy trap approach as x86 #NM).
RISC-V Vector Extension (RVV):
RISC-V Vector extension (RVV 1.0, ratified 2021) has variable vector length (VLEN: 128–65536 bits), making it the most flexible — and most complex to manage — of any supported architecture:
RVV state components:
| Component | Size at VLEN=128 | Size at VLEN=256 | Size at VLEN=1024 |
|---------------------|------------------|------------------|-------------------|
| V registers (v0-v31)| 512 bytes | 1024 bytes | 4096 bytes |
| vtype, vl, vstart | 24 bytes | 24 bytes | 24 bytes |
| vcsr (vxrm, vxsat) | 8 bytes | 8 bytes | 8 bytes |
| Total | 544 bytes | 1056 bytes | 4128 bytes |
Lazy RVV allocation policy:
1. New thread starts with Vector disabled (mstatus.VS = Off).
2. First vector instruction triggers illegal-instruction trap.
3. Handler:
a. Read VLEN from hart capabilities (discovered at boot from DT or CSR).
b. Allocate vector save area: 32 × (VLEN/8) + overhead bytes.
c. Set mstatus.VS = Initial (vector state clean).
d. Return — faulting instruction re-executes.
4. Context switch:
a. Check mstatus.VS — if Off, skip entirely (zero cost).
b. If Dirty: save all 32 V registers using 4× vs8r.v (v0-v7, v8-v15,
v16-v23, v24-v31). The RVV spec maximum whole-register store is 8
registers per instruction; there is no vs32r.v.
c. If Clean: skip save (state hasn't changed since last restore).
d. Restore: 4× vl8re8.v to load all 32 registers from next thread's area.
Total: 8 whole-register instructions (4 stores + 4 loads).
Set mstatus.VS = Clean after restore.
Per-hart VLEN variation:
On heterogeneous RISC-V systems, different harts may have different VLEN.
The scheduler tracks per-hart VLEN (discovered at boot). A thread that
uses vector instructions with VLEN=256 can only run on harts with
VLEN >= 256. This is integrated with the ISA-gating mechanism
(Section 7.1.5.9): VectorLengthInfo (companion to IsaCapabilities) encodes
per-thread VLEN. The scheduler uses rvv_vlen_bits to constrain hart placement.
Alignment with scheduler hints:
The scheduler already tracks ISA capability usage per-thread (Section 7.2,
IsaCapabilities bitflags including X86_AVX512, X86_AMX, ARM_SME).
The XSAVE policy uses the same flags:
- A thread with X86_AVX512 set in its IsaCapabilities is known to use AVX-512.
The scheduler places it on a P-core (avoiding frequency throttling on E-cores).
- The XSAVE subsystem uses the same flag to pre-allocate the ZMM XSAVE area,
avoiding the #NM trap latency on the first AVX-512 instruction for threads that
are known to use it (e.g., after an execve of a binary with AVX-512 in its ELF
.note.gnu.property).
7.3.2.1 Saved State Composition by Architecture¶
This subsection enumerates exactly which registers UmkaOS saves and restores on each supported architecture during a context switch. For every architecture, state is divided into three categories: (1) general-purpose and control registers that are always saved; (2) extended floating-point/SIMD state that is saved lazily (only when the thread has actually used the relevant unit); and (3) debug registers that are saved only when hardware breakpoints are active. The lazy FP/SIMD policy is described in the preceding subsection; the per-architecture enable bits that control laziness are called out explicitly below.
x86-64:
Always saved (integer/control registers):
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | RAX, RBX, RCX, RDX, RSI, RDI, RBP, R8–R15 | 14 explicit GPRs; RSP is implicit in the kernel stack switch |
| Stack pointer | RSP | Saved via kernel stack pointer swap in the switch stub |
| Instruction pointer | RIP | Saved via the call/ret discipline in the switch stub; not written explicitly |
| Flags | RFLAGS | Saved with pushfq/popfq |
| TLS base registers | FS.base, GS.base | Written via WRMSR/RDMSR (IA32_FS_BASE / IA32_GS_BASE); userspace TLS and kernel per-CPU pointer respectively. SWAPGS handles the kernel↔user GS.base exchange on entry/exit |
Extended state (lazy — saved only when xstate_used bitmap indicates use):
The extended state area is allocated as a single XSAVE-formatted buffer, size
determined at boot from CPUID leaf 0xD subleaf 0 (ECX = full XSAVE area size for
all enabled components). UmkaOS uses XSAVES/XRSTORS (compacted format) to save
only the components indicated by the task's xstate_used bitmap.
| Component | Registers | XSAVE component bit | Trigger |
|---|---|---|---|
| x87 FPU | ST0–ST7, FIP, FDP, FOP, FCW, FSW, FTW | Bit 0 | Any x87 instruction |
| SSE | XMM0–XMM15, MXCSR | Bit 1 | Any SSE/SSE2 instruction |
| AVX | YMM0–YMM15 upper 128-bit halves | Bit 2 | Any AVX instruction |
| AVX-512 opmask | K0–K7 | Bit 5 | Any AVX-512 masked operation |
| AVX-512 ZMM hi256 | ZMM0–ZMM15 upper 256-bit halves | Bit 6 | Any AVX-512 ZMM instruction |
| AVX-512 ZMM hi16 | ZMM16–ZMM31 full 512-bit | Bit 7 | Any AVX-512 ZMM16–31 instruction |
| Intel AMX tile config | TILECFG |
Bit 17 | LDTILECFG instruction |
| Intel AMX tile data | Up to 8 tiles × 1024 bytes = 8192 bytes | Bit 18 | Any AMX tile compute instruction |
Whether AVX, AVX-512, or AMX are present is determined at boot from CPUID leaf 0x7 and leaf 0xD; UmkaOS enables only the components reported by the hardware.
Debug registers (lazy — saved only when hardware breakpoints are active):
DR0–DR3 (break address), DR6 (status), DR7 (control). Saved on context switch out
and restored on context switch in for any thread that has set hardware breakpoints
(indicated by a per-task debug_active flag). DR4 and DR5 are aliases of DR6 and
DR7; DR4/DR5 are not saved separately.
AArch64:
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | X0–X30 | 31 integer registers; XZR (X31) is hardwired zero and never saved |
| User stack pointer | SP_EL0 | User-mode stack pointer; saved as a plain 64-bit value |
| Instruction pointer | PC | Saved via the ret target in the switch stub (stored in the thread's saved x30/LR slot pointing to the resume label) |
| Process state | SPSR_EL1 | Saved process state register; encodes NZCV flags, DAIF mask, execution state, SP selection |
| User TLS pointer | TPIDR_EL0 | User-readable thread pointer register; holds glibc TLS base |
UmkaOS's per-CPU kernel pointer lives in TPIDR_EL1; it is not a per-thread value
and is not saved/restored on context switch.
Extended state (lazy):
NEON/FP state is controlled by CPACR_EL1.FPEN. If FPEN is set to 0b00
(trapping), any FP/NEON instruction from EL0 or EL1 takes a trap that triggers
allocation and enable. Once enabled, NEON/FP is always saved (there is no
hardware equivalent to XSAVE's component-level selective save on non-SVE
AArch64).
| Component | Registers | Save size | Enable bit | Trigger |
|---|---|---|---|---|
| NEON/FP | V0–V31 (128-bit each), FPSR, FPCR | 528 bytes | CPACR_EL1.FPEN |
Any FP or NEON instruction |
| SVE (FEAT_SVE) | Z0–Z31 (variable, up to 2048 bits each), P0–P15 (predicates), FFR | VL-dependent (see Section 7.1.6) | CPACR_EL1.ZEN |
Any SVE instruction; first use traps to #UND |
| SME (FEAT_SME) | ZA tile array (SVL-dependent, up to 64 KB at SVL=2048 bits), streaming SVE state | SVL-dependent | SMCR_EL1.ENA |
SMSTART instruction; traps if ENA=0 |
SVE and SME presence is determined from CPUID registers ID_AA64PFR0_EL1 and
ID_AA64SMFR0_EL1 at boot. SVE vector length (VL) is read from ZCR_EL1; it may
differ per-CPU cluster on heterogeneous SoCs, which is reflected in
IsaCapabilities. SME streaming vector length (SVL) is read from SMCR_EL1.
Debug registers (lazy):
DBGBVR0–DBGBVR15 (breakpoint value registers) and DBGBCR0–DBGBCR15 (breakpoint
control registers), plus DBGWVR0–DBGWVR15 and DBGWCR0–DBGWCR15 (watchpoints).
The number of implemented breakpoints and watchpoints (up to 16 each) is read
from ID_AA64DFR0_EL1 at boot. Saved only when the per-task debug_active flag
is set.
ARMv7:
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | R0–R14 | 15 integer registers; R15 (PC) is handled by the switch stub's bx lr return |
| Process state | CPSR | Current Program Status Register (condition flags, mode bits, interrupt masks) |
| User TLS pointer | TPIDRURW | User-read/write thread pointer register; holds glibc TLS base |
Extended state (lazy):
VFP/NEON state is controlled by the FPEXC.EN bit. With EN=0, any VFP or NEON
instruction from any privilege level traps to the undefined-instruction handler,
which allocates the save area and sets EN=1.
| Component | Registers | Save size | Enable bit | Trigger |
|---|---|---|---|---|
| VFP/NEON | D0–D31 (64-bit each), FPSCR | 260 bytes | FPEXC.EN |
Any VFP or NEON instruction |
ARMv7 has no hardware equivalent to XSAVE's component-level selective save: all 32
doubleword registers are saved and restored as a unit using VSTMIA/VLDMIA.
The fixed 260-byte save area (32 × 8 + 4 FPSCR) is allocated on first VFP/NEON use and is never
resized. Threads that never use floating-point or NEON pay zero VFP save cost.
Presence of the VFP and NEON units is detected from FPSID and MVFR0 at boot. Some ARMv7 implementations (e.g., Cortex-M targets) omit VFP entirely; UmkaOS skips all VFP save/restore logic on those cores.
Debug registers (lazy):
BVR0–BVR15 (Breakpoint Value Registers) and BCR0–BCR15 (Breakpoint Control
Registers), plus WVR/WCR pairs for watchpoints. Count is read from DBGDIDR at
boot. Saved only when debug_active is set.
RISC-V (RV64GC):
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | x1–x31 | 31 integer registers; x0 is hardwired zero and is never saved |
| Instruction pointer | PC | Saved via the ra (x1) convention in the switch stub; the stub stores the resume label in ra before saving and jumps via ret on restore |
| Thread pointer | x4 (tp) | Holds the per-CPU CpuLocalBlock pointer when in kernel mode (swapped with sscratch on trap entry); userspace tp is preserved in the task's saved register frame |
Extended state (lazy):
Floating-point and vector state laziness is implemented using the FS and VS
fields in the sstatus CSR. Hardware sets FS/VS to Dirty whenever the
corresponding register set is written; UmkaOS checks this flag at context switch
time rather than maintaining a separate software bitmap.
| Component | Registers | Save size | sstatus field | Trigger |
|---|---|---|---|---|
| F extension (single) | f0–f31, fcsr | 132 bytes | FS |
Any F-extension instruction when FS = Off → traps; when FS = Initial or Dirty → no trap |
| D extension (double) | f0–f31 (64-bit view), fcsr | 260 bytes | FS (shared with F) |
Any D-extension instruction |
| V extension (vector) | v0–v31 (variable length), vcsr, vl, vtype, vstart | VLEN-dependent (see Section 7.1.6) | VS |
Any V-extension instruction when VS = Off → illegal-instruction trap |
The F and D extensions share the same register file and FS field; D is a
superset of F. If both are present, UmkaOS always saves 64-bit doubles. The
presence of F, D, and V extensions is determined from the misa CSR and from
the ISA string in the device tree or SBI firmware at boot.
Context switch policy for float/vector:
- If FS = Off: no FP registers are saved (zero cost).
- If FS = Initial: registers are in reset state; skip save, but restore from
a canonical all-zeros area on next thread's restore (or leave as Initial).
- If FS = Dirty: save all f0–f31 and fcsr. After save, set FS = Clean.
- If VS = Off: no vector registers are saved.
- If VS = Dirty: save all v0–v31 plus vcsr/vl/vtype/vstart. After save,
set VS = Clean.
On heterogeneous RISC-V systems where different harts have different VLEN, a
thread's VLEN is fixed at the VLEN of the hart that first executed a vector
instruction. The scheduler then constrains that thread to harts with matching
VLEN (see VectorLengthInfo in Section 7.1.5).
Debug registers (lazy):
The RISC-V debug trigger module provides tselect, tdata1, tdata2, and
optionally tdata3 CSRs for configuring hardware breakpoints and watchpoints.
The number of implemented triggers is determined at boot by iterating tselect
until it wraps. Saved only when debug_active is set.
PPC32:
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | R0–R31 | 32 integer registers |
| Special-purpose | LR, CTR, XER, CR | Link register, count register, integer exception register, condition register |
| Instruction pointer | SRR0 | Machine state save/restore register 0 holds the saved PC (restored via rfi) |
| Machine state | SRR1 | Machine state save/restore register 1 holds the saved MSR (restored via rfi) |
User TLS is managed by convention: the ABI designates R2 as the small-data area pointer and R13 as the read-only TLS base; both are part of the general GPR save above. The kernel per-CPU pointer lives in SPRG3 and is not per-task.
Extended state (lazy):
FPU state is controlled by MSR.FP. With MSR.FP = 0, any floating-point
instruction from any privilege level causes a floating-point unavailable exception.
The handler allocates the save area, sets MSR.FP = 1, and returns.
| Component | Registers | Save size | Enable bit | Trigger |
|---|---|---|---|---|
| FPR | FPR0–FPR31 (64-bit each), FPSCR | 264 bytes | MSR.FP |
Any FP instruction when MSR.FP = 0 |
AltiVec/VMX is not universally present on PPC32 targets supported by UmkaOS (primarily embedded e500/e500mc class cores). On embedded PPC32 cores that do implement SPE (Signal Processing Engine) floating-point, the SPE save area (32 × 32-bit upper halves of the 64-bit SPE GPRs, plus SPEFSCR) replaces the classical FPR block. Presence of SPE is detected from the PVR (Processor Version Register) at boot.
Debug registers (lazy):
DBCR0, DBCR1, DAC1, DAC2 (data address compare), IAC1, IAC2 (instruction
address compare). Count and capability are read from DBCR0 at boot. Saved only
when debug_active is set.
PPC64LE:
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | R0–R31 | 32 integer registers |
| Special-purpose | LR, CTR, XER, CR, DSCR, AMR | Link, count, exception, condition, data stream control, authority mask registers |
| Instruction pointer | SRR0 / HSRR0 | SRR0 for normal exceptions; HSRR0 for hypervisor exceptions (used in KVM context) |
| Machine state | SRR1 / HSRR1 | Saved MSR (restored via rfid / hrfid) |
The AMR (Authority Mask Register) implements a hardware equivalent to memory
protection keys on POWER9+ in Radix mode; it is always saved to preserve per-task
memory domain state. DSCR controls the hardware prefetch engine and is saved
to avoid polluting one task's prefetch hints into another.
User TLS follows the ELFv2 ABI: R13 holds the thread pointer in user mode. In kernel mode, R13 holds the PACA pointer (per-CPU area base). The kernel saves/restores the userspace R13 on kernel entry/exit. R13 is part of the general GPR save.
Extended state (lazy):
Three overlapping extended state components, each controlled by a separate
MSR bit:
| Component | Registers | Save size | MSR bit | Trigger |
|---|---|---|---|---|
| FPR | FPR0–FPR31 (64-bit each), FPSCR | 264 bytes | MSR.FP |
Any FP instruction when MSR.FP = 0 |
| VMX/AltiVec | VR0–VR31 (128-bit each), VRSAVE, VSCR | 528 bytes | MSR.VEC |
Any VMX instruction when MSR.VEC = 0 |
| VSX | VS0–VS63 (the VSX register file overlays FPR0–31 and VR0–31) | Covered by FPR + VMX saves | MSR.VSX |
Any VSX instruction when MSR.VSX = 0 |
The VSX register file (VS0–VS63) is not an additional 64 independent registers: VS0–VS31 are the same physical registers as FPR0–FPR31 (double-precision view), and VS32–VS63 are the same physical registers as VR0–VR31. Saving FPR and VMX captures the complete VSX state; there is no additional VSX-specific save area.
All three components are saved independently and lazily: a task that uses FPR but
not VMX pays only the 264-byte FPR save cost. MSR.VSX enables the xvmaddadp
class instructions that cross the FPR/VMX boundary; it requires both MSR.FP and
MSR.VEC to be set first.
Debug registers (lazy):
DAWR0 and DAWRX0 (data address watchpoint register, introduced POWER9) for
hardware memory watchpoints. Hardware instruction breakpoints via CIABR
(Completed Instruction Address Breakpoint Register). Saved only when
debug_active is set.
s390x:
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | R0–R15 | 16 general registers (64-bit each) |
| Program Status Word | PSW (instruction address + condition code + system mask) | Saved/restored via LPSWE/EPSW; encodes PC, addressing mode, condition code, interrupt masks, and DAT mode |
| Control registers | CR0–CR15 | 16 control registers governing interrupts, address-space control, tracing, clock comparator, and PER (Program Event Recording). CR1 holds the primary ASCE (Address Space Control Element, the page table root). CR7 holds the secondary ASCE. CR13 holds the home ASCE |
| Access registers | AR0–AR15 | 16 access registers (32-bit each); used for secondary-space addressing (AR mode). Each AR selects which address space (primary, secondary, or home) a corresponding GPR-based address refers to |
| Thread pointer | AR0 (by convention) | glibc on s390x uses AR0 + ALET for TLS base addressing; the kernel saves AR0–AR15 as part of the access register block |
s390x context switching uses the STMG/LMG instructions to save/restore GPRs
in a single instruction pair (store multiple / load multiple). The PSW is not
directly readable as a single register — the current PSW is captured by taking an
interrupt (supervisor call) or by using EPSW (Extract PSW). On context switch,
the kernel stores the interrupted PSW (from the interrupt-old PSW area in the
lowcore) into the task's saved state.
Extended state (lazy):
FP and vector state is controlled by control register bits. CR0 bits 56–63 control the AFP (Additional Floating Point) register facility. The VX (Vector Extension) facility, when present, extends the FP registers to 128-bit vector registers.
| Component | Registers | Save size | Enable mechanism | Trigger |
|---|---|---|---|---|
| FP (BFP/HFP) | FPR0–FPR15 (64-bit each), FPC (FP control register) | 132 bytes | CR0 AFP bit | Any FP instruction when AFP is disabled causes a data exception |
| Vector (VX facility) | V0–V31 (128-bit each; V0–V15 overlay FPR0–FPR15) | 512 bytes total (V0–V15 upper halves: 256 bytes, plus V16–V31: 256 bytes) | CR0 VX enable bit | Any vector instruction when VX is disabled causes a vector-processing exception |
When the VX facility is present, V0–V15 are the 128-bit extensions of FPR0–FPR15 (the low 64 bits are the classical FPR values). Saving V0–V31 captures all FP state. When VX is not present, only the 16 classical 64-bit FPRs are saved. VX facility presence is detected from STFLE (Store Facility List Extended) bit 129 at boot.
Context switch policy (save and restore are symmetric — both conditional on prior use):
- If the thread has never used FP: CR0 AFP bit is cleared, no FP state is saved or restored (zero cost).
- If the thread uses FP but not VX: save FPR0–FPR15 + FPC (132 bytes) via STD/STFPC;
restore via LD/LFPC. CR0 AFP bit is set for the incoming thread.
- If the thread uses VX: save V0–V31 via VSTM (vector store multiple), which captures
all FP + vector state in one operation; restore via VLM. CR0 VX enable bit is set.
Cost: ~40–80 ns depending on how many registers are dirty.
- On return to userspace: the kernel restores the incoming thread's FP/VX state only if
that thread previously used FP/VX (lazy trap-on-first-use). There is no unconditional
restore on the user-return path — the CR0 AFP/VX bits remain cleared if the thread
has never used FP, causing a data exception on first FP use which triggers allocation.
Debug registers (lazy):
PER (Program Event Recording) registers: CR9 (PER event mask), CR10 (PER starting address),
CR11 (PER ending address). PER provides instruction-address and storage-alteration tracing.
Saved only when debug_active is set.
LoongArch64:
Always saved:
| Register group | Registers | Notes |
|---|---|---|
| General-purpose | R0–R31 | 32 integer registers; R0 is hardwired zero and is never saved. R1 (RA) holds the return address. R3 (SP) is the stack pointer |
| CSR.PRMD | Previous mode register | Saves the privilege level (PLV), interrupt enable (PIE), and watchpoint enable (PWE) state from before the exception. Restored on ertn (exception return) |
| CSR.ERA | Exception Return Address | Holds the PC to return to after exception handling |
| Thread pointer | R2 (TP) | User-mode thread pointer register; holds glibc TLS base |
LoongArch64 uses CSR.PRMD (not a general SPSR equivalent) to record the pre-exception processor state. On context switch, the kernel saves CSR.PRMD and CSR.ERA into the task's saved state. The kernel per-CPU pointer is stored in CSR.KS0 (scratch register 0) and is not per-task.
Extended state (lazy):
FP/SIMD state is controlled by CSR.EUEN (Extended Unit Enable register). Individual bits in CSR.EUEN control access to the FPU, LSX (Loongson SIMD Extension, 128-bit), and LASX (Loongson Advanced SIMD Extension, 256-bit).
| Component | Registers | Save size | CSR.EUEN bit | Trigger |
|---|---|---|---|---|
| FPU | F0–F31 (64-bit each), FCSR0 (FP control/status) | 260 bytes | Bit 0 (FPE) | Any FP instruction when FPE=0 causes a floating-point disabled exception |
| LSX (128-bit SIMD) | VR0–VR31 (128-bit each; overlay F0–F31) | 512 bytes (upper 64-bit halves of VR0–VR31) | Bit 1 (SXE) | Any LSX instruction when SXE=0 causes an LSX disabled exception |
| LASX (256-bit SIMD) | XR0–XR31 (256-bit each; overlay VR0–VR31) | 1024 bytes (upper 128-bit halves of XR0–XR31) | Bit 2 (ASXE) | Any LASX instruction when ASXE=0 causes a LASX disabled exception |
The register files are hierarchically overlaid: XR0–XR31 (256-bit) contain VR0–VR31 (128-bit) which contain F0–F31 (64-bit). Saving the widest enabled component captures all narrower state. When LASX is enabled, saving XR0–XR31 captures LSX and FP state.
Context switch policy:
- If CSR.EUEN.FPE = 0: no FP/SIMD state is saved (zero cost).
- If FPE = 1 but SXE = 0: save F0–F31 + FCSR0 (260 bytes) via fst.d.
- If SXE = 1 but ASXE = 0: save VR0–VR31 via vst (512 bytes captures FP state too).
- If ASXE = 1: save XR0–XR31 via xvst (1024 bytes captures all FP + LSX state).
LSX and LASX presence is detected from CPUCFG register 2 (bits 6 and 7) at boot.
Debug registers (lazy):
LoongArch64 provides hardware breakpoint and watchpoint registers via CSR.DB0ADDR–
CSR.DB7ADDR (data breakpoint addresses) and CSR.IB0ADDR–CSR.IB7ADDR (instruction
breakpoint addresses), with corresponding control registers CSR.DB0CTL–CSR.DB7CTL
and CSR.IB0CTL–CSR.IB7CTL. Up to 8 data breakpoints and 8 instruction breakpoints
are supported. The actual count is read from CPUCFG register 6 at boot. Saved only
when debug_active is set.
Lazy FP/SIMD save — unified policy statement:
UmkaOS uses lazy or conditional FP/SIMD save on all supported architectures. Extended state is saved and restored only for threads that have actually used the corresponding unit. On architectures with hardware trap-on-first-use (x86-64, AArch64, ARMv7, RISC-V, PPC32, PPC64LE, LoongArch64), the mechanism is hardware-lazy (trap on first use). On s390x, the mechanism is software-conditional (per-task usage flags, no trap). The mechanism by which laziness is enforced is architecture-specific:
| Architecture | Lazy FP mechanism | Lazy vector mechanism |
|---|---|---|
| x86-64 | CR0.TS=1 causes #NM on first FP/SSE use; XSAVES saves only dirty XSAVE components |
Same XSAVE component mask; AVX/AVX-512/AMX each have independent bits |
| AArch64 | CPACR_EL1.FPEN=0b00 causes trap on first NEON/FP use | CPACR_EL1.ZEN=0b00 causes #UND on first SVE use; SMCR_EL1.ENA=0 traps SMSTART |
| ARMv7 | FPEXC.EN=0 causes undefined-instruction trap on first VFP/NEON use | N/A (no vector extension beyond NEON) |
| RISC-V | sstatus.FS=Off causes illegal-instruction trap; hardware sets FS=Dirty on write | sstatus.VS=Off causes illegal-instruction trap; hardware sets VS=Dirty on write |
| PPC32 | MSR.FP=0 causes floating-point unavailable exception | N/A (SPE if present uses same exception mechanism) |
| PPC64LE | MSR.FP=0 causes FP unavailable; MSR.VEC=0 causes VMX unavailable | MSR.VSX=0 causes VSX unavailable; requires FP+VEC first |
| s390x | Software-tracked conditional. z/Architecture has no trap-on-first-use for FP/VX. UmkaOS tracks per-task FP/VX usage via software flags (set when FP state is first dirtied during context switch inspection). If a task has never used FP: CR0 AFP bit is cleared, no FP state is saved or restored (zero cost). If a task has used FP: save/restore only the register ranges actually used (FP only, or FP+VX). This is not hardware-lazy (no trap instruction), but achieves the same outcome: tasks that never touch FP pay zero context-switch cost. | Same — software-conditional; per-task VX usage flag. |
| LoongArch64 | CSR.EUEN.FPE=0 causes FP disabled exception on first FP use | CSR.EUEN.SXE=0 causes LSX disabled exception; CSR.EUEN.ASXE=0 causes LASX disabled exception |
A task that never touches FP or SIMD registers pays zero extended-state save cost on every context switch across all architectures.
7.3.3 Post-Context-Switch Cleanup (finish_task_switch)¶
After the hardware context switch completes and the new task begins executing,
finish_task_switch() performs the handshake that completes the switch: releasing
the runqueue lock, re-enabling preemption, and freeing the previous task's
resources if it was exiting. This is the first code the new task executes after
being switched to.
Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules.
&selfmethods use interior mutability for mutation. Atomic fields use.store()/.load(). See CLAUDE.md Spec Pseudocode Quality Gates.
Call chain: schedule() acquires the runqueue lock, calls context_switch()
(which performs the hardware switch), and the first code the new task executes is
finish_task_switch(prev). The prev parameter is passed through the hardware
switch (saved in a callee-saved register or on the kernel stack before the switch,
restored after).
/// Post-context-switch cleanup. Called by the scheduler after switching
/// from `prev` to the current task. Runs in the context of the NEW task
/// (the one that was switched TO).
///
/// This function completes the context switch handshake: it releases the
/// runqueue lock that was held during the switch, re-enables preemption,
/// and handles cleanup of the previous task if it was exiting.
///
/// # Arguments
/// - `prev`: The task that was switched away from. Its register state was
/// saved by `context_switch()` before the hardware switch.
///
/// # Preconditions
/// - The caller holds the local CPU's runqueue lock (acquired by `schedule()`
/// before `context_switch()`).
/// - Preemption is disabled (disabled by `schedule()` before the switch).
/// - The current task is `next` from the preceding `context_switch(prev, next)`.
///
/// # Postconditions
/// - The runqueue lock is released.
/// - Preemption is re-enabled.
/// - If `prev` was TASK_DEAD, its kernel stack and task struct are freed.
fn finish_task_switch(prev: &Task) {
let rq = this_rq();
// Step 1: Complete the runqueue lock handshake.
// The rq lock was acquired by schedule() before context_switch().
// It is released here, after the switch, by the NEW task.
// This ensures the rq state is consistent across the switch boundary:
// no other CPU can observe a half-switched state (prev still on rq
// but next already running).
//
// SAFETY: The lock was acquired by schedule() and is guaranteed held
// at this point. The hardware switch preserves the lock state because
// the lock is stored in the per-CPU runqueue struct, not on the stack.
unsafe { rq.lock.unlock() };
// Step 2: Re-enable preemption.
// Preemption was disabled by schedule() before the switch. Re-enabling
// it here allows the new task to be preempted by higher-priority tasks
// or IRQs. This MUST happen after the rq lock is released (Step 1) —
// holding a spinlock with preemption enabled is a deadlock risk.
preempt_enable();
// Step 3: Check if the previous task is dead (TASK_DEAD state).
// TASK_DEAD means the task has been fully reaped (past zombie, parent
// called waitpid, release_task() set TASK_DEAD). The task cannot free
// its own kernel stack because it was still executing on it during the
// context_switch(). The NEXT task (us) frees it here, safely.
//
// Note: ZOMBIE tasks are NOT freed here. A zombie retains its Task
// struct and kernel stack until the parent calls waitpid(), which
// invokes release_task() → transitions ZOMBIE to DEAD → the NEXT
// schedule() on that CPU's finish_task_switch() frees the stack.
let prev_state = prev.state.load(Acquire);
if prev_state == TaskState::DEAD.bits() {
// Free the previous task's kernel stack.
// The stack was allocated from the kernel stack slab
// ([Section 4.3](04-memory.md#slab-allocator)) at fork()/clone() time. The slab
// allocator returns the stack pages to the per-CPU magazine
// (hot path, no global lock).
//
// SAFETY: prev is TASK_DEAD and will never be scheduled again.
// No other CPU can reference prev's stack because:
// (a) prev was removed from the runqueue in do_exit() Step 14,
// (b) prev's state is DEAD (Release store in release_task()
// pairs with our Acquire load above), and
// (c) the rq lock that we just released in Step 1 serialized
// the switch — no CPU can be in the middle of switching TO prev.
unsafe {
free_kernel_stack(prev.stack_base, prev.stack_size);
}
// Drop the task struct reference. If this is the last Arc reference
// (typical for TASK_DEAD — the parent's waitpid already dropped its
// reference in release_task()), the Task struct is freed via
// Arc::drop, returning memory to the task slab cache.
//
// PID slot release: the PID was already returned to the PID
// allocator in release_task() (called by the parent's waitpid).
// UID task count was also decremented there.
drop(Arc::from_raw(prev as *const Task));
}
// Step 4: Fire scheduler notifiers.
// These are lightweight callbacks registered by subsystems that need
// to act on every context switch IN event (from the new task's
// perspective):
//
// - KVM: If the new task is a vCPU thread, re-enter the guest
// (via kvm_sched_in() → vmresume/vmlaunch).
// - Perf: perf_schedule_in() was already called in context_switch()
// step 7 — no additional action here.
// - Cgroup: update per-CPU cgroup tracking for the new task.
//
// The notifier list is a per-CPU ArrayVec (bounded, no allocation).
// Each callback is expected to complete in < 100ns.
for notifier in CpuLocal::get().sched_in_notifiers.iter() {
notifier.on_sched_in();
}
}
Why the rq lock is released by the NEW task, not the OLD task: The rq lock
must be held across the entire context switch to prevent another CPU from observing
an inconsistent state (e.g., the old task still on the rq while the new task is
already running). The old task cannot release the lock because it stops executing
at the hardware switch point. The new task is the first to execute after the switch
and is responsible for releasing the lock. This is the same design as Linux
(finish_task_switch() in kernel/sched/core.c).
Kernel stack deferred free: A task cannot free its own kernel stack — it is still
executing on that stack at the time of schedule(). Linux solves this with the same
finish_task_switch() pattern: the NEXT task that runs on the CPU frees the dead
task's stack. In UmkaOS, the stack was allocated from the kernel stack slab cache
(Section 4.3), so freeing it returns the pages to the per-CPU magazine with
no global lock (hot path, ~10ns).
Architecture notes: finish_task_switch() is architecture-neutral. The
architecture-specific portion of the context switch ends at the hardware switch
point (step 6 in the context switch procedure above). All post-switch cleanup is
generic code. The prev pointer is passed through the switch via an
architecture-specific mechanism:
| Architecture | prev passing mechanism |
|---|---|
| x86-64 | Callee-saved register (r12 or rbx) preserved across __switch_to() |
| AArch64 | Callee-saved register (x19) preserved across cpu_switch_to() |
| ARMv7 | Callee-saved register (r4) preserved across __switch_to() |
| RISC-V | Callee-saved register (s0) preserved across __switch_to() |
| PPC32 | Callee-saved register (r14) preserved across _switch() |
| PPC64LE | Callee-saved register (r14) preserved across _switch() |
| s390x | Callee-saved register (r6) preserved across __switch_to() |
| LoongArch64 | Callee-saved register ($s0/r23) preserved across __switch_to() |
7.3.4 CPU Hotplug Integration¶
CPU hotplug (CPUs going offline/online at runtime) must be handled by the scheduler to migrate tasks and maintain invariants. UmkaOS supports full CPU hotplug on all architectures.
Offline sequence (CPU N going offline):
1. Mark CPU N as draining: set runqueue[N].state = DRAINING.
New tasks are no longer scheduled onto CPU N (load balancer skips it).
2. Migrate tasks from runqueue[N]:
For each runnable task in runqueue[N] (EEVDF tree + RT queues + DL queues):
a. Select migration target: prefer same-NUMA-node CPU with lowest load
(EAS-aware, Section 7.1.5).
b. Dequeue from runqueue[N], set task.cpu = target_cpu,
enqueue on runqueue[target_cpu].
c. If a task is currently running on CPU N: wait for it to yield
(it will find DRAINING state on next preemption point and yield).
3. Drain RCU quiescent state:
Call rcu_barrier() to process all pending RCU callbacks that reference
CPU N's per-CPU data. CPU N reports a final quiescent state.
4. Drain per-CPU slab magazines:
Flush CPU N's per-CPU slab magazines to their per-NUMA partial lists
(returns cached pages to the system).
5. Drain per-CPU writeback queue:
Flush any pending writeback work on CPU N.
6. Mark CPU N offline:
clear_bit(N, cpu_online_mask).
CPU N executes arch::current::cpu::park() (HLT loop on x86-64,
WFI in low-power state on AArch64, pause loop on RISC-V).
7. Notify subsystems:
Fire cpu_hotplug_notifier(OFFLINE, N) to allow subsystems (networking,
RCU, scheduler) to clean up CPU-N-specific state.
Online sequence (CPU N coming online):
1. Architecture-specific bring-up:
x86-64: INIT-SIPI-SIPI sequence via LAPIC.
AArch64: PSCI CPU_ON call.
RISC-V: SBI HSM hart_start call.
2. Per-CPU data initialization:
CPU N's PerCpu data structures are NOT re-allocated (they were sized
at boot for all possible CPUs — see Section 3.1.3). Only state is reset:
- runqueue[N]: initialize as empty, state = ACTIVE.
- CpuLocal block for CPU N: zero-fill state fields.
- RCU: call `rcu_cpu_online(N)` — updates leaf node `online_mask`,
sets `rcu_percpu[N].gp_seq_local`, clears `CpuLocal::rcu_passed_quiesce`.
3. Mark CPU N online:
set_bit(N, cpu_online_mask).
4. Fire cpu_hotplug_notifier(ONLINE, N).
5. Load balancer picks up CPU N in the next balance interval and starts
migrating tasks to it.
Design note: UmkaOS's per-CPU arrays are sized at boot for num_possible_cpus()
(all CPUs that could ever be brought online, including hotplugged ones).
This means CPU offline/online is a pure state machine transition with no
memory allocation, matching the "no hardcoded MAX_CPUS" principle and
enabling sub-millisecond hotplug transitions.
7.4 Platform Power Management¶
Standards: ACPI 6.5 Section 1.3 (Power Management), Intel SDM Vol 3B Section 18.9 (RAPL MSRs), AMD PPR (Zen2+) Section 2.1.9 (RAPL), ARM Energy Model (Documentation/power/energy-model.rst), IPMI v2.0 Section 11.2 / DCMI v1.5. IP status: All interfaces are open standards or documented hardware interfaces. No proprietary implementations referenced.
7.4.1 Problem and Scope¶
Power management is a kernel responsibility — not because policy belongs in the kernel, but because the mechanisms that enforce policy require ring-0 privileges and sub-millisecond response latency:
Why ring-0 is required for power management mechanisms:
- RAPL MSR writes require ring-0 access. Intel and AMD RAPL power-limit registers
(e.g.,
MSR_PKG_POWER_LIMITat0x610) are privileged MSRs. AWRMSRinstruction executed from ring-3 causes a#GP(0)fault. There is no userspace API that provides equivalent direct hardware control;powercapsysfs writes go through the kernel driver. - Thermal trip point response must be sub-millisecond. A thermal
Criticaltrip point (typically 5–10 °C below the hardwarePROCHOTshutdown temperature) requires an immediate forced poweroff. The kernel cannot wait for a userspace daemon to wake up, read a netlink event, and issue a shutdown ioctl — that path has unbounded latency. The kernel's thermal interrupt handler must act directly. Note: on SMI-mediated platforms (whereIA32_MISC_ENABLE.FORCEOPenables firmware-first thermal handling), the kernel's thermal interrupt fires after the SMI handler completes (50-150 us typical, up to 100 ms on slow BIOS implementations). The "sub-millisecond" target applies to the kernel's response time after receiving the interrupt, not to the end-to-end latency which includes firmware processing. - cgroup power accounting requires kernel-side energy counter integration. Attributing energy consumption to a cgroup requires reading RAPL energy counters at the same scheduler tick that records CPU time — these are indivisible from a correctness standpoint. A userspace poller cannot atomically correlate energy deltas to the task that was running.
- VM power budgets must be enforced even if the VM misbehaves. A guest OS cannot be trusted to self-limit its power consumption. The hypervisor (umka-kvm) must enforce power caps from outside the VM, using kernel-level RAPL and cgroup mechanisms.
Scope: This section covers mechanisms only:
- RAPL hardware interface abstraction (
RaplInterfacetrait, Section 7.4) - Thermal zone and trip-point framework (
ThermalZone, Section 7.4) - Powercap sysfs hierarchy (Section 7.4)
- cgroup power accounting and per-cgroup power limits (Section 7.4)
- VM watt-budget enforcement (Section 7.4)
- DCMI/IPMI rack-level power management (Section 7.4)
Policy — which power profile a user selects, when to throttle a VM for economic
reasons, how to balance performance against energy cost — is a userspace/orchestrator
concern. The kernel provides the enforcement hooks; daemons (e.g., tuned, power-profiles-daemon,
umka-kvm's scheduler) invoke them.
7.4.2 RAPL — Running Average Power Limit¶
7.4.2.1 Domain Taxonomy¶
RAPL partitions the platform into named power domains. Each domain has independent power-limit registers and energy-status counters:
| Domain | Scope | Availability |
|---|---|---|
Pkg |
Entire CPU socket including uncore (LLC, memory controller, PCIe root complex, integrated graphics on server SKUs) | Intel SNB+, AMD Zen2+ |
Core (PP0) |
CPU cores only (excluding uncore). Useful for isolating compute vs memory-bandwidth workloads. | Intel SNB+ |
Uncore (PP1) |
Integrated GPU / GT on Intel client SKUs. Not present on server SKUs (Xeon). | Intel client only |
Dram |
Memory controller and attached DIMMs. Separate power rail on server platforms. | Intel IVB-EP+, AMD Zen2+ server |
Platform (PSYS) |
Entire platform as measured from the charger/PSU side. Introduced on Intel Skylake+ client. Captures power not visible to PKG (PCH, NVMe, display). | Intel SKL+ client only |
The Core domain is always ≤ Pkg. Platform ≥ Pkg because it includes
peripheral power not counted by the socket energy counter.
7.4.2.2 MSR Interface (x86-64 / x86)¶
Intel RAPL is exposed via Model-Specific Registers readable/writable with
RDMSR/WRMSR from ring-0. The register layout is documented in Intel SDM Vol 3B Section 18.9.
Key registers for the Pkg domain (other domains follow the same pattern at
different base addresses):
| MSR Address | Name | Direction | Purpose |
|---|---|---|---|
0x610 |
MSR_PKG_POWER_LIMIT |
R/W | Set short-window and long-window power limits |
0x611 |
MSR_PKG_ENERGY_STATUS |
R | Read cumulative energy counter (wraps at ~65 J for typical units) |
0x613 |
MSR_PKG_PERF_STATUS |
R | Throttle duty cycle (fraction of time spent in power throttle) |
0x614 |
MSR_PKG_POWER_INFO |
R | Thermal Design Power (TDP), minimum, and maximum power |
MSR_PKG_POWER_LIMIT bit layout:
- Bits 14:0 — Long-window power limit (in hardware power units from MSR_RAPL_POWER_UNIT)
- Bit 15 — Enable long-window limit
- Bit 16 — Clamping enable (allow limit to go below TDP; requires CLAMPING_SUPPORT flag)
- Bits 23:17 — Long-window time window (tau_x, encoded as y * 2^F × base unit, typically ≤ 28 s)
- Bits 30:24 — Reserved
- Bit 31 — Reserved
- Bits 46:32 — Short-window power limit
- Bit 47 — Enable short-window limit
- Bit 48 — Short-window clamping enable
- Bits 55:49 — Short-window time window (tau_y, ≤ 10 ms)
- Bits 62:56 — Reserved
- Bit 63 — Lock bit (locks the entire register until next RESET; kernel must not set this)
The short-window limit (tau_y ≤ 10 ms) is the primary mechanism for burst
suppression. The long-window limit (tau_x ≈ 28 s) enforces sustained average power.
Setting both gives a two-tier policy: allow short bursts up to short_limit_W for
up to 10 ms, but enforce long_limit_W on average.
Energy units are encoded in MSR_RAPL_POWER_UNIT (address 0x606). The driver
must read this at boot and convert all values accordingly.
7.4.2.3 AMD Equivalent¶
AMD Zen2 and later processors implement RAPL-compatible MSRs at the same addresses
(0x610, 0x611, 0x614) with the same bit layout. This allows the same MSR
driver to serve both Intel and AMD on Zen2+.
Older AMD processors (pre-Zen2) use the System Management Unit (SMU), a
co-processor accessible via PCI config space (bus 0, device 0, function 0, PCI
vendor/device ID varies by generation). The SMU interface is not publicly
documented; UmkaOS uses the same reverse-engineered interface as the Linux
amd_energy driver (kernel/drivers/hwmon/amd_energy.c).
The RaplInterface abstraction (Section 7.4) hides this difference from upper layers.
7.4.2.4 ARM and RISC-V Equivalents¶
ARM Energy Model (EM): ARM SoCs do not expose hardware energy counters equivalent
to RAPL. Instead, the ARM Energy Model framework provides estimated power consumption
based on empirically measured power coefficients per CPU frequency operating point
(OPP). Each OPP has a power_mW coefficient stored in the device tree
(operating-points-v2 table). The kernel integrates over active OPPs to estimate
energy. This is less accurate than RAPL but enables the same cgroup accounting
interface (Section 7.4).
RISC-V: There is no standardised RAPL equivalent in the RISC-V ISA or the SBI
specification as of SBI v2.0. Platform-specific power management is exposed via
vendor SBI extensions (e.g., T-HEAD/Alibaba extensions for their RISC-V SoCs).
UmkaOS implements a NoopRaplInterface for RISC-V that returns
PowerError::NotSupported for all limit-setting operations and provides zero energy
readings. Cgroup accounting falls back to CPU-time-weighted estimation.
7.4.2.5 Kernel Abstraction¶
All RAPL consumers (cgroup accounting, VM power budgets, thermal passive cooling,
DCMI enforcement) interact with power domains through the RaplInterface trait,
never touching MSRs directly:
/// The type of a RAPL power domain.
/// See also the comprehensive PowerDomainType at [Section 7.7](#power-budgeting--design-power-as-a-schedulable-resource) which extends
/// this to non-CPU domains (accelerators, NICs, storage).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[repr(u32)]
pub enum PowerDomainType {
/// Entire CPU socket including uncore (LLC, memory controller, PCIe root).
CpuPackage = 0,
/// CPU cores only (PP0). Excludes uncore.
CpuCore = 1,
/// Memory controller and attached DIMMs. Server platforms only.
Dram = 2,
/// GPU / accelerator.
Accelerator = 3,
/// NIC (if power-metered).
Nic = 4,
/// NVMe SSD (if power-metered).
Storage = 5,
/// Entire system (platform-level RAPL or BMC).
Platform = 6,
}
/// A single RAPL power domain with its hardware interface and energy accumulator.
///
/// This is the x86/RAPL-specific domain object used for the powercap sysfs hierarchy
/// and energy accounting (Section 7.2.4). The generic cross-architecture abstraction
/// is `GenericPowerDomain` defined in Section 7.4.2.
pub struct RaplDomain {
/// The type of power domain this represents.
pub domain_type: PowerDomainType,
/// The hardware driver implementing this domain's register interface.
pub hw_interface: Arc<dyn RaplInterface>,
/// Cumulative energy consumed by this domain in microjoules.
///
/// This is a **software accumulator** (u64), distinct from the hardware
/// energy counter. The hardware counter (Intel MSR `MSR_RAPL_POWER_UNIT`
/// + `MSR_PKG_ENERGY_STATUS`) is typically 32 bits and wraps every
/// 0.3-1.3 seconds at 200W (wrap period = 2^32 * energy_unit / power).
/// The kernel power accounting thread (Section 7.2.5) polls the hardware
/// counter at `poll_interval = wrap_time / 2` and accumulates the delta
/// into this u64 field, which effectively never wraps (at 200W sustained,
/// u64 overflows after ~2.9 billion years).
///
/// **Implementer note**: The hardware poll loop must use `read_volatile`
/// for the MSR read and compute `delta = (new - old) & hw_mask` to
/// handle the 32-bit hardware wrap correctly.
///
/// Readers must still handle wrap-around by tracking deltas (in case
/// the software counter wraps at u64::MAX, which is astronomically
/// unlikely but must be handled for 50-year correctness).
pub energy_uj: AtomicU64,
/// Socket index (0-based) this domain belongs to.
pub socket_id: u32,
}
/// Hardware interface for reading and controlling a RAPL power domain.
///
/// Implementations exist for: Intel MSR (`IntelRaplMsr`), AMD MSR/SMU
/// (`AmdRaplInterface`), ARM Energy Model (`ArmEmInterface`), and no-op
/// (`NoopRaplInterface` for platforms without hardware support).
///
/// # Safety
///
/// Implementations that write MSRs must only do so from ring-0 kernel context.
/// MSR writes from interrupt context are permitted but must be idempotent and
/// must not acquire locks that could be held by non-interrupt code.
pub trait RaplInterface: Send + Sync {
/// Set a power limit on the given domain.
///
/// `limit_mw` is the power limit in milliwatts.
/// `window_ms` is the averaging window in milliseconds. Hardware may
/// round to the nearest supported window; callers must not assume exact values.
///
/// Returns `PowerError::NotSupported` if the domain or windowed limiting
/// is not available on this platform.
fn set_power_limit(
&self,
domain: PowerDomainType,
limit_mw: u32,
window_ms: u32,
) -> Result<(), PowerError>;
/// Remove a previously set power limit on the given domain, restoring
/// the hardware default (TDP-derived limit).
///
/// Returns `PowerError::NotSupported` if the domain is not available.
fn clear_power_limit(&self, domain: PowerDomainType) -> Result<(), PowerError>;
/// Read the cumulative energy consumed by the given domain in microjoules.
///
/// The counter wraps at `max_energy_range_uj()`. Callers must track
/// previous values and compute deltas to handle wrap-around correctly.
///
/// Returns `PowerError::NotSupported` if the domain is not available.
fn read_energy_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;
/// Read the Thermal Design Power (TDP) of the given domain in milliwatts.
///
/// This is the sustained power level the platform is designed to dissipate.
/// It is used as the upper bound for VM admission control (Section 7.2.6).
///
/// Returns `PowerError::NotSupported` if TDP information is not available.
fn read_tdp_mw(&self, domain: PowerDomainType) -> Result<u32, PowerError>;
/// Return the maximum value of the energy counter before it wraps, in microjoules.
///
/// Callers use this to correctly handle wrap-around in `read_energy_uj`.
fn max_energy_range_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;
}
/// Errors returned by `RaplInterface` operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum PowerError {
/// The requested domain or operation is not supported on this platform.
NotSupported,
/// The requested power limit is below the hardware minimum or above the TDP.
OutOfRange { min_mw: u32, max_mw: u32 },
/// MSR or SMU access failed (hardware error or driver not initialised).
HardwareFault,
/// The domain's power limit register is locked until next RESET.
Locked,
}
The platform boot sequence probes for available RAPL domains (by attempting
RDMSR and checking for #GP) and registers each discovered domain with the
global PowerDomainRegistry. Upper layers iterate the registry rather than
hard-coding which domains exist.
7.4.3 Per-Architecture Power Management Interfaces¶
The RaplInterface abstraction in Section 7.2.2 covers the common interface for
energy reading and power-limit setting. The hardware mechanisms that back that
interface differ substantially across architectures. This section specifies what
those mechanisms are so that the platform boot driver for each architecture knows
which registers, protocols, and firmware services to initialise.
All per-architecture power interfaces are accessed through the PlatformPowerOps
trait, which is registered at boot by each architecture's power driver (see
Section 7.2.2.5 for the RaplInterface trait that PlatformPowerOps builds on).
Upper layers — the cgroup power controller, the scheduler's Energy-Aware Scheduling
path (Section 7.1.5), and the FMA health subsystem (Chapter 20) — use this trait
exclusively. They never call architecture-specific MSRs, SCMI mailboxes, or SBI
extensions directly.
7.4.3.1 x86-64 (Intel and AMD)¶
Energy reporting:
Intel and AMD Zen2+ both expose energy via RAPL MSRs. The complete register map
and bit layout is specified in Section 7.2.2.2. Summary of the energy-status
registers used by the IntelRaplMsr and AmdRaplInterface implementations:
| MSR address | Domain | Availability |
|---|---|---|
MSR_PKG_ENERGY_STATUS (0x611) |
CPU socket (cores + uncore) | Intel SNB+; AMD Zen2+ |
MSR_PP0_ENERGY_STATUS (0x639) |
CPU cores only | Intel SNB+ |
MSR_PP1_ENERGY_STATUS (0x641) |
Integrated GPU (client only) | Intel client SKUs |
MSR_DRAM_ENERGY_STATUS (0x619) |
Memory controller + DIMMs | Intel IVB-EP+; AMD Zen2+ server |
MSR_PLATFORM_ENERGY_STATUS (0x64D) |
Entire platform (PSU side) | Intel SKL+ client only |
The energy unit is encoded in MSR_RAPL_POWER_UNIT (0x606); it must be read at
boot before any energy delta computation.
Frequency and voltage control:
- Intel P-states with HWP (Hardware-controlled Performance States): On Broadwell+
processors, HWP is enabled by writing bit 0 of
IA32_PM_ENABLE(MSR 0x770). The scheduler then controls per-CPU performance hints viaIA32_HWP_REQUEST(MSR 0x774), which encodes minimum performance, maximum performance, desired performance, and energy-performance preference (EPP) in a single 64-bit write. UmkaOS uses HWP when available in preference to legacy ACPI P-state (_PSS) switching. - AMD P-states (CPPC): On Zen2+, frequency scaling uses the Collaborative Processor
Performance Control (CPPC) interface exposed through ACPI CPPC objects or directly
via
MSR_AMD_CPPC_REQ(0xC00102B3). Thedesired_perffield in CPPC maps to a CPU frequency in the same role as HWP'sdesired_perf. - Legacy ACPI _PSS: On older hardware without HWP or CPPC, UmkaOS falls back to
ACPI P-state switching via
_PSS/_PPC/_PCTmethods, which the ACPI driver evaluates and translates into MSR writes (e.g.,IA32_PERF_CTLat 0x199).
Power caps and TDP:
MSR_PKG_POWER_LIMIT(0x610): dual-window power limit (see Section 7.2.2.2 for the full bit layout). UmkaOS sets this register to enforce VM power budgets (Section 7.2.6) and rack-level DCMI caps (Section 7.2.7).MSR_PKG_POWER_INFO(0x614): read-only; provides TDP, minimum, and maximum power. The TDP value is used as the default admission-control ceiling for VM watt-budget enforcement.MSR_RAPL_POWER_LIMITlock bit (bit 63 of 0x610): UmkaOS never sets this bit. Setting it prevents further limit changes until the next platform RESET.
Thermal:
IA32_THERM_STATUS(MSR 0x19C): per-core thermal status register. Bit 0 is the prochot log (set when the core has throttled due to heat). Bits 22:16 encode the "thermal margin" (degrees Celsius below TjMax). UmkaOS reads this MSR periodically in the thermal polling loop (Section 7.2.3.5) and maps it to aThermalZonetemperature reading.IA32_PACKAGE_THERM_STATUS(MSR 0x1B1): package-level equivalent of the per-core thermal status; includes the PROCHOT log and package thermal margin.- DTS (Digital Thermal Sensor): the thermal margin field combined with TjMax (read
from
MSR_TEMPERATURE_TARGET, address 0x1A2, bits 23:16) gives the absolute die temperature:T_die = TjMax - thermal_margin. - ACPI
_TSS/_TPCthrottling methods remain as a fallback on platforms where DTS MSRs are not accessible from the OS.
Runtime device power management:
Device power state transitions (D0 to D3, and back) on x86-64 are driven by ACPI
_PS0/_PS3 control methods evaluated by the ACPI interpreter in umka-kernel.
The ACPI runtime PM path is common to all ACPI platforms (x86-64 and ACPI-based
AArch64 servers); it is not x86-specific beyond the x86 ACPI initialisation
sequence.
7.4.3.1.1 Idle State Management and C-State Restrictions (x86-64)¶
The cpuidle governor selects idle states from a per-model table. Some CPU models have known C-state errata that require restricting the available idle states:
| Errata Flag | CPUs | Restriction | Rationale |
|---|---|---|---|
X86Errata::BAYTRAIL_CSTATE |
Bay Trail / Cherry Trail (Atom Z3xxx, x5-Z8xxx) | Block C6 and deeper states | C6 freeze: CPU fails to wake from C6 on certain steppings, requiring platform reset |
X86Errata::TSC_C3STOP |
Pre-Nehalem Intel, some AMD | tsc_reliable = false |
TSC stops in C3+ states, making it unusable as a clocksource when deep idle is allowed |
X86Errata::HPET_PC10 |
Coffee Lake, Ice Lake, Bay Trail | HPET unreliable in PC10 | HPET counter stops or glitches in package C10 |
X86Errata::MWAIT_BROKEN |
Apollo Lake, Ice Lake-X, Lunar Lake | IPI fallback for idle wakeup | MWAIT fails to wake on interrupt, requiring backup IPI delivery |
AMX tile state and deep C-states (Sapphire Rapids+): Before entering any C-state
deeper than C1, the kernel must execute TILERELEASE to release AMX tile data. The
hardware does not automatically save tile state on C-state entry (unlike FP/SSE/AVX
state, which is preserved across all C-states). Entering C6 with active tiles causes
silent data corruption of the tile registers on wake. The cpuidle enter path checks
CpuFeatureSet.xstate_used & XFEATURE_XTILEDATA and calls TILERELEASE when tiles
are active. On C-state exit, the lazy AMX fault (#NM from XFD) re-initializes tiles
on first use.
Multi-socket TSC desynchronization: On multi-socket systems, the TSCs of different
sockets may drift relative to each other (typically ~1-10 ppm, accumulating to
microseconds over hours). UmkaOS maintains per-socket TSC offset values in a
tsc_socket_offset[MAX_SOCKETS] array, calibrated during SMP bringup by measuring
round-trip IPI latency between sockets. The scheduler's idle duration estimation
and the clocksource watchdog (Section 7.6) use socket-local TSC
values adjusted by these offsets. If drift exceeds a threshold (>10μs divergence from
the reference socket), the kernel switches the system clocksource from TSC to HPET or
ACPI PM timer and logs a warning.
7.4.3.2 AArch64 (ARM Servers: Graviton, Neoverse, Ampere)¶
AArch64 server platforms do not expose RAPL-equivalent MSRs. Energy reporting and frequency control are provided by a combination of hardware activity counters and a firmware-mediated control channel (SCMI).
Energy reporting:
The Activity Monitor Unit (AMU, FEAT_AMU, introduced in Armv8.4) provides a set
of per-core hardware event counters accessible from EL1 via the AMEVCNTR0_EL0
and AMEVTYPER0_EL0 register families. Architecturally defined group-0 counters
include:
| Counter index | Event | Use |
|---|---|---|
| 0 | CPU cycles | Total core cycles consumed |
| 1 | Instructions retired | IPC computation |
| 2 | Memory stall cycles | DRAM latency pressure indicator |
| 3 | L3 cache miss stall cycles | LLC pressure (Neoverse V1/V2, Graviton3+) |
AMU counters provide activity data, not joules. To derive energy, UmkaOS integrates
AMU cycle counts against the per-OPP power coefficients from the ARM Energy Model
(stored in the device tree operating-points-v2 table as opp-microwatt values).
This gives an estimated energy per task, analogous to the delta-integration used on
RAPL platforms, but with lower accuracy (±10–20% typical).
AMU is present on Neoverse V1, Neoverse V2, Neoverse N2, Cortex-A78, Cortex-X1, and later cores. On older cores (Neoverse N1, Graviton2), the ARM Energy Model estimation falls back to CPU-time-weighted power at the current OPP frequency, with no per-core AMU data.
Frequency and voltage control — SCMI:
On AArch64 server platforms (Graviton, Ampere Altra/Altra Max, Neoverse RD series), CPU frequency and power-domain gating are controlled via the System Control and Management Interface (SCMI, ARM specification DEN0056, currently version 3.2). SCMI runs between the OS and a dedicated System Control Processor (SCP) or equivalent firmware agent (e.g., the Nitro controller on AWS Graviton instances) over a shared-memory mailbox (doorbell register + shared SRAM buffer).
SCMI protocols used by UmkaOS:
| SCMI protocol | Protocol ID | UmkaOS use |
|---|---|---|
SCMI_PERF |
0x13 | P-state (OPP) transitions per CPU cluster or per-core. PERF_LEVEL_SET maps to the frequency request analogous to IA32_HWP_REQUEST on x86 |
SCMI_POWER |
0x11 | Power-domain gating (power on/off entire CPU clusters, peripherals). Used during CPU hotplug (Section 7.1.7) and system suspend (Section 7.2.10) |
SCMI_SENSOR |
0x15 | Read platform sensor values (die temperature, supply voltage); used by the thermal framework (Section 7.2.3) as the sensor backend on SCMI platforms |
SCMI_PERF_CAP |
0x13 (cap sub-cmd) | Power capping per performance domain, where supported by the SCP firmware |
SCMI message exchange is asynchronous on multi-channel implementations (one
shared-memory channel per CPU cluster); UmkaOS posts a request and either polls or
waits for a doorbell interrupt (platform-dependent). Latency is typically
100–500 µs for a PERF_LEVEL_SET round-trip to the SCP. This is too slow for
per-task frequency switching; SCMI frequency control is therefore applied at the
granularity of runqueue load-balance intervals (typically 4–10 ms), not on every
context switch.
Power caps:
On Graviton2/3 instances, AWS exposes the Nitro hypervisor's power budget to the
guest OS via a platform-specific MMIO register or ACPI DSDT method; there is no
standard SCMI power-capping channel available from within a Graviton VM. On bare
metal (non-VM) AArch64 servers with SCMI power-capping protocol support, UmkaOS uses
SCMI_PERF_CAP to enforce rack-level power budgets.
The PlatformPowerOps::set_power_limit implementation for SCMI platforms translates
the limit_mw parameter into a performance level ceiling via the OPP table and
issues a SCMI_PERF_LEVEL_SET with that ceiling as the maximum. This achieves
power capping by constraining achievable frequency, not by a hardware power clamp
as on x86-64 RAPL.
Thermal:
Temperature data on SCMI platforms is obtained via:
1. SCMI_SENSOR protocol: the SCP reads die-temperature sensors and exposes them
to the OS as named sensors. UmkaOS's thermal zone driver registers these as
SensorBackend::Scmi entries (Section 7.2.3.6).
2. Device tree thermal-zones nodes with thermal-sensors references: on embedded
and mobile AArch64 SoCs (Qualcomm, MediaTek, Samsung), temperature sensors are
MMIO-mapped and described in the device tree; UmkaOS's thermal zone driver reads
them directly.
3. ACPI _TSS/_TPC: on ACPI-enumerated AArch64 servers (those following the
ACPI for Arm specification, SBSA/SBBR), the ACPI thermal zone path is the same
as on x86-64.
Thermal trip-point response on SCMI platforms follows the same framework as x86-64
(Section 7.2.3): the thermal interrupt (or polling timer) fires the trip-point
callback, which issues a cooling action via CoolingDevice::set_state. On SCMI
platforms, the FrequencyScalingCooler implementation translates the cooling state
to a SCMI_PERF_LEVEL_SET call.
Runtime device power management:
Device power gating on AArch64 uses one of:
- Device tree power-domains nodes backed by SCMI SCMI_POWER protocol: the
generic power-domain framework calls SCMI_POWER_DOMAIN_STATE_SET to transition
devices between POWER_ON and POWER_OFF.
- PSCI SYSTEM_SUSPEND (function ID 0x8400_000E): used for system-wide suspend to
RAM (Section 7.2.10). Per-CPU idle states also use PSCI CPU_SUSPEND.
- On embedded/mobile: operating-points-v2 device tree nodes with regulator
framework bindings allow the kernel to request voltage changes alongside frequency
changes, forming a complete DVFS (Dynamic Voltage and Frequency Scaling) path.
7.4.3.3 RISC-V¶
The RISC-V ISA specification and the SBI (Supervisor Binary Interface, specification v2.0) do not define a standardised energy reporting or frequency scaling interface equivalent to RAPL or SCMI. Power management on RISC-V platforms is therefore entirely platform-specific.
Energy reporting:
The RISC-V ISA defines hardware performance monitor (HPM) counters: hpmcounter3
through hpmcounter31 (CSRs 0xC03–0xC1F), each counting a platform-defined event
selected by the corresponding mhpmevent machine-mode CSR. Whether any HPM counter
counts an energy-proxy event (e.g., CPU cycles at a known voltage-frequency point)
is platform-defined. On platforms where such a counter exists, UmkaOS's RISC-V energy
driver reads it and converts to milliwatts using a boot-time calibration coefficient
from the device tree or SBI vendor extension.
On platforms with no HPM energy counter, UmkaOS falls back to CPU-time-weighted power
estimation: power (mW) = active_fraction × OPP_power_mW, where OPP_power_mW
comes from the operating-points-v2 device tree node. Cgroup energy accounting
uses this estimated power.
Frequency and voltage control:
- SBI HSM (Hart State Management) extension (EID 0x48534D): the
HART_SUSPENDfunction (FID 0x0) requests per-hart low-power clock-gating. This is the only standardised per-hart power state transition in SBI v2.0. It is used for idle (cpu_idle) and for offline CPUs (Section 7.1.7), not for frequency scaling. - Frequency scaling: there is no standardised RISC-V frequency scaling interface in the SBI specification as of version 2.0. On server-class RISC-V platforms following the RISC-V Server Platform specification (published 2023), ACPI CPPC is required; UmkaOS uses the ACPI CPPC driver path (same as on x86-64 legacy CPPC platforms). On embedded RISC-V SoCs, frequency scaling uses platform-specific MMIO registers described in the device tree, accessed through a platform-specific clock driver.
- SBI vendor extensions: some RISC-V SoC vendors (e.g., T-HEAD/Alibaba for their C906/C910 series cores) define private SBI vendor extensions for DVFS. UmkaOS implements these as optional platform drivers registered at boot if the SBI probe reports the vendor extension ID.
Power caps:
No standardised RISC-V power-capping interface exists in the base SBI specification
or RISC-V platform specifications as of 2025. The PlatformPowerOps::set_power_limit
implementation for RISC-V returns PowerError::NotSupported unless a platform-specific
driver (loaded via device tree compatible string matching) implements a power-capping
MMIO interface.
Thermal:
Temperature sensors on RISC-V platforms are described in the device tree using the
standard thermal-zones binding with thermal-sensors references pointing to
platform-specific thermal sensor nodes (e.g., compatible = "sifive,fu740-temp").
UmkaOS's thermal zone driver reads them via the platform's sensor driver. Trip-point
response uses the same thermal framework as other architectures (Section 7.2.3).
Runtime device power management:
SBI HSM HART_SUSPEND provides per-hart suspend (with and without local context
retention, depending on the suspend_type field). System-wide suspend follows the
platform-specific mechanism (ACPI S3 on RISC-V ACPI platforms; device-tree power
domains on embedded platforms).
7.4.3.4 PPC32 and PPC64LE¶
IBM POWER and PowerPC platforms have two distinct power management environments: bare-metal (directly running on the hardware, including OpenPOWER) and LPAR (Logical Partition, running under the PowerVM or KVM hypervisor). The mechanisms differ between these environments.
Energy reporting:
- LPAR on IBM POWER (PowerVM hypervisor): Energy data is exposed via the PHYP
(PowerVM Hypervisor) H-call
H_GET_EM_PARMS, which returns the partition's current power consumption as measured by the system's power meters. This is the LPAR equivalent of RAPL: the hypervisor aggregates physical PSU data and attributes a share to each partition. - Bare metal (OpenPOWER, POWER9/POWER10 with OPAL): The OPAL (OpenPOWER
Abstraction Layer) firmware exposes power data via
opal_sensor_read(OPAL call 0x30) andopal_sensor_read_u64(0x52). Theibm,opal-sensorsdevice tree node lists available sensors (die temperature, core power, memory power) by sensor handle. UmkaOS's OPAL sensor driver iterates this list at boot and registers each as aGenericPowerDomainin thePowerDomainRegistry. - Bare metal without OPAL (classic PPC32 embedded): No hardware power counters accessible from the OS. CPU-time-weighted OPP estimation is the only option.
- PMU counters: Both PPC32 and PPC64LE have hardware performance monitor facilities (configurable via MMCR0/MMCR1/MMCR2 and PMCx registers). These can count CPU cycles, L2/L3 misses, and memory bandwidth — useful energy proxies — but require platform-specific calibration. UmkaOS optionally uses PMC0 (total cycles) as a proxy if an OPAL/PHYP energy interface is not available.
Frequency and voltage control:
- LPAR (PowerVM): The
H_SET_PPP(Processor Folding Priority) H-call allows a partition to request a change in its CPU frequency priority relative to other partitions on the same physical POWER system. This is not a direct frequency knob; the hypervisor honors the request subject to available capacity. UmkaOS issuesH_SET_PPPfrom the scheduler's EAS path when the workload shifts between low and high throughput modes. - Bare metal OpenPOWER (OPAL): On POWER8 and POWER9 systems with OPAL, CPU
frequency is controlled via the
opal_set_freqcall or by writing to theEPS(Energy Management) registers via OPAL. UmkaOS uses the OPAL cpufreq driver for OpenPOWER platforms. - ACPI on OpenPOWER: POWER9 and POWER10 systems running the Little-Endian Linux ABI (ppc64le) and ACPI-enumerated (ACPI is supported on OpenPOWER via SBSA-like profile) can use ACPI CPPC for frequency control, the same as ARM ACPI servers.
- Embedded PPC32 (e500/e500mc): Frequency scaling is platform-specific; most embedded PPC32 SoCs use a simple PLL register write, described in the device tree.
Thermal:
- OPAL platforms: Temperature sensor data is read via
opal_sensor_readusing the sensor handles discovered from theibm,opal-sensorsnode. UmkaOS registers these asSensorBackend::Opalentries in the thermal framework. - LPAR (PowerVM): Thermal management is entirely hypervisor-controlled; the guest OS has no visibility into die temperature and cannot control throttling. UmkaOS does not register thermal zones in LPAR mode.
- Server platforms with IPMI: Both PPC32 and PPC64LE rack servers typically have a Baseboard Management Controller (BMC) accessible via IPMI. Temperature sensors reported by the BMC are accessed through the IPMI thermal zone backend (Section 7.2.3.6). This is the same DCMI/IPMI path as on x86-64 rack servers (Section 7.2.7).
Runtime device power management:
- LPAR: Device power gating is hypervisor-managed. UmkaOS does not control device power state directly in LPAR mode; the hypervisor handles it transparently.
- OPAL bare metal: OPAL exposes device power domains via
opal_pci_set_power_statefor PCIe devices and via theibm,opaldevice tree node's power-management subnode for on-chip devices. - Device tree power domains: Embedded PPC32/PPC64 platforms follow standard
device tree
power-domainsbindings, identical to ARM embedded.
7.4.4 Thermal Framework¶
7.4.4.1 Thermal Zones¶
A thermal zone is a region of the system that has one or more temperature sensors and a set of trip points. Physical examples:
- CPU die (one per socket; typically uses the
TCONTROLMSR or PECI for temperature) - GPU die (integrated or discrete)
- Battery (reported via ACPI
_BTPor Smart Battery System) - Skin/chassis (NTC thermistor on laptop lid; used to prevent burns)
- NVMe drive (SMART temperature, reported via hwmon Section 13.13)
/// A thermal zone: a named region with a temperature sensor and trip points.
pub struct ThermalZone {
/// Human-readable name (e.g., `"cpu0-die"`, `"battery"`, `"skin"`).
/// Must be unique within the system. Used as the sysfs directory name.
pub name: &'static str,
/// The temperature sensor for this zone.
pub sensor: Arc<dyn TempSensor>,
/// Ordered list of trip points, sorted by `temp_mc` ascending.
///
/// The thermal monitor evaluates all trip points on each poll cycle and
/// fires actions for any whose threshold has been crossed.
///
/// **Boot-time only**: populated by the ACPI/DT thermal zone parser at boot
/// and never resized after the thermal subsystem initializes. `Vec` is used
/// for owned, contiguous storage — not for dynamic growth.
pub trip_points: Vec<TripPoint>,
/// Cooling devices bound to this zone with their maximum cooling state
/// and the trip point(s) that activate them.
///
/// **Boot-time only**: populated at boot alongside `trip_points` and never
/// modified at runtime. Typical zone has 1–4 bindings.
pub cooling_devices: Vec<CoolingBinding>,
/// Current polling interval in milliseconds.
///
/// Starts at 1000 ms (normal), drops to 100 ms when the zone temperature
/// is within 5 °C of any trip point, and drops to 10 ms when within 1 °C
/// of a `Hot` or `Critical` trip point.
pub polling_interval_ms: AtomicU32,
}
7.4.4.2 Trip Points¶
A trip point is a temperature threshold with an associated action type:
/// A temperature threshold that triggers a thermal action when crossed.
pub struct TripPoint {
/// Temperature at which this trip point fires, in millidegrees Celsius.
///
/// For example, 95000 = 95 °C.
pub temp_mc: i32,
/// The action to take when this trip point is crossed.
pub trip_type: TripType,
/// Hysteresis in millidegrees Celsius.
///
/// The trip point is considered cleared only when the temperature drops
/// below `temp_mc - hysteresis_mc`. This prevents oscillation around the
/// threshold. Typical value: 2000 (2 °C).
pub hysteresis_mc: i32,
}
/// The action taken when a thermal trip point threshold is crossed.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TripType {
/// Reduce power consumption by notifying the cpufreq governor to lower
/// the maximum CPU frequency. Does not forcefully reduce frequency;
/// relies on the governor to converge. This is the primary mechanism
/// for sustained thermal management.
Passive,
/// Activate a cooling device (e.g., spin up a fan to a higher speed).
/// The bound `CoolingDevice` is set to its next higher state.
Active,
/// The temperature has reached a dangerous level. Post a `ThermalEvent`
/// to userspace monitoring daemons via the thermal netlink socket.
/// Userspace may respond by reducing workload. No kernel-side action.
Hot,
/// Emergency condition. The kernel immediately forces a system poweroff
/// (equivalent to `kernel_power_off()`). This happens synchronously in
/// the thermal interrupt handler or poll loop — userspace is not consulted.
/// Data integrity is not guaranteed; this is a last resort before hardware
/// thermal shutdown.
Critical,
}
The Critical trip point is typically set 5–10 °C below the hardware's own
PROCHOT# shutdown temperature to give the kernel a chance to shut down cleanly
(flushing journal, unmounting filesystems) before the hardware forcibly powers off.
7.4.4.3 Cooling Devices¶
A cooling device is something the kernel can actuate to reduce heat generation or increase heat dissipation:
/// A device that can reduce thermal load on a thermal zone.
///
/// Cooling states are represented as integers from 0 (no cooling) to
/// `max_state()` (maximum cooling). The mapping from state number to physical
/// action is device-specific.
///
/// # Examples
///
/// - `CpufreqCooler`: state 0 = max frequency, state N = minimum frequency.
/// - `FanCooler`: state 0 = fan off, state N = 100% PWM duty cycle.
pub trait CoolingDevice: Send + Sync {
/// Return the maximum cooling state this device supports.
///
/// The device can be set to any state in `[0, max_state()]`.
fn max_state(&self) -> u32;
/// Return the current cooling state.
fn current_state(&self) -> u32;
/// Set the cooling state to `state`.
///
/// Must be idempotent if `state == current_state()`.
/// Returns `ThermalError::OutOfRange` if `state > max_state()`.
fn set_state(&self, state: u32) -> Result<(), ThermalError>;
/// Human-readable name for this cooling device (e.g., `"cpufreq-cpu0"`,
/// `"fan-chassis0"`). Used as the sysfs `type` file content.
fn name(&self) -> &'static str;
}
/// Binding between a thermal zone and a cooling device.
pub struct CoolingBinding {
/// The cooling device to actuate.
pub device: Arc<dyn CoolingDevice>,
/// The trip point index (into `ThermalZone::trip_points`) that activates
/// this binding. The cooling device is stepped up one state each time the
/// thermal zone crosses this trip point.
pub trip_point_index: usize,
/// The cooling state to apply when the trip point is in the active
/// (crossed) state. When the zone cools below `temp_mc - hysteresis_mc`,
/// the device is stepped back down toward 0.
pub target_state: u32,
}
Standard cooling device types provided by UmkaOS:
| Type | Description | State mapping |
|---|---|---|
CpufreqCooler |
Limits max CPU frequency via cpufreq (Section 7.2) | 0 = cpu_max_freq, N = cpu_min_freq, linear interpolation |
GpufreqCooler |
Limits max GPU frequency via drm/gpu driver | Same as above |
FanCooler |
Sets fan PWM duty cycle via hwmon (Section 13.13) | 0 = fan off, max_state() = 100% PWM |
UsbCurrentCooler |
Reduces USB charging current to lower battery heat | 0 = max current, N = 0 mA |
RaplCooler |
Reduces RAPL PKG limit directly | 0 = TDP, N = minimum supported limit |
7.4.4.4 Cooling Map Discovery¶
The binding between thermal zones and cooling devices is discovered at boot from:
- ACPI:
_TZD(thermal zone devices),_PSL(passive cooling list),_AL0–_AL9(active cooling lists). The ACPI thermal driver evaluates these control methods and populates thecooling_deviceslist in eachThermalZone. - Device tree:
cooling-mapsnode under the thermal zone node (binding documented in Linux kernelDocumentation/devicetree/bindings/thermal/thermal-zones.yaml). UmkaOS parses this during DTB processing (Section 3.14). - Static board description: For platforms without ACPI or DTB thermal tables,
a board-specific Rust module in
umka-kernel/src/arch/can register zones and bindings at compile time.
7.4.4.5 Polling and Interrupt-Driven Monitoring¶
The thermal monitor uses two mechanisms:
Polling (always available): A kernel timer fires periodically to call
TempSensor::read_temp_mc() and evaluate all trip points. The polling interval
is adaptive:
| Temperature distance from nearest trip point | Polling interval |
|---|---|
| > 5 °C below any trip point | 1000 ms |
1–5 °C below a Passive or Active trip |
100 ms |
< 1 °C below a Hot or Critical trip |
10 ms |
Interrupt-driven (when available): Some platforms provide hardware thermal interrupts that fire when a temperature threshold is crossed:
- Intel PROCHOT interrupt: the CPU asserts
PROCHOT#when the die temperature reaches the factory-programmed limit. The kernel registers an interrupt handler on APIC vector0xFA(Linux convention for thermal LVT). This fires before RAPL-based throttling takes effect. - AMD SB-TSI alert: an SMBus alert from the SB-TSI temperature sensor on AMD
platforms. Handled by the
amd_sb_tsiI2C driver. - ACPI
_HOT/_CRTnotify: the firmware sends an ACPI notify event when a thermal zone crosses itsHotorCriticaltemperature. The ACPI event handler evaluates the zone immediately rather than waiting for the next poll cycle.
Interrupt-driven monitoring reduces the latency from temperature threshold crossing to kernel response from ≤ 1000 ms (polling) to ≤ 100 µs (interrupt).
7.4.4.6 Temperature Sensor Abstraction¶
/// A hardware temperature sensor.
///
/// Implementations include: x86 PECI (Platform Environment Control Interface),
/// ACPI `_TMP` control method, I2C/SMBus sensors (LM75, TMP102, etc.),
/// and ARM SoC on-die sensors.
pub trait TempSensor: Send + Sync {
/// Read the current temperature in millidegrees Celsius.
///
/// Returns `ThermalError::SensorFault` if the hardware sensor reports
/// an error condition (e.g., I2C NACK, PECI timeout).
fn read_temp_mc(&self) -> Result<i32, ThermalError>;
/// Human-readable name for this sensor (e.g., `"peci-cpu0"`, `"acpi-tz0"`).
fn name(&self) -> &'static str;
}
/// Errors returned by thermal framework operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ThermalError {
/// The sensor or cooling device is not available or not initialised.
NotAvailable,
/// The sensor returned an error condition (hardware fault or communication error).
SensorFault,
/// The requested cooling state is outside `[0, max_state()]`.
OutOfRange,
/// The cooling device is currently locked by another subsystem.
DeviceBusy,
}
7.4.4.7 Linux sysfs Compatibility¶
UmkaOS exposes the thermal framework under the same sysfs paths as the Linux kernel thermal framework, enabling unmodified Linux monitoring tools:
/sys/class/thermal/
thermal_zone0/
type # zone name (e.g., "x86_pkg_temp")
temp # current temperature in millidegrees (e.g., "52000")
mode # "enabled" or "disabled"
trip_point_0_temp # first trip point temperature
trip_point_0_type # "passive", "active", "hot", or "critical"
trip_point_0_hyst # hysteresis in millidegrees
policy # cooling policy: "step_wise" or "user_space"
cooling_device0/
type # cooling device name (e.g., "Processor")
max_state # maximum cooling state
cur_state # current cooling state
The type file content for CPU cooling devices uses the string "Processor"
for compatibility with lm_sensors, thermald, and similar tools that match
on this string.
7.4.5 Powercap Interface (sysfs)¶
The powercap sysfs hierarchy provides a unified interface for reading energy
counters and setting power limits. UmkaOS's layout is byte-for-byte compatible with
Linux's intel_rapl_msr driver output, ensuring that existing power monitoring
and management tools work without modification.
7.4.5.1 Directory Structure¶
/sys/devices/virtual/powercap/
intel-rapl/ # Control type: Intel RAPL
intel-rapl:0/ # Socket 0 PKG domain
name # "package-0"
energy_uj # Cumulative energy (µJ, read-only, wraps)
max_energy_range_uj # Counter wrap value in µJ
constraint_0_name # "long_term"
constraint_0_power_limit_uw # Long-window limit in µW (read-write)
constraint_0_time_window_us # Long-window duration in µs (read-write)
constraint_0_max_power_uw # Maximum settable limit (TDP) in µW
constraint_1_name # "short_term"
constraint_1_power_limit_uw # Short-window limit in µW (read-write)
constraint_1_time_window_us # Short-window duration in µs (read-write)
constraint_1_max_power_uw # Maximum settable short-term limit
enabled # "1" to enable limits, "0" to disable
intel-rapl:0:0/ # Socket 0 Core (PP0) sub-domain
name # "core"
energy_uj
max_energy_range_uj
constraint_0_name # "long_term"
constraint_0_power_limit_uw
constraint_0_time_window_us
constraint_0_max_power_uw
enabled
intel-rapl:0:1/ # Socket 0 Uncore (PP1) sub-domain (client only)
name # "uncore"
...
intel-rapl:1/ # Socket 1 PKG domain (dual-socket servers)
...
The DRAM domain appears as a separate top-level entry on server platforms:
On AMD Zen2+ systems, the same layout is used with the control type still named
intel-rapl for compatibility (Linux uses the same driver name). AMD-specific
extensions (if any) appear in an amd-rapl control type directory.
7.4.5.2 Tool Compatibility¶
The following tools work against UmkaOS's powercap hierarchy without modification:
| Tool | Use |
|---|---|
powerstat |
Per-socket power consumption over time |
turbostat |
CPU frequency, power, and temperature combined |
s-tui |
Terminal UI showing frequency and power |
powertop |
Process-level power attribution (uses /proc, not powercap, but reads energy_uj) |
Prometheus node_exporter |
--collector.powersupplyclass and powercap collector |
rapl-read |
Low-level RAPL register dump |
7.4.5.3 Write Semantics¶
Writing constraint_N_power_limit_uw calls RaplInterface::set_power_limit() on the
corresponding RaplDomain. Writes from unprivileged userspace are rejected with
EPERM. Root (or a process with CAP_SYS_ADMIN) may write any domain.
Writing a limit that exceeds the domain's constraint_N_max_power_uw returns EINVAL.
Writing 0 is equivalent to calling RaplInterface::clear_power_limit() (removes the
software limit, restoring hardware default).
7.4.6 Cgroup Power Accounting¶
7.4.6.1 Design¶
Energy consumption is attributed to cgroups using a sampling-based model that
parallels CPU time accounting. A dedicated kernel thread (the power accounting
thread) wakes every 10 ms (configurable via /proc/sys/kernel/power_sample_interval_ms,
range 1–1000 ms) and:
- Reads
energy_ujfrom all active RAPL domains (all sockets, all sub-domains). - Computes the delta from the previous sample, handling counter wrap-around.
- Queries the scheduler to get, for each cgroup, the CPU time consumed in the last 10 ms interval.
- Distributes the energy delta across cgroups proportional to their CPU time share.
- Accumulates the attributed energy into each cgroup's
power.energy_ujcounter.
Clarification on per-cgroup overhead: The 10 ms interval is the RAPL sampling and reporting period — a single kernel thread reads hardware energy counters and distributes the delta. This is NOT per-cgroup polling. The power accounting thread performs one RAPL read per domain per interval (typically 4-8 RAPL domains total), then a single O(n) pass over active cgroups to distribute the delta. The per-cgroup CPU time data is already maintained by the scheduler's existing accounting (updated on context switch and dequeue events, not by polling). Therefore, even with 4096 active cgroups, the accounting thread's cost is: ~4-8 RAPL reads (~200 ns each) + one linear scan of cgroup time deltas (~4096 × ~20 ns = ~80 μs) = under 100 μs per 10 ms interval, or <0.001% CPU. The naive concern that 4096 cgroups × 10 ms polling would cost ~4% CPU assumes each cgroup requires independent hardware polling; in reality, the hardware counters are per-socket (not per-cgroup) and the per-cgroup attribution is a lightweight arithmetic distribution.
This is the same weighted attribution model used by Linux's cpuacct cgroup
controller and, more recently, by Intel's Energy Aware Scheduling patches.
7.4.6.2 Attribution Model¶
Let E_delta be the total PKG energy delta in the current interval (µJ),
and let T_i be the CPU time consumed by cgroup i in the interval (µs).
The energy attributed to cgroup i is:
where the sum is over all cgroups with T_j > 0. Idle time (no cgroup running)
is attributed to a synthetic idle cgroup and not charged to any user cgroup.
Limitation: This model has two known imprecisions:
-
It does not account for memory bandwidth differences between cgroups sharing a socket. A cgroup running a memory-bandwidth-intensive workload consumes more power per CPU cycle than one running a compute-bound workload, but they receive the same energy charge per CPU time unit. This is acceptable for accounting and billing; it is not suitable for precise per-process energy metering.
-
On a multi-socket server,
PKGenergy from socket 0 may be attributed to a cgroup whose threads ran on socket 1 if the sampling window captures a migration. The error is bounded by one sampling interval (10 ms default).
7.4.6.3 Cgroup Interface¶
The power cgroup controller provides the following files:
| File | Mode | Description |
|---|---|---|
power.energy_uj |
R | Cumulative energy attributed to this cgroup in µJ. Wraps at u64::MAX. |
power.stat |
R | Per-domain energy breakdown: pkg_energy_uj, core_energy_uj, dram_energy_uj. |
power.limit_uw |
RW | Power limit for this cgroup in µW. 0 = no limit. Setting a non-zero value enables power cap enforcement (Section 7.4). |
power.limit_window_ms |
RW | Averaging window for power.limit_uw enforcement, in ms. Default: 100. |
These files are created under the cgroup hierarchy directory, e.g.:
7.4.6.4 Per-Cgroup Power Limit Enforcement¶
When power.limit_uw is non-zero, the power accounting thread checks, at each
sample interval, whether the cgroup's rolling-average power consumption (calculated
from power.energy_uj deltas over power.limit_window_ms) exceeds the limit.
The power accounting thread runs at SCHED_NORMAL priority (nice 0) and is
pinned to NUMA node 0's first online CPU. It does not require real-time priority
because power accounting tolerates jitter — sample intervals are typically
100 ms, and a delayed sample by one tick (1-4 ms) has negligible impact on
energy attribution accuracy.
If the limit is exceeded:
- The cgroup's effective RAPL PKG short-window limit is reduced proportionally
to bring the cgroup's power consumption within budget. This is implemented by
adjusting the
cpu.maxbandwidth (Section 7.6) for the cgroup's tasks — reducing their CPU time allocation reduces their power consumption. - A
PowerLimitEventis posted to the cgroup's event fd (readable viacgroup.events), allowing userspace monitoring daemons to observe throttling.
Limitation: RAPL enforcement at sub-PKG granularity (per-cgroup, per-core) is
not directly supported by hardware. Per-cgroup limits are enforced indirectly via
CPU time throttling (Section 7.6). True per-cgroup hardware power isolation would require
per-core RAPL (available on some Intel Xeon generations as MSR_PP0_POWER_LIMIT)
combined with strict core pinning — a configuration that umka-kvm uses for VM
power budgets (Section 7.4), but which is not the general case.
7.4.7 VM Power Budget Enforcement¶
7.4.7.1 Motivation¶
Traditional VM resource accounting models (CPU cores, RAM) do not capture actual power consumption. A VM running a STREAM memory-bandwidth benchmark or a dense linear algebra kernel (e.g., BLAS DGEMM with AVX-512) can consume 2–3× the power of a VM running a web server at equivalent CPU utilisation. In a datacenter where the binding constraint is rack PDU amperage, not CPU cores, CPU-count quotas systematically mis-model the actual cost of workloads.
Watt-based quotas reflect actual rack power budget more honestly:
- A 500W rack PDU can host either ten 50W VMs or five 100W VMs regardless of how many vCPUs each is assigned.
- A burst-capable VM (bursty ML inference job) can be allocated 80W sustained with a 150W burst cap for 10 ms — mirroring the RAPL two-tier limit model.
- Overcommit is detectable and rejectable at admission time by comparing
sum(vm_power_limit_mw)against measured or rated socket TDP.
7.4.7.2 Mechanism¶
When umka-kvm creates a VM with a vm_power_limit_mw budget:
- Dedicated cgroup: A cgroup is created at
/sys/fs/cgroup/umka-vms/<vm-id>/for the VM's vCPU threads.power.limit_uwis set tovm_power_limit_mw * 1000. - Core pinning: The VM's vCPU threads are pinned to a CPU set on a single
socket (or across sockets if
vm_numa_topologyspecifies multi-socket). This ensures energy counter attribution is accurate (Section 7.4 limitation 2). - Socket RAPL coordination: The PKG short-window limit for each socket is set
to
sum(vm_power_limit_mw for all VMs pinned to that socket) + headroom_mw, whereheadroom_mwis a configurable per-socket constant (default: 10% of TDP) reserved for host kernel overhead. - Monitoring: The
VmPowerBudget::update()method is called by umka-kvm's power accounting thread every 100 ms. If the VM exceeds its budget, vCPU scheduling quota is reduced.
/// Power budget tracking for a single VM.
pub struct VmPowerBudget {
/// Allocated sustained power budget for this VM in milliwatts.
pub limit_mw: u32,
/// Allocated burst power limit in milliwatts, enforced for windows ≤ 10 ms.
///
/// Maps to the RAPL short-window limit. Set to `limit_mw` if no burst
/// allowance is configured (conservative mode).
pub burst_limit_mw: u32,
/// Measured average power consumption over the last 1-second sliding window,
/// in milliwatts. Updated by `update()` every 100 ms.
pub measured_mw: AtomicU32,
/// Number of times vCPU quota was reduced due to power budget violation.
///
/// Monotonically increasing. Used for throttle-rate monitoring and alerting.
pub throttle_count: AtomicU64,
/// Handle to the cgroup backing this VM's power accounting and enforcement.
cgroup: CgroupHandle,
}
impl VmPowerBudget {
/// Called every 100 ms by umka-kvm's power accounting thread.
///
/// Reads the energy delta from the VM's cgroup, updates `measured_mw`,
/// and reduces vCPU scheduling quota if the budget is exceeded.
///
/// `sched` is the CPU bandwidth controller for this VM's vCPU threads ([Section 7.6](#cpu-bandwidth-guarantees)).
pub fn update(&self, sched: &CpuBandwidth) {
let delta_uj = self.cgroup.read_energy_delta_uj();
// 100 ms window: delta_uj / 100 ms = µJ/ms = mW.
let measured = (delta_uj / 100) as u32;
self.measured_mw.store(measured, Ordering::Relaxed);
if measured > self.limit_mw {
// Reduce vCPU CBS quota proportional to overage fraction.
// Uses fixed-point arithmetic (percentage × 10) to avoid FPU in kernel.
// Example: measured = 120 mW, limit = 100 mW →
// overage = 20, reduction_permille = 20 * 1000 / 120 = 166 (16.6%).
let overage = measured - self.limit_mw;
let reduction_permille = overage * 1000 / measured;
sched.reduce_quota_permille(reduction_permille);
self.throttle_count.fetch_add(1, Ordering::Relaxed);
}
}
}
7.4.7.3 Admission Control¶
Before umka-kvm creates a new VM with a vm_power_limit_mw budget, it checks
the admission constraint:
where socket_tdp_mw is read from RaplInterface::read_tdp_mw(PowerDomainType::Pkg)
at boot, and host_headroom_mw defaults to 10% of TDP (configurable via
/sys/module/umka_kvm/parameters/power_headroom_pct).
If the constraint is violated, umka-kvm returns ENOSPC to the VM creation ioctl.
The caller (e.g., an orchestrator) must either reduce the requested budget, migrate
an existing VM to another socket/host, or reject the workload.
This is an admission control gate, not a guarantee. Actual power consumption may exceed TDP temporarily due to:
- Turbo Boost / AMD Precision Boost (transient power above TDP for ≤ 10 ms)
- RAPL enforcement latency (the hardware enforces limits over the configured window, typically 1-10 ms; instantaneous power can spike). On ARM platforms using SCMI (System Control and Management Interface), enforcement latency is firmware-dependent and typically higher (~10-100 ms) because power limits are communicated via SCMI mailbox messages to the SCP (System Control Processor), which enforces them asynchronously.
These transient overages are expected and handled by the hardware's own thermal and power-delivery circuitry. UmkaOS's admission control operates at the sustained (long-window) level.
7.4.7.4 Observability¶
Per-VM power accounting is exposed via:
/sys/fs/cgroup/umka-vms/<vm-id>/power.energy_uj # Cumulative energy (µJ)
/sys/fs/cgroup/umka-vms/<vm-id>/power.stat # Per-domain breakdown
/sys/fs/cgroup/umka-vms/<vm-id>/power.limit_uw # Current limit (µW)
umka-kvm also exposes power metrics via the KVM statistics interface
(/sys/bus/event_source/devices/kvm/), enabling Prometheus node_exporter
KVM collector to report per-VM power consumption.
7.4.8 DCMI / IPMI Rack Power Management¶
7.4.8.1 Overview¶
In server deployments managed by a Baseboard Management Controller (BMC), the BMC may impose a platform-level power cap via the Data Center Manageability Interface (DCMI), an extension of IPMI v2.0 (specification: DCMI v1.5, published by Intel/DMTF).
DCMI provides the following power management commands over the IPMI channel:
| DCMI Command | NetFn/Cmd | Description |
|---|---|---|
Get Power Reading |
2C/02h |
Current platform power in watts (instantaneous, min, max, average over a rolling window) |
Get Power Limit |
2C/03h |
Read the currently configured platform power cap |
Set Power Limit |
2C/04h |
Set a platform power cap and exception action (hard power-off or OEM-defined) |
Activate/Deactivate Power Limit |
2C/05h |
Enable or disable the configured power cap |
Get DCMI Capabilities |
2C/01h |
Enumerate which DCMI features the BMC supports |
These commands are sent by the datacenter management infrastructure (e.g., OpenBMC, Redfish, Dell iDRAC, HP iLO) to impose a rack-level power budget on individual servers.
7.4.8.2 UmkaOS Integration¶
The UmkaOS IPMI driver (Tier 1; KCS/SMIC/BT keyboard-controller-style interfaces via
I2C or LPC, as described in Section 13.13) handles incoming DCMI commands from the BMC.
When the BMC asserts a power cap via Set Power Limit + Activate Power Limit,
the kernel responds as follows:
BMC sets cap C_bmc (watts)
│
▼
umka-ipmi driver receives DCMI Set/Activate Power Limit
│
├─► Reduce aggregate RAPL PKG limits across all sockets (TDP-proportional)
│ for each socket i:
│ new_pkg_limit[i] = C_bmc × (socket_tdp[i] / Σ socket_tdp)
│ TDP values are read from MSR_PKG_POWER_INFO (x86) or ACPI PPTT at boot.
│ On heterogeneous systems (mixed socket SKUs), this gives higher-TDP
│ sockets a proportionally larger share of the cap. On homogeneous systems,
│ this reduces to C_bmc / num_sockets.
│ → RaplInterface::set_power_limit(Pkg, new_pkg_limit[i], long_window_ms)
│
├─► Notify umka-kvm to reduce VM watt budgets proportionally
│ reduction_factor = C_bmc / current_total_vm_budget
│ → for each VM: VmPowerBudget::limit_mw *= reduction_factor
│ → Re-run admission control check (may trigger VM migration signal)
│
└─► Post PowerCapEvent to userspace monitoring channel
→ sysfs: /sys/bus/platform/drivers/dcmi/power_cap_uw (updated)
→ netlink thermal event (for `thermald` compatibility)
→ KVM statistics update (for Prometheus collector)
7.4.8.3 Escalation Hierarchy¶
Power management operates at three levels, each enforced by a different actor:
| Level | Mechanism | Enforced by | Override possible? |
|---|---|---|---|
| Software power limit | RAPL MSR_PKG_POWER_LIMIT |
UmkaOS kernel | Yes (root can raise within TDP) |
| BMC power cap | DCMI Set Power Limit |
BMC firmware | Only by BMC admin |
| Physical current limit | PSU OCP / PDU circuit breaker | Hardware | No |
The kernel controls only the first level. The BMC cap (second level) is
communicated to the kernel via DCMI but ultimately enforced by the BMC's power
management controller, which can throttle the server via SYS_THROT# or force
a hard power-off regardless of OS state. The kernel's DCMI integration is
cooperative, not authoritative.
7.4.8.4 DcmiPowerCap Interface¶
/// Interface for the DCMI power cap enforcement callback.
///
/// Implemented by the IPMI driver. Called when the BMC asserts or modifies
/// a DCMI power limit.
pub trait DcmiPowerCap: Send + Sync {
/// Called when the BMC sets a new platform power cap.
///
/// `cap_mw` is the new cap in milliwatts. `0` indicates the cap has been
/// deactivated (no limit). Implementors must update RAPL limits and notify
/// umka-kvm within this call or schedule it for immediate async processing.
fn on_cap_set(&self, cap_mw: u32);
/// Return the currently active BMC-imposed cap in milliwatts.
///
/// Returns `None` if no cap is currently active.
fn current_cap_mw(&self) -> Option<u32>;
/// Return the last measured platform power reading from the BMC in milliwatts.
///
/// This is the BMC's own measurement, which may differ from RAPL's
/// (BMC measures at the PSU, RAPL measures at the socket).
fn last_reading_mw(&self) -> u32;
}
7.4.9 Battery and SMBus Monitoring¶
SMBus (System Management Bus) is a subset of I2C used for battery/charger chips. Example: Smart Battery System (SBS) batteries expose registers at I2C address 0x0B:
- 0x08: Temperature (in 0.1K units).
- 0x09: Voltage (in mV).
- 0x0A: Current (in mA, signed).
- 0x0D: Relative State of Charge (0-100%).
- 0x0F: Remaining Capacity (in mAh).
The battery driver (Tier 1, probed via ACPI PNP0C0A device) reads these registers periodically (every 5 seconds when on battery, every 60 seconds when on AC) and exposes them via sysfs/umkafs (see Section 7.4 for the userspace interface).
7.4.10 Consumer Power Profiles¶
This subsection defines the user-facing policy layer; Section 7.4–Section 7.4 define the underlying mechanisms.
7.4.10.1 Power Profile Enumeration¶
// umka-core/src/power/profile.rs
/// User-facing power profile (consumer policy; translates to Section 7.2 mechanisms).
#[repr(u32)]
pub enum PowerProfile {
/// Maximum performance. AC adapter expected.
Performance = 0,
/// Balanced performance and power. Default on AC.
Balanced = 1,
/// Aggressive power saving. Default on battery.
BatterySaver = 2,
/// User-defined constraints loaded from /ukfs/power/custom_profile.
Custom = 3,
}
7.4.10.2 Profile → Mechanism Translation¶
Each profile maps to concrete Section 7.4 parameters:
| Profile | RAPL PKG limit | CPU turbo | GPU freq cap | WiFi PSM | Display brightness |
|---|---|---|---|---|---|
| Performance | None (HW TDP) | Enabled | 100% | Disabled | 100% |
| Balanced | 80% TDP | Enabled | 80% | PSM | 75% |
| BatterySaver | 50% TDP | Disabled | 40% | Aggressive | 40% |
set_profile() calls RaplInterface::set_power_limit() (Section 7.4), the cpufreq
governor, and WirelessDriver::set_power_save() (Section 13.2). No RAPL MSR writes
happen in consumer-layer code — all hardware access goes through Section 7.4.
7.4.10.3 AC/Battery Auto-Switch¶
The PowerManager listens for ACPI AC adapter events (from the battery driver,
Section 7.4) and automatically applies the user's preferred profile for each power
source. On critical battery (≤5%), BatterySaver is forced. A BatteryCritical
event is posted to the Section 7.9 event ring so userspace can display a notification.
7.4.10.4 Per-Process Power Attribution¶
Per-process energy attribution (for desktop power managers like GNOME Settings,
KDE Powerdevil) is provided by the Section 7.4 cgroup power accounting. Per-process
granularity: each process lives in a cgroup; power.energy_uj on that cgroup
gives its energy consumption. Exposed via /proc/<pid>/power_consumed_uj in
umka-sysapi procfs.
7.4.10.5 Userspace Interface¶
Kernel exposes power management state via the following paths:
| Path | Description |
|---|---|
/sys/kernel/umka/power/profile |
Read/write power profile selection (performance, balanced, battery-saver). Writable by processes with CAP_SYS_ADMIN. |
/proc/<pid>/power_consumed_uj |
Per-process energy consumption in microjoules (RAPL cgroup attribution, Section 7.11). |
/sys/class/power_supply/BAT0/capacity |
Battery charge percentage (0–100). |
/sys/class/power_supply/BAT0/energy_now |
Remaining energy in µWh. |
/sys/class/power_supply/BAT0/current_now |
Discharge/charge current in µA. |
/sys/class/power_supply/BAT0/cycle_count |
Charge cycle count. |
/sys/class/power_supply/BAT0/status |
Charging, Discharging, Full, Unknown. |
Time-remaining estimation, low-battery notifications, and battery health display are handled by userspace daemons (UPower or equivalent) reading these paths.
Continued in Section 7.5: Suspend/resume protocol (S3, S0ix), driver suspend/resume callbacks, device suspend ordering, resume failure recovery, per-device runtime power management (two-counter state machine), and cpuidle governor.
7.5 Suspend, Resume, and Runtime Power Management¶
7.5.1 Suspend and Resume Protocol¶
Context: The S4Hibernate variant in SleepState (below) covers S4 hibernate — see Section 18.4 for the full hibernate specification. This section specifies S3 (Suspend-to-RAM) and S0ix (Modern Standby), which are the primary suspend mechanisms on consumer laptops.
7.5.1.1 Sleep State Enumeration¶
// umka-core/src/power/suspend.rs
/// ACPI sleep state.
#[repr(u32)]
pub enum SleepState {
/// S3: Suspend-to-RAM. CPU powered off, DRAM refreshing, ~2-5W, wake in <2s.
S3SuspendToRam = 3,
/// S0ix: Modern Standby (S0 Low Power Idle). CPU in deep C-states, OS "running",
/// network alive, <1W, instant wake. Intel 6th gen+, AMD Ryzen 3000+, ARM.
S0ixModernStandby = 0x0F, // Not a standard ACPI state, vendor-specific
/// S4: Hibernate. See [Section 18.4](18-virtualization.md#suspend-and-resume) for the
/// full specification (snapshot creation, dm-crypt signed snapshot, resume protocol).
S4Hibernate = 4,
}
/// Power state machine states.
#[repr(u32)]
pub enum SuspendPhase {
/// System running normally.
Running = 0,
/// Pre-suspend: freeze userspace, sync filesystems.
PreSuspend = 1,
/// Device suspend: call driver suspend callbacks.
DeviceSuspend = 2,
/// CPU suspend: save CPU state, enter ACPI sleep state.
CpuSuspend = 3,
/// (system asleep, this state is never observed by running code)
Asleep = 4,
/// CPU resume: restore CPU state.
CpuResume = 5,
/// Device resume: call driver resume callbacks.
DeviceResume = 6,
/// Post-resume: thaw userspace.
PostResume = 7,
/// Unwind in progress due to a failure during suspend.
/// The system transitions through this state while reversing
/// already-completed suspend steps. Returns to `Running` after
/// cleanup completes.
SuspendAbort = 8,
}
7.5.1.2 Power State Machine¶
impl SuspendManager {
/// Initiate suspend to a given sleep state.
pub fn suspend(&self, state: SleepState) -> Result<(), SuspendError> {
// Phase 1: PreSuspend
self.set_phase(SuspendPhase::PreSuspend);
self.freeze_userspace()?; // Stop all userspace tasks
self.sync_filesystems()?; // Flush all dirty pages, journal commits
// Phase 2: DeviceSuspend
self.set_phase(SuspendPhase::DeviceSuspend);
// Before calling SuspendResume::suspend(), the suspend path calls
// rtpm_get(dev) for any device in D2/D3hot state (i.e., runtime PM
// Suspended). This ensures the device is in D0 (active) before
// receiving the suspend callback — drivers must not receive a
// system suspend while in a runtime-suspended power state, because
// their suspend callback assumes an active device context.
//
// Devices in RtpmState::Switching (mid-tier-change) are handled by
// wait_for_tier_switches() — see below.
self.reconcile_runtime_pm()?;
// Wait for any in-progress tier switches to complete before
// suspending devices. A device in RtpmState::Switching is between
// driver teardown and replacement init — it cannot receive a suspend
// callback in this state.
self.wait_for_tier_switches()?;
if let Err(e) = self.suspend_devices(state) {
// Unwind: resume already-suspended devices in reverse order,
// then thaw userspace. Without this, userspace remains frozen
// and the system is deadlocked.
self.set_phase(SuspendPhase::SuspendAbort);
self.resume_devices_partial(state); // resume [0..N-1] in reverse
self.thaw_userspace().ok(); // best-effort thaw
self.set_phase(SuspendPhase::Running);
return Err(e);
}
// Phase 3: CpuSuspend
self.set_phase(SuspendPhase::CpuSuspend);
if let Err(e) = self.save_cpu_state() {
self.set_phase(SuspendPhase::SuspendAbort);
self.resume_devices(state).ok();
self.thaw_userspace().ok();
self.set_phase(SuspendPhase::Running);
return Err(e);
}
self.enter_acpi_sleep_state(state)?; // Write to ACPI PM1a_CNT, CPU halts
// ... (system asleep, wake event occurs) ...
// Phase 4: CpuResume (code resumes here after wake)
self.set_phase(SuspendPhase::CpuResume);
self.restore_cpu_state()?; // Restore registers, reload CR3, GDTR, IDTR
// Phase 5: DeviceResume
self.set_phase(SuspendPhase::DeviceResume);
self.resume_devices(state)?; // Call driver resume callbacks in probe order
// Phase 6: PostResume
self.set_phase(SuspendPhase::PostResume);
self.thaw_userspace()?; // Unfreeze userspace tasks
self.set_phase(SuspendPhase::Running);
Ok(())
}
/// Wait for all devices in `RtpmState::Switching` to complete their
/// tier switch before proceeding with system suspend. A tier switch
/// involves tearing down the old driver domain and loading the
/// replacement driver — the device cannot receive suspend callbacks
/// during this window.
///
/// # Algorithm
///
/// 1. Enumerate all devices in the device registry.
/// 2. For each device whose `rtpm.state == RtpmState::Switching`:
/// a. Poll `rtpm.state` with 1 ms intervals up to a 100 ms timeout.
/// b. If the state transitions to `Active` (or any non-`Switching`
/// state) within the timeout: continue — the tier switch completed
/// and the device is ready for the normal suspend path.
/// c. If 100 ms expires with the device still in `Switching`:
/// abort the tier switch by signaling the replacement driver's
/// `init()` to cancel, revert the device to its previous tier's
/// driver (which is still loaded until the switch commits), and
/// set `rtpm.state` back to `Active`. Log via klog!(Warning).
/// The device proceeds through suspend with its original driver.
///
/// # Rationale
///
/// 100 ms is chosen because tier switches typically complete in 50–150 ms
/// (Tier 1 reload latency). Waiting indefinitely would block system
/// suspend on a potentially stuck driver. Reverting preserves system
/// suspend reliability: the original driver can still suspend the device.
fn wait_for_tier_switches(&self) -> Result<(), SuspendError> {
const TIER_SWITCH_TIMEOUT_MS: u64 = 100;
const POLL_INTERVAL_MS: u64 = 1;
for dev in self.device_registry.iter() {
if dev.rtpm.state.load(Ordering::Acquire) == RtpmState::Switching as u32 {
let deadline = monotonic_ms() + TIER_SWITCH_TIMEOUT_MS;
loop {
if dev.rtpm.state.load(Ordering::Acquire) != RtpmState::Switching as u32 {
break; // Tier switch completed.
}
if monotonic_ms() >= deadline {
klog!(Warning,
"suspend: device {} still in Switching after {}ms, aborting tier switch",
dev.name(), TIER_SWITCH_TIMEOUT_MS);
self.abort_tier_switch(&dev)?;
break;
}
sleep_ms(POLL_INTERVAL_MS);
}
}
}
Ok(())
}
/// Resume all runtime-suspended devices to D0 before system suspend.
///
/// Drivers' suspend callbacks assume the device is in D0 (active).
/// A device in D2/D3hot (runtime-suspended) cannot receive a system
/// suspend callback because the driver's saved context corresponds to
/// the active state, not the low-power state.
///
/// # Algorithm
///
/// 1. Enumerate all devices in the device registry.
/// 2. For each device whose `rtpm.state == RtpmState::Suspended`:
/// a. Call `rtpm_get_sync(dev)` to resume the device to D0.
/// b. If resume fails, log a warning and mark the device as
/// `SuspendError::DeviceResumeFailed`. The system suspend
/// continues — a single device failure does not abort suspend
/// (unless it is the root storage device).
/// 3. For devices in `RtpmState::Active` or `RtpmState::Disabled`: no action.
/// 4. For devices in `RtpmState::Switching`: handled separately by
/// `wait_for_tier_switches()`.
///
/// # Ordering
///
/// `reconcile_runtime_pm()` runs BEFORE `wait_for_tier_switches()` and
/// BEFORE `suspend_devices()`. The ordering is:
/// `reconcile_runtime_pm` → `wait_for_tier_switches` → `suspend_devices`.
/// This ensures all devices are in D0 before tier-switch completion is
/// checked and before driver suspend callbacks are invoked.
fn reconcile_runtime_pm(&self) -> Result<(), SuspendError> {
for dev in self.device_registry.iter() {
let state = dev.rtpm.state.load(Ordering::Acquire);
if state == RtpmState::Suspended as u32 {
if let Err(e) = rtpm_get_sync(&dev) {
klog!(Warning,
"suspend: failed to resume runtime-suspended device {}: {:?}",
dev.name(), e);
// Non-fatal for most devices. Fatal only for root storage.
if dev.is_root_storage() {
return Err(SuspendError::DeviceResumeFailed(dev.id()));
}
}
}
}
Ok(())
}
}
Device suspend ordering vs topology sort: The "reverse dependency order" for suspend
and "dependency order" for resume are both computed as a topological sort of the device
tree maintained by the device registry (Section 11.4). The
device_order field on each DeviceNode is NOT used for suspend ordering — it records
the probe (enumeration) order for debugging. Suspend order is computed dynamically from
the parent-child edges in the device tree at suspend time, because the topology may have
changed since boot (hotplug, tier switches). The topological sort runs once per suspend
cycle and is cached for the corresponding resume. SuspendManager uses this cached snapshot of the device tree topological sort; if a device is hot-plugged during suspend_prepare(), the order is recomputed before proceeding to the device suspend phase.
S0ix provider-client power edges: S0ix (Modern Standby) adds platform-level power
domain constraints beyond the device tree parent-child edges. The SoC's power management
controller defines provider-client relationships between power domains (e.g., the PCH
provides power to all PCIe root ports; the display controller's power domain depends on
the GPU domain for scanout). These edges are discovered from ACPI _PR0/_PR3 power
resource lists at boot and added to the device registry's dependency graph. During S0ix
entry, enter_s0ix() uses the same topological sort but includes power-domain edges in
addition to device-tree edges, ensuring that a power provider is suspended only after
all its clients have entered low-power state.
7.5.1.3 Driver Suspend/Resume Callbacks¶
Every driver (Tier 1 and Tier 2) must implement suspend/resume:
// umka-driver-sdk/src/suspend.rs
/// Driver suspend/resume trait.
pub trait SuspendResume {
/// Suspend the device to the given sleep state.
///
/// # Contract
/// - Flush all pending I/O to the device.
/// - Disable interrupts (deregister interrupt handler or mask at device level).
/// - Power down the device (write to PCI PM registers, or device-specific power control).
/// - Save any device state that cannot be reconstructed (e.g., firmware upload not repeatable).
///
/// # Timeout
/// If suspend does not complete within the tier-specific timeout (Tier 0/1: 2s,
/// Tier 2: 5s), the kernel may force-kill the driver (Tier 2) or mark it as
/// failed (Tier 0/1).
fn suspend(&self, target: SleepState) -> Result<(), SuspendError>;
/// Resume the device from the given sleep state.
///
/// # Contract
/// - Restore device state (e.g., re-upload firmware, reconfigure registers).
/// - Re-enable interrupts.
/// - Re-establish any connections (WiFi: reconnect to AP, NVMe: reinitialize controller).
///
/// # Failure handling
/// If resume fails, return Err. The kernel will attempt recovery (Section 7.2.10.6).
fn resume(&self, from: SleepState) -> Result<(), SuspendError>;
}
7.5.1.4 Device Suspend Ordering¶
Devices must suspend in reverse dependency order and resume in dependency order:
Suspend order (leaves first, roots last):
1. pci0000:01:00.0 (GPU — no child dependents)
2. pci0000:00:02.0 (display controller — depends on GPU for framebuffer scanout)
3. nvme0n1 (NVMe namespace — no child dependents)
4. vda / sda (block device layer — depends on nvme0n1 for I/O)
5. wlan0 / eth0 (NIC — no child dependents)
6. platform:rtc0 (RTC — leaf device, no dependents)
Resume order (roots first, leaves last):
1. platform:rtc0
2. wlan0 / eth0
3. nvme0n1
4. vda / sda
5. pci0000:00:02.0
6. pci0000:01:00.0
The device registry (Section 11.4) tracks dependencies. Before suspend, the registry computes a topological sort of the device tree and calls suspend callbacks in that order.
7.5.1.5 Tier 2 Driver Suspend¶
Tier 2 drivers are separate processes. Suspending them requires IPC:
impl SuspendManager {
/// Suspend a Tier 2 driver (send message via ring buffer).
fn suspend_tier2_driver(&self, driver_pid: Pid, state: SleepState) -> Result<(), SuspendError> {
// Send DRIVER_SUSPEND message to the driver's control ring (Section 11.6.2).
let msg = DriverControlMessage::Suspend { state };
driver_ring_buffer.push(msg)?;
// Wait for response (DRIVER_SUSPEND_ACK) with tier-specific timeout
// (Tier 2: 5s — longer because userspace drivers may need to flush
// async I/O and signal completion over the control ring).
match self.wait_for_response(driver_pid, Duration::from_secs(5)) {
Ok(DriverControlMessage::SuspendAck) => Ok(()),
Err(TimeoutError) => {
// Driver did not respond: force terminate.
process::kill(driver_pid, Signal::SIGKILL)?;
// Mark device as unavailable until resume attempts to restart the driver.
self.device_registry.mark_unavailable(driver_pid)?;
Ok(()) // Continue suspend, device is orphaned but system suspends
}
Err(e) => Err(SuspendError::DriverFailed(e)),
}
}
}
7.5.1.6 Resume Failure Recovery¶
If a driver fails to resume:
Tier 1 driver failure:
1. Attempt Function Level Reset (FLR) via PCI config space (PCI_EXP_DEVCTL_BCR_FLR).
2. If FLR succeeds, reload the driver module (call probe() again).
3. If FLR fails or reload fails, mark device unavailable, continue boot. Log error to console and /var/log/kernel.log.
Tier 2 driver failure: 1. Kill the driver process. 2. Restart the driver process (spawn new process, re-initialize ring buffers). 3. If restart succeeds, device resumes normal operation (~10-50ms total recovery). 4. If restart fails 3 times, mark device unavailable, continue boot.
Critical device failure (NVMe root filesystem, display controller on the only display): - If the root NVMe fails to resume, suspend fails, system must reboot (no recovery possible without filesystem). - If the display fails to resume, system continues to run but displays a VT panic message (via Tier 0 VGA fallback).
7.5.1.7 S0ix Modern Standby¶
S0ix is not a true suspend state — the OS remains "running", but the CPU enters deep C-states (C10 on Intel, Ccd6 on AMD) where cores are powered off but SoC stays alive.
Differences from S3: - CPU does not power off: Scheduler still runs, interrupts still fire, but all tasks are idle (blocked on I/O or sleeping). - WiFi stays live: Driver keeps the radio in D3hot (low power but still connected to AP), wakes on packet. - Display off: Panel enters DPMS Off (Section 21.5), backlight off, but display controller stays powered. - Device D3hot, not D3cold: Devices enter D3hot (low power, can wake quickly) instead of D3cold (unpowered).
Enter S0ix:
impl SuspendManager {
pub fn enter_s0ix(&self) -> Result<(), SuspendError> {
// 1. Suspend devices in leaf-to-root topological order.
//
// The device registry ([Section 11.4](11-drivers.md#device-registry-and-bus-management)) maintains
// the device tree with parent-child relationships. enter_s0ix() walks the
// tree from leaves to root, suspending children before parents. This ensures
// that a child device's DMA and interrupts are quiesced before its parent
// power domain enters D3hot.
//
// The ordering is the same reverse-dependency topological sort used by S3
// suspend (see "Device Suspend Ordering" above). For S0ix, the target power
// state is D3hot (not D3cold) because devices must remain wake-capable.
//
// Each device transitions through runtime PM: rtpm_put_sync() decrements
// the usage count and immediately suspends. Devices that support wake
// (PCI PME, GPIO wake) configure their wake source before entering D3hot.
// Devices that do not support wake from D3hot remain in D0 (idle) — the
// driver's RuntimePmOps.idle callback returns NoAction for such devices.
let topo_order = self.device_registry.topological_sort_leaves_first();
for dev in &topo_order {
if dev.supports_runtime_pm() {
// Suspend via runtime PM path — this calls the driver's
// RuntimePmOps.suspend, which programs PCI PMCSR or ACPI _PS3.
rtpm_put_sync(dev);
} else {
// Legacy devices without runtime PM: direct power state write.
dev.set_power_state(PowerState::D3Hot)?;
}
}
// 2. Set CPU P-state to minimum frequency.
self.cpu_freq_governor.set_min_freq()?;
// 3. Program CPU package C-state limit to C10 (deepest).
self.cpu_cstate_governor.set_max_cstate(CState::C10)?;
// 4. Idle all CPUs (all tasks blocked or sleeping).
// Scheduler tick timer set to 1 Hz (extremely long idle periods).
self.scheduler.set_idle_mode(true)?;
// System now in S0ix. CPUs enter C10, wake on interrupt (timer, GPIO, PCIe PME).
Ok(())
}
pub fn exit_s0ix(&self) -> Result<(), SuspendError> {
// Resume in root-to-leaf order (reverse of enter_s0ix).
// Parent power domains must be active before children resume.
self.scheduler.set_idle_mode(false)?;
self.cpu_cstate_governor.set_max_cstate(CState::default())?;
self.cpu_freq_governor.restore_governor()?;
let topo_order = self.device_registry.topological_sort_roots_first();
for dev in &topo_order {
if dev.supports_runtime_pm() {
rtpm_get(dev)?;
} else {
dev.set_power_state(PowerState::D0Active)?;
}
}
Ok(())
}
}
Exit S0ix: Any interrupt (lid open, network packet, RTC alarm, USB device activity) wakes the CPU from C10 back to C0, resume is instant (~1-5ms).
7.5.2 Integration Points¶
The following table maps each Section 7.4 mechanism to its consumers in other sections:
| Mechanism | Defined in | Consumed by |
|---|---|---|
RaplInterface::set_power_limit() |
Section 7.4 | Section 7.4 consumer power profiles, Section 7.4 VM power budgets, Section 7.4 DCMI enforcement, thermal passive cooling (Section 7.4 RaplCooler) |
PowerDomainRegistry |
Section 7.4 | powercap sysfs (Section 7.4), cgroup power accounting (Section 7.4), VM admission control (Section 7.4) |
ThermalZone trip points |
Section 7.4–3.2 | Scheduler passive cooling (Section 7.1): Passive trips reduce cpufreq max; hwmon fan control (Section 13.13): Active trips actuate FanCooler |
TripType::Critical handler |
Section 7.4 | kernel_power_off() — no other dependencies |
cgroup power.energy_uj |
Section 7.11 | Billing/monitoring userspace agents, umka-kvm per-VM accounting (Section 7.4), Section 7.5 per-process attribution |
cgroup power.limit_uw |
Section 7.11–5.4 | umka-kvm VmPowerBudget (sets this file on VM creation), Section 7.4 power profiles (sets this on cgroup creation) |
VmPowerBudget struct |
Section 7.4 | umka-kvm/src/power.rs; interacts with Section 7.6 CpuBandwidth::reduce_quota() |
DcmiPowerCap::on_cap_set() |
Section 7.4 | BMC-driven cap propagation to RAPL (Section 7.4) and umka-kvm (Section 7.4) |
| SMBus battery registers | Section 7.4 | Battery driver, sysfs/umkafs (Section 7.4), Section 7.4 AC/battery auto-switch |
PowerProfile enum |
Section 7.4 | Section 7.4 profile translation, userspace power managers |
SuspendManager::suspend() |
Section 7.5 | System suspend/resume, Section 7.5 Tier 2 driver IPC |
SuspendResume trait |
Section 7.5 | Tier 1/Tier 2 driver implementations |
7.5.3 Per-Device Runtime Power Management¶
Sections 6.2.1–6.2.11 specify system-level power management: package-level RAPL limits, thermal trip points, cgroup power accounting, rack-level DCMI control, battery monitoring, consumer power profiles, suspend/resume protocols, and integration points. Per-device runtime PM addresses a different concern: controlling the power state of individual peripheral devices when they are idle.
An NIC that has not transmitted or received a packet in 10 seconds should be able to power-gate its PHY. A USB host controller with no active transfers should enter D3cold. A GPU idle between rendering jobs should clock-gate its shader cores. Without per-device runtime PM, these devices stay fully powered regardless of utilization, wasting energy and creating unnecessary thermal load.
Cross-references: workqueue for async suspend/resume (Section 3.11), clock gating (Section 2.24), regulator framework (Section 13.27), PCIe ASPM (Section 11.5), ACPI device power states (Section 2.4).
7.5.3.1 Design: Two-Counter State Machine¶
Linux pm_runtime uses three separate state variables (rpm_status, rpm_enabled,
power.disable_depth) plus multiple lock types. This has documented races and requires
careful lock ordering across different subsystems.
UmkaOS approach: Two atomic counters + one atomic state enum. All state transitions
are serialized through a single ordered work item dispatched to the pm-async workqueue
(Section 3.1.11). This eliminates lock ordering issues: the state machine has exactly
one concurrent writer at any time.
/// Runtime PM state of a device. Transitions are always serialized through pm-async.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum RtpmState {
/// Device is fully powered and ready for I/O.
Active,
/// Suspend callback is in progress. Device is transitioning to low-power state.
Suspending,
/// Device is in low-power state. DMA is stopped; clock gates may be closed;
/// voltage may be reduced (depending on device-specific suspend callback).
Suspended,
/// Resume callback is in progress. Device is returning to Active state.
Resuming,
/// Runtime PM is disabled for this device. Device stays in Active state
/// regardless of usage count. Use during device reset or power-critical operations.
Disabled,
/// Device is undergoing a driver tier switch (live evolution). The old driver's
/// domain is being torn down and the replacement driver is being loaded. In this
/// state: (a) new `rtpm_get()` calls return `EBUSY`, (b) stale autosuspend timers
/// from the old driver are silently discarded (the timer callback checks for
/// `Switching` and no-ops), (c) the replacement driver's `init()` transitions
/// the device back to `Active` via `rtpm_get()` after establishing its domain.
/// Tier switching is an operational change (not a crash). The tier-switch protocol
/// is defined in [Section 11.1](11-drivers.md#three-tier-protection-model--tier-mobility). Crash recovery
/// ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)) is a separate mechanism for
/// post-crash driver reload.
Switching,
}
/// Runtime PM state embedded in every DeviceNode.
/// All fields are accessed only through the rtpm_* API; never directly.
pub struct RuntimePm {
/// Current state. Encoded as u32 for atomic access; cast to RtpmState on read.
pub state: AtomicU32,
/// Usage count. Must be > 0 for the device to stay Active.
/// Incremented by rtpm_get(); decremented by rtpm_put().
/// When it reaches 0: the autosuspend timer is started.
/// When it is incremented from 0 while Suspended: resume is triggered.
pub active_count: AtomicI32,
/// Time in nanoseconds to wait after active_count reaches 0 before
/// triggering automatic suspend.
/// 0 = suspend immediately when active_count reaches 0.
/// u64::MAX = never autosuspend (device must be manually suspended).
pub autosuspend_delay_ns: AtomicU64,
/// Monotonic timestamp (from clock_monotonic_ns()) when active_count
/// last transitioned to 0. Used by the autosuspend timer to check if
/// the device has been re-acquired since the timer was set.
pub idle_since_ns: AtomicU64,
/// Parent device for power sequencing.
/// If set: when this device enters Active, parent.rtpm_get() is called.
/// When this device enters Suspended, parent.rtpm_put() is called.
/// This ensures a power domain stays active while any child device is active.
pub parent: Option<Weak<DeviceNode>>,
/// Driver-supplied power state callbacks. If None: device has no runtime PM.
pub ops: Option<&'static RuntimePmOps>,
}
/// Driver-supplied callbacks for runtime PM transitions.
/// All callbacks run in the `pm-async` workqueue context:
/// preemptible, may sleep, may allocate with GFP_KERNEL.
pub struct RuntimePmOps {
/// Prepare the device for low-power state.
///
/// The driver must:
/// - Quiesce all outstanding DMA (wait for completion or cancel)
/// - Stop Rx (disable device interrupts, if safe)
/// - Save any state that must be restored on resume
/// - Gate device clocks if managed by the driver (not the clock framework)
///
/// Must complete within RTPM_SUSPEND_TIMEOUT_MS. If it does not: the kernel
/// logs KERN_WARNING and transitions the device back to Active.
pub suspend: fn(dev: &DeviceNode) -> Result<(), KernelError>,
/// Restore the device from low-power state.
///
/// The driver must restore all state saved in `suspend`, re-enable Rx,
/// and confirm the device is ready for I/O before returning.
pub resume: fn(dev: &DeviceNode) -> Result<(), KernelError>,
/// Optional: called when active_count first reaches 0.
///
/// The driver can inspect device state and return:
/// - `Idle`: proceed with the normal autosuspend path.
/// - `NoAction`: driver is not ready to suspend (has pending work, etc.).
/// The autosuspend timer will not be started; the driver calls rtpm_put()
/// again later to re-trigger the idle path.
pub idle: Option<fn(dev: &DeviceNode) -> RtpmIdleAction>,
}
#[repr(u8)]
pub enum RtpmIdleAction {
Idle = 0, // Proceed to autosuspend
NoAction = 1, // Driver not ready; skip autosuspend
}
/// Maximum time the suspend callback may take before the kernel gives up.
pub const RTPM_SUSPEND_TIMEOUT_MS: u64 = 5000;
RtpmState to ACPI device power state mapping (ACPI 6.5 Section 7.2):
The runtime PM subsystem does not directly program ACPI Dx states — the driver's
RuntimePmOps.suspend callback is responsible for writing the appropriate PCI PM
register or ACPI _PSx method. This mapping defines the expected correspondence so
that drivers and the kernel agree on the semantics of each RtpmState:
| RtpmState | ACPI Dx | Description |
|---|---|---|
Active |
D0 | Device fully powered, clocks running, ready for I/O. |
Suspending |
D0 (transitional) | Device still in D0 while the driver quiesces DMA and saves state. No ACPI state change yet. |
Suspended |
D2 or D3hot (device-dependent) | Low-power state. D2: device retains context, reduced power. D3hot: device loses most context but PCI config space is accessible, PME# can wake. The driver's suspend callback selects D2 vs D3hot based on device capabilities (PCI_PM_CAP_PME_D3hot, wake requirement). |
Resuming |
D0 (transitional) | Driver has initiated transition back to D0. _PS0 method or PCI PMCSR write in progress. |
Disabled |
D0 | Runtime PM is disabled; device stays in D0 regardless of usage count. |
Switching |
D0 | Device is undergoing a driver tier switch. Device stays powered (D0) but new runtime PM operations are rejected until the replacement driver completes initialization. |
D3cold: Not reachable via runtime PM alone. D3cold (auxiliary power removed, device
completely off) requires platform-level power rail control (e.g., ACPI _PR3 power
resource, GPIO-controlled regulator). D3cold is used only during system suspend (S3/S4)
or when the platform power manager explicitly removes power. The transition path is:
RtpmState::Suspended (D3hot) then platform removes power rail to reach D3cold.
Re-entering D0 from D3cold requires a full device reset (PCI FLR or bus re-enumeration).
7.5.3.2 API¶
/// Increment usage count. If device is Suspended, triggers resume and blocks
/// until the device is Active (resume callback has completed successfully).
///
/// May sleep; must not be called from atomic context or IRQ handlers.
/// Use rtpm_get_async() from non-sleeping contexts.
pub fn rtpm_get(dev: &DeviceNode) -> Result<(), KernelError>;
/// Async variant: enqueues a resume work item in pm-async and returns immediately.
/// The caller must check rtpm_is_active() or use rtpm_get_sync() before performing I/O.
pub fn rtpm_get_async(dev: &DeviceNode) -> Result<WorkHandle, KernelError>;
/// Decrement usage count.
///
/// If count reaches 0:
/// - If ops.idle is set: calls it. If it returns NoAction, stops here.
/// - Else: starts the autosuspend timer (autosuspend_delay_ns from now).
/// Does NOT block; returns immediately.
pub fn rtpm_put(dev: &DeviceNode);
/// Decrement usage count and immediately enqueue a suspend work item,
/// bypassing the autosuspend delay. Blocks until the device is Suspended.
pub fn rtpm_put_sync(dev: &DeviceNode);
/// Disable runtime PM: transitions to Disabled state if currently Active.
/// Device will not be suspended while Disabled.
pub fn rtpm_disable(dev: &DeviceNode);
/// Re-enable runtime PM: transitions from Disabled to Active.
pub fn rtpm_enable(dev: &DeviceNode);
/// Returns true iff the device is currently in Active state.
/// Non-blocking; returns immediately.
pub fn rtpm_is_active(dev: &DeviceNode) -> bool;
/// Set the autosuspend delay.
///
/// - 0: suspend immediately when active_count reaches 0
/// - u64::MAX: never autosuspend (only manual rtpm_put_sync will suspend)
pub fn rtpm_set_autosuspend_delay(dev: &DeviceNode, delay_ns: u64);
/// Mark the device as active WITHOUT incrementing the usage count.
///
/// This resets the autosuspend timer by updating `idle_since_ns` to
/// the current monotonic time. Called by the KABI vtable trampoline
/// ([Section 11.6](11-drivers.md#device-services-and-boot--service-vtable-trampoline-mechanism))
/// on every KABI call that targets this device — it indicates the
/// device is in active use without creating a matching `rtpm_put()`
/// obligation.
///
/// **Cost**: one `Ordering::Relaxed` atomic store to a per-device
/// cache line (`idle_since_ns`). No cross-core invalidation traffic
/// because each device's `RuntimePm` struct is on its own cache line.
///
/// Does NOT trigger resume if the device is Suspended. The KABI
/// trampoline calls `rtpm_get()` separately when dispatching to a
/// suspended device — `rtpm_mark_active()` is an optimization for
/// the common case where the device is already Active and only the
/// autosuspend timer needs resetting.
#[inline]
pub fn rtpm_mark_active(dev: &DeviceNode) {
dev.rtpm.idle_since_ns.store(
clock_monotonic_ns(),
Ordering::Relaxed,
);
}
7.5.3.3 State Machine¶
All state transitions are serialized: only one transition can be in progress at a time,
because they are dispatched as ordered work items to the pm-async workqueue.
State transitions:
Active ──(active_count → 0 AND idle check passes AND delay expires)──► Suspending
Suspending ──(suspend callback returns Ok)──────────────────────────► Suspended
Suspending ──(suspend callback returns Err OR timeout)──────────────► Active
[on error: KERN_WARNING logged; retry after RTPM_RETRY_DELAY_MS = 100 ms]
Suspended ──(rtpm_get() or rtpm_get_async() called)─────────────────► Resuming
Resuming ──(resume callback returns Ok)─────────────────────────────► Active
Resuming ──(resume callback returns Err)────────────────────────────► Active
[on error: KERN_ERR logged; device assumed broken; I/O will fail]
Any state ──(rtpm_disable() called)─────────────────────────────────► Disabled
Disabled ──(rtpm_enable() called)───────────────────────────────────► Active
Autosuspend timer:
Implemented using the UmkaOS timer framework (Section 7.5).
Armed at: idle_since_ns + autosuspend_delay_ns (monotonic time).
On expiry: check if active_count == 0 AND state == Active.
- If yes: enqueue suspend work item in pm-async.
- If no (device re-acquired since timer was armed): no-op; timer is self-canceling.
Timer handler runs in IRQ context and only enqueues work; it does NOT directly
call the suspend callback (which may sleep).
Parent-child power sequencing:
When child device enters Active state:
→ rtpm_get(child.parent) is called automatically by the runtime PM core.
→ This ensures the parent power domain is Active before the child uses it.
When child device enters Suspended state:
→ rtpm_put(child.parent) is called automatically.
→ If parent's active_count reaches 0, parent may also suspend.
Circular dependency detection:
Performed at DeviceNode registration time using a BFS cycle check on the
parent chain. A circular dependency causes a kernel panic with the cycle
description logged. Circular power dependencies cannot be correctly handled
and indicate a driver or DT error.
Maximum parent chain depth: 16 devices. Deeper chains are rejected at registration.
7.5.3.4 Linux External ABI¶
The following sysfs files are required for userspace power management tools (udev, tlp, tuned, gnome-power-manager, systemd):
/sys/bus/<bus>/devices/<device>/power/control
Values: "on" | "auto"
"on" = rtpm_disable() (device stays Active)
"auto" = rtpm_enable() (runtime PM active; device may autosuspend)
/sys/bus/<bus>/devices/<device>/power/runtime_status
Values: "active" | "suspended" | "suspending" | "resuming" | "unsupported"
Read-only. "unsupported" if ops == None.
/sys/bus/<bus>/devices/<device>/power/autosuspend_delay_ms
Read-write integer: autosuspend delay in milliseconds.
Writes map to rtpm_set_autosuspend_delay(dev, value_ms * 1_000_000).
-1 means "never autosuspend" (maps to autosuspend_delay_ns = u64::MAX).
/sys/bus/<bus>/devices/<device>/power/runtime_usage
Read-only integer: current active_count value.
/sys/bus/<bus>/devices/<device>/power/runtime_active_time
Read-only u64: cumulative nanoseconds spent in Active state (for energy accounting).
/sys/bus/<bus>/devices/<device>/power/runtime_suspended_time
Read-only u64: cumulative nanoseconds spent in Suspended state.
These files are identical in layout and semantics to Linux's sysfs PM interface, enabling existing tools to work without modification.
7.5.3.5 Tier 2 Runtime PM API¶
Both Tier 1 and Tier 2 drivers invoke rtpm_get() / rtpm_put() via kabi_call!
which resolves to the appropriate transport: ring buffer commands for cross-domain
dispatch (Section 12.6). Tier 0 modules call the API directly
(same domain, direct vtable call). The driver code is the same regardless of tier —
only the transport differs, selected at bind time.
KABI runtime PM commands:
/// Runtime PM commands sent by Tier 2 drivers over the driver control ring.
/// These are KABI ring buffer message types, not syscalls — Tier 2 drivers
/// use the standard ring buffer push/pop protocol to send them.
#[repr(u32)]
pub enum DriverRtpmCommand {
/// Increment the device's runtime PM usage count. If the device is
/// currently Suspended, the kernel triggers a resume and blocks the
/// driver's ring buffer consumer thread until the device reaches Active
/// state. The response message carries the result.
///
/// Equivalent to the in-kernel rtpm_get().
Get = 0x0010_0001,
/// Decrement the device's runtime PM usage count. If the count reaches
/// zero, the autosuspend timer is started. Non-blocking: the response
/// is sent immediately (suspend happens asynchronously).
///
/// Equivalent to the in-kernel rtpm_put().
Put = 0x0010_0002,
/// Set the autosuspend delay for this device.
/// Payload: u64 delay_ns (0 = immediate, u64::MAX = never).
///
/// Equivalent to the in-kernel rtpm_set_autosuspend_delay().
SetAutosuspendDelay = 0x0010_0003,
/// Query the current runtime PM state. Non-blocking.
/// Response payload: RtpmState as u32.
GetStatus = 0x0010_0004,
}
/// Response to a DriverRtpmCommand, sent back to the Tier 2 driver
/// on its completion ring.
pub struct DriverRtpmResponse {
/// The command this response is for.
pub command: DriverRtpmCommand,
/// 0 = success, negative = error (KernelError encoding).
pub result: i32,
/// For GetStatus: the current RtpmState as u32.
/// For other commands: unused (0).
pub state: u32,
}
Kernel-side dispatch: The Tier 2 driver's KABI control ring consumer (running
as a kernel thread) receives DriverRtpmCommand messages and translates them to
the corresponding in-kernel rtpm_*() calls:
/// Handle a runtime PM command from a Tier 2 driver.
/// Called by the KABI control ring consumer in process context (may sleep).
fn handle_tier2_rtpm(
dev: &DeviceNode,
cmd: DriverRtpmCommand,
payload: &[u8],
) -> DriverRtpmResponse {
match cmd {
DriverRtpmCommand::Get => {
let result = rtpm_get(dev);
DriverRtpmResponse {
command: cmd,
result: result.map(|_| 0i32).unwrap_or_else(|e| e.to_errno()),
state: 0,
}
}
DriverRtpmCommand::Put => {
rtpm_put(dev);
DriverRtpmResponse { command: cmd, result: 0, state: 0 }
}
DriverRtpmCommand::SetAutosuspendDelay => {
let delay_ns = u64::from_le_bytes(payload[..8].try_into().unwrap());
rtpm_set_autosuspend_delay(dev, delay_ns);
DriverRtpmResponse { command: cmd, result: 0, state: 0 }
}
DriverRtpmCommand::GetStatus => {
let active = rtpm_is_active(dev);
let state = dev.rtpm.state.load(Ordering::Acquire);
DriverRtpmResponse { command: cmd, result: 0, state }
}
}
}
Blocking semantics: DriverRtpmCommand::Get blocks the KABI control ring
consumer thread until the device resume completes (matching the in-kernel
rtpm_get() semantics). During this time, other control ring commands from the
same Tier 2 driver are queued. This is acceptable because runtime PM resume is
bounded (RTPM_SUSPEND_TIMEOUT_MS = 5000 ms) and the data-plane ring (used for
actual I/O) is a separate ring that is not blocked. If the Tier 2 driver needs
non-blocking resume, it can send GetStatus after Get to poll for completion
asynchronously (send Get, continue processing data-plane I/O, periodically send
GetStatus until state == Active).
Security: The kernel validates that the Tier 2 driver process holds a
DeviceHandle for the target device. A Tier 2 driver cannot send runtime PM
commands for devices it does not own. The DeviceHandle is bound to the driver's
KABI session at device probe time and cannot be forged.
7.5.4 Cpuidle Governor¶
When a CPU enters the idle task (no runnable work), the cpuidle governor selects the appropriate processor idle state (C-state). The governor decides how deeply the CPU should sleep based on expected idle duration and wake-up latency constraints.
/// Per-CPU cpuidle governor state.
///
/// Implements a menu-based idle state selection algorithm (matching Linux's
/// menu governor concept but with a simpler, more predictable implementation).
/// Updated on every idle entry and exit.
pub struct CpuidleGovernor {
/// Predicted idle duration in microseconds, based on recent history
/// and the next scheduled timer event.
pub predicted_us: u32,
/// Exponential moving average of actual idle durations (microseconds).
/// Updated on every idle exit: avg = avg * 7/8 + actual * 1/8.
pub avg_idle_us: AtomicU32,
/// History ring of recent idle durations (8 entries, microseconds).
/// Used to detect bimodal idle patterns (e.g., alternating short/long
/// idles from periodic interrupt sources).
pub history: [u32; 8],
/// Current index into the history ring (wraps at 8).
pub history_idx: u8,
/// Correction factor for prediction accuracy (fixed-point Q8, range [25, 256]).
/// 256 = 1.0 (perfect prediction), 25 ≈ 0.1 (worst case).
/// Updated after each idle exit based on prediction error.
pub correction_factor: u32,
}
/// A single processor idle state (C-state), discovered at boot.
///
/// Populated from ACPI `_CST` objects (x86), DT `idle-states` node
/// (ARM/RISC-V), or OPAL idle state table (PPC64).
pub struct CpuidleState {
/// C-state index (0 = poll/mwait, higher = deeper sleep).
pub index: u8,
/// Human-readable name (e.g., "C1", "C6S", "POLL").
pub name: ArrayString<16>,
/// Worst-case exit latency in microseconds. The CPU is unavailable
/// for this duration when woken from this state.
pub exit_latency_us: u32,
/// Target residency in microseconds. The minimum time the CPU must
/// remain in this state for the entry/exit overhead to be worthwhile.
pub target_residency_us: u32,
/// Architecture-specific entry function.
/// x86: MWAIT hint value (Cx sub-state). ARM: WFI or PSCI CPU_SUSPEND.
/// RISC-V: WFI or SBI HSM hart_suspend. PPC: nap/sleep/winkle.
pub enter: fn(state: &CpuidleState),
}
Idle state selection algorithm:
-
Compute expected idle duration:
expected_us = min(next_timer_event_us, avg_idle_us * correction_factor / 256). The next timer event is obtained from the per-CPU timer wheel;avg_idle_usprovides a history-based prediction. -
Select deepest safe C-state: iterate C-states from deepest to shallowest. Select the deepest state whose
exit_latency_us < expected_us * latency_factor / 256. Thelatency_factoris 128 (i.e., 0.5), meaning a state is eligible only if its exit latency is at most 50% of the expected idle duration. This prevents selecting states where the wake-up cost dominates. -
Target residency check: additionally require
target_residency_us <= expected_us. This ensures the CPU stays in the state long enough for the power savings to offset the entry/exit cost. -
Update correction factor on wake: after exiting idle, compute
ratio = actual_idle_us * 256 / predicted_us. Updatecorrection_factoras an exponential moving average:correction_factor = correction_factor * 7/8 + ratio * 1/8, clamped to[25, 256](i.e.,[0.1, 1.0]). If prediction was too optimistic (woken early by interrupt), the factor decreases, causing shallower states to be selected in future. -
History ring update: record
actual_idle_usin the history ring. The ring is checked for bimodal patterns: if the coefficient of variation exceeds 1.5, the governor uses the shorter mode of the distribution as the predicted value, avoiding deep states that will be interrupted.
Per-architecture C-state table population:
| Architecture | Source | Discovery Time |
|---|---|---|
| x86-64 | ACPI _CST method or CPUID MWAIT leaf (0x05) |
Boot (ACPI parse) |
| AArch64 | DT idle-states node (ARM DEN0022D) |
Boot (DTB parse) |
| ARMv7 | DT idle-states node |
Boot (DTB parse) |
| RISC-V | DT idle-states node or SBI HSM extension |
Boot (DTB parse) |
| PPC64LE | OPAL idle state table (ibm,opal/power-mgt) |
Boot (OPAL query) |
| PPC32 | Static table (single WFI state on most e500 cores) | Compile time |
| s390x | STSI facility query + diagnose 0x44 idle |
Boot (facility query) |
| LoongArch64 | DT idle-states node or CSR-based idle (CPUCFG discovery) |
Boot (DTB parse / CPUCFG) |
Integration with scheduler idle path: The idle task (Section 7.1.4) calls
cpuidle_select(governor, states) after processing softirqs. The selected state's
enter function executes the platform halt instruction. On interrupt wake, the idle
task updates the governor with the actual idle duration before calling pick_next_task().
Runtime PM interaction with driver tier transitions: When the FMA engine (Section 20.1) demotes a driver from Tier 1 to Tier 2 (or promotes Tier 2 to Tier 1), the runtime PM state must be coordinated:
- Before tier demotion (Tier 1 → Tier 2): call
rtpm_put_sync(dev)to suspend the device if it is runtime-active. The device enters a known low-power state before the isolation boundary is reconfigured. After the tier switch completes and the driver is reloaded in the new tier,rtpm_get(dev)resumes the device. - Before tier promotion (Tier 2 → Tier 1): the Tier 2 driver process must call
rtpm_put_sync(dev)before being terminated. The Tier 1 driver'sinit()callsrtpm_get(dev)to resume the device in the new isolation domain. - Autosuspend timer coordination: the autosuspend timer (
RtpmAutoSuspendTimer) is cancelled before the tier switch and re-armed by the replacement driver'sinit(). Stale timers from the old driver must not fire against the new driver's domain — the timer callback checksdev.rtpm_stateand no-ops if the device is inRtpmState::Switching.
7.6 CPU Bandwidth Guarantees¶
Inspired by: QNX Adaptive Partitioning (concept only). IP status: Built from academic scheduling theory (CBS/EDF, 1998) and Linux cgroup v2 interface. QNX-specific implementation NOT referenced. Term "adaptive partitioning" NOT used.
7.6.1 Problem¶
Section 7.1 defines three scheduler classes: EEVDF (normal), RT
(FIFO/RR), and Deadline (EDF/CBS). Cgroups v2 provides resource limits (cpu.max caps
the ceiling, cpu.weight sets relative priority).
What is missing: guaranteed minimum CPU bandwidth under overload. Current mechanisms:
cpu.weightis proportional sharing — if one cgroup has weight 100 and another has weight 900, the first gets 10% of whatever is available. But if the system is fully loaded, "10% of available" might not meet the minimum requirement.cpu.maxis a ceiling, not a floor. It limits maximum, does not guarantee minimum.- Deadline scheduler provides guarantees, but only for individual tasks, not for groups.
Use case: A server runs a database (needs guaranteed 40% CPU), a web frontend (needs guaranteed 20%), and batch jobs (uses the rest). Under overload, the batch jobs must not be able to starve the database below 40%, even if the batch jobs are numerous.
7.6.2 Design: CBS-Based Group Bandwidth Reservation¶
The solution combines the existing Deadline scheduler's Constant Bandwidth Server (CBS) algorithm with cgroup v2's group hierarchy.
CBS (Abeni & Buttazzo, 1998) provides bandwidth isolation: each server (task or group) is assigned a bandwidth Q/P (Q microseconds of CPU time every P microseconds). The server is guaranteed this bandwidth regardless of other servers' behavior. Unused bandwidth is redistributed (work-conserving).
This is a well-established academic algorithm with no IP encumbrance.
7.6.3 Cgroup v2 Interface¶
New control file in the cpu controller, additive to the existing interface:
Format: $QUOTA $PERIOD (microseconds). The syntax is identical to cpu.max
($QUOTA $PERIOD in microseconds). Typical cpu.guarantee periods are longer
(1 second) than typical cpu.max periods (100 ms) because guarantee is a
coarser-grained bandwidth allocation — CBS replenishment overhead is amortized
over longer periods.
Example:
# # Database cgroup: guaranteed 40% CPU bandwidth
echo "400000 1000000" > /sys/fs/cgroup/database/cpu.guarantee
# # = 400ms of CPU time every 1000ms = 40% guaranteed
# # Web frontend: guaranteed 20%
echo "200000 1000000" > /sys/fs/cgroup/web/cpu.guarantee
# # Batch jobs: no guarantee (uses whatever is left)
# # cpu.guarantee defaults to "max" (no guarantee)
# # Total guaranteed: 60%. Remaining 40% is shared by weight among all.
Semantics:
- A group with
cpu.guaranteeset is backed by a CBS server at the specified bandwidth. - The CBS server ensures the group receives at least its guaranteed bandwidth even under full system load.
- When the group is idle, its unused bandwidth is redistributed to other groups (work-conserving). This is inherent to CBS — no special logic needed.
cpu.guaranteecannot exceedcpu.max(if both are set, guarantee <= max).- Sum of all
cpu.guaranteeacross the system must not exceed total CPU bandwidth. Attempting to overcommit returns-ENOSPC. - Nested cgroups: a child's guarantee comes out of its parent's guarantee budget.
Multi-core accounting: cpu.guarantee specifies system-wide bandwidth, not
per-CPU. The implementation uses a global budget pool with per-CPU runtime slices (same
pattern as Linux EEVDF bandwidth throttling for cpu.max). A CBS server with a 40%
guarantee on a 4-CPU system gets a global budget of 400ms per 1000ms period. This budget
is drawn down as tasks in the group run on any CPU. When exhausted, all tasks in the
group are throttled until the next period. Per-NUMA-node variants are tracked in
BandwidthAccounting.per_node_guaranteed for NUMA-aware scheduling hints, but the
guarantee is enforced globally.
RT and DL task handling: RT and DL tasks within a CBS-guaranteed cgroup bypass the
CBS server's budget accounting. Their guarantees come from the RT/DL schedulers directly.
The CBS guarantee applies only to EEVDF-class (normal) tasks within the group. This matches Linux's
behavior where cpu.max throttling does not apply to RT tasks.
7.6.4 Kernel-Internal Design¶
Evolvable component classification (Section 13.18):
| Component | Classification | Rationale |
|---|---|---|
CbsGroupConfig data structure |
Nucleus (data) | Struct layout must survive policy swap; budget fields are data |
CbsCpuServer data structure |
Nucleus (data) | Per-CPU server state (budget, deadline) layout must survive replacement |
CbsCpuServer.tree (VruntimeTree) |
Nucleus (data) | Shared tree + accumulator base type; same Nucleus classification as VruntimeTree in Section 7.1 |
cbs_replenish() budget math |
Evolvable | Runtime formula code. Correctness ensured by Nucleus invariant checker: validates budget_remaining_ns post-replenishment is <= quota_us * 1000 and deadline_ns is monotonically advancing. Making it Nucleus prevents fixing a budget formula bug without reboot. |
cbs_charge() accounting |
Evolvable | Runtime formula code. Invariant checker validates that charge does not exceed delta_exec and that deficit is bounded to -(quota_us * 1000). |
Admission control (cpu_guarantee_write) |
Evolvable | Admission formula code. Invariant checker validates sum invariant (total_guaranteed <= max_guarantee). |
cbs_try_steal() heuristics |
Evolvable | Steal order (NUMA-first), steal fraction (1/2) are tunable policy |
CBS pick_next_eevdf_task() |
Evolvable | Delegates to the active SchedPolicy module (same as main EEVDF) |
BandwidthAccounting data |
Nucleus (data) | System-wide admission bookkeeping struct layout; corrupted = overcommit possible |
CpuBandwidthThrottle data |
Nucleus (data) | Ceiling enforcement struct layout |
CpuMaxLocalSlice per-CPU cache |
Nucleus (data) | Local slice accounting struct layout |
| cpu.max burst threshold | Evolvable | Burst tolerance is a tunable heuristic |
Phase assignment: CBS guarantee mechanism is Phase 2 (required for cgroup CPU isolation). ML policy hooks for CBS steal heuristics are Phase 4 (Section 23.1).
// umka-core/src/sched/cbs_group.rs (kernel-internal)
/// Cgroup-wide CBS configuration. Stored in `CpuController.cbs`.
/// Written by the `cpu.guarantee` / `cpu.max` cgroupfs interface.
/// The per-CPU servers read this at replenishment time.
pub struct CbsGroupConfig {
/// Bandwidth: quota microseconds per period microseconds.
pub quota_us: u64,
pub period_us: u64,
/// **Note**: CBS guarantee (floor) does NOT use a burst buffer.
/// Burst semantics apply only to `cpu.max` (ceiling) enforcement via
/// `CpuBandwidthThrottle.burst_us`. The guarantee mechanism provides
/// bandwidth reservation, not burst tolerance — the per-CPU proportional
/// model and atomic steal already absorb scheduling jitter naturally.
/// Total weight of runnable tasks across all CPUs. Updated atomically
/// on enqueue/dequeue. Used by per-CPU servers to compute proportional
/// share: `local_share = quota_us * local_weight / total_weight`.
pub total_weight: AtomicU64,
/// Admission control: sum of guarantee/period ratios across all CBS
/// groups on this system. Must not exceed the configured utilization
/// cap (default 95%, matching Linux SCHED_DEADLINE).
/// Updated atomically when `cpu.guarantee` is written.
pub system_utilization: &'static AtomicU64, // Shared singleton
}
/// Per-CPU CBS server for a single cgroup. Allocated lazily when the
/// cgroup's first task becomes runnable on a CPU. Stored in per-CPU
/// `RunQueueData.cbs_servers` XArray, keyed by cgroup ID (O(1) lookup).
///
/// Each CBS server maintains its own local EEVDF tree for tasks in the
/// guaranteed cgroup. `pick_next_task()` checks CBS servers (in EDF
/// order by `deadline_ns`) between the DL and plain-EEVDF steps. This
/// ensures guaranteed groups receive CPU time ahead of non-guaranteed
/// tasks while budget remains. When budget is exhausted (`throttled`
/// is set), the server is skipped and its tasks wait for replenishment.
///
/// **Enqueue/dequeue**: When a task in a CBS-guaranteed cgroup becomes
/// runnable, it is enqueued in the CBS server's EEVDF tree (not the
/// main EEVDF tree). When it blocks/exits, it is dequeued from the
/// server's tree. If the cgroup's `cpu.guarantee` is removed, tasks
/// are migrated to the main EEVDF tree.
/// **Unit convention**: Budget accounting (`budget_remaining_ns`) uses nanoseconds
/// internally, matching Linux's `cfs_rq->runtime_remaining` precision. This avoids
/// the persistent truncation bias of `delta_exec / 1000` (up to ~0.1% systematic
/// under-charge per second). The user-facing `cpu.guarantee` cgroup knob accepts
/// microseconds; the conversion (`quota_us * 1000 -> quota_ns`) happens once at
/// configuration time. Timestamps and deadlines (`deadline_ns`) also use nanoseconds
/// for compatibility with `ktime_get_ns()` and hrtimer APIs.
///
/// **Kernel-internal struct, not `#[repr(C)]`.** Layout determined by Rust compiler.
/// Never crosses KABI or wire boundaries. No `const_assert!` needed.
pub struct CbsCpuServer {
/// Number of tasks currently CBS-throttled in this cgroup on this CPU.
/// Incremented when a task exhausts its CBS budget and is dequeued from
/// the EEVDF tree (throttle path). Decremented on replenishment wake
/// (timer fires, budget restored) or task exit (`do_exit` →
/// `sched_task_exit`). Used to determine whether the per-CPU
/// replenishment timer can be cancelled: when `nr_throttled_tasks`
/// reaches 0, no tasks on this CPU need replenishment and the timer
/// is disarmed to avoid spurious wakeups.
pub nr_throttled_tasks: AtomicU32,
/// Budget remaining on this CPU for this cgroup (nanoseconds).
/// Signed: CBS allows transient overspend (tick granularity > remaining
/// budget). When negative:
/// 1. Server is throttled, tasks dequeued from EEVDF tree.
/// 2. Scheduler attempts atomic steal from sibling CPUs (see below).
/// 3. If steal fails, tasks remain throttled until replenishment.
/// 4. Deficit carries forward: replenishment adds `share_ns` to current
/// budget, so negative balance reduces next period's effective budget.
///
/// **Deficit cap**: clamped to `-(quota_ns as i64)` to bound recovery
/// time to at most one period.
///
/// **Precision**: All budget arithmetic uses integer nanoseconds (i64).
/// No fixed-point or floating-point — integer subtraction and addition
/// are exact. Nanosecond precision eliminates the truncation bias that
/// microsecond accounting would introduce (up to 999 ns lost per tick).
/// Tick granularity (typically 1-4 ms) bounds the maximum overshoot per
/// period to one tick worth of CPU time.
pub budget_remaining_ns: AtomicI64,
/// CBS deadline for this server (absolute nanoseconds). When budget
/// is exhausted and replenished, deadline is pushed by one period.
/// EDF ordering: the server with the earliest deadline is replenished
/// first when multiple servers compete on the same CPU.
pub deadline_ns: AtomicU64,
/// Weight of runnable tasks on this CPU for this cgroup. Used to
/// compute proportional share at replenishment.
pub local_weight: AtomicU64,
/// Whether this server is currently CBS-throttled (guarantee budget
/// exhausted). When true, the server's tasks are dequeued from the
/// CBS EEVDF tree. Cleared on replenishment or successful steal.
///
/// **Layout note**: `throttled` and `max_throttled` are grouped together
/// at the end of the struct to avoid padding between AtomicBool (align 1)
/// and AtomicU64 (align 8) fields. Both are read together on the CBS
/// pick path, so co-locating them improves cache efficiency.
pub throttled: AtomicBool,
/// Whether this cgroup is cpu.max-throttled (ceiling quota exhausted).
/// cpu.max is always the hard ceiling — when `max_throttled` is true,
/// the CBS server is skipped in `pick_next_task()` even if CBS budget
/// remains. CBS budget consumption is frozen while max-throttled: no
/// replenishment timers fire, no runtime is charged. When cpu.max
/// unthrottles the cgroup (period boundary), CBS resumes with its
/// current budget.
///
/// Set by the cpu.max bandwidth enforcement path (same mechanism as
/// Linux CFS bandwidth throttling). Cleared by the cpu.max period
/// timer. The CBS replenishment timer is armed/re-armed only when
/// `max_throttled` is false.
pub max_throttled: AtomicBool,
/// Cumulative runtime consumed this period (nanoseconds). For
/// `cpu.stat` accounting (converted to microseconds at read time).
pub consumed_ns: AtomicU64,
/// Per-CPU replenishment timer. Fires at `deadline_ns`.
pub replenish_timer: HrTimer,
/// Local EEVDF tree for tasks in this CBS-guaranteed cgroup on this
/// CPU. Tasks are enqueued here (not in the main per-CPU EEVDF tree)
/// while the cgroup has an active `cpu.guarantee`. `pick_next_task()`
/// selects the eligible task from this tree when this server wins
/// EDF arbitration. When the cgroup's guarantee is removed, tasks
/// are migrated back to the main EEVDF tree.
///
/// **Relationship with `GroupEntity`** ([Section 7.2](#heterogeneous-cpu-scheduling--hierarchical-eevdf-via-group-scheduling-entities)):
/// `CbsCpuServer.tree` is a **flat** EEVDF tree containing only
/// leaf tasks (no nested group entities). It is separate from the
/// hierarchical EEVDF tree managed by `GroupEntity`. `GroupEntity`
/// implements proportional weight sharing across cgroups in the main
/// EEVDF tree; `CbsCpuServer` provides CBS minimum-bandwidth
/// guarantees as an orthogonal mechanism. A cgroup may have both
/// a `cpu.weight` (`GroupEntity` in the main tree) and a
/// `cpu.guarantee` (`CbsCpuServer` with its own tree). Tasks with
/// an active CBS guarantee are placed in the CBS tree, not in the
/// `GroupEntity.child_rq`.
///
/// Uses `VruntimeTree` directly ([Section 7.1](#scheduler)) — the shared base
/// type containing the augmented RB tree and two-accumulator state.
/// All accumulator-only EEVDF helpers (`avg_vruntime_update`,
/// `entity_key`, `__enqueue_entity`, `__dequeue_entity`) operate
/// on `&VruntimeTree` and work identically on this CBS sub-tree.
/// Root-only fields (`curr`, `next`, `bandwidth_timer`) are not
/// present — CBS has its own `replenish_timer` and the currently
/// running task is tracked solely by the root per-CPU
/// `EevdfRunQueue.curr`.
pub tree: VruntimeTree,
}
/// **Throttle state summary**: A task is throttled if `throttled || max_throttled`.
/// - `throttled` = CBS guarantee budget exhausted. Cleared by replenishment or steal.
/// - `max_throttled` = cgroup `cpu.max` hard ceiling. Cleared by period boundary timer.
/// - `OnRqState::CbsThrottled` = task-level state reflecting either condition.
/// Unthrottle checks both flags independently: clearing one does not unthrottle
/// if the other is still set.
///
/// **Weight accounting during throttle/unthrottle**: When a task enters
/// `CbsThrottled`, it is dequeued from the CBS server's `VruntimeTree` —
/// the tree's `sum_weight` accumulator is decremented by the task's weight.
/// `CbsGroupConfig.total_weight` is unchanged: the task still belongs to the
/// cgroup, it is just not schedulable. On replenishment, the task is re-enqueued
/// into the `VruntimeTree` (`sum_weight` incremented). This ensures that
/// proportional share calculations in `cbs_replenish()` reflect only runnable
/// (non-throttled) tasks when computing `local_share = quota_us * local_w / total_w`.
///
/// **GroupEntity weight invariant**: CBS throttling does NOT affect the cgroup's
/// `GroupEntity` weight in the hierarchical scheduler. The `GroupEntity` continues
/// contributing its weight to the parent EEVDF tree — only the individual task is
/// dequeued from the `CbsCpuServer`'s `VruntimeTree`. On replenishment, the task
/// re-enters the CBS server's tree with its original weight.
7.6.4.1.1 CBS Replenishment (per-CPU, no global pool)¶
Each CbsCpuServer has its own replenishment timer. There is no global pool
and no global lock. Budget is distributed proportionally and reclaimed via
atomic steal.
Period boundary replenishment (per-CPU timer fires at deadline_ns):
fn cbs_replenish(server: &CbsCpuServer, config: &CbsGroupConfig):
// Proportional share for this CPU.
// Memory ordering rationale: local_weight and total_weight are updated
// on the slow path (cgroup migration, weight change) and read here on
// the timer path. Relaxed is safe because stale values cause only a
// transient proportional error (corrected on the next period). No
// invariant depends on these loads being sequentially consistent.
local_w = server.local_weight.load(Relaxed)
total_w = config.total_weight.load(Relaxed)
// Compute this CPU's proportional share in nanoseconds.
// quota_us is in microseconds (cgroup API unit); convert to ns for budget math.
share_ns = if total_w > 0 { config.quota_us * 1000 * local_w / total_w } else { 0 }
// **Transient weight staleness**: Under rapid cgroup migration (many tasks
// moving between CPUs simultaneously), both local_weight and total_weight
// may be transiently stale. Example: a weight-1024 task migrates from CPU A
// to CPU B. If CPU A's replenishment timer fires before local_weight is
// updated, CPU A computes a larger-than-deserved share (includes the departed
// task's weight). Meanwhile CPU B's share does not yet include the arriving
// task. For one period, the total distributed bandwidth may exceed quota_us
// by up to the migrating task's proportional share (e.g., ~1-2% of CPU time
// for a 40% guarantee). This is self-correcting: the next replenishment
// period reads updated weights and normalizes. CBS theory absorbs transient
// bandwidth errors — no correctness invariant is violated.
// Replenish: add share to current budget (may be negative from deficit).
// Relaxed: budget_remaining_ns is only read by this CPU's scheduler tick
// and by cbs_charge() on this CPU (both are serialized by the rq lock).
old = server.budget_remaining_ns.load(Relaxed)
// Deficit cap: clamp to at most one period's deficit (converted to ns).
clamped = max(old, -(config.quota_us as i64 * 1000))
server.budget_remaining_ns.store(clamped + share_ns as i64, Relaxed)
// Advance deadline by one period.
server.deadline_ns.fetch_add(config.period_us * 1000, Relaxed)
// Un-throttle if budget is now positive.
// Release ordering: the store to `throttled` must be visible AFTER the
// budget and deadline updates above. Other CPUs that read `throttled`
// with Acquire will see the updated budget/deadline values.
if clamped + (share_ns as i64) > 0:
server.throttled.store(false, Release)
// Re-enqueue all CbsThrottled tasks for this cgroup on this CPU.
// Dequeue-first semantics: each task is removed from the throttled
// list BEFORE being placed on the runqueue. This prevents double-
// enqueue if a concurrent SIGKILL delivery also tries to enqueue
// the task (signal delivery checks OnRqState before re-enqueue).
//
// migration_pending check: skip tasks with migration_pending == true.
// A task whose migration is in flight will be enqueued on its
// destination CPU by the migration completion path. Enqueuing it
// here would cause a double-enqueue race: the migration path also
// calls enqueue_task() on the destination CPU after the task has
// been dequeued from this CPU. The migration completion path is
// responsible for clearing CbsThrottled state on the destination.
cbs_unthrottle_tasks(this_cpu, cgroup_id)
// cbs_unthrottle_tasks pseudocode:
// for task in throttled_list(this_cpu, cgroup_id):
// if task.migration_pending:
// continue // migration path will handle re-enqueue
// task.on_rq_state.store(Queued, Release)
// throttled_list.remove(task)
// enqueue_task(task)
rearm server.replenish_timer to fire at new deadline_ns
SIGKILL interaction with CBS throttle: A SIGKILL to a CBS-throttled task must
not cause double-enqueue. The invariant is enforced by OnRqState transitions:
fn signal_wake_up_throttled(task: &Task):
// Atomically transition from CbsThrottled → Queued.
// If the CAS fails, the task was already unthrottled by
// cbs_replenish() — no action needed (the task is already
// on the runqueue or running).
if task.on_rq_state.compare_exchange(
OnRqState::CbsThrottled,
OnRqState::Queued,
AcqRel, Relaxed
).is_ok():
// We won the race: remove from throttled list, enqueue on runqueue.
cbs_throttled_list.remove(task)
enqueue_task(task)
// If CAS failed: cbs_replenish() already handled it. No-op.
Atomic steal (when a CPU exhausts its budget before the period ends):
fn cbs_try_steal(server: &CbsCpuServer, cgroup_id: CgroupId) -> bool:
// Scan sibling CPUs on the same NUMA node first, then remote.
for sibling_cpu in steal_order(this_cpu):
if let Some(donor) = sibling_cpu.cbs_servers.get(cgroup_id):
if donor.throttled.load(Relaxed):
continue // Already exhausted.
// Try to steal half of the donor's remaining budget.
// Bounded retry: 3 CAS attempts per donor. If all fail (high
// contention — many CPUs targeting the same donor), move to the
// next donor in steal_order(). This bounds per-donor spin to
// ~600ns worst-case while preserving steal success rate.
const CBS_STEAL_CAS_RETRIES: usize = 3;
for _ in 0..CBS_STEAL_CAS_RETRIES:
let avail = donor.budget_remaining_ns.load(Acquire)
if avail <= 0:
break // Nothing to steal.
let steal = avail / 2
if steal == 0:
break
if donor.budget_remaining_ns.compare_exchange(
avail, avail - steal, AcqRel, Relaxed).is_ok():
server.budget_remaining_ns.fetch_add(steal, Relaxed)
server.throttled.store(false, Release)
return true
// CAS contention — move to next donor in steal_order().
false // No budget available anywhere — remain throttled.
CBS budget charging (called from update_curr() on every scheduler tick
and on dequeue when a CBS-guaranteed task has been running):
fn cbs_charge(server: &CbsCpuServer, config: &CbsGroupConfig, delta_ns: u64):
// Decrement this CPU's budget by the consumed runtime (nanoseconds).
// Relaxed ordering rationale: the local rq lock serializes this
// CPU's charge path (scheduler tick) and replenishment timer. However,
// budget_remaining_ns is ALSO modified by cbs_try_steal() from remote
// CPUs via CAS (without holding this CPU's rq lock). The Relaxed
// ordering here is still correct because: (a) fetch_sub is atomic
// regardless of ordering, (b) the rq lock provides happens-before for
// local reads of the updated value, and (c) steal's CAS provides its
// own atomicity guarantee for the remote modification. A steal that
// races with a local charge may observe a slightly stale value, but
// the CAS loop in cbs_try_steal() retries on failure, and any
// resulting budget imprecision is bounded to one tick's worth of
// runtime (~1-4 ms), corrected at the next replenishment.
let prev = server.budget_remaining_ns.fetch_sub(delta_ns as i64, Relaxed)
// Note: `new` is computed locally from `prev`. A concurrent steal may
// add budget between `fetch_sub` and the `if new <= 0` check below.
// In that case, `new` is negative but the actual `budget_remaining_ns`
// is now positive (steal added budget). The code enters the throttle
// path, attempts `cbs_try_steal()` (which succeeds since budget is
// positive), and un-throttles immediately. This is a benign race:
// a spurious throttle-steal-unthrottle cycle with no correctness impact.
let new = prev - delta_ns as i64
// Also charge consumed_ns for accounting (cpu.stat, reported as us to userspace).
// Note: consumed_ns may slightly over-count vs budget_remaining_ns in the
// steal race window. Between the fetch_sub above and this fetch_add, a
// remote steal CAS (cbs_try_steal) could reclaim budget, causing
// consumed_ns to account the full delta_ns even though some budget was
// "reclaimed" by the steal. This is a statistics-only discrepancy (no
// scheduling correctness impact): bounded to one tick's worth (~1-4 ms),
// corrected at the next replenishment. Same approximate accounting as Linux.
server.consumed_ns.fetch_add(delta_ns, Relaxed)
// Check if budget is exhausted.
if new <= 0 && !server.throttled.load(Relaxed):
// Attempt atomic steal from sibling CPUs before throttling.
if !cbs_try_steal(server, cgroup_id):
// Steal failed — throttle this server.
server.throttled.store(true, Release)
// Dequeue all tasks in this cgroup on this CPU from the
// CBS server's EEVDF tree. Set OnRqState to CbsThrottled.
cbs_throttle_tasks(this_cpu, cgroup_id)
// Request immediate reschedule — the currently running task
// must yield. Without this, the running task continues for up
// to one full tick (1-4 ms) past CBS budget exhaustion,
// violating the CBS bandwidth guarantee. The resched_curr
// urgency table entry: "CBS budget exhaustion | Eager".
resched_curr(rq, ReschedUrgency::Eager)
// Arm replenishment timer if not already armed.
if !server.replenish_timer.is_armed():
arm_hrtimer(server.replenish_timer, server.deadline_ns.load(Relaxed))
Relationship with cpu.max charging: CBS guarantee charging and cpu.max
ceiling charging are independent. A single scheduler tick for a CBS-guaranteed
task calls both cbs_charge() (decrements the CBS server's per-CPU budget) and
charge_cpu_max() (decrements the cgroup's global cpu.max budget). If either
budget is exhausted, the task is throttled. The max_throttled and throttled
flags are checked independently in pick_next_task().
Key properties of the per-CPU CBS model: - No global pool lock: all budget operations are per-CPU atomics or CAS on sibling CPUs. - No 1ms residual stranding: idle servers are steal targets, not locked pools. - No thundering herd: each CPU has its own replenishment timer at its own deadline. - Proportional rebalance: at each replenishment, share is recalculated from live weights. - NUMA-aware steal: local node first, reducing cross-node atomic CAS traffic.
Lock ordering for CBS operations:
| Operation | Lock held | Reason |
|---|---|---|
cbs_charge() (scheduler tick) |
Local RunQueue.lock (level 50) |
Called from update_curr() under rq lock |
cbs_replenish() (hrtimer) |
Local RunQueue.lock (level 50) |
Timer handler acquires rq lock before modifying server state |
cbs_try_steal() (steal from sibling) |
Local RunQueue.lock only |
Reads sibling's budget_remaining_ns via atomic CAS — no lock on sibling's rq. The CAS provides atomicity for the budget transfer. |
cbs_throttle_tasks() |
Local RunQueue.lock (level 50) |
Modifies local EEVDF tree and task OnRqState |
cbs_unthrottle_tasks() |
Local RunQueue.lock (level 50) |
Re-inserts tasks into local EEVDF tree |
cpu_guarantee_write() |
Cgroup config lock + CAS on system.total_guaranteed |
No rq lock held — only modifies cgroup config. Per-CPU servers read config at next replenishment under their own rq lock. |
charge_cpu_max() (cpu.max tick) |
Local RunQueue.lock (level 50) |
Atomics on global runtime_remaining; rq lock for task state changes |
Critical invariant: The RunQueue.lock is NEVER held across CPUs for CBS
operations. cbs_try_steal() accesses sibling CPUs' CbsCpuServer fields via
atomic operations only — it does NOT acquire the sibling's runqueue lock. This
eliminates cross-CPU lock contention on the CBS hot path.
Cgroup task migration (task moves from CPU A → CPU B):
1. Dequeue from CPU A: A.server.local_weight -= task.weight. If last task,
server becomes a steal donor (budget remains, accessible via CAS).
2. Enqueue on CPU B: if no CbsCpuServer for this cgroup on B, create one with
budget = cbs_try_steal(). Set B.server.local_weight += task.weight.
- Zero-budget edge case: If cbs_try_steal() fails (no budget available
on any sibling), the new server is created with budget_remaining_ns = 0
and throttled = true. The task is enqueued as OnRqState::CbsThrottled.
The replenishment timer is armed for the next period boundary
(deadline_ns = ktime_get_ns() + config.period_us * 1000). This means the
migrated task waits at most one period before receiving its first budget
allocation. This is the correct CBS behavior: the server starts fresh with
a new deadline and receives its proportional share at that deadline.
3. Proportional shares are recalculated at next replenishment (no immediate rebalance).
This ensures both exhaustion-triggered and timer-triggered replenishment follow
the same CBS invariant: the server's deadline advances by exactly one period per
replenishment, bounding the guaranteed bandwidth to quota_us / period_us over
any sliding window.
Integration with existing scheduler (Section 7.1):
Per-CPU run queue structure (updated):
+------------------+
| RT Queue | <- Highest priority (unchanged)
+------------------+
| DL Queue | <- Deadline tasks (unchanged)
+------------------+
| CBS Group Servers| <- NEW: CBS servers for guaranteed groups
| +-- db_server | Each server has its own EEVDF tree inside
| +-- web_server |
+------------------+
| EEVDF Tree | <- Normal tasks without guarantee (unchanged)
+------------------+
Scheduling decision:
1. Check RT queue (highest priority) — unchanged.
2. Check DL queue (deadline tasks) — unchanged.
3. Check CBS group servers (ordered by earliest deadline):
- If a server has budget and runnable tasks: pick its next task.
- CBS guarantees each server receives its bandwidth.
- Iteration: CBS servers are stored in RunQueueData.cbs_servers
(XArray keyed by CgroupId). XArray provides O(1) lookup by cgroup ID
but does NOT support deadline-ordered iteration natively. The
pick_next_cbs_task() function performs a linear scan of all
CBS servers on this CPU to find the one with the earliest deadline_ns
that has budget and runnable tasks. For typical server counts (N < 20
CBS-guaranteed cgroups with tasks on a single CPU), this is ~20
comparisons = ~100 cycles, which is acceptable within the scheduler
tick budget. If N exceeds 20 (unlikely in practice — it requires 20+
simultaneously-active guaranteed cgroups on one CPU), a deadline-sorted
intrusive list maintained alongside the XArray should be added as an
optimization. The XArray remains the authoritative store (keyed by
CgroupId for O(1) lookup during enqueue/dequeue); the intrusive list
provides O(1) earliest-deadline access for pick_next_task().
4. Check EEVDF tree (normal tasks without guarantee) — unchanged.
Unguaranteed tasks (step 4) run when all CBS servers are idle or throttled. Plus, CBS servers that are under-utilizing their budget donate the slack back to step 4 (work-conserving).
7.6.5 Overcommit Prevention¶
// umka-core/src/sched/cbs_group.rs
/// System-wide guarantee accounting.
pub struct BandwidthAccounting {
/// Total guaranteed bandwidth across all CBS servers.
/// Stored as fixed-point fraction scaled by `BW_SCALE` (1 << 20 = 1_048_576).
/// Example: 40% bandwidth = `(40 * BW_SCALE) / 100 = 419_430`.
pub total_guaranteed: AtomicU64,
/// Maximum allowable guarantee (default: 95%).
/// Reserves 5% for kernel threads, interrupts, housekeeping.
pub max_guarantee: u64,
/// Per-NUMA-node guaranteed bandwidth (for NUMA-aware scheduling).
/// Allocated at boot via `Box::new_zeroed_slice(numa_node_count)`
/// once NUMA topology is discovered (Section 4.9). Owned by the
/// cgroup subsystem for the lifetime of the cgroup hierarchy.
pub per_node_guaranteed: Box<[AtomicU64]>,
}
Setting cpu.guarantee fails with -ENOSPC if total_guaranteed + new_guarantee would
exceed max_guarantee. This prevents overcommit and guarantees all promises are
simultaneously satisfiable.
7.6.5.1 CPU Hotplug Interaction¶
When a CPU goes offline, the total system capacity decreases. This interacts with CBS guarantee admission control:
- CPU offline: The system recalculates
effective_capacity = online_cpu_count. Iftotal_guaranteed > effective_capacity * max_guarantee_ratio, the system is temporarily over-guaranteed. The guarantees are NOT revoked — CBS servers on the offline CPU are migrated to surviving CPUs (tasks and their local_weight are redistributed via the standard task migration path). The proportional share formula (quota_us * local_w / total_w) naturally adapts: with fewer CPUs, each surviving CPU's share increases, but the absolute guarantee bandwidth is unchanged. The CBS period timer and budget math are unaffected because they operate on system-wide totals, not per-CPU capacities.
Warning condition: If the sum of guarantees exceeds what the remaining CPUs
can physically deliver, guaranteed cgroups will receive less than their promised
bandwidth. The kernel emits a cbs_overcommit_warning tracepoint and writes a
rate-limited pr_warn to the kernel log. No guarantees are revoked automatically
— the administrator (or Kubernetes) must respond to the warning.
-
CPU online: New CPU capacity becomes available.
max_guaranteeis NOT automatically increased (it is an absolute ratio, not CPU-count-dependent). The new CPU starts with emptycbs_serversXArray. As tasks are migrated or scheduled on the new CPU,CbsCpuServerinstances are created lazily. The proportional share formula adapts at the next replenishment period. -
Admission control during hotplug: New
cpu.guaranteewrites continue to check against the staticmax_guarantee(default 95%). The admission control does NOT factor in current online CPU count — it assumes all CPUs are available (matching the steady-state). This prevents oscillating admission decisions when CPUs are transiently offline during maintenance. -
CBS server cleanup on CPU offline: All
CbsCpuServerinstances on the dying CPU are drained: throttled tasks are unthrottled and migrated, replenishment timers are cancelled, and remaining local budget is returned to sibling CPUs viacbs_try_steal()(reverse direction: the dying CPU donates its full remaining budget to the first sibling that accepts it). This matches the Linux pattern inunthrottle_offline_cfs_rqs().
7.6.6 Interaction with Existing Controls¶
| Control | Meaning | Interaction with cpu.guarantee |
|---|---|---|
cpu.weight |
Relative share of excess CPU | Distributes CPU beyond guaranteed minimums |
cpu.max |
Maximum CPU ceiling | Guarantee cannot exceed max; max still enforced |
cpu.guarantee |
Minimum CPU floor | NEW: CBS-backed guaranteed bandwidth |
cpu.pressure |
PSI pressure info | Reports pressure relative to guarantee |
Example: a cgroup with cpu.guarantee=40%, cpu.max=60%, cpu.weight=100:
- Always gets at least 40% CPU (even under full system load)
- Never gets more than 60% CPU (even if system is idle — ceiling applies)
- Between 40% and 60%, shares proportionally with other cgroups by weight
7.6.7 Nested Cgroup Hierarchy¶
A child cgroup's cpu.guarantee draws from its parent's guarantee budget. The
kernel enforces this at write time — it is not merely advisory.
Admission control for nested guarantees:
fn cpu_guarantee_write(cgrp: &Cgroup, new_quota_us: u64, period_us: u64) -> Result<()> {
// All bandwidth comparisons use fixed-point arithmetic: bandwidth is
// represented as (quota_us * BW_SCALE) / period_us, yielding a u64
// in the range [0, BW_SCALE]. This avoids FPU save/restore in kernel
// context and eliminates negative-f64-to-u64 truncation bugs.
//
// BW_SCALE = 1 << 20 (1_048_576), matching Linux's BW_UNIT = 1 << BW_SHIFT.
// CBS and DL admission use the SAME scale factor for direct comparison
// in the system-wide overcommit check (total_guaranteed + dl_bandwidth < capacity).
const BW_SCALE: u64 = 1 << 20;
let new_bw = new_quota_us.saturating_mul(BW_SCALE) / period_us;
// 1. Check against cpu.max ceiling (guarantee cannot exceed max).
let max_us = cgrp.cpu.as_ref().map(|c| c.max_us.load(Relaxed)).unwrap_or(u64::MAX);
if max_us != u64::MAX {
let max_period = cgrp.cpu.as_ref().unwrap().period_us.load(Relaxed);
let max_bw = max_us.saturating_mul(BW_SCALE) / max_period;
if new_bw > max_bw { return Err(EINVAL); }
}
// 2. Check against parent's remaining guarantee budget.
if let Some(parent) = cgrp.parent() {
let parent_cfg = &parent.cpu.cbs;
if parent_cfg.is_none() {
// Parent has no guarantee — child cannot have one either.
// Exception: root cgroup implicitly has 100% guarantee.
if !parent.is_root() { return Err(EINVAL); }
} else {
let parent_quota = parent_cfg.unwrap().quota_us;
let parent_period = parent_cfg.unwrap().period_us;
let parent_bw = parent_quota.saturating_mul(BW_SCALE) / parent_period;
// Sum of existing children's guarantees (excluding this cgroup).
// children_guarantee_sum_excluding() returns fixed-point (scaled by BW_SCALE).
let siblings_bw = parent.children_guarantee_sum_excluding(cgrp);
if siblings_bw + new_bw > parent_bw {
return Err(ENOSPC); // Would overcommit parent's budget.
}
}
}
// 3. Check-and-commit system-wide utilization cap (CAS loop).
// total_guaranteed and max_guarantee are both stored in BW_SCALE units.
//
// This MUST be an atomic compare-and-swap loop, not a Relaxed load followed
// by a separate fetch_add. With Relaxed check-then-commit, two concurrent
// writers can both read the same `total`, both pass the overcommit check,
// and both commit — exceeding max_guarantee. The CAS loop ensures exactly
// one writer succeeds per slot of available bandwidth.
let system = &BANDWIDTH_ACCOUNTING;
let current_bw = cgrp.current_guarantee_bw(); // fixed-point, BW_SCALE units
if new_bw >= current_bw {
let increase = new_bw - current_bw;
loop {
let total = system.total_guaranteed.load(Acquire);
if total + increase > system.max_guarantee {
return Err(ENOSPC);
}
if system.total_guaranteed.compare_exchange(
total, total + increase, AcqRel, Acquire
).is_ok() {
break;
}
// CAS failed: another writer committed first. Re-read and retry.
}
} else {
// Shrinking: total decreases — always succeeds, no overcommit risk.
// fetch_sub is safe because we hold the cgroup's config lock, and
// current_bw was committed by a prior successful CAS.
let decrease = current_bw - new_bw;
system.total_guaranteed.fetch_sub(decrease, Release);
}
// 4. Commit: update cgroup config.
cgrp.cpu.cbs = Some(CbsGroupConfig { quota_us, period_us, .. });
Ok(())
}
Key invariants:
- sum(children.guarantee) <= parent.guarantee at every level.
- Root cgroup has an implicit 100% guarantee (all CPU bandwidth).
- Removing a child's guarantee immediately frees budget for siblings.
- Guarantee bandwidth is expressed in absolute terms (quota/period), not
relative weights. This makes nesting arithmetic straightforward.
7.6.8 cpu.max vs cpu.guarantee Interaction¶
cpu.max (ceiling) and cpu.guarantee (floor) can be set simultaneously.
The interaction is deterministic:
-
cpu.max always wins. If the cgroup's cpu.max quota is exhausted, all tasks are max-throttled regardless of CBS guarantee budget. The CBS server's
max_throttledflag is set, andpick_next_task()skips the server. -
CBS budget is frozen during max-throttle. While
max_throttledis true: - No CBS runtime is charged (tasks aren't running).
- The CBS replenishment timer is not re-armed.
-
CBS budget remains at its current value.
-
When cpu.max unthrottles (cpu.max period boundary timer fires):
max_throttledis cleared.- CBS replenishment timer is re-armed if not already armed.
-
Tasks become eligible for CBS pick in the next
pick_next_task(). -
Validation (guarantee write):
cpu.guaranteewrite is rejected with-EINVALif the guarantee bandwidth would exceedcpu.maxbandwidth. The constraintguarantee_bw <= max_bwis checked at write time. -
Validation (max write — reverse constraint):
cpu.maxwrite is rejected with-EINVALif the new max bandwidth would fall below the existingcpu.guaranteebandwidth. This prevents the system from entering an inconsistent state whereguarantee > max. The check:Invariant: At all times,fn cpu_max_write(cgrp: &Cgroup, new_max_us: u64, period_us: u64) -> Result<()> { // ... existing cpu.max validation ... if let Some(cbs) = &cgrp.cpu.cbs { let guarantee_bw = cbs.quota_us.saturating_mul(BW_SCALE) / cbs.period_us; let max_bw = new_max_us.saturating_mul(BW_SCALE) / period_us; if max_bw < guarantee_bw { return Err(EINVAL); // Cannot lower max below guarantee. } } // ... commit cpu.max change ... }guarantee_bw <= max_bwfor every cgroup. Both write paths enforce their respective side of this constraint.
State diagram per CBS server:
┌──────────────────┐
│ RUNNING │
│ (budget > 0, │
│ !throttled, │
│ !max_throttled)│
└─┬──────────┬────┘
CBS budget │ │ cpu.max quota
exhausted │ │ exhausted
▼ ▼
┌──────────────┐ ┌──────────────┐
│ CBS_THROTTLED│ │ MAX_THROTTLED│
│ (throttled) │ │ (max_throttled)│
└──────┬───────┘ └──────┬───────┘
replenish │ cpu.max │
or steal │ period │
▼ boundary ▼
┌──────────────┐ ┌──────────────┐
│ RUNNING │ │ RUNNING │
└──────────────┘ │ (or CBS_ │
│ THROTTLED │
│ if budget=0)│
└──────────────┘
7.6.9 Use Case: Driver Tier Isolation¶
CPU guarantees integrate naturally with the driver tier model:
# # Ensure Tier 1 drivers always have CPU bandwidth for I/O processing
echo "200000 1000000" > /sys/fs/cgroup/umka-tier1/cpu.guarantee # 20%
# # Ensure Tier 2 drivers have some guaranteed bandwidth
echo "50000 1000000" > /sys/fs/cgroup/umka-tier2/cpu.guarantee # 5%
A misbehaving Tier 2 driver process spinning in a loop cannot starve Tier 1 NVMe or NIC drivers of CPU time.
7.6.10 CBS Task Migration Between Cores¶
When a task inside a CBS bandwidth server is migrated to another CPU (via load balancing or explicit affinity change), the EEVDF scheduling parameters must be translated to the destination run queue's context:
-
vruntime adjustment: The migrated task's
vruntimeis rewritten relative to the destination CBS server'szero_vruntime(Section 7.1):// src_tree / dst_tree are the VruntimeTree in each CPU's CbsCpuServer. vruntime_offset = task.vruntime as i64 - src_tree.zero_vruntime as i64 task.vruntime = (dst_tree.zero_vruntime as i64 + vruntime_offset) as u64zero_vruntimetracks close toavg_vruntime(the weighted average virtual runtime), NOT the minimum vruntime. This preserves the task's relative position in virtual time on the destination CPU. A task that was "behind" (vruntime < avg) on the source remains "behind" on the destination. Note:vruntime_offsetis distinct from the EEVDFtask.vlagfield (which is the signed, unweighted quantityavg_vruntime - vruntime). -
Eligibility is computed dynamically — there is no stored
eligible_vtimefield. After adjustingvruntime, the task'svlagis preserved from the dequeue on the source CPU (set byupdate_entity_lag()). On the destination,place_entity()uses the savedvlagto position the task correctly in virtual time:se.vruntime = avg_vruntime(dst_rq) - inflated_vlag. The task is then inserted into the CBS server'stree(keyed byvdeadline). Eligibility is determined dynamically bypick_eevdf()viavruntime_eligible(). -
CBS server state: The CBS server's
budget_remaining_ns,deadline_ns, and replenishment timer are not adjusted — they are absolute (nanoseconds since boot) and do not depend on per-CPU vruntime. The migrated task continues consuming from the same CBS budget. If the budget was exhausted on the source CPU, the task remains throttled until the next period boundary (timer fires on any CPU — hrtimers are per-timer, not per-CPU). -
avg_vruntime accumulator update: The source run queue's
sum_w_vruntimeandsum_weightaccumulators are updated to remove the migrated task's contribution (viaavg_vruntime_update(src_rq, se, false)); the destination's accumulators add it (viaavg_vruntime_update(dst_rq, se, true)). This ensuresentity_eligible()/vruntime_eligible()remains correct on both CPUs after migration.
This is the same protocol used for non-CBS EEVDF tasks (Section 7.1), with the CBS-specific invariant that the bandwidth budget is absolute time (not relative to any per-CPU baseline).
7.6.11 cpu.max Ceiling Enforcement (Bandwidth Throttling)¶
cpu.max enforces a hard ceiling on CPU consumption per cgroup. Unlike cpu.guarantee
(which provides a floor via CBS), cpu.max is a ceiling that limits the
maximum CPU time a cgroup can consume in any given period. This matches Linux cgroup v2
cpu.max semantics exactly.
/// Per-cgroup cpu.max bandwidth throttle state.
///
/// Stored in `CpuController` ([Section 17.2](17-containers.md#control-groups--cpu-controller-state)).
/// Tracks the global budget for a cgroup's cpu.max enforcement across all CPUs.
///
/// Design: Unlike CBS guarantee (per-CPU servers with per-CPU budgets), cpu.max
/// uses a **global pool** with per-CPU runtime slices. This matches Linux's CFS
/// bandwidth throttling design. The global pool is necessary because cpu.max is
/// a system-wide ceiling (a cgroup limited to 200ms/1000ms should get at most
/// 200ms total across all CPUs combined).
pub struct CpuBandwidthThrottle {
/// Quota in microseconds per period. From `cpu.max` first field.
/// `u64::MAX` means unlimited (no throttling — the default).
/// Exposed to userspace as microseconds via `cpu.max` cgroup file.
pub quota_us: u64,
/// Quota in nanoseconds (= `quota_us * 1000`). Used internally by
/// `charge_cpu_max()` and replenishment to avoid truncation bias.
/// Computed once at configuration time.
pub quota_ns: u64,
/// Period in microseconds. From `cpu.max` second field (default 100_000 = 100 ms).
pub period_us: u64,
/// Runtime remaining in the current period (**nanoseconds**, signed).
/// Nanoseconds are used internally (matching CBS `budget_remaining_ns`)
/// to avoid the persistent truncation bias of `delta_exec / 1000`.
/// The userspace-visible `cpu.max` interface uses microseconds; the
/// kernel converts at configuration time: `quota_ns = quota_us * 1000`.
///
/// Signed to allow transient overshoot: a task may consume a few
/// nanoseconds beyond zero before the scheduler tick detects exhaustion.
/// The overshoot is carried forward as a deficit into the next period
/// (subtracted from the replenished quota). Clamped to `-(quota_ns as i64)`
/// to bound recovery to at most one period.
pub runtime_remaining: AtomicI64,
/// Whether all tasks in this cgroup are currently throttled (dequeued from
/// runqueues). Set when `runtime_remaining` drops to zero or below. Cleared
/// when the period timer fires and replenishes the quota.
pub throttled: AtomicBool,
/// Burst buffer in microseconds. Allows temporary over-budget execution
/// to absorb scheduling jitter without throttling. Set via `cpu.max.burst`.
/// Default 0. Maximum: `quota_us`. When non-zero, throttling triggers at
/// `runtime_remaining < -burst_us` instead of `runtime_remaining <= 0`.
pub burst_us: u64,
/// Burst buffer in nanoseconds (= `burst_us * 1000`). Used internally
/// by `charge_cpu_max()`. Computed once at configuration time.
pub burst_ns: u64,
/// Number of periods that have elapsed (for `cpu.stat` accounting).
/// Statistics accessed from multiple CPUs (cgroup accounting reads,
/// timer handler writes). Relaxed ordering sufficient for statistics.
pub nr_periods: AtomicU64,
/// Number of periods in which this cgroup was throttled.
pub nr_throttled: AtomicU64,
/// Total time spent throttled (microseconds, for `cpu.stat`).
pub throttled_time_us: AtomicU64,
/// High-resolution timer that fires at each period boundary.
/// On expiry: replenishes `runtime_remaining = quota_us`, clears
/// `throttled`, and re-enqueues all throttled tasks.
pub timer: HrTimerHandle,
}
Period timer replenishment:
fn cpu_max_replenish(throttle: &CpuBandwidthThrottle):
// Carry forward any overshoot as deficit, clamped to one period.
let deficit = min(0, throttle.runtime_remaining.load(Relaxed))
let clamped = max(deficit, -(throttle.quota_ns as i64))
// Replenish: add quota_ns to current remaining (which may be negative).
throttle.runtime_remaining.store(clamped + throttle.quota_ns as i64, Release)
// Un-throttle: re-enqueue all CbsThrottled/MaxThrottled tasks.
if throttle.throttled.swap(false, Release):
for cpu in online_cpus():
if let Some(server) = cpu.rq.cbs_servers.get(cgroup_id):
server.max_throttled.store(false, Release)
// Re-enqueue all tasks in this cgroup on this CPU.
cpu_max_unthrottle_tasks(cpu, cgroup_id)
// Advance period counter.
throttle.nr_periods.fetch_add(1, Relaxed);
// Re-arm timer for next period.
rearm_hrtimer(throttle.timer, throttle.period_us * 1000)
Runtime charging and throttle detection:
On every scheduler tick (task_tick()) and on voluntary dequeue (put_prev_task()),
the runtime consumed by the current task is charged against the cgroup's global pool:
/// **Locking**: `charge_cpu_max()` is called from `update_curr()` with the
/// local CPU's `RunQueue.lock` held (level 50). The `runtime_remaining`
/// atomic allows concurrent charging from multiple CPUs without a global
/// lock. When throttling is triggered, the iteration uses IPI-based
/// throttle broadcast instead of inline remote lock acquisition:
/// `cpu_max_throttle_tasks(cpu, cgroup_id)` sends an IPI to each remote
/// CPU, which locally throttles tasks under its own `RunQueue.lock`. This
/// avoids holding multiple rq locks simultaneously and prevents ABBA
/// deadlock with load balancing (which acquires rq locks in CPU-ID order).
/// `charge_cpu_max` accepts **nanoseconds** (matching CBS `budget_remaining_ns`
/// and `update_curr()`'s `delta_exec` which is in nanoseconds). This avoids the
/// persistent truncation bias of `delta_exec / 1000` that CBS explicitly chose
/// to avoid (see §CBS Budget Accounting).
fn charge_cpu_max(task: &Task, bw: &CpuBandwidthThrottle, delta_ns: u64):
// Unlimited — no throttling.
if bw.quota_ns == u64::MAX:
return
let prev = bw.runtime_remaining.fetch_sub(delta_ns as i64, AcqRel)
let new = prev - delta_ns as i64
// Check if budget is exhausted (accounting for burst buffer).
// burst_ns = burst_us * 1000 (converted at configuration time).
if new <= -(bw.burst_ns as i64) && !bw.throttled.load(Relaxed):
bw.throttled.store(true, Release)
bw.nr_throttled.fetch_add(1, Relaxed)
let cgroup_id = task.cgroup_id();
// Throttle all tasks in this cgroup across all CPUs.
// Each cpu_max_throttle_tasks() sends an IPI; the remote CPU's
// IPI handler acquires its own rq lock and throttles locally.
for cpu in online_cpus():
if let Some(server) = cpu.rq.cbs_servers.get(cgroup_id):
server.max_throttled.store(true, Release)
cpu_max_throttle_tasks(cpu, cgroup_id) // IPI-based, not inline lock
// Request immediate reschedule for the currently running task.
// Without this, the task continues for up to one full tick past
// budget exhaustion, violating the cpu.max bandwidth ceiling.
resched_curr(rq, ReschedUrgency::Eager)
Throttle/unthrottle task state transitions:
When cpu.max throttles a cgroup:
1. Each task in the cgroup has its OnRqState set to CbsThrottled (reusing the same
throttled state as CBS guarantee exhaustion — the scheduler makes no distinction).
2. Tasks are removed from their EEVDF tasks_timeline tree.
3. pick_next_task() skips max_throttled CBS servers.
When the period timer fires and replenishes quota:
1. throttled is cleared.
2. All throttled tasks have OnRqState restored to Queued and are re-inserted
into tasks_timeline. Eligibility is computed dynamically by pick_eevdf().
3. A reschedule IPI is sent to CPUs that have unthrottled tasks, so pick_next_task()
runs immediately.
Multi-CPU quota distribution: The global runtime_remaining is accessed via
AtomicI64 from all CPUs. To reduce cross-CPU atomic contention on high-core-count
systems, each CPU maintains a local runtime slice:
/// Per-CPU local slice of the cpu.max quota. Reduces atomic contention
/// on the global `CpuBandwidthThrottle.runtime_remaining`.
///
/// **Units: nanoseconds** — consistent with the global pool
/// (`CpuBandwidthThrottle.runtime_remaining`) and the CBS accounting
/// model which chose nanoseconds to "avoid the persistent truncation
/// bias of `delta_exec / 1000`."
pub struct CpuMaxLocalSlice {
/// Local runtime budget (nanoseconds). Drawn from the global pool
/// in chunks of `slice_size_ns`. When exhausted, refills from global.
pub local_remaining_ns: i64,
/// Refill chunk size (nanoseconds). Default: `(period_us * 1000) / nr_cpus`,
/// clamped to minimum 1_000_000 (1 ms) and maximum `quota_ns / 2`.
pub slice_size_ns: u64,
}
The per-CPU slice is refilled from the global pool via fetch_sub on the global
runtime_remaining. When the global pool is also exhausted, throttling triggers.
This reduces global atomic operations from once-per-tick to once-per-slice-refill.
CBS budget cleanup on task exit: When a task exits (do_exit → sched_task_exit),
its CBS server state is cleaned up:
1. Cancel the cgroup's per-CPU CBS replenishment timer if no other throttled tasks
remain on this CPU (the replenish_timer lives in CbsCpuServer, not in the task).
2. If the task was throttled (OnRqState::CbsThrottled), decrement the cgroup's
nr_throttled_tasks counter.
3. Return any remaining per-CPU local slice budget to the global
CpuBandwidthThrottle.runtime_remaining pool (fetch_add of the local remainder).
4. Remove the task from the CBS server's task list.
This ensures no stale timers fire for dead tasks and unused budget is returned to the
cgroup for other tasks to use.
Cross-cgroup migration bandwidth for cpu.max: When a task migrates between cgroups
(via cgroup_migrate or sched_move_task), the CBS budget transfer follows:
0. Before dequeuing, cancel any pending CBS replenishment timer for this task on
the source CPU: if the migrating task is the last throttled task on this CPU
(nr_throttled_tasks == 1), call hrtimer_cancel(&source_server.replenish_timer).
Set task.migration_pending = true to prevent the timer callback from
re-enqueuing the task during migration. This flag is cleared after enqueue in
the destination cgroup (step 6).
1. Dequeue the task from the source cgroup's runqueue.
2. Cancel the source cgroup's CBS timer for this task.
3. Return any unconsumed budget from the source cgroup's per-CPU local slice.
4. Re-initialize the task's CBS parameters from the destination cgroup's cpu.max
(quota, period, burst). Set budget = 0, deadline = now + period (fresh start).
5. Note: The task's own nice-derived weight (sched_prio_to_weight[task.nice + 20])
does NOT change during cgroup migration — nice is a per-task property, not
per-cgroup. The cgroup's cpu.weight affects the GroupEntity weight in the
hierarchical scheduler, not individual task weights. No weight recomputation
is needed here. The dequeue→enqueue sequence in step 6 ensures the task
re-enters the EEVDF tree in the correct GroupEntity for its new cgroup.
6. Enqueue the task in the destination cgroup's runqueue. Clear
task.migration_pending (set in step 0) so replenishment timers on the
destination CPU can enqueue this task normally.
Counter maintenance: Step 1 (dequeue) decrements the source server's
nr_throttled_tasks if the task was in CbsThrottled state. Step 6
(enqueue) does NOT increment the destination's nr_throttled_tasks because
the task starts with a fresh budget (budget = 0, deadline = now + period
from step 4) and is enqueued as Queued, not CbsThrottled. The
destination's nr_throttled_tasks increments only when the task's fresh
budget is exhausted in a future tick. This ensures nr_throttled_tasks
remains accurate across migration boundaries.
The fresh-start policy prevents a task from carrying a large unconsumed budget from a
generous cgroup into a restrictive one (bandwidth amplification). It also prevents a
nearly-exhausted budget from causing immediate throttling in the destination cgroup.
cpu.guarantee (CBS minimum bandwidth) transfer: When a task with a cpu.guarantee
reservation migrates between cgroups, the guaranteed bandwidth (expressed as
microseconds per period) is transferred proportionally. The per-task guarantee
(task_guarantee_us) is derived at runtime as cgroup.cpu.guarantee / nr_tasks_in_cgroup
(proportional share of the cgroup's total guarantee divided evenly among its member
tasks). It is not a stored field; it is computed during cgroup_migrate().
The source cgroup's committed guarantee decreases by task_guarantee_us and the destination cgroup's committed
guarantee increases by the same amount. If the destination cgroup's total committed
guarantees would exceed its cpu.guarantee limit, the migration is rejected with
ENOSPC at cgroup_migrate() time (admission control). This ensures that guaranteed
bandwidth is never overcommitted and the CBS admission invariant
(sum(task_guarantees) <= cgroup.cpu.guarantee) is maintained across migrations.
UmkaOS behavioral difference from Linux: echo $PID > cgroup.procs may return
ENOSPC if the destination cgroup's CBS bandwidth guarantee would be overcommitted.
Linux does not have this failure mode because it lacks cpu.guarantee. Container
runtimes (Docker, containerd, systemd) that write to cgroup.procs already handle
errors (EBUSY, ENOMEM, ESRCH); they must additionally handle ENOSPC when UmkaOS
CBS guarantees are in use.
Cgroup freezer interaction: When a cgroup enters the Frozen state, all CBS
replenishment timers for tasks in that cgroup are cancelled (hrtimer_cancel). On
thaw, CBS timers are re-armed with fresh budgets (budget = full period, deadline =
now + period, throttled flags cleared). This prevents both free bandwidth accrual
during freeze and spurious immediate throttling on resume. See
Section 17.2 for the full freeze/thaw protocol.
7.6.12 ML Policy Integration¶
The CBS guarantee subsystem exposes tunable parameters and observations to the ML policy framework (Section 23.1).
Tunable parameters (registered via register_param!, SubsystemId::Scheduler):
| ParamId | Name | Default | Min | Max | Effect |
|---|---|---|---|---|---|
0x0010 |
cbs_steal_fraction_pct |
50 | 10 | 90 | Fraction of donor's budget to steal (percent). Default: steal half. Lower values reduce steal impact on donor; higher values reduce throttle latency. |
0x0011 |
cbs_steal_numa_only |
0 | 0 | 1 | If 1, only steal from same-NUMA-node siblings (skip remote). Reduces cross-node CAS traffic at cost of higher throttle latency when local budgets are exhausted. |
Observation points (emitted via observe_kernel!, gated by static_key):
| Event | obs_type | features[0..5] | Frequency |
|---|---|---|---|
| CBS throttle | 0x10 |
cgroup_id, cpu_id, budget_deficit_ns, period_us, nr_throttled_tasks, steal_attempts | On throttle |
| CBS replenish | 0x11 |
cgroup_id, cpu_id, share_ns, prev_budget_ns, new_budget_ns, consumed_ns | On replenish |
| CBS steal success | 0x12 |
cgroup_id, src_cpu, dst_cpu, stolen_ns, donor_remaining_ns, _ | On successful steal |
| CBS steal fail | 0x13 |
cgroup_id, cpu_id, nr_siblings_scanned, , , _ | On failed steal (all siblings exhausted) |
These observations enable the ML framework to learn: - Whether steal fraction should be adjusted per workload (batch vs interactive) - Whether NUMA-only steal is beneficial for a given topology - Correlation between guarantee utilization and application performance
Parameter consumption: cbs_steal_fraction_pct is read in cbs_try_steal()
(Evolvable hot path — replaces hardcoded avail / 2):
let fraction = PARAM_STORE.get(ParamId::CbsStealFractionPct)
.map_or(50, |p| p.current.load(Relaxed));
let steal = avail * fraction as i64 / 100;
7.6.13 cpu.stat CBS Guarantee Statistics¶
The cpu.stat file for a cgroup with an active cpu.guarantee includes
additional CBS-specific counters alongside the standard Linux counters:
# Standard Linux counters (unchanged):
usage_usec 123456789
user_usec 100000000
system_usec 23456789
nr_periods 1234
nr_throttled 56
throttled_usec 789000
# UmkaOS CBS guarantee counters (new):
guarantee_nr_periods 1234 # Periods elapsed for CBS guarantee
guarantee_nr_throttled 12 # Periods in which CBS guarantee budget exhausted
guarantee_throttled_usec 45000 # Total time CBS-throttled (microseconds)
guarantee_consumed_usec 456000 # Total CBS budget consumed (microseconds)
guarantee_steal_count 89 # Number of successful budget steals from siblings
guarantee_quota_usec 400000 # Configured guarantee quota (for reference)
guarantee_period_usec 1000000 # Configured guarantee period (for reference)
These counters are per-cgroup (aggregated across all CPUs). They are read from
CbsCpuServer.consumed_ns (divided by 1000 for usec output) and CbsCpuServer.nr_throttled_tasks aggregated
via a per-CPU scan when cpu.stat is read. The scan is O(nr_cpus) and takes
the cgroup's config lock (not any runqueue lock) to snapshot the per-CPU servers.
7.7 Power Budgeting¶
7.7.1 Problem¶
Datacenters in 2026 are power-wall limited. A rack has a fixed power budget (typically 20-40 kW). Power, not compute, is the scarce resource.
Linux has power management (cpufreq, DVFS, C-states, RAPL readout) but no power budgeting. There is no way to say "this container gets at most 150W total across CPU, GPU, memory, and NIC." There is no way for the scheduler to make holistic power-performance tradeoffs.
Relationship to Section 7.2 (Heterogeneous CPU / EAS): Section 7.2 covers Energy-Aware Scheduling at the per-task level — selecting the most energy-efficient core type (P-core vs E-core) for each task using OPP tables and PELT utilization. This section covers a complementary concern: per-cgroup power budgeting — enforcing total watt caps across all power domains (CPU + GPU + DRAM + NIC). The two mechanisms interact: EAS picks the optimal core, power budgeting enforces the envelope.
7.7.2 Design: Power as a Schedulable Resource¶
Power joins CPU time, memory, and accelerator time as a kernel-managed resource with cgroup integration.
// umka-core/src/power/budget.rs
/// Maximum number of power domains tracked by the power budgeting subsystem.
/// A typical datacenter server has:
/// - 1-2 CPU packages (CpuPackage)
/// - 8-128 CPU cores (CpuCore, if per-core RAPL is available)
/// - 1-2 DRAM controllers (Dram)
/// - 0-8 GPUs/accelerators (Accelerator)
/// - 1-4 NICs (Nic, if power-metered)
/// - 1-8 NVMe SSDs (Storage, if power-metered)
/// - 1 platform-level domain (Platform)
/// Setting 256 covers high-end servers with per-core monitoring enabled.
/// The ArrayVec avoids heap allocation on the tick hot path.
pub const MAX_POWER_DOMAINS: usize = 256;
/// Maximum number of cgroups tracked by the power budgeting subsystem.
/// Cgroups are typically hierarchical; a large server may have:
/// - 1 root cgroup
/// - 10-100 system.slice cgroups (systemd services)
/// - 10-1000 user.slice cgroups (user sessions, containers)
/// Setting 4096 covers large container hosts without excessive memory.
pub const MAX_POWER_CGROUPS: usize = 4096;
/// Platform-agnostic power domain.
///
/// This is the generic cross-architecture power domain object used by the
/// power-budget enforcer (Section 7.4.4) and cgroup power accounting
/// (Section 7.2.5). It identifies a device by `DeviceNodeId` and tracks
/// current and maximum power draw regardless of the underlying measurement
/// mechanism (RAPL, SCMI, ACPI, or estimation).
///
/// Contrast with `RaplDomain` (Section 7.2.2.5), which is x86/RAPL-specific
/// and carries a `RaplInterface` hardware handle. `GenericPowerDomain` is the
/// unified abstraction that upper layers use after the architecture-specific
/// boot driver has populated the `PowerDomainRegistry`.
pub struct GenericPowerDomain {
/// Domain identifier (matches device registry node).
pub device_id: DeviceNodeId,
/// Domain type.
pub domain_type: PowerDomainType,
/// Current power draw (milliwatts, updated every tick).
pub current_mw: AtomicU32,
/// Maximum power this domain can draw (TDP or configured limit).
/// Initialized from ACPI PPCC (Participant Power Control Capabilities)
/// tables where available; these define hardware power limits that
/// the OS must respect. Falls back to TDP from CPUID/ACPI otherwise.
pub max_mw: u32,
/// Current performance level (0 = lowest power, 100 = maximum).
pub perf_level: AtomicU32,
/// Power measurement source.
pub measurement: PowerMeasurement,
}
/// CPU packages are represented as `GenericPowerDomain` with
/// `device_id = INVALID_DEVICE_NODE_ID` (sentinel). The power domain identifies
/// the package by its ACPI/DT topology path, not as a `DeviceNode`. `PowerState`
/// transitions (D0→D3) are initiated via ACPI/SCMI methods, not device driver
/// callbacks.
// PowerDomainType: see canonical definition in
// [Section 7.4](#platform-power-management--kernel-abstraction).
// 7 variants: CpuPackage(0), CpuCore(1), Dram(2), Accelerator(3),
// Nic(4), Storage(5), Platform(6).
// Not duplicated here — the single canonical definition in
// platform-power-management.md is authoritative.
#[repr(u32)]
pub enum PowerMeasurement {
/// Intel RAPL (Running Average Power Limit) via MSR.
IntelRapl = 0,
/// AMD RAPL equivalent.
AmdRapl = 1,
/// ARM SCMI (System Control and Management Interface).
ArmScmi = 2,
/// ACPI Power Meter device.
AcpiPowerMeter = 3,
/// BMC/IPMI (out-of-band, lower frequency).
BmcIpmi = 4,
/// Estimated from utilization (no hardware meter).
Estimated = 5,
}
Per-architecture power measurement details:
Intel/AMD RAPL (x86):
- Read via MSR: IA32_PKG_ENERGY_STATUS (package), IA32_PP0_ENERGY_STATUS (cores),
IA32_DRAM_ENERGY_STATUS (DRAM), IA32_PP1_ENERGY_STATUS (GPU/uncore).
- Resolution: ~15.3 μJ per LSB (Intel), ~15.6 μJ (AMD).
- Read cost: ~100ns per MSR read. 6 domains × 100ns = 600ns per tick.
- Per-core RAPL (AMD Zen 2+ via `MSR_CORE_ENERGY_STAT`, MSR `0xC001_029A`):
per-CPU energy attribution. Enables precise per-cgroup power accounting without
proportional estimation. **Intel does not provide per-core energy counters** —
Intel RAPL PP0 is an all-cores aggregate for the entire package. On Intel
platforms, per-CPU energy is estimated proportionally from PP0 using utilization
weights (less precise than AMD's direct per-core counters).
- Overflow: 32-bit energy counters. Overflow interval depends on the CPU model's
energy unit and current power draw — it MUST be computed at runtime. At boot,
the kernel reads MSR_RAPL_POWER_UNIT to extract the energy unit (bits 12:8),
giving energy_unit_joules = 2^(-ESU) (e.g., ESU=14 → ~61 μJ on Haswell
and later (including Skylake), ESU=16 → ~15.3 μJ on Sandy Bridge / Ivy
Bridge (the architecture default)). The overflow interval is then:
overflow_seconds = 2^32 * energy_unit_joules / current_power_watts
For a 200W package with ESU=16: 2^32 * 15.3e-6 / 200 ≈ 329 seconds.
For a 500W package with ESU=14: 2^32 * 61e-6 / 500 ≈ 524 seconds.
The kernel sets the RAPL polling interval to min(overflow_seconds / 2, tick)
to guarantee no counter wraparound is missed. With a 4ms tick and typical
overflow intervals of 329-524 seconds, the formula always evaluates to `tick`
(4ms) on current hardware — the overflow margin is enormous. The calculation
is still performed at runtime (not assumed) to handle hypothetical hardware
where very high power draw or very coarse energy units could produce an
overflow interval shorter than twice the tick period.
ARM SCMI (AArch64/ARMv7):
- SCMI (System Control and Management Interface, ARM DEN 0056) is a standardized
protocol for communication between the OS and a System Control Processor (SCP).
- Power domains are discovered via SCMI_POWER_DOMAIN_ATTRIBUTES (protocol 0x11, message 0x03).
- Power measurement: SCMI_SENSOR_READING_GET (protocol 0x15, message 0x06) reads sensor values
from the SCP. Sensor types include POWER (watts), ENERGY (joules), CURRENT (amps).
- Read cost: ~1-5 μs per SCMI message (shared memory + doorbell interrupt to SCP).
Higher than RAPL (~100ns) but still within the 4ms tick budget.
- Available on: ARM SBSA servers (AWS Graviton, Ampere), Cortex-M SCP-based
platforms, and any SoC implementing SCMI power management.
- Fallback: If SCMI is not available (e.g., simple embedded boards without SCP),
fall back to Estimated mode.
Power domain mapping:
SCMI domain ID → UmkaOS GenericPowerDomain:
- SCMI_POWER_DOMAIN type "CPU" → PowerDomainType::CpuPackage or CpuCore
- SCMI_POWER_DOMAIN type "GPU" → PowerDomainType::Accelerator
- SCMI_POWER_DOMAIN type "MEM" → PowerDomainType::Dram
- Platform-level SCMI sensor → PowerDomainType::Platform
RISC-V SBI PMU:
- RISC-V has no standard power measurement interface. The SBI PMU extension
(ratified) provides performance counters but not power counters.
- On platforms with BMC/IPMI (e.g., datacenter RISC-V): use BmcIpmi source.
- On platforms without any power measurement: use Estimated mode.
- Future: the RISC-V power management task group is defining power management
extensions. UmkaOS will adopt these when ratified.
Estimated (fallback for all architectures):
- When no hardware power meter is available, UmkaOS estimates power from:
a. CPU utilization × TDP per core (linear model, ~10% accuracy).
b. Frequency scaling: power ∝ V² × f. Frequency from cpufreq.
c. C-state residency: idle cores at deep C-states draw ~0.5-2W.
- Estimation runs in the scheduler tick handler (zero additional overhead).
- Accuracy: ±20-30% vs actual hardware measurement. Sufficient for coarse
power budgeting (e.g., "keep this rack under 30kW") but not for fine-grained
per-cgroup accounting.
- The estimated source is logged at boot:
umka: power: No hardware power meter detected, using estimated power model
umka: power: Estimated power accuracy: ±25%. Consider RAPL/SCMI hardware.
7.7.3 Cgroup Integration¶
/sys/fs/cgroup/<group>/power.max
# # Maximum total power budget for this cgroup (milliwatts).
# # Enforced across ALL power domains (CPU + GPU + memory + NIC).
# # Format: "150000" (150W)
# # "max" = no limit (default)
/sys/fs/cgroup/<group>/power.current
# # Current power draw by this cgroup (milliwatts, read-only).
# # Sum of all power domains attributed to this cgroup's processes.
/sys/fs/cgroup/<group>/power.stat
# # Power statistics:
# # energy_uj <total energy consumed in microjoules>
# # throttle_count <times power budget was exceeded and throttled>
# # throttle_us <total microseconds spent throttled>
# # avg_power_mw <average power over last 10 seconds>
/sys/fs/cgroup/<group>/power.weight
# # Relative share of excess power budget (like cpu.weight).
# # Default: 100. Higher = more power when contended.
/sys/fs/cgroup/<group>/power.domains
# # Per-domain power limits (optional, for fine-grained control).
# # Format: "cpu 80000 gpu 60000 dram 10000"
# # If not set, the global power.max is split by the kernel
# # based on workload demand.
7.7.4 Power-Aware Scheduler¶
// umka-core/src/sched/power.rs
const MAX_POWER_CGROUPS: usize = 4096; // Must be power of two
/// A fixed-capacity open-addressing hash map with compile-time maximum size.
/// Uses power-of-two capacity with linear probing. Never allocates — backed
/// by a static or slab-allocated array. Insertion returns Err if at capacity.
/// N must be a power of two; load factor is capped at 75% (0.75 * N entries).
///
/// Hash function: FxHash (Firefox's fast integer hash — multiply by a golden
/// ratio constant, shift right). FxHash is ideal for small integer keys
/// (CgroupId, DeviceId) and has no allocation or state. FxHash is not
/// DoS-resistant, but this is safe because the keys (cgroup IDs) are
/// kernel-assigned integers, not user-controlled. For string keys or
/// user-controlled inputs, SipHash-1-3 is used (DoS-resistant).
///
/// Collision resolution: linear probing with step size 1. At 75% load factor
/// and power-of-two sizing, expected probe length is ~2 (Birthday paradox
/// bound). At capacity (75% of N), insert returns `Err(MapFull)`.
///
/// The map is pre-allocated at CBS initialization with capacity
/// `max_concurrent_cbs_tasks × 2` (load factor 0.5), where
/// `max_concurrent_cbs_tasks` is derived from the system's admission-control
/// limit ([Section 7.6](#cpu-bandwidth-guarantees)). Pre-allocation guarantees that no insertion fails
/// during the tick hot path as long as the admitted task count does not exceed
/// the limit enforced at `cpu.guarantee` write time.
///
/// On the rare case of map overflow (should not occur with correct
/// pre-allocation; indicates a kernel bug or admission-control bypass):
/// charge `budget_remaining_ns = 0` for the current tick as a conservative
/// fallback — the task is considered to have exhausted its budget for this
/// tick. This is safe: it errs on the side of throttling the task rather than
/// allowing unaccounted CPU consumption, preserving CBS bandwidth isolation
/// guarantees. A kernel warning is emitted unconditionally on overflow
/// (not gated on debug_assertions) since this path should never be reached.
/// **Removed**: `FixedHashMap` replaced by XArray below. CgroupId is an integer
/// key — per collection policy, integer-keyed mappings must use XArray.
/// Power budget enforcer.
/// Runs at scheduler tick frequency (~4ms). NOT per-scheduling-decision.
pub struct PowerBudgetEnforcer {
/// Power domains on this machine.
/// Populated at boot from ACPI/DT hardware discovery. The number of power
/// domains is small and bounded (typically <=16: package + cores + DRAM +
/// accelerators). Uses a fixed-capacity array sized to MAX_POWER_DOMAINS.
domains: ArrayVec<GenericPowerDomain, MAX_POWER_DOMAINS>,
/// Per-cgroup power accounting.
/// XArray keyed by CgroupId (u64, integer key — per collection policy).
/// O(1) lookup at tick frequency (~4ms). Maximum cgroup count bounded by
/// the cgroup hierarchy (typically <1024 active cgroups, max MAX_POWER_CGROUPS).
/// Entries inserted on cgroup creation (warm path), removed on cgroup deletion.
cgroup_power: XArray<CgroupPowerState>,
/// Global power budget (rack-level, from BMC or admin-configured).
global_budget_mw: Option<u32>,
}
pub struct CgroupPowerState {
/// Budget for this cgroup (from power.max).
budget_mw: u32,
/// Current attributed power draw.
current_mw: u32,
/// Running energy counter (microjoules).
energy_uj: u64,
/// Is this cgroup currently throttled?
throttled: bool,
/// Throttle mechanism:
/// 1. Reduce CPU frequency (cpufreq) for this cgroup's cores.
/// 2. Reduce accelerator clock (AccelBase set_performance_level).
/// 3. As last resort: CPU throttling (delay scheduling).
/// At most one action per PowerDomainType variant (7 variants), so bounded to 8.
/// Using 8 instead of MAX_POWER_DOMAINS (256) saves ~3 KB per cgroup entry.
throttle_actions: ArrayVec<ThrottleAction, 8>,
}
pub enum ThrottleAction {
/// Reduce CPU frequency to this level (MHz).
CpuFrequency(u32),
/// Reduce accelerator performance level.
AccelPerformance { device_id: DeviceNodeId, level: u32 },
/// Throttle CPU time (insert idle cycles).
CpuThrottle { duty_cycle_percent: u32 },
}
Enforcement flow:
Every scheduler tick (~4ms):
1. Read power counters from all domains (1 MSR read per domain, ~100ns each).
2. Attribute power to cgroups based on CPU time + accelerator time share.
Note: power attribution is an APPROXIMATION. RAPL gives package-level
power, not per-process. Attribution model:
Per-core RAPL (AMD Zen 2+): precise per-CPU attribution. Intel: proportional.
Package-level RAPL: proportional to (cgroup CPU time / total CPU time)
weighted by frequency at time of execution.
Accelerator: proportional to AccelBase get_utilization() per cgroup.
This is the same limitation as Linux (perf energy-cores event).
Exact per-process power metering requires hardware not yet available.
3. For each cgroup:
a. Is current_mw > budget_mw?
b. Yes → rebuild throttle_actions array (see selection algorithm below).
c. No → clear throttle_actions and release any active throttles.
4. Total overhead analysis:
- RAPL domain reads: up to 32 domains × ~100ns MSR read = ~3.2 μs.
- Cgroup budget checks: up to MAX_POWER_CGROUPS (4096) × ~20ns = ~82 μs.
- Worst case: ~85 μs per tick = 2.1% of a 4ms tick.
- Typical case (8 domains, 64 cgroups): ~2 μs = 0.05% of a 4ms tick.
The worst case is acceptable: power budgeting runs on a dedicated
kthread (not on scheduler tick path), and 4096 power-budgeted cgroups
is an extreme configuration. Systems with >256 power-budgeted cgroups
should increase the poll interval to 8ms via
`umka.power_budget_interval_ms`.
Throttle action selection algorithm (step 3b):
Input: excess_mw = current_mw - budget_mw for the cgroup.
Output: throttle_actions filled in priority order.
Step 1 — CPU frequency reduction:
For each CPU frequency domain that contains CPUs running this cgroup's tasks:
current_pstate = cpufreq_get_current_pstate(domain)
If current_pstate > PSTATE_MIN:
Add ThrottleAction::CpuFrequency(pstate_to_mhz(current_pstate - 1))
Estimated power reduction: (current_mw * pstate_freq_ratio_drop) mw
If estimated reduction ≥ excess_mw: stop here (frequency alone is enough).
Step 2 — Accelerator clock reduction:
Only if remaining_excess_mw > 0 after Step 1.
For each accelerator context used by this cgroup:
current_level = accel_vtable.get_performance_level(device_id)
If current_level > 0:
Add ThrottleAction::AccelPerformance { device_id, level: current_level - 1 }
If estimated reduction ≥ remaining_excess_mw: stop here.
Step 3 — CPU duty-cycle throttle (last resort):
Only if remaining_excess_mw > 0 after Steps 1 and 2.
duty_cycle = max(50, 100 - (remaining_excess_mw * 100 / current_mw))
Add ThrottleAction::CpuThrottle { duty_cycle_percent: duty_cycle }
Invariants:
- throttle_actions is rebuilt from scratch every tick (no incremental state).
- At most one CpuFrequency entry per frequency domain.
- At most one AccelPerformance entry per device_id.
- At most one CpuThrottle entry (covers all CPUs for this cgroup).
- Array capacity 8 is sufficient: at most one entry per PowerDomainType (7 types)
plus the CpuThrottle fallback = 8 maximum.
7.7.5 System-Level Power Accounting¶
/sys/kernel/umka/power/
energy_total_uj # Total system energy since boot (microjoules)
budget_mw # System-wide power budget (admin-set)
Carbon policy is NOT the kernel's job. The kernel measures watts and enforces watt budgets. Carbon intensity depends on grid mix, geography, time of day, renewable contracts — all external to the machine. Orchestrators (Kubernetes, Nomad, custom fleet managers) can read
energy_total_ujand compute carbon externally. This is the correct separation of concerns: kernel provides accurate power telemetry, userspace applies policy.
EAS feedback: When PowerBudgetEnforcer throttles a power domain (CPU package, DRAM), it updates the EAS capacity table: throttled cores have reduced capacity_dmips. The scheduler's EAS path reads this capacity on every task placement decision (Section 7.2). Feedback latency: ~1ms (enforcer runs at tick frequency, capacity update is an atomic store).
Thermal Throttling Coordination:
When the hardware thermal throttle engages, the power budget enforcer backs off to
prevent double-throttling. Detection is architecture-specific:
- x86: MSR IA32_THERM_STATUS (PROCHOT assertion) or ACPI thermal zone events.
- AArch64/ARMv7: SCMI thermal notifications from SCP, or ACPI thermal zones on
SBSA-compliant servers. On DT-based platforms: thermal zone DT nodes with trip points.
- RISC-V: Platform-specific (BMC/IPMI thermal events, or DT thermal zones).
If hardware is already throttling a domain, the kernel does not apply additional software throttling to that domain — doing so would reduce performance below what the thermal situation requires. The kernel logs the thermal event and adjusts its power model to account for reduced headroom.
Hardware/software throttle coordination: To prevent double-throttling during the
detection window, the power enforcer reads the hardware throttle status BEFORE
applying its own throttle. On x86, IA32_THERM_STATUS bit 0 (PROCHOT active) and
IA32_PACKAGE_THERM_STATUS indicate active hardware throttling. On ARM, SCMI
notifications deliver thermal events asynchronously. The coordination protocol:
1. Before each enforcement tick, read hardware throttle status.
2. If hardware throttling is active, skip software throttling for this tick
(hardware is already reducing power draw).
3. If hardware throttling was active on the previous tick but is now inactive,
re-evaluate software throttle based on current power measurement (the RAPL
or SCMI reading now reflects the hardware-throttled power level).
4. Race window: between hardware engaging thermal throttle and the next
enforcement tick (~4 ms worst case), both throttles may be active
simultaneously. This is safe — double-throttling reduces performance
temporarily but does not cause correctness issues. The next tick detects
the hardware throttle and removes the software throttle.
7.7.5.1 Thermal Passive Cooling — EAS Capacity Update¶
When thermal throttling reduces a CPU's operating frequency (passive cooling), the
scheduler's energy-aware placement decisions become incorrect unless the per-CPU
CpuCapacity.capacity value is updated to reflect the reduced throughput. Without
this feedback, EAS may place tasks on a thermally throttled core believing it has
full capacity, leading to both missed performance targets and further thermal
escalation.
The thermal framework notifies the scheduler via thermal_update_capacity() whenever
a thermal zone's passive cooling governor reduces a CPU's maximum frequency:
/// Called by the thermal framework when passive cooling reduces or restores
/// a CPU's maximum operating frequency. Updates the EAS capacity model so
/// the scheduler accounts for the reduced throughput.
///
/// # Arguments
///
/// * `cpu` — The CPU whose capacity changed.
/// * `throttled_freq_khz` — The new maximum frequency allowed by the thermal
/// governor (in kHz). If the thermal constraint is lifted, this equals the
/// CPU's original `max_freq`.
///
/// # Effect
///
/// Recomputes `CpuCapacity.capacity` for the target CPU:
/// new_capacity = capacity_max × (throttled_freq_khz / max_freq_khz)
///
/// The `capacity_max` field (boot-time maximum at the CPU's highest OPP)
/// is unchanged. Only the `capacity` field (current maximum available to the
/// scheduler) is reduced. `capacity_curr` continues to track the actual
/// instantaneous frequency set by the cpufreq governor (which is now clamped
/// to at most `throttled_freq_khz`).
///
/// This function also updates the cpufreq policy's `max` frequency, preventing
/// the governor from requesting a frequency above the thermal limit.
///
/// # Integration with cpufreq
///
/// The thermal governor calls `thermal_update_capacity()` BEFORE calling
/// `cpufreq_update_policy()`. This ordering ensures that:
/// 1. The scheduler sees the reduced capacity before the next `pick_next_task`.
/// 2. The cpufreq governor sees the clamped `policy.max` before the next
/// frequency decision.
///
/// # Performance
///
/// Called only when thermal trip points are crossed (rare — typically once
/// per thermal event, not per tick). The capacity recalculation is O(1)
/// per affected CPU.
pub fn thermal_update_capacity(cpu: CpuId, throttled_freq_khz: u32) {
let cap = per_cpu!(cpu_capacity, cpu);
let ratio = throttled_freq_khz as u64 * 1024 / cap.max_freq_khz as u64;
let new_cap = core::cmp::min(ratio as u32, cap.capacity_max);
cap.capacity.store(new_cap, Ordering::Release);
// Clamp cpufreq policy max to prevent the governor from exceeding
// the thermal limit.
if let Some(policy) = cpufreq_get_policy(cpu) {
policy.max_freq_khz.store(throttled_freq_khz, Ordering::Release);
}
}
Thermal → EAS feedback path:
Thermal zone poll (every thermal_polling_delay_ms):
→ zone temperature exceeds passive trip point
→ thermal governor reduces CPU freq: cpufreq_cooling_set_max(freq_khz)
→ cpufreq_cooling_set_max calls thermal_update_capacity(cpu, freq_khz)
→ CpuCapacity.capacity reduced proportionally
→ EAS next wakeup sees reduced capacity → avoids placing tasks on throttled core
→ Schedutil sees clamped policy.max → does not request frequency above thermal limit
When the thermal zone cools below the trip point, the governor restores the original
frequency, and thermal_update_capacity() is called with the CPU's original
max_freq_khz, restoring CpuCapacity.capacity to capacity_max.
ML Policy → EAS Closed-Loop Feedback Protocol:
The ML policy framework (Section 23.1) provides predictive power management by feeding power telemetry into the EAS (Energy-Aware Scheduling) task placement engine. The closed-loop protocol:
1. Power telemetry collection (every scheduler tick, ~4ms):
→ PowerDomain.current_watts read from RAPL/SCMI/OCC
→ Per-CPU utilization from CFS load tracking
→ Thermal zone temperature from thermal polling
2. ML policy inference (every policy_inference_interval_ms, default 100ms):
→ Input: [power_watts, utilization, temperature, freq_khz] per domain
→ Output: PowerPolicyAction { target_freq_khz, capacity_headroom_pct }
→ Inference runs in the ML policy kthread (SCHED_NORMAL, nice 5)
3. EAS feedback application (immediate, on ML policy output):
→ If target_freq_khz < current_freq_khz:
cpufreq_cooling_set_max(target_freq_khz) // same path as thermal
thermal_update_capacity(cpu, target_freq_khz)
→ capacity_headroom_pct adjusts EAS's energy_threshold:
energy_threshold = base_threshold * (100 + capacity_headroom_pct) / 100
(Higher headroom → EAS more willing to use higher-power cores)
4. Observation (next telemetry collection):
→ ML policy observes the effect of its previous action
→ Adjusts next inference based on actual power/thermal response
→ Convergence: within 3-5 inference cycles (~300-500ms)
Safety bound: The ML policy cannot set target_freq_khz below
policy.min_freq_khz (hardware minimum) or above the thermal governor's
current limit. It operates strictly within the envelope defined by the
thermal governor and cpufreq driver constraints.
Battery Systems:
Power budgeting for battery-powered systems (laptops, edge devices) is out of scope for v1. Battery charge level, discharge rate, and remaining runtime are platform-management concerns handled by ACPI/UPower in userspace. The power budgeting system provides the watt-level telemetry that battery management software can consume, but does not implement battery-specific policies.
7.7.6 Performance Impact¶
Per-architecture overhead per scheduler tick (~4ms):
| Architecture | Read mechanism | Cost per domain | 6-domain system | Overhead |
|---|---|---|---|---|
| x86 (RAPL) | MSR read | ~100ns | 600ns | 0.015% |
| AArch64 (SCMI) | SCP mailbox | ~1-5 μs | 6-30 μs | 0.15-0.75% |
| ARMv7 (SCMI) | SCP mailbox | ~1-5 μs | 6-30 μs | 0.15-0.75% |
| RISC-V (Estimated) | Calculation | ~50ns | 300ns | 0.008% |
| PPC32 (Estimated) | Calculation | ~50ns | 300ns | 0.008% |
| PPC64LE (OCC) | OPAL sensor read | ~1-5 μs | 6-30 μs | 0.15-0.75% |
| Any (BMC/IPMI) | OOB polling | ~10-50 μs | 60-300 μs | 0.006-0.03% (rate-limited to 1/s) |
SCMI overhead is higher than RAPL but still well within budget. For BMC/IPMI sources, the kernel rate-limits reads to 1 per second (not per tick) to avoid I2C/IPMI bus saturation, using the last-read value for inter-read ticks. The overhead percentage for BMC/IPMI reflects amortization over the 1-second read interval (60-300 μs / 1s), not per-tick cost.
When power throttling is active: performance reduction is intentional and configured. It replaces uncontrolled thermal throttling (which is worse — it's sudden and undifferentiated).
When power throttling is NOT active: zero overhead beyond the power reads.
7.8 Timekeeping and Clock Management¶
Accurate, low-latency timekeeping is foundational: the scheduler needs monotonic
timestamps for CBS deadlines (Section 7.6), real-time tasks need bounded timer
latency (Section 8.4), and userspace applications call clock_gettime() millions
of times per second. This section describes how UmkaOS reads hardware clocks,
maintains system time, exposes fast timestamps to userspace, and manages timer
events.
7.8.1 Clock Source Hierarchy¶
Each architecture provides one or more hardware cycle counters. UmkaOS selects the best available source at boot and can switch at runtime if a source proves unstable (Section 7.8).
| Architecture | Primary Source | Secondary | Resolution | Access |
|---|---|---|---|---|
| x86-64 | TSC (Time Stamp Counter) | HPET, ACPI PM Timer | sub-ns | rdtsc (user/kernel) |
| AArch64 | Generic Timer (CNTPCT_EL0) |
— | typically 1-10 ns | mrs (EL0 if enabled) |
| ARMv7 | Generic Timer (CNTPCT via cp15) |
— | typically 1-10 ns | mrc (PL0 if enabled) |
| RISC-V | mtime (MMIO) |
rdtime CSR |
implementation-defined | rdtime (U-mode) |
| PPC32 | Timebase (TBL/TBU) | Decrementer (DEC) | typically 1-10 ns | mftb / mfspr |
| PPC64LE | Timebase (TB) | Decrementer (DEC) | sub-ns (POWER9: 512 MHz) | mftb (user/kernel) |
| s390x | TOD (Time-of-Day) clock | — | sub-ns (1.024 GHz native) | STCK / STCKE (all privilege levels) |
| LoongArch64 | Stable Counter | — | implementation-defined (freq from CPUCFG) |
RDTIME (user/kernel) |
x86-64 notes: Modern processors (Intel Nehalem+, AMD Zen+) provide an
invariant TSC that runs at a constant rate regardless of frequency scaling or
C-state transitions. CPUID leaf 0x8000_0007 EDX bit 8 advertises this. When
invariant TSC is available, it is the preferred source: zero-cost reads
(rdtsc is unprivileged), sub-nanosecond resolution, and monotonicity
guaranteed across cores. When invariant TSC is absent, UmkaOS falls back to HPET
(~100 ns read latency, MMIO) or the ACPI PM Timer (~800 ns read latency,
port I/O).
AArch64 / ARMv7 notes: The ARM Generic Timer is architecturally defined and
always present. The kernel configures CNTKCTL_EL1 to allow EL0 (userspace)
reads of CNTPCT_EL0, enabling a vDSO fast path identical in spirit to x86
rdtsc.
RISC-V notes: The rdtime pseudo-instruction reads the platform-provided
real-time counter. Frequency is discoverable from the device tree
(timebase-frequency property). Resolution varies by implementation.
s390x notes: The TOD (Time-of-Day) clock is a 104-bit architecturally
defined clock running at 1.024 GHz (bit 51 = 1 microsecond). STCK reads the
upper 64 bits (sufficient for sub-nanosecond resolution); STCKE reads the
full 128-bit extended format. The TOD clock is continuous across all CPU states
and synchronized across all CPUs in a configuration via the STP (Server Time
Protocol) facility. No secondary source is needed — the TOD clock is the sole
timekeeping mechanism on s390x.
LoongArch64 notes: The Stable Counter is a fixed-frequency counter
accessible via the RDTIME instruction from any privilege level. The counter
frequency is discoverable at boot from the CPUCFG instruction (register
0x4, CC_FREQ field) or from the device tree. Like ARM's Generic Timer,
it provides a uniform timekeeping interface independent of CPU frequency
scaling.
All clock sources implement a common abstraction:
// umka-core/src/time/clocksource.rs
/// Hardware clock source abstraction.
/// Implementations are per-architecture; the best source is selected at boot.
pub trait ClockSource: Send + Sync {
/// Read the current cycle count from hardware.
fn read_cycles(&self) -> u64;
/// Nominal frequency of this clock source in Hz.
fn frequency_hz(&self) -> u64;
/// Quality rating: higher values are preferred when multiple sources exist.
/// TSC invariant = 350, HPET = 250, ACPI PM Timer = 100.
fn rating(&self) -> u32;
/// Whether this source continues counting through CPU sleep states.
fn is_continuous(&self) -> bool;
/// Upper bound on single-read uncertainty in nanoseconds.
/// Accounts for read latency and synchronization jitter.
fn uncertainty_ns(&self) -> u32;
}
At boot, umka-core enumerates available sources, sorts by rating(), and
activates the highest-rated continuous source. The secondary source (if any) is
retained for watchdog cross-validation (Section 7.8).
7.8.2 Timekeeping Subsystem¶
UmkaOS maintains four clocks, matching POSIX semantics:
| Clock | Semantics | Adjustable? |
|---|---|---|
CLOCK_MONOTONIC |
Time since boot, NTP-adjusted rate | No (monotonic) |
CLOCK_MONOTONIC_RAW |
Time since boot, raw hardware rate | No |
CLOCK_REALTIME |
Wall clock (UTC), NTP-adjusted | Yes (clock_settime, NTP) |
CLOCK_BOOTTIME |
Like CLOCK_MONOTONIC but includes suspend time |
No |
Timestamp representation: All internal timestamps use a (seconds: u64,
nanoseconds: u64) tuple. Both fields are 64-bit to avoid overflow in
intermediate arithmetic (nanoseconds may temporarily exceed 10^9 during
computation and are normalized before storage).
Global timekeeper state is protected by a seqlock — the same pattern used in
Linux timekeeping.c. Readers (including the vDSO) retry if they observe a torn
update. Writers (the timer interrupt handler) are serialized by holding the
seqlock write side.
// umka-core/src/time/timekeeper.rs
/// Global timekeeping state, updated on every tick or clocksource event.
pub struct Timekeeper {
pub seq: SeqLock, // seqlock ([Section 3.6](03-concurrency.md#lock-free-data-structures--seqlockt-sequence-lock)) protecting all fields
pub clock: &'static dyn ClockSource, // active clock source
pub cycle_last: u64, // last cycle count at update
pub mask: u64, // counter wrap bitmask
pub mult: u32, // ns = (cycles * mult) >> shift
pub shift: u32,
pub wall_sec: u64, // CLOCK_REALTIME
pub wall_nsec: u64,
pub mono_sec: u64, // CLOCK_MONOTONIC
pub mono_nsec: u64,
pub boot_offset_sec: u64, // CLOCK_BOOTTIME delta
pub boot_offset_nsec: u64,
pub freq_adj: i64, // NTP/PTP frequency correction (scaled ppm)
pub phase_adj: i64, // NTP/PTP phase correction (ns)
}
NTP/PTP discipline: An adjtimex()-compatible interface accepts frequency
and phase corrections from userspace NTP or PTP daemons. Frequency adjustment
modifies mult slightly so that cycles-to-nanoseconds conversion drifts at the
requested rate. Phase adjustment is applied as a slew (at most 500 ppm rate
adjustment) to avoid wall clock jumps. CLOCK_MONOTONIC_RAW is immune to both
adjustments — it reflects raw hardware cycles converted at the nominal rate.
7.8.3 vDSO Fast Path¶
Linux problem: clock_gettime() is the most frequently invoked syscall in
many workloads (databases, trading systems, telemetry). A kernel entry costs
~100-200 ns due to mode switch, KPTI page table reload, and speculative
execution mitigations. At millions of calls per second, this adds up.
UmkaOS design: Like Linux, UmkaOS maps a vDSO (virtual Dynamic Shared Object)
into every process's address space. The vDSO contains userspace implementations
of clock_gettime(), gettimeofday(), and time() that read the hardware
clocksource directly and apply precomputed conversion parameters — no syscall
needed.
The kernel maintains a read-only shared page (VvarPage,
Section 2.22) that it
updates on every timer tick and on NTP adjustments. Userspace vDSO code reads
this page under seqlock (Section 3.6) protection.
The canonical VvarPage struct definition is in
Section 2.22. Key fields
for timekeeping:
| Field | Type | Description |
|---|---|---|
seq |
u32 |
Seqlock counter (odd = update in progress, even = stable). ABI-constrained u32; wraps in ~24.8 days at 1000 Hz but seqlock parity check makes wrap safe. |
clock_mode |
u32 |
Active clocksource (TSC, HPET, Generic Timer, ...) |
cycle_last |
u64 |
Cycle count at last kernel update |
mask / mult / shift |
u64 / u32 / u32 |
NTP-adjusted conversion parameters |
clock_realtime_sec / _nsec |
u64 |
CLOCK_REALTIME base |
clock_monotonic_ns |
u64 |
CLOCK_MONOTONIC base (nanoseconds) |
clock_tai_offset_sec |
i64 |
TAI - UTC offset (typically 37s as of 2024). CLOCK_TAI = CLOCK_REALTIME + this offset. Updated on leap second events. |
clock_boottime_sec / _nsec |
u64 |
CLOCK_BOOTTIME base (includes suspend time) |
monotonic_raw_sec / _nsec |
u64 |
CLOCK_MONOTONIC_RAW (immune to NTP adjustments) |
raw_mult / raw_shift |
u32 |
Nominal (non-NTP-adjusted) conversion parameters |
vDSO read path (userspace, per-architecture):
- Read
seq. If odd, spin (kernel is mid-update). Incrementretry_count. - Read
cycle_last,mult,shift,mask, and the relevant base time. - Read the hardware counter (
rdtsc/mrs CNTPCT_EL0/rdtime). - Compute
delta = (now - cycle_last) & mask. - Compute
ns = base_nsec + (delta * mult) >> shift. Normalize into seconds. - Re-read
seq. If it changed: - If
retry_count < 100: go to step 1. - If
retry_count >= 100: fall back toclock_gettime()syscall. This matches Linux vDSO behavior and prevents indefinite spinning when the kernel performs sustained timekeeper updates (e.g., NTP slew adjustment, clocksource switch, or a pathological interrupt storm holding the timekeeper seqlock for extended periods). The 100-iteration threshold is conservative: normal contention resolves within 1-3 retries; reaching 100 indicates a systemic issue rather than transient contention.
Cost: ~5-20 ns depending on architecture (dominated by the clocksource read instruction itself). This is 10-40x faster than a syscall path. The syscall fallback path costs ~100-200 ns but is taken only under extreme contention.
Fallback: If clock_mode indicates no userspace-readable source is
available (e.g., HPET on x86, which requires MMIO the kernel has not mapped
into user address space), the vDSO falls back to a real syscall instruction.
7.8.4 Timer Infrastructure¶
UmkaOS provides two timer mechanisms, matching the Linux split between coarse and high-resolution timers.
Timer wheel (coarse-grained, jiffies resolution):
Used for network retransmission timeouts, poll/epoll timeouts, and other events where millisecond precision is sufficient. Implemented as a hierarchical timer wheel with O(1) insertion and O(1) per-tick processing (cascading is amortized). The wheel uses 8 levels with 256 slots each, covering timeouts from 1 tick to ~50 days at HZ=250.
High-resolution timers (hrtimers, nanosecond precision):
Used for timer_create() (POSIX per-process timers), nanosleep() /
clock_nanosleep(), timerfd_create(), and scheduler deadline enforcement.
Implemented as a per-CPU red-black tree keyed by absolute expiry time. The
nearest expiry programs the hardware timer (local APIC on x86, Generic Timer on
ARM, mtimecmp on RISC-V) to fire at the exact time.
// umka-core/src/time/hrtimer.rs
/// A high-resolution timer.
pub struct HrTimer {
/// Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
pub expires_ns: u64,
/// Callback invoked on expiry. Runs in hard-IRQ context.
pub callback: fn(&mut HrTimer),
/// Opaque context value passed to the callback. Typically a pointer to
/// the enclosing structure (cast via `as usize`), allowing the callback
/// to recover its context via `unsafe { &*(context as *const T) }`.
/// This is the Rust equivalent of Linux's `container_of` pattern for
/// timer callbacks.
///
/// # Safety
///
/// Using `context` as a pointer requires the following invariants:
///
/// 1. **Pinning**: The `HrTimer` must be embedded in a `Pin`-ned
/// allocation. The enclosing structure must not move while the timer
/// is armed, since `context` stores a raw pointer to it.
/// 2. **Drop ordering**: The enclosing structure's `Drop` implementation
/// must cancel the timer (`hrtimer_cancel()`) before deallocation,
/// ensuring the callback never fires with a dangling `context`.
/// 3. **Type agreement**: The callback is responsible for casting
/// `context` back to the correct type via
/// `unsafe { &*(self.context as *const T) }`. The caller that sets
/// `context` and the callback must agree on the type `T`.
/// 4. These invariants match Linux's `container_of` + `hrtimer` pattern,
/// adapted for Rust's ownership model. The timer subsystem enforces
/// invariant (2) by requiring `Pin<&mut HrTimer>` for
/// `hrtimer_start()`.
pub context: usize,
/// Timer state.
pub state: HrTimerState,
/// Owning CPU (timers are per-CPU to avoid cross-CPU synchronization).
pub cpu: u32,
}
/// Convenience alias used by subsystem specs (watchdog, IPVS, timerfd).
/// `KernelTimer` is the same `HrTimer` defined above.
pub type KernelTimer = HrTimer;
7.8.4.1 Cross-Domain Timer Registration¶
Tier 1 modules (e.g., umka-net TCP, IPVS) run in isolated domains and cannot
receive direct timer callbacks from the Tier 0 timer wheel. Instead, they register
timers with a domain_id parameter. On expiry, the timer subsystem routes the
event to the target domain's MPSC IRQ ring as a TimerExpiry notification
(Section 12.8), rather than invoking a callback directly.
This separation is required by the Unified Domain Model: the timer wheel runs in Tier 0 softirq context, while the timer handler runs in the Tier 1 driver's domain. Direct cross-domain function calls violate the isolation boundary.
/// Cross-domain timer registration for Tier 1 modules.
///
/// When `domain_id != CORE_DOMAIN_ID`, the timer expiry path calls
/// `timer_fire_to_domain()` ([Section 12.8](12-kabi.md#kabi-domain-runtime)) instead of
/// invoking `callback` directly. The callback field is ignored for
/// cross-domain timers (the driver's IRQ consumer loop handles dispatch
/// via `DriverIrqHandler::handle_timer_expiry()`).
///
/// When `domain_id == CORE_DOMAIN_ID` (Tier 0), the timer fires normally
/// via the `callback` field in `HrTimer` -- no ring dispatch.
///
/// # Arguments
///
/// - `timer_id`: Opaque identifier. For TCP: packed
/// `(sock_handle: u48, timer_type: u16)`. For other modules: module-defined.
/// The timer subsystem does not interpret this value -- it is echoed verbatim
/// in the `TimerExpiryPayload.timer_id` field on expiry.
/// - `domain_id`: Target domain for expiry delivery. Validated at registration
/// time: the domain must exist and be in `Active` state. If the domain crashes
/// between registration and expiry, the event is silently dropped.
/// - `expiry_ns`: Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
/// - `timer_type`: `Wheel` for coarse-grained (jiffies resolution, network
/// retransmit timeouts) or `HrTimer` for high-resolution (nanosecond precision,
/// deadline enforcement).
///
/// # Returns
///
/// `CrossDomainTimerHandle` on success. The handle can be used to cancel or
/// rearm the timer. Cancellation is synchronous: after `cancel()` returns, no
/// expiry event for this timer will be delivered (in-flight events may still
/// be in the IRQ ring but the consumer detects staleness via `expiry_ns`
/// comparison).
///
/// # Errors
///
/// - `EINVAL`: `domain_id` does not exist or is not `Active`.
/// - `ENOMEM`: Timer wheel or hrtimer tree is at capacity (should not
/// happen in practice -- both are dynamically sized).
pub fn timer_register_cross_domain(
timer_id: u64,
domain_id: DomainId,
expiry_ns: u64,
timer_type: TimerType,
) -> Result<CrossDomainTimerHandle, Error> {
// Validate domain.
let domain = DOMAIN_REGISTRY.get(domain_id)
.ok_or(Error::INVAL)?;
if domain.domain_crashed() {
return Err(Error::INVAL);
}
let handle = match timer_type {
TimerType::Wheel => {
// Insert into per-CPU timer wheel. The wheel entry stores
// (domain_id, timer_id) instead of a callback pointer.
let wheel_handle = timer_wheel_insert_cross_domain(
expiry_ns, domain_id, timer_id,
)?;
CrossDomainTimerHandle::Wheel(wheel_handle)
}
TimerType::HrTimer => {
// Insert into per-CPU hrtimer tree. The hrtimer entry stores
// (domain_id, timer_id). On expiry, the hrtimer callback calls
// timer_fire_to_domain() instead of the driver's callback.
let hr_handle = hrtimer_insert_cross_domain(
expiry_ns, domain_id, timer_id,
)?;
CrossDomainTimerHandle::HrTimer(hr_handle)
}
};
Ok(handle)
}
/// Timer type selector for cross-domain registration.
pub enum TimerType {
/// Coarse-grained timer (jiffies resolution). Used for network retransmit
/// timeouts, keepalive, TIME_WAIT, and other events where millisecond
/// precision is sufficient.
Wheel,
/// High-resolution timer (nanosecond precision). Used for deadline
/// enforcement, CBS replenishment, and latency-sensitive timers.
HrTimer,
}
/// Handle returned by `timer_register_cross_domain()`. Supports cancel
/// and rearm operations.
pub enum CrossDomainTimerHandle {
Wheel(WheelTimerHandle),
HrTimer(HrTimerHandle),
}
impl CrossDomainTimerHandle {
/// Cancel the timer. After this returns, no new expiry events will be
/// enqueued for this timer. Events already in the IRQ ring are detected
/// as stale by the consumer (expiry_ns mismatch).
pub fn cancel(&mut self) {
match self {
Self::Wheel(h) => h.cancel(),
Self::HrTimer(h) => h.cancel(),
}
}
/// Rearm the timer with a new expiry time. The old expiry is cancelled
/// and a new one scheduled. In-flight events for the old expiry are
/// detected as stale by the consumer.
pub fn rearm(&mut self, new_expiry_ns: u64) {
match self {
Self::Wheel(h) => h.rearm(new_expiry_ns),
Self::HrTimer(h) => h.rearm(new_expiry_ns),
}
}
}
Tier 0 expiry path change: When a timer with domain_id != CORE_DOMAIN_ID
expires, the timer wheel (or hrtimer) expiry handler calls
timer_fire_to_domain(domain_id, timer_id, expiry_ns) instead of invoking
the normal callback. This function is defined in Section 12.8
and enqueues a TimerExpiry event on the target domain's MPSC IRQ ring.
The cost is one CAS enqueue (~3-5 cycles) plus a conditional IPI (~0-3 cycles)
-- comparable to a normal timer callback invocation.
Per-CPU timer queues: Each CPU maintains its own timer wheel and hrtimer tree. Timer insertion targets the local CPU by default. Expiry processing happens in the local timer interrupt — no cross-CPU IPI is needed. This eliminates contention and provides deterministic latency on isolated CPUs (Section 8.4).
Timer coalescing: When a timer is inserted with a slack tolerance (e.g.,
a 100 ms timeout with 10 ms acceptable slack), the kernel may delay it to
coalesce with nearby timers. This reduces wakeups on idle CPUs, improving power
efficiency. Coalescing is disabled for hrtimers with zero slack (RT workloads).
The timer_slack_ns per-process tunable controls default slack, identical to
the Linux interface.
7.8.5 Time Namespace Offsets¶
Containers and checkpoint/restore (CRIU) require the ability to present shifted
monotonic and boottime clocks to isolated processes. Linux added time namespaces
in kernel 5.6 (via unshare(CLONE_NEWTIME)). UmkaOS provides the same
capability.
/// Per-time-namespace offsets applied to monotonic and boottime clocks.
/// Created when a process calls unshare(CLONE_NEWTIME).
pub struct TimeNamespace {
/// Offset added to CLOCK_MONOTONIC readings within this namespace.
pub monotonic_offset_ns: i64,
/// Offset added to CLOCK_BOOTTIME readings within this namespace.
pub boottime_offset_ns: i64,
/// Frozen flag: when set, all time reads return the value at freeze time.
/// Used for container checkpoint/restore (CRIU).
pub frozen: bool,
}
clock_gettime() path with namespace offsets:
| Clock | Formula | Namespace-affected? |
|---|---|---|
CLOCK_MONOTONIC |
raw_monotonic_ns + current_task().nsproxy.time_ns.monotonic_offset_ns |
Yes |
CLOCK_BOOTTIME |
raw_boottime_ns + current_task().nsproxy.time_ns.boottime_offset_ns |
Yes |
CLOCK_REALTIME |
wall_time (real wall clock, matches Linux) |
No |
CLOCK_MONOTONIC_RAW |
Raw hardware counter, no namespace adjustment | No |
CLOCK_REALTIME is intentionally unaffected — it represents actual wall clock
time and shifting it per-namespace would break distributed protocols (TLS
certificate validation, Kerberos ticket lifetimes, NFS lease timers).
CLOCK_MONOTONIC_RAW is the raw hardware counter exposed for benchmarking; it
bypasses both NTP discipline and namespace offsets.
Frozen time (CRIU): When frozen == true, all four clocks return the
timestamp captured at freeze time. This is used during container
checkpoint/restore: the container process tree is frozen, checkpointed, migrated
to another host, and restored. The restored processes see time continuing from
the checkpoint instant (with the offset set to target_monotonic - source_monotonic
at restore time), avoiding spurious timer expirations and timeout-driven errors.
vDSO fast path: Each time namespace has a dedicated vDSO data page mapped
into user processes belonging to that namespace. The page contains pre-applied
offsets (the VvarPage.clock_monotonic_ns and clock_boottime_sec/nsec fields
already include the namespace offset) so userspace clock_gettime() never
crosses into the kernel for namespace-aware time reads. When a process calls
unshare(CLONE_NEWTIME), the kernel allocates a new VvarPage and remaps
it into the process's vDSO mapping via mremap() of the vDSO data region. The
kernel's timer tick handler updates all active vDSO data pages (one per distinct
time namespace with at least one live process).
Offset configuration: Offsets are set by writing to
/proc/[pid]/timens_offsets before the process enters the new time namespace
(i.e., after unshare(CLONE_NEWTIME) but before the first exec() or
clone() that would use the new namespace). The format matches Linux:
exec() after
unshare(CLONE_NEWTIME)), the offsets are immutable — they cannot be changed
for the lifetime of the namespace.
7.8.6 Clocksource Watchdog¶
A clocksource that reports incorrect time is worse than a slow one — it causes silent data corruption in timestamps, incorrect scheduler decisions, and broken network protocols.
Cross-validation: Every 500 ms (configurable), the kernel reads both the primary and secondary clocksource and compares the elapsed interval. If the primary's elapsed time deviates from the secondary's by more than a threshold (default: 100 ppm sustained over 5 consecutive checks), the primary is marked unstable.
TSC instability detection: On x86, the TSC can be unreliable in several scenarios:
- Non-invariant TSC (pre-Nehalem Intel, pre-Zen AMD): frequency changes with P-state transitions.
- TSC halts during deep C-states on some older processors.
- TSC desynchronization across sockets on early multi-socket systems.
The watchdog detects all three cases. When the TSC is marked unstable:
- The kernel logs a warning:
clocksource: TSC marked unstable (drift >100ppm vs HPET). - The active clocksource switches to HPET (or ACPI PM Timer if HPET is absent).
- The vDSO
clock_modeis updated so userspace falls back to the syscall path (HPET is not readable from userspace without kernel MMIO mapping). - The switch is atomic from the perspective of seqlock readers — one consistent snapshot uses TSC parameters, the next uses HPET parameters.
Capability-gated calibration: TSC frequency calibration (reading MSRs like
MSR_PLATFORM_INFO or calibrating against PIT/HPET) requires privileged
operations. Only umka-core holds the capability to read/write MSRs. Tier 1
drivers cannot influence clocksource selection — a compromised driver cannot
subvert system timekeeping.
7.8.7 Interaction with RT and Power Management¶
RT timer latency: Real-time tasks (Section 8.4) depend on bounded timer
expiry. On CPUs designated for RT workloads (isolcpus, nohz_full), hrtimer
expiry is serviced directly in hard-IRQ context with a preemption-disabled path
of bounded length. The worst-case path from hardware interrupt to hrtimer
callback execution is: interrupt entry (~200 cycles) + hrtimer tree lookup (O(1)
for the nearest timer) + callback invocation. On x86 with a local APIC timer
and an isolated CPU (no frequency scaling, shallow C-states, nohz_full),
the software path completes in under 1 μs. However, hardware-level
non-determinism (DRAM refresh cycles ~350ns worst-case, cache miss penalties,
memory controller contention) means the end-to-end observed latency on
real hardware is typically 1-5 μs under favorable conditions and up to 10 μs
under worst-case memory pressure. These figures match measured PREEMPT_RT Linux
performance on isolated cores. Section 8.4 details the hardware resource
partitioning (CAT, MBA, RDT) that UmkaOS uses to minimize hardware-level jitter.
When PreemptionModel::Realtime is active (Section 8.4), softirq-context
timers are promoted to hard-IRQ context for RT-priority hrtimers, ensuring they
cannot be delayed by threaded interrupt processing.
C-state interaction with clocksources: CPU power states affect timer behavior:
| C-state | Invariant TSC | Non-Invariant TSC | Generic Timer (ARM) | mtime (RISC-V) | Timebase (PPC) |
|---|---|---|---|---|---|
| C1 (halt) | Continues | May stop | Continues | Continues | Continues |
| C3+ (deep sleep) | Continues | Stops | Continues | Continues | Continues |
When a non-invariant TSC is detected and the system supports deep C-states, the kernel forces HPET as the clocksource and disables the vDSO fast path for timestamp reads. This is a correctness requirement, not a performance choice.
Tickless (nohz) mode: When a CPU has no pending timers and is running a single task (or is idle), the periodic tick is stopped entirely. The kernel reprograms the hardware timer to fire at the next actual event (nearest hrtimer expiry, or infinity if none). This eliminates unnecessary wakeups on isolated RT CPUs and idle CPUs.
Resuming the tick happens when: (a) a new timer is inserted on the CPU, (b) a
second task becomes runnable (the scheduler needs periodic load balancing), or
(c) an interrupt wakes the CPU from idle. The nohz implementation reuses Linux's
nohz_full semantics: user code on an isolated CPU can run for arbitrarily long
periods without a single kernel interrupt.
Power-aware timer placement: When timer coalescing (Section 7.8) groups timers, the kernel prefers placing them on CPUs that are already awake. Waking a CPU from C3+ costs ~100 us and defeats the purpose of coalescing. The timer subsystem queries the per-CPU idle state before choosing a coalescing target.
7.9 System Event Bus¶
The event bus is a core kernel facility that enables kernel subsystems and drivers to notify userspace of hardware and system state changes via a capability-gated, lock-free ring buffer mechanism. Netlink compatibility (for udev/systemd) is implemented in umka-sysapi (Section 19.5).
7.9.1 Event Subscription Model¶
// umka-core/src/event/mod.rs
/// System event types.
#[repr(u32)]
pub enum EventType {
/// Battery level changed.
BatteryLevelChanged = 0,
/// AC adapter state changed (plugged/unplugged).
AcStateChanged = 1,
/// WiFi connection state changed.
WifiStateChanged = 2,
/// Bluetooth device paired/unpaired.
BluetoothDeviceChanged = 3,
/// USB device inserted/removed.
UsbDeviceChanged = 4,
/// Display hotplug (connected/disconnected).
DisplayHotplug = 5,
/// Thermal event (warning, critical).
ThermalEvent = 6,
/// Power profile changed.
PowerProfileChanged = 7,
/// Block device added/removed (for storage hotplug).
BlockDeviceChanged = 8,
/// Memory pressure event (for OOM-aware daemons).
MemoryPressure = 9,
/// Driver crash/recovery event (for monitoring daemons).
DriverRecovery = 10,
}
/// Event payload (exactly 256 bytes, cache-line friendly).
///
/// Layout verification with `#[repr(C)]`:
/// Offset 0: event_type (EventType = u32) = 4 bytes
/// Offset 4: _pad0 ([u8; 4]) = 4 bytes (explicit alignment padding)
/// Offset 8: timestamp_ns (u64) = 8 bytes
/// Offset 16: data (EventData, 240 bytes) = 240 bytes
/// Total: 4 + 4 + 8 + 240 = 256 bytes.
///
/// **Compile-time assertion**: `const_assert!(size_of::<Event>() == 256);`
// kernel-internal, not KABI
#[repr(C)]
pub struct Event {
/// Event type.
pub event_type: EventType,
/// Explicit padding for u64 alignment of timestamp_ns.
pub _pad0: [u8; 4],
/// Timestamp (monotonic ns).
pub timestamp_ns: u64,
/// Event-specific data.
pub data: EventData,
}
/// Event-specific data (union of all possible payloads).
///
/// Size assertion: `const_assert!(core::mem::size_of::<EventData>() == 240)`.
/// This ensures that adding a new variant cannot silently grow the union
/// (and therefore the `SystemEvent` struct) beyond 256 bytes.
///
/// **Info leak prevention**: Event structs are delivered to userspace via `read()`
/// on `/dev/event_bus`. Each variant is smaller than 240 bytes, leaving tail bytes.
/// The `Event::new()` constructor MUST zero-initialize the entire 256-byte struct
/// before writing the variant payload: `let mut e = Event { ..core::mem::zeroed() };`
/// This ensures no kernel stack/heap data leaks through uninitialized tail bytes.
#[repr(C)]
pub union EventData {
pub battery: BatteryEvent,
pub ac: AcEvent,
pub wifi: WifiEvent,
pub bluetooth: BluetoothEvent,
pub usb: UsbEvent,
pub display: DisplayEvent,
pub thermal: ThermalEvent,
pub power_profile: PowerProfileEvent,
pub block_device: BlockDeviceEvent,
pub memory_pressure: MemoryPressureEvent,
pub driver_recovery: DriverRecoveryEvent,
_pad: [u8; 240], // EventData union size = 240 bytes; Event total = 4 + 4 + 8 + 240 = 256.
}
impl Event {
/// Zero-initialize an Event, then populate the header and variant payload.
/// The entire 256-byte struct is zeroed BEFORE writing any fields, ensuring
/// no kernel data leaks through uninitialized tail bytes when delivered to
/// userspace via `read()` on `/dev/event_bus`.
///
/// # Safety
/// `core::mem::zeroed()` is safe for `Event` because all fields are
/// `#[repr(C)]` with no enum discriminants or non-zero invariants.
pub fn new(event_type: EventType, timestamp_ns: u64) -> Self {
// SAFETY: Event is repr(C), all-zeroes is valid for every field.
let mut e: Self = unsafe { core::mem::zeroed() };
e.event_type = event_type;
e.timestamp_ns = timestamp_ns;
e
}
}
// Event variant structs below are kernel-internal payloads embedded in the
// EventData union (max 240 bytes per variant). The 256-byte Event struct
// has its own const_assert. Individual variants do not need separate asserts.
/// Battery event data.
#[repr(C)]
pub struct BatteryEvent {
/// Battery percentage (0-100).
pub percent: u8,
/// Charging state (0=discharging, 1=charging, 2=full).
pub charging: u8,
/// Time remaining in minutes (0xFFFF = unknown).
pub time_remaining_min: u16,
}
/// Display hotplug event data.
#[repr(C)]
pub struct DisplayEvent {
/// Connector ID.
pub connector_id: u32,
/// Event subtype: 0 = disconnected, 1 = connected.
pub connected: u8,
}
const_assert!(core::mem::size_of::<DisplayEvent>() == 8);
/// Block device event data.
#[repr(C)]
pub struct BlockDeviceEvent {
/// Major:minor encoded as `(major << 20) | minor`.
pub dev_id: u32,
/// Event subtype (0=removed, 1=added, 2=changed).
pub action: u8,
pub _pad: [u8; 3],
/// Device name (e.g., "sda", "nvme0n1"). Null-terminated.
pub name: [u8; 32],
}
const_assert!(core::mem::size_of::<BlockDeviceEvent>() == 40);
/// Memory pressure event data.
///
/// Layout (16 bytes, no implicit padding):
/// offset 0: available_pages (u64, 8 bytes)
/// offset 8: numa_node (i32, 4 bytes)
/// offset 12: level (u8, 1 byte)
/// offset 13: _pad ([u8; 3], 3 bytes — tail padding to 8-byte alignment)
/// Total: 16 bytes.
#[repr(C)]
pub struct MemoryPressureEvent {
/// Available memory in pages at the time of the event.
pub available_pages: u64,
/// NUMA node that triggered the event (-1 = system-wide).
pub numa_node: i32,
/// Pressure level (0=low, 1=medium, 2=critical).
pub level: u8,
/// Explicit tail padding to 8-byte struct alignment.
pub _pad: [u8; 3],
}
const_assert!(core::mem::size_of::<MemoryPressureEvent>() == 16);
/// Driver crash/recovery event data.
#[repr(C)]
pub struct DriverRecoveryEvent {
/// Device handle of the affected device.
pub device_id: u64,
/// Event subtype (0=crashed, 1=recovering, 2=recovered, 3=quarantined).
pub action: u8,
pub _pad: [u8; 7],
/// Driver name. Null-terminated, max 63 bytes.
pub driver_name: [u8; 64],
}
const_assert!(core::mem::size_of::<DriverRecoveryEvent>() == 80);
7.9.2 Subscription via Capability¶
Processes subscribe to events via the capability system (Section 9.1):
/// Global event subscription manager. Singleton, initialized during boot.
///
/// **Collection policy**: `EventType` is `#[repr(u32)]` with 11 variants (0-10).
/// A fixed-size array indexed by `event_type as usize` provides O(1) lookup
/// with zero overhead (0-indexed: variant 0 = index 0, variant 10 = index 10).
/// Each entry holds a bounded list of subscribers.
/// The post_event path is warm (device state changes, not per-syscall).
pub struct EventManager {
/// Per-event-type subscription lists. Indexed by `EventType as usize`.
/// Fixed-size array (EVENT_TYPE_COUNT entries, one per EventType variant).
/// Each entry is a bounded list of subscribers, enforced at subscribe time.
subscriptions: [SpinLock<ArrayVec<SubscriptionInfo, MAX_SUBSCRIBERS_PER_EVENT>>; EVENT_TYPE_COUNT],
capability_manager: &'static CapabilityManager,
}
/// Maximum subscribers per event type. Enforced at subscribe() time;
/// exceeding this returns `EventError::TooManySubscribers`.
const MAX_SUBSCRIBERS_PER_EVENT: usize = 64;
/// Number of EventType enum variants.
const EVENT_TYPE_COUNT: usize = 11;
/// Per-subscriber state. Stored in the per-event-type subscription array.
struct SubscriptionInfo {
process_id: Pid,
ring: Weak<EventRing>,
dropped_events: AtomicU64,
}
impl EventManager {
/// Subscribe to a class of events. Returns an EventSubscription capability.
///
/// # Security
/// Requires `CAP_SYS_ADMIN` for system-wide events (thermal, power profile).
/// Requires `CAP_NET_ADMIN` for network events (WiFi, Bluetooth).
/// Battery, AC, USB, display events are unrestricted (visible to all processes).
pub fn subscribe(&self, event_type: EventType, process: &Process) -> Result<CapabilityToken, EventError> {
// Check capability grants.
match event_type {
EventType::ThermalEvent | EventType::PowerProfileChanged => {
if !process.has_capability(Capability::SysAdmin) {
return Err(EventError::PermissionDenied);
}
}
EventType::WifiStateChanged | EventType::BluetoothDeviceChanged => {
if !process.has_capability(Capability::NetAdmin) {
return Err(EventError::PermissionDenied);
}
}
EventType::MemoryPressure | EventType::DriverRecovery => {
if !process.has_capability(Capability::SysAdmin) {
return Err(EventError::PermissionDenied);
}
}
// Unrestricted — any process can subscribe to these event types.
// New variants must be explicitly listed here — deny by default.
// This match is intentionally exhaustive with no wildcard: adding a
// new EventType variant triggers a compile error, forcing the developer
// to assign it to an explicit capability tier above or list it here.
EventType::BatteryLevelChanged
| EventType::AcStateChanged
| EventType::UsbDeviceChanged
| EventType::DisplayHotplug
| EventType::BlockDeviceChanged => {}
}
// Allocate event ring buffer (per-process, 4 KB = ~16 events).
// The process holds the Arc (strong ref); subscription holds Weak.
// When the process exits, Arc is dropped, Weak::upgrade() returns None,
// and post_event() silently skips the dead subscriber.
let ring = Arc::new(EventRing::allocate(process)?);
// Mint capability token.
let cap_token = self.capability_manager.mint(
CapabilityType::EventSubscription,
CapabilityRights::READ,
ring.ring_id(),
)?;
// Register subscription with weak reference (AI-065: Arc/Weak lifecycle).
let weak = Arc::downgrade(&ring);
let idx = event_type as usize;
let mut guard = self.subscriptions[idx].lock();
if guard.is_full() {
return Err(EventError::TooManySubscribers);
}
guard.push(SubscriptionInfo {
process_id: process.pid(),
ring: weak,
dropped_events: AtomicU64::new(0),
});
// Return the Arc to the caller — the process stores it in its
// file descriptor table so userspace can read() events from it.
// (cap_token is the capability granting access to the ring FD.)
Ok(cap_token)
}
/// Post an event to all subscribers.
///
/// Takes `&Event` to avoid copying the 256-byte struct per call.
/// The ring buffer copies the event internally on each push.
pub fn post_event(&self, event: &Event) {
let idx = event.event_type as usize;
let guard = self.subscriptions[idx].lock();
for sub in guard.iter() {
if let Some(ring) = sub.ring.upgrade() {
// Write event to subscriber's ring buffer (lock-free push).
if ring.push(event).is_err() {
// Ring full: drop event (subscriber is too slow).
sub.dropped_events.fetch_add(1, Ordering::Relaxed);
}
}
}
}
}
For Netlink compatibility (udev, systemd integration), see Section 19.5.
7.9.3 Integration Points¶
| Subsystem | Events posted |
|---|---|
| Battery driver (Section 7.4) | BatteryLevelChanged, AcStateChanged |
| WiFi driver (Section 13.2, 10-drivers.md) | WifiStateChanged |
| Bluetooth (Section 13.14, 10-drivers.md) | BluetoothDeviceChanged |
| USB bus (Section 13.12) | UsbDeviceChanged |
| Display driver (Section 21.5) | DisplayHotplug |
| Thermal framework (Section 7.4) | ThermalEvent |
| Power profiles (Section 7.4) | PowerProfileChanged |
| Block layer (Section 15.2) | BlockDeviceChanged |
| OOM killer (Section 4.5) | MemoryPressure |
| Crash recovery (Section 11.9) | DriverRecovery |
7.10 Intent-Based Resource Management¶
7.10.1 The Abstraction Gap¶
UmkaOS has all the mechanisms for smart resource management: - In-kernel inference engine (Section 22.6) for learned decisions - Per-device utilization tracking (Section 22.1) - Topology awareness (device registry, Section 11.4) - Power metering (Section 7.7 ) - Memory tier tracking (PageLocationTracker, Section 22.4) - Network fabric topology (Section 5.2)
What's missing is the abstraction that ties these together. Currently, resources are managed imperatively: "give me 4 cores and 16GB RAM." The alternative: declare goals, let the kernel optimize.
7.10.2 Design: Resource Intents¶
// umka-core/src/intent/mod.rs
/// A resource intent declares WHAT the workload needs,
/// not HOW to allocate resources.
#[repr(C)]
pub struct ResourceIntent {
/// Target P99 SCHEDULING latency (nanoseconds).
/// This is the time from task becoming runnable to task getting CPU.
/// The kernel cannot measure application-level latency (it doesn't know
/// what an "operation" is). This metric is scheduling + I/O completion
/// latency — both kernel-observable.
/// Kernel adjusts CPU priority, memory placement, I/O scheduling.
/// 0 = no latency target (best-effort).
pub target_latency_ns: u64,
/// Target throughput (operations per second).
/// Kernel adjusts CPU allocation, I/O queue depth, batch sizes.
/// 0 = no throughput target (best-effort).
pub target_ops_per_sec: u64,
/// Availability requirement (basis points: 9999 = 99.99%).
/// Kernel adjusts redundancy, crash recovery priority.
/// 0 = no availability target.
/// Used by: cgroup knob `intent.availability` (Section 7.7.3),
/// crash recovery priority in Section 20.1 (higher availability_bp
/// = faster restart, more aggressive health monitoring).
pub availability_bp: u32,
/// Power efficiency preference (0 = max performance, 100 = max efficiency).
/// Kernel adjusts DVFS, core parking, accelerator clock.
/// 50 = balanced (default).
pub efficiency_preference: u32,
/// Data locality hint: where does this workload's data live?
/// Kernel uses this for NUMA placement and distributed scheduling.
/// Used by: cgroup knob `intent.data_affinity` (Section 7.7.3),
/// NUMA placement optimizer (Section 7.7.5 step 2b), and distributed
/// scheduling in Section 5.1.
pub data_affinity: DataAffinityHint,
/// Struct layout version. Enables future extension without breaking binary
/// compatibility: the kernel checks this field and interprets fields beyond
/// the base layout only if version >= the version that introduced them.
/// v1 = initial layout (this definition). Future versions extend into _reserved.
pub version: u32,
/// Reserved for future extension fields. New versions of ResourceIntent
/// consume bytes from this region. Zero-initialized by callers; the kernel
/// ignores non-zero bytes in positions it does not recognize for the given
/// `version` field value. When `version` is incremented, newly-defined
/// fields are parsed from specific offsets within `_reserved`. Sized to
/// make the struct exactly 64 bytes (u64-aligned) with no implicit
/// tail padding: 8+8+4+4+4+4+32 = 64.
pub _reserved: [u8; 32],
}
// Layout: 8+8+4+4+4+4+32 = 64 bytes.
const_assert!(size_of::<ResourceIntent>() == 64);
#[repr(u32)]
pub enum DataAffinityHint {
/// No preference. Kernel decides based on observation.
Auto = 0,
/// Data is primarily local (disk-bound workload).
Local = 1,
/// Data is distributed across nodes (distributed workload).
Distributed = 2,
/// Data is on accelerators (GPU-bound workload).
Accelerator = 3,
}
7.10.3 Cgroup Integration¶
/sys/fs/cgroup/<group>/intent.latency_ns
# # Target P99 latency in nanoseconds.
# # "0" = no target (default, pure imperative mode).
# # "5000000" = target 5ms P99 latency.
/sys/fs/cgroup/<group>/intent.throughput
# # Target operations per second.
# # "0" = no target.
/sys/fs/cgroup/<group>/intent.efficiency
# # 0 = max performance, 100 = max efficiency, 50 = balanced.
# # Default: 50.
/sys/fs/cgroup/<group>/intent.availability
# # Availability target in basis points (0 = no target, 9999 = 99.99%).
# # Kernel adjusts crash recovery priority and health monitoring
# # frequency for drivers serving this cgroup. Higher values trigger
# # faster driver restart (Section 20.1) and redundant I/O path selection.
# # Default: 0 (no availability target).
# # Maps to: ResourceIntent.availability_bp
/sys/fs/cgroup/<group>/intent.data_affinity
# # Data locality hint for NUMA placement and distributed scheduling.
# # Values: "auto" (default), "local", "distributed", "accelerator"
# # "auto" = kernel observes memory access patterns and decides.
# # "local" = data is primarily on local storage (optimize for disk I/O).
# # "distributed" = data spans cluster nodes (optimize for network).
# # "accelerator" = data lives on accelerator memory (minimize transfers).
# # Maps to: ResourceIntent.data_affinity (DataAffinityHint enum)
/sys/fs/cgroup/<group>/intent.status
# # Read-only. Current intent satisfaction:
# # latency_met: true|false
# # latency_p99_actual_ns: <value>
# # throughput_met: true|false
# # throughput_actual: <value>
# # power_actual_mw: <value>
# # optimizer_action: <last action taken, e.g., "raised cpu.weight to 200">
# # adjustments_last_hour: <count>
# # contradiction: <none|description>
Multi-tenant access control: In K8s multi-tenant clusters,
intent.statusexposes internal workload metrics (actual latency, throughput, power draw) for each cgroup. A process in one container reading/sys/fs/cgroup/other-tenant/intent.statuswould expose the other tenant's workload profile, which is an information disclosure risk. Access control:intent.statusis readable only by processes withCAP_SYS_ADMINin the cgroup's user namespace, or by the cgroup owner (matching the cgroup'suid). This matches the access model for/proc/PID/status— visible to owner and root only. Non-owner reads returnEACCES. The same access policy applies tointent.explainandintent.adjustment_history, which also contain tenant-specific operational data.
7.10.4 Objective Function and Conflict Resolution¶
Clarification: SCHED_INTENT is NOT a scheduling class.
SCHED_INTENTis an annotation layered on top of existing scheduling classes (EEVDF, RT, Deadline). A task'sSchedClassis unchanged by intent assignment — an EEVDF task withintent::LATENCY_SENSITIVEremains EEVDF class.The annotation modifies the task's effective eligibility calculation: a latency-sensitive EEVDF task receives a forward-eligible offset (effectively a negative lag boost via
place_entity()) that prioritizes it WITHIN EEVDF without promoting it to RT class. EEVDF computes eligibility dynamically fromavg_vruntime(), not from a stored field.The "priority level 4" in the multi-class selection table below refers to the selection order among scheduler classes: a LATENCY_SENSITIVE EEVDF task is considered before standard EEVDF tasks but after all RT tasks. This is implemented by adjusting the task's vruntime on wakeup — there is no separate SCHED_INTENT class in the runqueue.
Intent Scheduling: Objective Function and Conflict Resolution
Objective function: minimize total latency for latency-sensitive tasks subject to meeting all SCHED_DEADLINE deadlines.
minimize: Σ latency(task_i) for all SCHED_INTENT/SCHED_NORMAL tasks
subject to: deadline_j met for all SCHED_DEADLINE tasks j
Scheduling class priority (highest to lowest):
| Priority | Class | Condition |
|---|---|---|
| 1 (highest) | SCHED_DEADLINE | CBS task with remaining budget and active deadline |
| 2 | SCHED_RT FIFO | Real-time, FIFO policy, priority 1-99 |
| 3 | SCHED_RT RR | Real-time, round-robin, priority 1-99 |
| 4 | SCHED_NORMAL (EEVDF) | Standard tasks (Section 7.1) |
| 5 | SCHED_BATCH | CPU-bound batch jobs (EEVDF with longer time slices) |
| 6 (lowest) | SCHED_IDLE | Run only when nothing else is runnable |
Note: Tasks with
intent::LATENCY_SENSITIVEreceive a forward-eligible offset within EEVDF (row 4 / SCHED_NORMAL). They do NOT form a separate scheduling class. The intent annotation adjusts the task's vruntime positioning viaplace_entity()on wakeup, giving latency-sensitive tasks priority within EEVDF without promoting them above standard EEVDF tasks in the class hierarchy.
Intent conflict resolution rules:
-
Two SCHED_INTENT tasks competing for the same CPU: Schedule by EEVDF virtual time (same as SCHED_NORMAL). Annotations affect eligibility but not within-class ordering.
-
LATENCY_SENSITIVEvsTHROUGHPUTon same CPU:LATENCY_SENSITIVEruns first in the current scheduling quantum.THROUGHPUTtasks use the remaining time in the quantum. -
POWER_EFFICIENThint: Migrate to an efficiency core (EAS decision, Section 7.2) if: (a) latency budget allows the migration cost (~10-50μs), AND (b) the efficiency core is not already at capacity.POWER_EFFICIENTis never honored if it would cause aLATENCY_SENSITIVEtask to miss its target latency. -
EXCLUSIVE_CPUhint (two tasks competing): Round-robin at equal EEVDF priority. The hint is advisory — UmkaOS does not dedicate a CPU to a single SCHED_INTENT task unless it has an explicit CPU affinity set viasched_setaffinity(). -
SCHED_INTENTdegradation: If the requested intent is unachievable (e.g., all CPUs committed to RT tasks), the task degrades to SCHED_NORMAL for that scheduling quantum. Degradation is logged to the observability layer (Section 20.1) and exposed via/proc/PID/sched_intent_stats.
7.10.5 The Optimization Loop¶
Every ~1 second (configurable):
1. IntentOptimizer collects metrics:
- Per-cgroup: actual latency (P50, P99), throughput, power
- Per-device: utilization, temperature, power
- Cluster-wide: node loads, memory pressure, network utilization
2. For each cgroup with intents:
a. Is the intent being met?
- latency_p99_actual <= target_latency_ns?
- throughput_actual >= target_ops_per_sec?
b. If not met → need more resources:
- Increase CPU allocation (raise cpu.weight or cpu.guarantee)
- Improve NUMA placement (migrate pages closer to running CPUs)
- Increase accelerator allocation (raise accel.compute.guarantee)
- Increase I/O priority
c. If met with headroom → can release resources:
- Reduce CPU allocation (lower cpu.weight)
- Lower frequency (save power)
- Free accelerator time for other workloads
3. Apply adjustments via existing cgroup knobs.
Intent layer is an OPTIMIZER that writes to existing imperative knobs.
It does NOT replace the imperative interface — it sits above it.
4. **Optimization algorithm** (gradient-descent-inspired, bounded):
- For each unmet intent, compute the **deficit**: e.g.,
`latency_deficit = latency_p99_actual - target_latency_ns`.
- Map deficit to resource adjustment via a **PD controller**:
`delta_weight = K_p × (deficit / target) + K_d × d(deficit / target) / dt`,
where `K_p` is a per-resource-type proportional gain constant (default:
K_p=0.5 for CPU weight, K_p=0.3 for frequency) and `K_d` is the
derivative gain (default: K_d=1.0 for CPU weight, K_d=0.6 for frequency).
See [Section 7.10](#intent-based-resource-management--stability-analysis) for the complete control law
and gain derivation.
- **Clamp** adjustments to avoid oscillation: each iteration adjusts
by at most ±20% of the current allocation (`MAX_INTENT_ADJUSTMENT = 0.20`).
Multiple iterations converge geometrically.
- **Convergence criterion**: intent is "met" when the metric is within
10% of the target for 3 consecutive measurement windows. Once met,
the optimizer enters a **hold** state for that cgroup (no further
adjustments until the metric drifts outside the 10% band).
- **`compute.weight` decomposition**: When the intent specifies
`accel.compute.weight`, the optimizer distributes across CPU and
accelerator proportionally to their current utilization ratio:
`cpu_share = cpu_util / (cpu_util + accel_util)`.
- **Safety bound**: The optimizer never reduces allocation below the
cgroup's `*.min` guarantee or above the `*.max` ceiling.
Stability controls (prevent oscillation/hunting):
- Hysteresis: don't adjust unless delta exceeds 10% of current value.
- Minimum hold time: no changes within 5 seconds of last adjustment.
- Damping: exponential backoff if last 3 adjustments didn't converge.
- Max adjustment rate: at most ±20% change per optimization cycle.
4. Conflicting intents: when multiple cgroups declare intents that cannot
all be satisfied simultaneously (insufficient resources):
- Intents are BEST-EFFORT, not guarantees.
- Priority follows existing cpu.weight / accel.compute.weight hierarchy.
- Higher-weight cgroups get intent satisfaction first.
- Unsatisfied intents are reported in intent.status (latency_met: false).
- The optimizer does NOT starve low-priority cgroups — it respects
existing cgroup min guarantees (cpu.min, memory.min).
5. Policy priority ordering (prevents conflicts between subsystems):
Priority (highest to lowest):
a. Hardware limits (thermal throttle, voltage limits) — immutable
b. Admin-configured cgroup limits (cpu.max, power.max) — hard ceiling
c. Power budget enforcement (Section 7.4) — watt cap
d. Intent optimization (this section) — soft optimization
e. EAS energy optimization (Section 7.1.5) — per-task core selection
Each layer can only adjust WITHIN the ceiling set by the layer above.
Power budget is a hard constraint; intents work within it. No oscillation.
When power budgeting and EAS conflict (e.g., power budgeting throttles a
CPU domain that EAS prefers), power budgeting takes precedence — the EAS
migration is deferred until the power budget is satisfied. This may
temporarily route tasks to less energy-efficient cores, but prevents
thermal throttling and power supply overload, which are correctness
constraints rather than optimization goals.
6. Log adjustments to /sys/kernel/umka/intent/adjustment_log
for observability and debugging.
The in-kernel inference engine (Section 22.6) powers the optimization. The "Intent I/O Scheduler" and "Intent Page Prefetch" models (Section 22.6) are use cases of intent-based management.
IntentOptimizer data structures.
The PD controller operates independently per resource dimension. Each dimension maps to one kernel control variable that the optimizer adjusts:
/// Resource dimension: identifies which resource knob the PD controller
/// tunes for a given cgroup. Each dimension has independent gain constants
/// (`k_p`, `k_d`) and error history, allowing different convergence rates
/// for CPU vs. I/O vs. accelerator workloads.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum ResourceDim {
/// CPU weight (cgroup `cpu.weight`). The optimizer adjusts the
/// EEVDF weight to converge on `target_latency_ns`.
Cpu = 0,
/// Memory placement tier (DRAM vs. CXL vs. compressed). The optimizer
/// adjusts page promotion/demotion thresholds to meet the latency target.
Memory = 1,
/// Block I/O priority (`io.weight`, I/O scheduler class). Adjusts
/// I/O scheduling parameters to converge on `target_ops_per_sec`.
Io = 2,
/// Network bandwidth priority (TC qdisc class, BPF cgroup egress
/// rate). Adjusts to converge on network throughput or latency targets.
Net = 3,
/// Accelerator allocation (GPU/inference engine time-slice fraction).
/// Adjusts `AccelWeight` to converge on accelerator utilization targets.
Accel = 4,
}
impl ResourceDim {
/// Total number of resource dimensions. Used to size fixed-length arrays
/// (`[PdControllerState; ResourceDim::COUNT]`, `[f64; ResourceDim::COUNT]`).
pub const COUNT: usize = 5;
}
The IntentOptimizer is a singleton kernel subsystem that owns the optimization loop
state. It runs as a dedicated kernel thread (kthread/intent_optimizer) and is never
instantiated more than once.
/// Top-level state for the intent optimization subsystem.
///
/// One instance exists system-wide, owned by the `intent_optimizer` kernel
/// thread. All fields are accessed only from that thread except where noted.
pub struct IntentOptimizerState {
/// PD controller state per resource dimension.
/// Each dimension (CPU weight, frequency, accelerator allocation, I/O
/// priority) has independent gain constants and error history.
pub controller: [PdControllerState; ResourceDim::COUNT],
/// Per-cgroup intent control state. Keyed by cgroup ID (u64).
/// XArray: O(1) lookup by integer cgroup ID, RCU-compatible reads.
/// Allocated on first intent assignment, freed when all intents are cleared.
pub cgroup_states: XArray<IntentControlState>,
/// Monotonic nanosecond timestamp of the last optimizer tick.
/// Used to detect missed ticks and adjust derivative computation.
pub last_tick_ns: u64,
/// Whether the optimizer is in the "hold" state (all intents converged).
/// When true, the optimizer still runs on its timer but skips computation
/// unless a metric drifts outside the convergence band.
pub all_converged: bool,
/// Configuration: optimizer tick period in nanoseconds. Default: 1_000_000_000 (1s).
/// Adjustable at runtime via `/sys/kernel/umka/intent/tick_period_ns`.
pub tick_period_ns: u64,
}
/// PD controller state for one resource dimension (CPU weight, frequency, etc.).
///
/// # Floating-Point Usage
///
/// This struct contains `f64` fields. Floating-point arithmetic in kernel context
/// requires FPU state to be saved/restored across preemption points and context
/// switches. To avoid this overhead on the common path, this struct is only ever
/// accessed from the `IntentOptimizer` kthread — a dedicated kernel thread that
/// runs with FPU context enabled in task (non-interrupt) context.
///
/// **Forbidden contexts**: interrupt handlers, RCU callbacks, softirq handlers,
/// spinlock critical sections, or any preemption-disabled section. Violations cause
/// FPU state corruption on preemptible kernels.
///
/// Compile-time enforcement: `PdControllerState` is `!Send` (via `PhantomData<*mut ()>`)
/// — only the single `IntentOptimizer` kthread may hold a reference to it.
pub struct PdControllerState {
/// Proportional gain constant. Default depends on dimension
/// (0.5 for CPU weight, 0.3 for frequency). May be halved by the
/// exponential backoff mechanism after 3 consecutive non-convergent ticks.
pub k_p: f64,
/// Derivative gain constant. Set to `k_p * tau_d` where `tau_d` is the
/// dominant feedback delay for this dimension (see Section 7.7.5.1).
pub k_d: f64,
/// Setpoint: the target value that the controller drives toward.
/// For latency dimensions, this is `target_latency_ns`. For throughput,
/// `target_ops_per_sec`. Updated when the cgroup's intent changes.
pub setpoint: f64,
/// Normalized error from the previous optimizer tick.
/// Used to compute the discrete derivative `d_error[k]`.
/// Initialized to 0.0 on first tick after intent assignment.
pub prev_error: f64,
/// Accumulated integral term (reserved for future PID extension).
/// Currently unused — the optimizer runs a PD controller only.
/// Retained in the struct to avoid a layout change if PID is needed.
pub integral: f64,
/// Current effective gain multiplier, reduced by exponential backoff.
/// Starts at 1.0, halved after 3 consecutive non-convergent adjustments,
/// reset to 1.0 on convergence.
pub gain_multiplier: f64,
/// `!Send` marker: prevents this struct from being moved across threads.
/// `PdControllerState` must only be accessed from the `IntentOptimizer`
/// kthread, which holds FPU context. Transferring it to another thread
/// would bypass this invariant and risk FPU state corruption.
_not_send: core::marker::PhantomData<*mut ()>,
}
Observation flow: observe_scheduler_metric().
Scheduler observations flow into the optimizer through per-CPU observation rings. Each CPU writes metrics locally without contention; the optimizer thread reads them in bulk during its tick.
/// Write a scheduler observation to the per-CPU observation ring.
///
/// Called from the scheduler hot path (task tick, context switch, wakeup)
/// with preemption disabled. The write is O(1) — a single slot in a
/// pre-allocated per-CPU ring buffer. If the ring is full, the oldest
/// unread observation is silently overwritten (lossy under extreme load,
/// but the optimizer is statistical and tolerates dropped samples).
///
/// # Arguments
///
/// * `cpu` — The CPU producing the observation. Must be the current CPU
/// (enforced by the `PreemptGuard` the caller holds).
/// * `metric_type` — The type of metric being reported.
/// * `value` — The metric value (nanoseconds for latency, count for
/// throughput, 0-1024 for utilization).
///
/// # Performance
///
/// ~5-10 ns per call (single cache-line write to per-CPU ring). Zero
/// allocation. No locks. The per-CPU ring is sized to hold 256 entries
/// (one cache line per entry), sufficient for ~250ms of observations at
/// 1000 observations/second before wraparound.
pub fn observe_scheduler_metric(cpu: CpuId, metric_type: SchedMetricType, value: u64) {
let ring = per_cpu!(sched_observation_ring, cpu);
ring.push(SchedObservation {
timestamp_ns: arch::current::cpu::read_timestamp(),
metric_type,
value,
});
}
/// Scheduler metric types reported to the intent optimizer.
#[derive(Copy, Clone, Debug)]
pub enum SchedMetricType {
/// Task wakeup-to-run latency in nanoseconds.
WakeupLatency,
/// Run queue depth (number of runnable tasks) at observation time.
RunQueueDepth,
/// CPU utilization (PELT-smoothed, 0-1024 scale).
CpuUtilization,
/// Context switch count since last observation.
ContextSwitchCount,
/// EAS migration decision (1 = migrated to efficiency core, 0 = stayed).
EasMigration,
/// Accelerator compute utilization (0-1000 scale, per-device).
/// Reported by the accelerator scheduler ([Section 22.1](22-accelerators.md#unified-accelerator-framework)).
AccelUtilization,
/// Accelerator command queue depth (per-device).
AccelQueueDepth,
/// Accelerator power draw in milliwatts (per-device).
AccelPowerDraw,
/// Accelerator temperature in millidegrees Celsius (per-device).
AccelTemperature,
}
/// A single scheduler observation written to the per-CPU ring.
///
/// Aligned to cache line (64 bytes) to prevent false sharing between per-CPU
/// observation slots. The padding beyond the ~24 bytes of data fields is the
/// cost of contention-free concurrent writes from independent CPUs.
#[repr(C, align(64))]
pub struct SchedObservation {
/// Monotonic timestamp in nanoseconds.
pub timestamp_ns: u64,
/// The metric being reported.
pub metric_type: SchedMetricType,
/// The metric value.
pub value: u64,
}
// kernel-internal ML observation. align(64) pads to 64 bytes (one cache line).
const_assert!(size_of::<SchedObservation>() == 64);
Pipeline to KernelObservation bus:
observe_scheduler_metric() is a typed convenience wrapper around the
observe_kernel! macro (Section 23.1).
It writes directly to the per-CPU ObservationRing as a KernelObservation with
subsystem = SubsystemId::Scheduler and the metric type mapped to obs_type.
There is no separate SchedObservation ring — the SchedObservation struct is
the internal format that observe_scheduler_metric() converts to
KernelObservation::features[] before pushing to the shared ring:
features[0]=metric_typeasi32features[1]=valuelow 32 bits asi32features[2]=valuehigh 32 bits asi32(for values > 2^31)
The Tier 2 ML policy service reads KernelObservation entries from the
shared ObservationRing (4096 entries, mmap'd read-only). This single-ring
design avoids a second ring and an intermediate aggregation thread.
The observe_kernel! macro (zero-cost when no consumer is attached) is the generic
entry point; observe_scheduler_metric() is the scheduler-specific typed wrapper.
Parameter update propagation: apply_policy_update().
When the optimizer (or an external Tier 2 ML policy service) computes a new parameter value, it flows through a validated, bounded update path:
/// Apply a policy-driven parameter update to a kernel tunable.
///
/// This is the sole entry point for runtime parameter changes from both
/// the in-kernel intent optimizer and external Tier 2 policy services.
/// All updates are validated and bounded before application.
///
/// # Validation
///
/// 1. The parameter must be registered in the Kernel Tunable Parameter
/// Store ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence)).
/// 2. The new value must be within the parameter's declared `[min, max]`
/// bounds. Out-of-bounds values are clamped (not rejected) and the
/// clamping is logged.
/// 3. The rate of change must not exceed `MAX_INTENT_ADJUSTMENT` (20%)
/// per optimizer tick. If the requested change exceeds this, it is
/// clamped to the maximum rate.
/// 4. The caller must hold `CAP_ML_TUNE` (for Tier 2 services) or be
/// the intent optimizer kernel thread (implicitly authorized).
///
/// # Propagation
///
/// After validation, the update is converted to fixed-point integer
/// representation and written to the parameter's `AtomicI64` backing
/// store in the tunable registry. The conversion uses a fixed-point
/// scale factor:
///
/// ```text
/// PARAM_SCALE = 1_000_000 (microsecond / permille precision)
/// stored = AtomicI64::store((value * PARAM_SCALE).round() as i64, Relaxed)
/// read = AtomicI64::load(Relaxed) as f64 / PARAM_SCALE
/// ```
///
/// This avoids `AtomicF64` (which does not exist in the Rust atomic
/// model) while preserving sub-ppm precision for all tunable parameters.
/// The scale factor is a global constant shared by all `KernelTunableParam`
/// consumers — there is no per-parameter scale. `PARAM_SCALE = 1_000_000`
/// gives 6 decimal digits of precision, sufficient for all scheduler
/// and policy parameters (whose bounds are expressed in permille or
/// microseconds).
///
/// The affected subsystem reads the new value on its next access — there
/// is no explicit notification. This is safe because all tunable parameters
/// are designed to be read speculatively (the scheduler reads
/// `eevdf_weight_scale` on every `pick_next_task`, not cached).
///
/// # Auto-decay
///
/// Every parameter update carries an expiry timestamp. If no new update
/// arrives within the expiry window (default: 60 seconds), the parameter
/// reverts to its compiled-in default. This prevents a crashed ML service
/// from leaving the kernel in a mis-tuned state indefinitely.
/// Fixed-point scale factor for f64 ↔ AtomicI64 conversion.
///
/// All `KernelTunableParam` values are stored as `(value * PARAM_SCALE).round() as i64`
/// in the `AtomicI64` backing store. Consumers reconstruct f64 via
/// `param.current.load(Relaxed) as f64 / PARAM_SCALE as f64`.
///
/// 1_000_000 gives 6 decimal digits of precision — sufficient for all
/// scheduler parameters (permille gains, microsecond latencies).
pub const PARAM_SCALE: i64 = 1_000_000;
/// # FPU Safety
///
/// This function uses `f64` arithmetic. It MUST be called ONLY from the
/// `IntentOptimizer` kthread or a Tier 2 policy service's kernel entry
/// point (which runs in task context with FPU enabled). It MUST NOT be
/// called from interrupt context, softirq, RCU callbacks, or any
/// preemption-disabled section.
///
/// The `IntentOptimizer` kthread calls `kernel_fpu_begin()` before
/// entering its policy evaluation loop and `kernel_fpu_end()` on exit.
/// This saves/restores the user-mode FPU state once per evaluation
/// cycle (not per parameter update), amortizing the cost across all
/// parameter updates in the cycle.
///
/// Debug builds assert `current_thread_is_intent_optimizer()` at entry.
pub fn apply_policy_update(
param: &KernelTunableParam,
value: f64,
expiry_ns: u64,
) -> Result<(), PolicyUpdateError> {
// 1. Bounds check and clamp (in f64 domain, against scaled-back bounds).
let min_f = param.min_value as f64 / PARAM_SCALE as f64;
let max_f = param.max_value as f64 / PARAM_SCALE as f64;
let clamped = value.clamp(min_f, max_f);
if (clamped - value).abs() > f64::EPSILON {
log_clamped_update(param, value, clamped);
}
// 2. Rate-of-change check and clamp.
let current_i = param.current.load(core::sync::atomic::Ordering::Relaxed);
let current = current_i as f64 / PARAM_SCALE as f64;
let max_delta = current.abs() * MAX_INTENT_ADJUSTMENT;
let delta = (clamped - current).clamp(-max_delta, max_delta);
let final_value = current + delta;
// 3. Convert to fixed-point and write to AtomicI64 backing store.
let stored = (final_value * PARAM_SCALE as f64).round() as i64;
param.set_with_expiry(stored, expiry_ns);
Ok(())
}
Wake mechanism: the optimizer kernel thread.
The intent optimizer runs as a dedicated kernel thread (kthread/intent_optimizer)
created during boot. It does not busy-poll — it sleeps and is woken by one of two
mechanisms:
-
Periodic timer. A high-resolution timer fires every
tick_period_ns(default: 1 second). This is the primary wake source during normal operation. The timer is set withHRTIMER_MODE_RELand re-armed at the end of each optimizer tick to avoid drift accumulation. -
Threshold-crossing event. When a per-CPU observation ring detects that a metric has crossed a critical threshold (e.g., wakeup latency exceeds 2x the target for any cgroup with an active latency intent), it sets an atomic flag (
intent_optimizer_wake_pending) and sends an IPI to the CPU running the optimizer thread. This wakes the optimizer ahead of the next timer tick, enabling faster response to sudden load changes. The threshold check is a single comparison in theobserve_scheduler_metric()hot path and is skipped entirely when no cgroup has active intents (checked via a global atomicactive_intent_count).
Threshold Crossing Detection and Wake Protocol:
The intent optimizer is a background kernel thread (umka_intent_optimizer, SCHED_IDLE priority) that re-evaluates scheduling policy hints based on observed workload behavior.
Threshold monitoring: Each schedulable entity tracks a set of behavioral metrics updated by the scheduler hot path:
- runtime_last_window_us: CPU time consumed in the last 100ms window.
- iowait_fraction: fraction of runnable time spent in I/O wait.
- cache_miss_rate: L3 miss rate sampled via per-CPU PMU (updated every 10ms tick).
- wakeup_rate_hz: wakeups per second (exponential moving average).
Threshold crossing detection (lock-free, hot path): On each scheduler tick, a fast check compares each metric against its registered threshold. The comparison uses a hysteresis band (5% above the "enter" threshold, 5% below the "exit" threshold) to prevent flapping. The check is implemented as:
if (abs(metric - threshold) > hysteresis && !already_pending):
per_cpu(intent_sample_pending) = true
// No IPI yet — piggyback on the next scheduler yield point.
IPI targeting (deferred, not on hot path):
The intent optimizer thread sleeps on intent_wait_queue. It is woken by:
1. Scheduler yield points: schedule() checks per_cpu(intent_sample_pending) for the current CPU and calls intent_wakeup_if_pending() — a non-IPI local wakeup (enqueues the optimizer thread on the current CPU's runqueue if it's not already runnable). O(1), no cross-CPU traffic.
2. Cross-CPU threshold: If a threshold was crossed on CPU B but CPU B's scheduler is idle (no local yield point imminent), the timer tick on CPU B sends a SCHEDULER_KICK IPI to the optimizer thread's home CPU (the optimizer is pinned to CPU 0 by default, or the least-loaded CPU if configured). IPIs are rate-limited to 1 per 100ms per crossing to prevent IPI storms.
Batch processing: The optimizer wakes at most 100 times per second (configurable). Each wakeup processes up to 64 pending threshold crossings from all CPUs before returning to sleep.
/// The intent optimizer kernel thread entry point.
///
/// This function runs in an infinite loop, sleeping between optimizer
/// ticks. It is created once during boot and runs for the lifetime of
/// the kernel.
fn intent_optimizer_thread() -> ! {
let state = IntentOptimizerState::new();
let timer = HrTimer::new(state.tick_period_ns, HrTimerMode::Rel);
loop {
// Sleep until timer fires or threshold-crossing wakes us early.
timer.wait_or_wake(&INTENT_OPTIMIZER_WAKE_PENDING);
// Drain all per-CPU observation rings into aggregated metrics.
for cpu in 0..num_online_cpus() {
let ring = per_cpu!(sched_observation_ring, cpu);
while let Some(obs) = ring.pop() {
state.aggregate_observation(cpu, &obs);
}
}
// Run the optimization loop (steps 1-6 from the pseudocode above).
for (cgroup_id, cgroup_state) in &mut state.cgroup_states {
let metrics = state.collect_cgroup_metrics(*cgroup_id);
let adjustments = state.compute_adjustments(cgroup_state, &metrics);
state.apply_cgroup_adjustments(*cgroup_id, &adjustments);
}
// Re-arm the periodic timer.
timer.rearm(state.tick_period_ns);
}
}
The optimizer thread runs at SCHED_IDLE priority (consistent with its declaration above). It is
not a real-time thread — it must not interfere with RT, deadline, or normal workloads. If the
optimizer tick takes longer than expected (e.g., due to a large number of intent cgroups),
the next tick is simply delayed — there is no attempt to "catch up" missed ticks, as the
PD controller is designed for variable tick rates.
7.10.5.1 Stability Analysis¶
Control law.
The proportional control law from step 4 above is augmented with a derivative term to
provide active damping. The complete PD control law for each resource dimension r is:
error[k] = deficit[k] / target[r] // normalized error at tick k
d_error[k] = (error[k] - error[k-1]) / T // discrete derivative (T = tick period)
raw_adjustment = K_p[r] × error[k] + K_d[r] × d_error[k]
adjustment[r] = clamp(raw_adjustment, -MAX_INTENT_ADJUSTMENT, +MAX_INTENT_ADJUSTMENT)
where:
| Parameter | Value | Meaning |
|---|---|---|
T |
1 s | Optimizer tick period (see Section 7.7.6) |
K_p (CPU weight) |
0.5 | Proportional gain, CPU weight dimension |
K_p (frequency) |
0.3 | Proportional gain, frequency dimension |
K_d (CPU weight) |
1.0 | Derivative gain = K_p × τ_d where τ_d = 2 s (dominant delay, see below) |
K_d (frequency) |
0.6 | Derivative gain for frequency dimension |
MAX_INTENT_ADJUSTMENT |
0.20 | Maximum fractional change per tick (20% of current allocation) |
MAX_INTENT_ADJUSTMENT = 0.20 means the optimizer will never increase or decrease a
cgroup's allocation by more than 20% of its current value in a single tick, regardless
of the magnitude of the computed adjustment. For example, a cgroup at cpu.weight = 100
can move at most to [80, 120] in one optimizer cycle.
Dominant delay estimate.
The feedback signals observed by the optimizer are:
| Signal | Measurement latency | Settling time |
|---|---|---|
| CPU utilization (PELT) | ~40 ms (PELT decay constant) | ≤ 200 ms |
| P99 latency (sliding window) | Configurable; default 1 s window | ≤ 1 s |
| Memory pressure (PSI) | 10 s exponential window | ≤ 10 s |
| Temperature (RAPL / ACPI thermal) | 1 s read interval (Section 7.4.1) | ≤ 2 s |
| I/O utilization (iostat-style) | 250 ms sampling | ≤ 500 ms |
The dominant (largest) delay in the loop is the PSI memory pressure signal with a
settling time of up to 10 seconds. However, the intent optimizer only acts on PSI as a
secondary, advisory input — it does not directly adjust memory allocation in response to
PSI (that is the responsibility of the memory reclaim path, Section 4). For the primary
control dimensions (CPU weight and frequency), the dominant delay is the P99 latency
window of at most 1 second, and the RAPL/thermal signal at ≤ 2 seconds. Setting
τ_d = 2 s for the derivative gain is therefore conservative (larger than needed for
CPU control, but correct for the temperature feedback path).
Nyquist stability condition.
For a proportional-only controller in a sampled-data loop with tick period T and
signal delay τ_d, the Nyquist stability criterion requires:
For the primary CPU-weight control path: T = 1 s, τ_d_primary ≤ 1 s (P99 window).
The Nyquist condition 1 s ≥ 2 × 1 s is not satisfied by the P99 window alone,
which is why the derivative term is required.
The PD controller transforms the open-loop transfer function. With K_d = K_p × τ_d:
For K_p = 0.5, τ_d = 2, T = 1:
Stability analysis for discrete-time PD controller:
Closed-loop characteristic equation: 1 + G_PD(z) × z⁻¹ = 0
where G_PD(z) = k_p + k_d × (1 - z⁻¹) = 1.5 - 1.0z⁻¹ [example gains: k_p=0.5, τ_d=2, T=1]
Expanding: 1 + (1.5 - z⁻¹) × z⁻¹ = 0
→ 1 + 1.5z⁻¹ - z⁻² = 0
→ multiply by z²: z² + 1.5z - 1.0 = 0 [characteristic polynomial, degree 2]
Jury stability criterion for z² + bz + c (b = 1.5, c = -1.0):
Condition 1: |c| < 1 → |-1.0| = 1.0 [marginally stable at these example gains]
Note: The gains above (k_p = 0.5, τ_d = 2) are illustrative only. Production gain
selection is performed numerically at deployment time via the umka-intent-tuner tool,
which sweeps the gain space and verifies Jury stability for the measured plant delay on
each hardware configuration. The architecture document shows the stability analysis
methodology, not fixed production gains.
Gain margin and stability boundary.
The gain margin is the factor by which K_p can increase before the closed loop
becomes unstable. For the CPU-weight PD controller (K_d = 2 K_p), numerical analysis
gives a gain margin of approximately 3× (i.e., K_p up to ~1.5 before instability). The
chosen K_p = 0.5 is well within this margin.
If feedback pathology causes error[k] to grow without bound — for example, a bug in
the P99 latency measurement that returns extreme values — the MAX_INTENT_ADJUSTMENT
clamp prevents runaway:
// In every optimizer tick, regardless of computed adjustment magnitude:
let adjustment = raw_adjustment.clamp(-MAX_INTENT_ADJUSTMENT, MAX_INTENT_ADJUSTMENT);
Concretely: even if the proportional and derivative terms compute a raw adjustment of
+5.0 (500% increase), the clamped adjustment is +0.20 (20% increase). The system
cannot diverge faster than geometric growth at rate 1.20 per second, and the existing
5-second minimum hold time (stability controls in step 4) further limits the maximum
achievable divergence rate to 1.20^(1/5) ≈ 1.038 per second — less than 4% per second
even in a fully pathological feedback scenario.
State carried across ticks.
Each controlled cgroup retains:
struct IntentControlState {
/// Normalized error from the previous optimizer tick (for derivative computation).
prev_error: [f64; ResourceDim::COUNT],
/// Number of consecutive ticks within the 10% convergence band.
converged_ticks: u32,
/// Monotonic nanosecond timestamp of the last applied adjustment.
last_adjustment_ns: u64,
/// Number of consecutive adjustments that failed to improve the metric (for backoff).
backoff_count: u32,
}
prev_error is initialized to 0.0 on the first tick after an intent is set, so the
derivative term contributes zero on the first control step. This avoids a derivative
kick from the initial large error value.
Interaction with existing stability controls.
The stability controls in step 4 complement the PD controller:
- Hysteresis (10% band): prevents the derivative term from amplifying measurement
noise. When
|error[k]| < 0.10, bothK_p × errorandK_d × d_errorare zeroed (no adjustment applied). This is equivalent to a deadband in the control law. - 5-second minimum hold time: provides a floor on effective tick rate, giving the
plant time to respond before the next adjustment. This is equivalent to adding latency
to the feedback path, making the system behave as if
T_eff = max(T, 5 s)for purposes of the Nyquist criterion. - Exponential backoff (3-miss rule): after 3 consecutive adjustments without
convergence,
K_pis halved for subsequent ticks until convergence is achieved. This adaptive gain reduction handles unmodelled plant dynamics (e.g., a workload whose latency is bottlenecked on I/O, not CPU — increasing CPU weight does nothing). - ±20% max rate (MAX_INTENT_ADJUSTMENT): as analyzed above, provides the runaway-prevention bound.
Together, these mechanisms give the intent optimizer the stability properties of an overdamped second-order system: it converges monotonically (no overshoot in the normal case) with a time constant of approximately 3–5 optimizer ticks (3–5 seconds). This is appropriate for a slow-path resource allocation controller — fast enough to respond to workload shifts within 10–15 seconds, slow enough to avoid the thrashing that a tighter controller would produce.
Intent Admission Control:
Intents are advisory, not guaranteed. When a cgroup sets intent.latency_ns = 5000000
(5ms), the kernel attempts to meet it but does not reject the intent if resources are
insufficient. Instead:
- If the intent cannot be met: intent.status reports latency_met: false with the
actual observed P99 latency.
- Clamping: intent values are clamped to physically achievable bounds. An
intent.latency_ns = 1 (1 nanosecond) is silently clamped to the system's minimum
achievable scheduling latency (~10μs on a typical x86 system).
- Contradictions (e.g., intent.latency_ns = 1000 with intent.efficiency = 100)
are logged as warnings in intent.status with contradiction: latency_vs_efficiency.
The optimizer prioritizes the latency target.
Intent Feedback:
The intent.status file (defined above) provides real-time feedback on intent
satisfaction. See the intent.status definition in the cgroup interface listing above
for the full field set.
Multi-Tenant Isolation:
The cgroup hierarchy IS the authority for resource isolation. Intents operate within
existing cgroup limits:
- A child cgroup's intent cannot cause resource consumption exceeding the parent's
cpu.max, memory.max, or power.max limits.
- Parent limits are a hard ceiling. Intents are soft optimization within that ceiling.
- Cross-tenant interference is prevented by existing cgroup isolation — the intent
optimizer adjusts knobs for one cgroup without affecting other cgroups' guarantees
(cpu.min, memory.min are respected).
7.10.6 Performance Impact¶
The optimization loop runs once per second as a background kernel thread. Each iteration reads per-cgroup metrics, runs the inference engine, and writes adjusted parameters. The total cost scales linearly with the number of cgroups that have active intents:
| Active intent cgroups | Typical iteration cost | Notes |
|---|---|---|
| ≤16 | ~100-500 μs | Sub-millisecond; inference engine inline |
| 10-50 | ~1-5 ms | Still negligible as fraction of 1-second period |
| >100 | ~10-50 ms | Should use intent_optimizer_batch_size to split |
For the default case (≤50 cgroups), the amortised overhead is well under 0.5% CPU.
The actual scheduling/allocation decisions use the same fast paths as before. Only the cgroup parameters change. Hot-path performance: identical to Linux.
7.10.7 Explainability Interface¶
Intent optimization (Section 7.10) reports whether intents are met and what adjustments were made, but administrators also need to understand why a specific performance target is not being met, what the system tried and rejected, and what action they could take to help. The explainability interface provides this deep diagnostic view.
sysfs interface (per-cgroup, read-only):
/sys/fs/cgroup/<group>/intent.explain
bottleneck: cpu|memory|io|accelerator|power|network
bottleneck_detail: "CPU saturated: 4/4 cores at 100%, cpu.max reached"
adjustments_attempted: 5
adjustments_rejected: 2
rejected_reasons: ["cpu.max ceiling reached", "power budget Section 7.4 constraint"]
recommendation: "Increase cpu.max from 400000 to 600000"
conflicting_intents: ["cgroup:/prod/db has higher cpu.weight, consuming 3/4 cores"]
Each field is populated by the optimization loop (Section 7.10) at the end of each cycle. The
bottleneck field identifies the single most constrained resource. The recommendation
field suggests the smallest configuration change that would allow the intent to be met.
The conflicting_intents field lists other cgroups whose intents are competing for the
same resource.
Structured tracepoint: umka_tp_stable_intent_explain is emitted every optimization
cycle for cgroups with unmet intents. Fields: cgroup path, intent type, target value,
actual value, bottleneck type, attempted adjustments (count), rejection reasons (array).
This enables perf / BPF-based monitoring of intent optimization across the system.
Adjustment history log: Per-cgroup ring buffer of the last 64 adjustments, exposed via:
/sys/fs/cgroup/<group>/intent.adjustment_history
# # Each entry:
# # timestamp: 1708012345.123456
# # parameter: cpu.max
# # old_value: 400000
# # new_value: 500000
# # reason: "latency_p99 target 5ms, actual 8ms, cpu was bottleneck"
# # effect: "latency_p99 dropped from 8ms to 4.2ms at next cycle"
The ring buffer is fixed-size (64 entries, ~8 KiB per cgroup) and wraps around. It provides a complete causal trail: what changed, why, and what happened as a result.
Integration with umkactl: umkactl intent explain <cgroup> provides a
human-readable summary combining intent.status + intent.explain +
intent.adjustment_history into a single diagnostic view. Example output:
$ umkactl intent explain /prod/web
Intent: latency_p99 ≤ 5ms
Status: NOT MET (actual: 8.1ms)
Bottleneck: CPU (4/4 cores at 100%, cpu.max = 400000)
Recommendation: Increase cpu.max to 600000
Conflicting: /prod/db (cpu.weight=200, consuming 3/4 cores)
Last 3 adjustments:
[12:01:05] cpu.weight 100→150 — effect: p99 9.3ms→8.5ms
[12:01:06] io.weight 100→200 — effect: p99 8.5ms→8.3ms (not bottleneck)
[12:01:07] cpu.weight 150→200 — rejected: power budget Section 7.4 constraint
7.10.8 Integration with ML Policy Framework¶
The Intent Optimizer (Section 7.7.5) is a natural consumer of the ML Policy
framework (Section 23.1). The PD
controller's gain constants and stability parameters are tunable via
PolicyUpdateMsg, giving the ML framework control over how aggressively the
intent optimizer responds to workload changes — without replacing the optimizer
itself.
Tunable parameters exposed to the ML Policy framework:
param_id |
Field | Default | Bounds | Effect |
|---|---|---|---|---|
intent_kp_cpu_weight |
PdControllerState.k_p (CPU weight dim) |
500 | [50, 2000] | Proportional gain (permille) for CPU weight adjustments |
intent_kp_frequency |
PdControllerState.k_p (frequency dim) |
300 | [50, 2000] | Proportional gain (permille) for frequency adjustments |
intent_kd_cpu_weight |
PdControllerState.k_d (CPU weight dim) |
1000 | [0, 5000] | Derivative gain (permille) for CPU weight damping |
intent_kd_frequency |
PdControllerState.k_d (frequency dim) |
600 | [0, 5000] | Derivative gain (permille) for frequency damping |
intent_max_adjustment |
MAX_INTENT_ADJUSTMENT |
200 | [50, 500] | Maximum per-tick adjustment (permille of current allocation) |
intent_hold_time_ms |
Minimum hold time | 5000 | [1000, 30000] | Minimum milliseconds between adjustments |
intent_convergence_band |
Convergence criterion | 100 | [10, 300] | Permille of target within which intent is "met" |
All values are integer (permille for fractional quantities). The PD controller's
f64 fields are derived: k_p = param_value as f64 / 1000.0. This keeps
PolicyUpdateMsg integer-only while preserving sub-1% controller precision.
Why this is better than a standalone PD controller:
- Temporal decay: If the ML policy service crashes, the PD controller
gains revert to defaults over
decay_period_ms. A standalone controller either keeps running with stale gains or stops entirely. - Per-cgroup tuning: Different workloads can receive different gain profiles via the cgroup override mechanism (Section 23.1). A latency-sensitive database gets aggressive gains; a batch job gets conservative ones.
- Observability: All gain changes flow through the
PolicyUpdateMsgaudit log (Section 23.1), making controller behavior fully traceable. - Stability guarantee: The ML framework's bounded parameter ranges
(enforced by
KernelParamStoreclamping) prevent a misbehaving policy service from setting gains that cause oscillation. The bounds in the table above are chosen so that even the extreme corners (k_p=2.0, k_d=0.0, max_adjustment=50%) produce a stable system (damping ratio > 0.3 for all resource dimensions).
The PD controller remains the default. When no ML policy service is running, the intent optimizer uses the default gains from the table above. The ML framework is an enhancement layer — it makes the optimizer adaptive, not dependent.
7.11 Core Provisioning and Workload Partitioning¶
Inspired by: Akaros many-core OS (Berkeley, 2009–2018) — the MCP model with provisioning/allocation split. Adapted for production use on 8 architectures with Linux cgroup compatibility. IP status: Akaros was BSD-licensed academic research. The provisioning/allocation separation is a general scheduling concept; this section specifies UmkaOS's original design.
7.11.1 Problem¶
Modern workloads span a wide spectrum of CPU isolation needs. A latency-sensitive database engine needs dedicated cores with no OS noise — timer ticks, RCU callbacks, and workqueue items cause measurable tail-latency spikes. A parallel HPC job needs all N cores to start simultaneously (gang scheduling) or the entire wavefront stalls. A batch analytics pipeline needs cores only when available and should yield them instantly when the latency-sensitive workload reclaims.
Linux provides isolcpus (boot-time, static), cpuset.cpus (cgroup pinning but no
noise elimination), and nohz_full (per-core tickless, boot-time). None of these
supports dynamic provisioning, backfill, or gang allocation. Combining them requires
error-prone manual coordination across boot parameters, cgroup configuration, and
IRQ affinity masks.
UmkaOS unifies these capabilities into a single cgroup-aware provisioning model with three core classes, dynamic reconfiguration, and a backfill protocol that eliminates wasted cycles on dedicated cores.
Cross-references: EEVDF scheduler (Section 7.1), CBS bandwidth guarantees (Section 7.6), intent-based management (Section 7.10), ML policy integration (Section 23.1), cgroup namespacing (Section 17.2).
7.11.2 Core Classes¶
Every online CPU core is classified into exactly one of three classes at any given time. The classification is dynamic — a core's class can change at runtime via the cgroup provisioning interface.
// umka-core/src/sched/provision.rs
/// Classification of a CPU core's scheduling mode.
///
/// Each online core belongs to exactly one class. The class determines what
/// kernel services run on that core and how the scheduler treats it.
///
/// Transitions are: LL → CG (on provision), CG → Backfill (on provisionee
/// idle), Backfill → CG (on provisionee reclaim), CG → LL (on deprovision).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum CoreClass {
/// Low-Latency: default class. Time-shared via EEVDF (Section 7.1.2).
/// Full kernel services: timer tick, RCU callbacks, workqueue items,
/// softirqs, hardware interrupts. Standard Linux scheduling behavior.
LL = 0,
/// Coarse-Grained: dedicated to a single cgroup. No timer tick
/// (tickless, equivalent to `nohz_full`). RCU callbacks offloaded to
/// LL cores (equivalent to `rcu_nocbs`). No workqueue items. No softirqs.
/// Only IPIs (for cross-core coordination and backfill reclaim) and page
/// faults (cannot be disabled — application needs memory) are handled.
/// OS noise target: <1μs per second.
CG = 1,
/// Backfill: transient state for a CG core whose provisionee is idle.
/// Available for batch work from the backfill queue, but reclaimable
/// within 10μs via IPI preemption. Transitions back to CG when the
/// provisionee reclaims.
Backfill = 2,
}
OS noise elimination on CG cores — kernel services disabled vs. retained:
| Component | CG Core Behavior | Rationale |
|---|---|---|
| Timer tick | Disabled (tickless, nohz_full equivalent) |
Eliminates periodic interrupts |
| RCU callbacks | Offloaded to LL cores (rcu_nocbs equivalent) |
No RCU processing on CG |
| Workqueue items | Not scheduled on CG cores | Dispatched to LL cores only |
| Softirqs | Deferred to LL core via IPI bounce | No network/timer softirqs on CG |
| Page faults | Handled locally | Cannot disable — application needs memory |
| IPIs | Received | Cross-core coordination, backfill reclaim |
| Hardware interrupts | Affinity-masked away from CG cores | Only IPIs remain |
Per-architecture implementation of CG noise elimination:
| Arch | Timer Disable | RCU Offload | IRQ Affinity |
|---|---|---|---|
| x86-64 | APIC timer one-shot, no periodic tick | Per-CPU rcu_nocbs flag |
IOAPIC destination mask |
| AArch64 | CNTP_CTL_EL0.IMASK bit set |
Same flag | GIC SPI affinity registers |
| ARMv7 | Cortex-A timer IMASK | Same flag | GIC-400 ITARGETSR |
| RISC-V | SBI timer not re-armed | Same flag | PLIC enable bitmap |
| PPC32 | Decrementer not re-armed | Same flag | OpenPIC destination mask |
| PPC64LE | Decrementer not re-armed | Same flag | XIVE EQ target |
| s390x | Clock comparator not re-armed | Same flag | SIGP cpu mask (directed interrupts) |
| LoongArch64 | CSR.TCFG timer disabled | Same flag | EIOINTC routing vector |
The mechanism differs per architecture, but the guarantee is identical: no involuntary kernel entry on CG cores except page faults and IPIs.
7.11.3 Provisioning and Allocation¶
Provisioning and allocation are distinct operations, inspired by Akaros's separation of promise (provisioning) from grant (allocation):
- Provisioning: The kernel records that a cgroup is entitled to N cores of a given class. The cores are not yet assigned — this is a capacity reservation.
- Allocation: The kernel selects specific cores from the provisioned pool and assigns them to the cgroup. The cgroup's threads begin running on these cores.
This separation allows the kernel to promise capacity before threads are ready to use it, enables gang allocation (atomically allocating all N cores), and supports backfill of idle provisioned cores.
// umka-core/src/sched/provision.rs
/// A provisioning record: the kernel's promise that a cgroup can obtain
/// up to `core_count` cores of the specified class.
///
/// Provisioning is requested via the `cpu.provision_count` cgroup knob.
/// The kernel validates that the system has enough free cores to honor the
/// provision before accepting it.
// Kernel-internal struct. Not KABI. Not wire.
#[repr(C)]
pub struct CoreProvision {
/// Owning cgroup's unique identifier.
pub cgroup_id: u64,
/// Number of cores provisioned for this cgroup.
pub core_count: u16,
/// Core class: LL or CG. LL provisioning reserves cores in the EEVDF
/// pool (guarantees capacity). CG provisioning dedicates cores exclusively.
pub core_class: CoreClass,
/// If true, cores are exclusively reserved — no backfill when idle.
/// Only allowed when the system has at least 2× the provisioned count
/// in free LL cores (prevents over-reservation that starves the system).
pub provision_hard: bool,
/// If true, all provisioned cores must be allocated atomically
/// (all-or-none). Used for parallel workloads where partial allocation
/// is worse than no allocation (e.g., MPI jobs, GPU compute pipelines).
pub gang_mode: bool,
/// Explicit padding: 3 bytes between gang_mode (offset 12) and
/// gang_timeout_ms (offset 16, u32 requires 4-byte alignment).
pub _pad1: [u8; 3],
/// Timeout for gang allocation in milliseconds. If the scheduler cannot
/// allocate all N cores within this duration, the behavior depends on
/// gang_partial_ok (future extension). Default: 100ms.
pub gang_timeout_ms: u32,
/// Priority for backfill work on idle CG cores belonging to this
/// provision. Lower values are evicted first when the provisionee
/// reclaims its cores. Default: 0. Range: -32768 to 32767.
pub backfill_priority: i16,
/// Padding for alignment. Zero-initialized, must be zero.
pub _pad: [u8; 2],
}
const_assert!(size_of::<CoreProvision>() == 24);
/// The runtime state of an allocation derived from a CoreProvision.
pub struct CoreAllocation {
/// Back-reference to the provisioning record.
///
/// # Safety
///
/// The `CoreProvision` is owned by the cgroup's `CpuController`
/// extension. It outlives all `CoreAllocation`s that reference it,
/// because the cgroup teardown path releases all allocations
/// (returning cores to the global pool) before freeing the
/// provision. The pointer is valid for the entire `CoreAllocation`
/// lifecycle (Provisioned -> Active -> Released).
pub provision: *const CoreProvision,
/// Currently granted core IDs. The array is populated from index 0.
/// `allocated_count` indicates how many entries are valid.
pub allocated_cores: [CpuId; MAX_GANG_STACK_SIZE],
/// Number of valid entries in `allocated_cores`.
pub allocated_count: u16,
/// Current allocation state.
pub state: CoreAllocState,
}
/// Maximum number of cores in a single gang allocation's stack-allocated array.
/// The effective limit is `min(online_cpus(), 256)` — discovered at boot. The
/// 256 upper bound sizes the per-request stack allocation (~2 KB for `[CpuId; 256]`),
/// fitting comfortably within the 16 KB kernel stack. Systems with >256 cores
/// use heap-allocated gang arrays via `Box<[CpuId]>`. At boot,
/// `init_gang_allocator()` reads the online CPU count and selects stack or heap
/// backing accordingly. The constant is compile-time-fixed (not runtime tunable)
/// because it determines the stack frame layout.
pub const MAX_GANG_STACK_SIZE: usize = 256;
/// State machine for a core allocation.
///
/// Transitions:
/// Idle → Provisioned (provision accepted)
/// Provisioned → Allocated (cores granted)
/// Allocated → Backfilling (provisionee idle, backfill work assigned)
/// Backfilling → Allocated (provisionee reclaims via reschedule IPI)
/// Allocated → Provisioned (cores released, provision retained)
/// Provisioned → Idle (provision revoked)
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum CoreAllocState {
/// No provisioning active for this cgroup.
Idle = 0,
/// Cores are provisioned (reserved) but not yet allocated.
Provisioned = 1,
/// Cores are allocated and running the provisionee's threads.
Allocated = 2,
/// Provisionee is idle; cores are running backfill work.
Backfilling = 3,
/// Provisionee is reclaiming cores from backfill state.
/// Transient: lasts only until reschedule IPI is acknowledged (~2-10μs).
Reclaiming = 4,
}
7.11.4 Cgroup Interface¶
New control files under the cpu controller, additive to the existing interface
(Section 7.6):
/sys/fs/cgroup/<group>/cpu.provision_count
# # Number of cores to provision for this cgroup.
# # 0 = no provisioning (default). The cgroup uses standard EEVDF scheduling.
# # N > 0 = request N cores. The kernel validates that sufficient free cores
# # exist before accepting the write. Returns -ENOSPC if the system cannot
# # honor the request.
/sys/fs/cgroup/<group>/cpu.core_class
# # Core class for provisioned cores.
# # "ll" (default) = Low-Latency. Provisioned LL cores remain in the EEVDF
# # pool but are capacity-reserved for this cgroup.
# # "cg" = Coarse-Grained. Provisioned cores are dedicated exclusively.
# # OS noise is eliminated (see Section 7.8.2 table).
/sys/fs/cgroup/<group>/cpu.provision_hard
# # "0" (default) = soft provisioning. Idle CG cores accept backfill work.
# # "1" = hard provisioning. Idle CG cores remain idle (no backfill).
# # Only accepted when the system has >= 2x the provisioned count in free
# # LL cores. Returns -ENOSPC otherwise.
/sys/fs/cgroup/<group>/cpu.gang_mode
# # "0" (default) = cores allocated individually as they become available.
# # "1" = gang mode. All provisioned cores are allocated atomically.
# # Allocation blocks up to cpu.gang_timeout_ms. If the gang cannot be
# # formed within the timeout, the write to cpu.provision_count returns
# # -EAGAIN.
/sys/fs/cgroup/<group>/cpu.gang_timeout_ms
# # Timeout for gang allocation in milliseconds. Default: 100.
# # Only meaningful when cpu.gang_mode = 1.
/sys/fs/cgroup/<group>/cpu.allocated_cores
# # (read-only) Space-separated list of currently allocated core IDs.
# # Empty if no cores are allocated.
# # Example: "4 5 12 13"
/sys/fs/cgroup/<group>/cpu.provision_status
# # (read-only) Current provisioning state.
# # Values: "none", "provisioned", "allocated", "backfilling"
Capability requirements:
- Writing cpu.provision_count requires CAP_SYS_NICE (same capability used
for scheduler policy changes). Without it, writes return -EPERM.
- Writing cpu.core_class or cpu.provision_hard requires CAP_SYS_ADMIN
(these affect system-wide core allocation policy). Without it, writes return -EPERM.
- cpu.gang_mode and cpu.gang_timeout_ms require CAP_SYS_NICE.
- Inside a user namespace, these capabilities apply to the owning user namespace
(matching cgroup delegation rules in Section 17.2.6).
Validation rules:
- cpu.provision_count must not exceed the number of online cores minus a minimum
LL reserve (default: 2 cores, configurable via sysctl kernel.min_ll_cores).
- The sum of all cpu.provision_count values across all cgroups must not exceed
online_cores - min_ll_cores. Returns -ENOSPC on overcommit.
- cpu.core_class = cg requires cpu.provision_count > 0. Setting core_class
to cg without provisioning returns -EINVAL.
- Nested cgroups: a child's provision comes out of its parent's provision budget,
analogous to cpu.guarantee nesting (Section 7.6).
7.11.5 Backfill Protocol¶
When a provisionee's threads are all idle on a CG core, the core is wasted unless repurposed. The backfill protocol allows batch work to use idle CG cores while guaranteeing that the provisionee can reclaim them within 10μs.
Protocol steps:
-
Idle detection: The last runnable thread of the provisionee on a CG core enters sleep (blocks on I/O, futex, etc.). The per-CPU idle handler detects that no provisionee threads remain runnable. State transitions: Allocated → Backfilling.
-
Backfill dispatch: The scheduler checks the global backfill queue (a priority queue sorted by
CoreProvision.backfill_priority, lower = evicted first). If batch work is available, the scheduler performs a full context switch to the highest-priority backfill thread. The CG core temporarily behaves like an LL core for the backfill thread (timer tick re-enabled, RCU callbacks local). -
Reclaim trigger: The provisionee's thread becomes runnable (woken by I/O completion, futex wake, signal). The
try_to_wake_uppath detects the target core is backfilling (viaCpuLocal::core_class == Backfill) and sends a standard reschedule IPI — the same IPI used by EEVDF load balancing (Section 7.1). State transitions: Backfilling → Reclaiming. -
Reclaim execution: The backfill core receives the reschedule IPI. The handler sets
need_reschedon the backfill CPU'sCpuLocalBlock(standard mechanism, Section 3.2). On interrupt return,schedule()runspick_next_task()which re-applies the cgroup filter and selects the provisionee's thread. State transitions: Reclaiming → Allocated. -
Backfill thread disposition: The preempted backfill thread is migrated to an LL core by the EEVDF load balancer (standard migration path). If it was in a syscall, the syscall either completes (if near completion) or is interrupted with
EINTR— standard preemption semantics, no special handling. -
CG restoration: Once the provisionee's thread is running, the core reverts to CG mode: timer tick disabled, RCU callbacks offloaded, workqueue items removed from the core's queue.
Provisionee CG Core State Backfill Thread
─────────── ────────────── ───────────────
Running ──→ Allocated (not present)
All idle ──→ Backfilling ←── Scheduled
Wakes up ──→ Reclaiming ──→ Preempted (resched IPI)
Running ──→ Allocated Migrated to LL
Timing guarantees: - reschedule IPI delivery: <1μs on all architectures (IPI is highest-priority interrupt). - Context switch on IPI return: 1-4μs (architecture-dependent, see table below). - Total reclaim latency (provisionee wake → provisionee running): ≤10μs.
| Arch | IPI Delivery | Context Switch | Total Reclaim |
|---|---|---|---|
| x86-64 | ~0.5μs | ~1.5-3μs | ~2-5μs |
| AArch64 | ~1μs | ~3-6μs | ~5-8μs |
| ARMv7 | ~1μs | ~3-5μs | ~4-7μs |
| RISC-V | ~2μs | ~3-6μs | ~5-9μs |
| PPC32 | ~1μs | ~3-5μs | ~4-7μs |
| PPC64LE | ~1μs | ~2-4μs | ~3-6μs |
7.11.6 Gang Scheduling Protocol¶
Gang scheduling ensures that a parallel workload's threads all start simultaneously. Without gang scheduling, threads that start at different times waste cycles spinning on barriers or synchronization primitives, waiting for the last thread to be scheduled.
Protocol:
- A cgroup with
cpu.gang_mode=1writes a value N tocpu.provision_count. - The provisioning subsystem reserves N cores (validating availability).
- When the cgroup's threads become runnable, the scheduler attempts to allocate all N cores from the provisioned set simultaneously.
- If N cores are available: all are allocated atomically. Each core receives an IPI directing it to schedule the cgroup's thread. All threads begin executing within one IPI round-trip (~1-5μs skew within a single NUMA domain; cross-NUMA adds 5-50μs depending on topology).
- If fewer than N cores are available: the scheduler waits up to
gang_timeout_ms(default 100ms, tunable via/sys/fs/cgroup/<group>/cpu.gang_timeout_ms). 100ms allows time for a tier switch or driver reload to complete (~50-150ms typical) while avoiding indefinite blocking of system suspend or workload rescheduling. During the wait, the cgroup's threads remain inTASK_UNINTERRUPTIBLEstate (not consuming CPU). - On timeout: allocation fails. The
cpu.provision_statusreadsprovisioned(notallocated). The kernel logsKERN_INFOwith the shortfall count. User space can retry, reducecpu.provision_count, or fall back to non-gang scheduling. - Allocated gang cores are guaranteed until explicitly released (the cgroup writes
cpu.provision_count = 0or is destroyed). The scheduler does not preempt gang cores — they run until the provisionee voluntarily yields or exits.
NUMA-aware gang allocation: The scheduler prefers allocating all gang cores
from the same NUMA node. If the preferred node has insufficient free cores, the
scheduler spills to the nearest NUMA node (lowest NumaTopology::distance()). A
cross-NUMA gang emits tracepoint umka_tp_stable_sched_gang_cross_numa to alert
operators. For strict NUMA locality, set ProvisionNumaHint.allow_cross_numa = false
— gang allocation will fail with -ENOSPC rather than cross NUMA boundaries.
Gang teardown: When a gang-allocated cgroup writes cpu.provision_count = 0 or
is destroyed, all allocated cores are simultaneously released. Each core receives a
standard reschedule IPI, transitions to LL class, and re-enables full kernel services
(timer tick, RCU, workqueue). Teardown completes within 1ms (IPI fan-out +
per-core re-initialization).
7.11.7 Integration with Existing Scheduler¶
The provisioning subsystem is an extension of EEVDF, not a parallel path. CG cores use the same scheduler infrastructure with modified parameters — they don't bypass the scheduler.
CG cores as an EEVDF scheduling class:
CG cores run EEVDF like LL cores, but with a restricted candidate set:
- A CG core's runqueue accepts only tasks from the owning cgroup (and
backfill tasks, at lowest priority). This is enforced by the runqueue's
cgroup_filter: Option<CgroupId>field — when set,pick_next_task()skips tasks not matching the filter. - Within the owning cgroup, EEVDF operates normally: virtual deadline ordering, fair share, preemption on eligibility. If the cgroup has 8 threads on 4 CG cores, EEVDF schedules them fairly across those 4 cores.
- When all cgroup tasks are blocked, the cgroup filter temporarily opens to admit backfill tasks (state → Backfilling). On provisionee wake, the filter re-closes and backfill tasks are preempted.
This means: no separate scheduling path for CG cores. The difference is configuration (restricted runqueue, disabled tick, offloaded RCU), not a different algorithm. The existing EEVDF load balancer, priority handling, and preemption logic all work unchanged on CG cores.
Backfill preemption via standard mechanism:
Backfill reclaim uses the existing CpuLocalBlock.need_resched flag
(Section 3.2), not a custom IPI protocol.
When a provisionee task wakes on a backfilling core:
- The waking path (
try_to_wake_up) detects the target core is backfilling (core class = Backfill in CpuLocal state). - It sends a standard reschedule IPI to the core (same IPI used by EEVDF load balancing for migration, Section 7.1).
- The IPI handler stores
truetoneed_resched(.store(true, Relaxed)). - On interrupt return,
schedule()runs EEVDF pick_next_task() which selects the provisionee task (cgroup filter re-applied) over the backfill task. - Backfill task is migrated to an LL core by the load balancer.
No new preemption mechanism. The only new element is checking
core_class == Backfill in try_to_wake_up to prioritize the IPI.
NUMA-aware provisioning:
Provisioned cores are NUMA-aware. The provisioning subsystem:
- Prefers cores on the same NUMA node as the cgroup's memory allocations
(determined by the cgroup's
cpuset.memsor auto-detected from the first allocation's NUMA node). - Gang allocations are NUMA-local when possible: all N cores from the
same NUMA node. If insufficient cores on one node, spills to the
nearest NUMA node (lowest distance in
NumaTopology::distance()). - Cross-NUMA gang allocation is logged as a tracepoint
(
umka_tp_stable_sched_gang_cross_numa) since it may cause NUMA penalties. The operator can increasecpu.provision_countor reduce the gang size.
/// NUMA preference for core provisioning.
pub struct ProvisionNumaHint {
/// Preferred NUMA node. Auto-detected from cgroup memory or explicit.
pub preferred_node: NumaNodeId,
/// Allow cross-NUMA allocation if preferred node is full.
/// Default: true. Set to false for strict NUMA locality.
pub allow_cross_numa: bool,
}
CBS bandwidth (Section 7.6): Does not apply to
CG cores. A CG core has 100% bandwidth by definition — it runs only one cgroup's
threads. The CBS server for the cgroup, if configured, applies only to LL cores
where the cgroup's threads may also run (if cpu.provision_count < the cgroup's
thread count, excess threads run on LL cores under EEVDF+CBS).
Power management (Section 7.7): CG cores participate in power budgeting. If the system hits a RAPL or thermal limit, CG cores are frequency-throttled (not descheduled). The provisionee retains its cores but at reduced frequency. This is preferable to losing cores entirely, which would break gang scheduling invariants.
Intent-based management (Section 7.10):
The intent.latency_ns target maps to core class recommendations:
- latency_ns < 100_000 (< 100μs): recommend CG provisioning.
- 100_000 ≤ latency_ns < 10_000_000 (100μs–10ms): recommend LL with CBS guarantee.
- latency_ns ≥ 10_000_000 or latency_ns = 0: no provisioning recommendation.
The intent optimizer (Section 7.10) can automatically
write cpu.provision_count and cpu.core_class based on observed latency, subject
to the same validation rules as manual writes.
ML policy (Section 23.1): The ML policy
framework receives provisioning state as an observation channel: current core class
per CPU, allocation states, backfill utilization, reclaim latency histograms. The
policy can recommend provisioning changes (e.g., "workload X's backfill utilization
is 0% — switch from soft to hard provisioning to avoid backfill overhead") via
PolicyUpdateMsg. These recommendations flow through the intent optimizer, not
directly to the provisioning subsystem.
7.11.8 Relationship to Linux cpuset/isolcpus¶
UmkaOS provisioning is a strict superset of Linux's core isolation mechanisms. The following table maps Linux concepts to UmkaOS equivalents:
| Linux Mechanism | UmkaOS Equivalent | Difference |
|---|---|---|
isolcpus=2,3 (boot param) |
cpu.core_class = cg + cpu.provision_count = 2 |
Dynamic, not boot-time |
nohz_full=2,3 (boot param) |
Automatic on CG cores | Per-core, dynamic |
rcu_nocbs=2,3 (boot param) |
Automatic on CG cores | Per-core, dynamic |
cpuset.cpus = 2,3 |
cpu.allocated_cores (read-only output) |
Provisioning is separate from pinning |
irqaffinity=0,1 (boot param) |
Automatic IRQ masking on CG cores | Per-core, dynamic |
| No equivalent | Backfill protocol | Idle CG cores are not wasted |
| No equivalent | Gang scheduling | Atomic multi-core allocation |
| No equivalent | Intent-driven provisioning | Automatic core class selection |
Linux compatibility: cpuset.cpus continues to work as in Linux. Writing to
cpuset.cpus on a cgroup that also has cpu.provision_count > 0 is rejected with
-EBUSY — the two mechanisms are mutually exclusive. cpu.provision_count is an
UmkaOS extension; Linux applications that do not use it experience standard EEVDF
scheduling on LL cores, identical to Linux behavior.
7.11.9 Performance Bounds¶
All performance targets are measured on the specified reference hardware. Actual values may differ on other platforms but must remain within the stated bounds for certified hardware.
| Operation | Bound | Typical (x86-64) | Typical (AArch64) | Reference Hardware |
|---|---|---|---|---|
| Backfill reclaim (provisionee wake → running) | ≤ 10μs | 2-5μs | 5-8μs | AMD EPYC 9004 / Ampere Altra |
| Gang allocation (N ≤ 16 cores) | ≤ 100ms timeout | < 1ms | < 2ms | Lightly loaded system |
| Provisioning change (add/remove cores) | ≤ 1ms | ~200μs | ~400μs | Bitmap update + IPI fan-out |
| CG→LL transition (deprovision) | ≤ 1ms | ~300μs | ~500μs | Re-enable tick + RCU + workqueue |
| OS noise on CG core | < 1μs/sec | ~0.3μs/sec | ~0.5μs/sec | FTQ benchmark (Akaros methodology) |
OS noise measurement methodology: OS noise is measured using the Fixed Time Quantum (FTQ) benchmark, which measures the number of iterations of a tight loop per fixed time quantum. Deviations from the expected count indicate kernel interference. The target of <1μs per second means that over a 1-second measurement window, the total time lost to kernel interference (page faults excluded, as they are application-triggered) is less than 1 microsecond. This matches or exceeds the best published results from Akaros FTQ benchmarks on comparable hardware.