Chapter 22: AI/ML Policy Framework
Companion to Chapter 21: AI/ML and Accelerators. This chapter contains §22.1: AI/ML Policy Framework — Closed-Loop Kernel Intelligence. See 21-accelerators.md for §21.1–§21.6 (hardware accelerator framework).
22.1 AI/ML Policy Framework: Closed-Loop Kernel Intelligence
Section 21.4 defines the in-kernel inference engine — a tiny integer-only model running on the hot path (~500-5000 cycles). Section 21.4.11 defines Tier 2 inference services — more powerful models in userspace. But these sections describe the inference plumbing, not the policy integration: what kernel knobs can ML actually adjust, how does the kernel emit telemetry so ML has data to learn from, and how do results from an external "big model" (LLM, large transformer, RL agent) flow back into the kernel to affect scheduler behavior until the next tuning cycle?
This section defines the complete closed-loop framework.
22.1.1 Design Principles
Advisory, not authoritative. Every ML-adjusted parameter has a heuristic fallback. If the ML layer crashes, misbehaves, or is absent, the kernel runs its built-in heuristics. ML improves average-case performance; it never replaces correctness.
Bounded parameters. Each tunable parameter has [min, max] bounds enforced at the
kernel enforcement point. An ML model cannot set eevdf_weight_scale = 1000000 any more
than a sysctl can — the kernel clamps it. This makes the parameter space safe even if
the model is adversarially manipulated.
Temporal decay. Parameters set by ML automatically revert to their defaults after
decay_period_ms milliseconds without a refresh. If the Tier 2 service crashes or
becomes unreachable, the kernel gradually returns to baseline behavior. No explicit
"ML service is down" signal is needed.
Workload-specific, not global. ML tuning is cgroup-scoped: the ML layer can set different parameters for a latency-sensitive web server cgroup and a batch analytics cgroup running on the same machine. Global parameters exist but are the minority.
Latency tiers for AI decisions:
| Tier | Latency | Mechanism | Examples |
|---|---|---|---|
| A | < 1 μs | Pure heuristic | Page fault handler, IRQ routing |
| B | 1–50 μs | In-kernel model (Section 21.4) | Page prefetch stride, I/O queue reorder |
| C | 50 μs–5 s | Tier 2 service round-trip | NUMA migration, compression selection, power budget |
| D | 5 s–5 min | Tier 2 + external "big model" | Scheduler workload characterization, full EAS recalibration, anomaly root-cause |
22.1.2 Kernel Observation Bus
Every kernel subsystem that participates in ML tuning emits observations via a zero-cost macro. Observations are stored in per-CPU ring buffers and consumed asynchronously by Tier 2 policy services.
// umka-core/src/ml/observation.rs
/// Identifies the emitting subsystem.
#[repr(u16)]
pub enum SubsystemId {
Scheduler = 1,
MemoryManager = 2,
TcpStack = 3,
BlockIo = 4,
PowerManager = 5,
FmaHealth = 6,
NvmeDriver = 7,
NetworkDriver = 8,
// Room for up to 65535 subsystem IDs; new subsystems just add an entry.
}
/// Compact observation emitted by a kernel subsystem.
/// 64 bytes total — fits in one cache line.
#[repr(C)]
pub struct KernelObservation {
pub timestamp_ns: u64, // Monotonic clock (TSC-derived)
pub subsystem: SubsystemId, // Source subsystem
pub obs_type: u16, // Subsystem-defined event type (see tables below)
pub cgroup_id: u32, // Originating cgroup (0 = kernel/no cgroup)
pub cpu_id: u16, // CPU where the event occurred
pub _pad: u16,
pub features: [i32; 10], // Up to 10 integer feature dimensions
}
/// Per-CPU observation ring buffer.
/// Lock-free SPSC: kernel writes (producer), Tier 2 service reads (consumer).
/// Consumer attaches exactly one reader handle per subsystem per CPU.
pub struct ObservationRing {
pub buf: [KernelObservation; 4096], // ~256 KB per CPU per subsystem
pub head: AtomicU64, // Write pointer (kernel)
pub tail: AtomicU64, // Read pointer (Tier 2)
pub overflow: AtomicU64, // Dropped observation count
}
The observe_kernel! macro emits observations with zero overhead when no consumer
is registered (one byte static key, branch predicted-not-taken). It depends on the
collect_features! helper macro to pack feature arguments into the fixed-size
[i32; 10] features array:
/// Packs up to 10 feature expressions into a `[i32; 10]` array for use in
/// `KernelObservation::features`. Accepts 1–10 positional expressions; unused
/// slots are zero-filled. All expressions must be convertible to `i32`.
///
/// # Example
/// ```
/// let arr = collect_features!(latency_ns as i32, runqueue_len, cpu_id as i32);
/// // arr == [latency_ns as i32, runqueue_len, cpu_id as i32, 0, 0, 0, 0, 0, 0, 0]
/// ```
macro_rules! collect_features {
// Match 1–10 feature expressions and zero-pad to exactly 10 elements.
($f0:expr $(, $f1:expr $(, $f2:expr $(, $f3:expr $(, $f4:expr
$(, $f5:expr $(, $f6:expr $(, $f7:expr $(, $f8:expr $(, $f9:expr)?)?)?)?)?)?)?)?)?$(,)?) => {
[
$f0 as i32,
$({ let _x: i32 = $f1 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f2 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f3 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f4 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f5 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f6 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f7 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f8 as i32; _x })*.unwrap_or(0),
$({ let _x: i32 = $f9 as i32; _x })*.unwrap_or(0),
]
};
}
Implementation note: In the generated code the optional-slot pattern above is expressed via a counting macro that emits the correct number of
, 0zero-fillers, not viaOption. The pseudocode above conveys the intent; the actual expansion uses a standard Rust macro repetition counter pattern. The result is always a[i32; 10]with no run-time cost — all zero-filling is resolved at compile time.
The observe_kernel! macro emits observations with zero overhead when no consumer
is registered (one byte static key, branch predicted-not-taken):
/// Emit a kernel observation.
/// Overhead when disabled: 1–3 cycles (static branch miss rate ~0%).
/// Overhead when enabled: ~10–30 cycles (TSC read + ring buffer write).
///
/// # Example
/// ```
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskWoke,
/// cgroup_id, latency_ns as i32, runqueue_len, prev_cpu);
/// ```
macro_rules! observe_kernel {
($subsystem:expr, $obs_type:expr, $cgroup:expr, $($feat:expr),+) => {{
// Static key: single .byte 0x90 (NOP) patched to JMP when no consumer.
// Runtime patcher changes this when a Tier 2 service registers.
if static_key_enabled!(OBSERVE_ENABLED[$subsystem as usize]) {
let __obs = KernelObservation {
timestamp_ns: crate::arch::current::cpu::read_tsc_ns(),
subsystem: $subsystem,
obs_type: $obs_type as u16,
cgroup_id: $cgroup,
cpu_id: CpuLocal::cpu_id() as u16,
_pad: 0,
features: collect_features!($($feat),+),
};
ObservationRing::push_current_cpu($subsystem, __obs);
}
}}
}
Aggregation. Tier 2 services typically consume raw observations and aggregate them into feature vectors over configurable windows (100ms, 1s, 10s, 60s). The ring buffer provides raw data; aggregation policy is entirely in Tier 2 userspace.
22.1.3 Tunable Parameter Store
Every kernel subsystem that accepts ML-driven tuning registers its parameters in a
global KernelParamStore. Parameters are read by the kernel with a single atomic load
(~1–3 cycles); writes require a CAS plus a version increment.
// umka-core/src/ml/params.rs
/// A single tunable kernel parameter.
/// Layout: 64 bytes (two cache lines with version to prevent torn reads).
#[repr(C, align(64))]
pub struct KernelTunableParam {
/// Unique monotonic ID assigned at registration time.
pub param_id: u32,
pub subsystem: SubsystemId,
pub param_name: [u8; 24], // Null-terminated ASCII name
/// Current value (ML-adjusted). Read with AtomicI64::load(Relaxed).
pub current: AtomicI64,
pub default_value: i64,
pub min_value: i64,
pub max_value: i64,
/// After this many ms without a refresh from the ML layer, the parameter
/// automatically decays to `default_value`. 0 = no decay (permanent until reset).
pub decay_period_ms: u32,
/// Monotonic version, incremented on every successful update.
/// Consumers can detect parameter staleness by watching for version changes.
pub version: AtomicU64,
/// Timestamp of last ML-driven update (ns). 0 = never updated.
pub last_updated_ns: AtomicU64,
}
/// Global parameter registry. Up to 1024 tunable parameters total.
/// Initialized at boot from per-subsystem `register_param!` calls.
pub struct KernelParamStore {
pub params: [Option<KernelTunableParam>; 1024],
pub count: AtomicU32,
pub store_lock: SpinLock<()>, // Protects registration; not held during reads
}
Reading a parameter (zero overhead when compared to hardcoded values):
// Hot-path read pattern — no lock, single atomic load:
let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
.map_or(100, |p| p.current.load(Relaxed));
Decay enforcement runs from the scheduler tick on each CPU:
// Called from schedule_tick() — O(active_params) with early-exit on no-expiry
fn enforce_param_decay(now_ns: u64) {
for param in PARAM_STORE.active_params() {
if param.decay_period_ms > 0 {
let last = param.last_updated_ns.load(Relaxed);
if now_ns - last > param.decay_period_ms as u64 * 1_000_000 {
param.current.store(param.default_value, Release);
param.version.fetch_add(1, Release);
}
}
}
}
22.1.4 Policy Consumer KABI (Tier 2 → Kernel)
Before the vtable is defined, the two supporting types passed during registration are specified:
// umka-core/src/ml/observation.rs (continued)
/// Per-CPU set of observation ring buffers, one per KernelObservation category.
/// Populated by the `observe_kernel!` macro; consumed by policy services.
/// The kernel mmaps this structure read-only into the Tier 2 service address space.
pub struct ObservationRingSet {
/// Per-category ring buffers. Indexed by `SubsystemId as usize`.
/// Entries for unregistered subsystems are zeroed and must not be read.
pub rings: [ObservationRing; OBSERVATION_CATEGORY_COUNT],
/// CPU this ring set belongs to (for NUMA-aware allocation by the consumer).
pub cpu_id: u32,
/// Total observations dropped due to ring overflow since last reset.
/// Updated atomically by the kernel producer; read-only for consumers.
pub dropped: AtomicU64,
}
/// Number of observation categories (= max SubsystemId discriminant + 1,
/// rounded up to the next power of two for index-by-enum access).
pub const OBSERVATION_CATEGORY_COUNT: usize = 16;
// umka-core/src/ml/params.rs (continued)
/// Read-only shadow view of the KernelTunableParam store for use by policy consumers.
/// Updated atomically on each policy epoch (a complete pass of decay enforcement);
/// consumers read from this without needing to acquire the param store write lock.
///
/// The kernel mmaps this structure into Tier 2 service address space as read-only.
/// Consumers detect staleness by watching `epoch`: if `epoch` changes between two reads
/// of `params[i]`, re-read the slot.
pub struct KernelParamStoreShadow {
/// Epoch counter; incremented on every param store update.
/// Odd epoch = update in progress; even epoch = stable snapshot.
/// Consumers must spin-wait for an even epoch before reading `params`.
pub epoch: AtomicU64,
/// Snapshot of all tunable parameters as of `epoch`.
/// Indexed by `KernelTunableParam::param_id` (0..MAX_TUNABLE_PARAMS).
/// Entries whose corresponding param is unregistered hold `i64::MIN` as a sentinel.
pub params: [AtomicI64; MAX_TUNABLE_PARAMS],
/// Timestamp of the last shadow update (monotonic clock, nanoseconds since boot).
pub last_update_ns: AtomicU64,
}
/// Maximum number of tunable parameters tracked in `KernelParamStoreShadow`.
/// `KernelParamStore` supports up to 1024 registered parameters; the shadow
/// uses the same bound so indexing is identical.
pub const MAX_TUNABLE_PARAMS: usize = 1024;
A Tier 2 policy driver that wishes to read observations and write parameter updates
implements the PolicyConsumerVTable:
#[repr(C)]
pub struct PolicyConsumerVTable {
pub vtable_size: u64,
pub version: u32,
/// Kernel calls this on registration to give the service access to
/// observation ring buffer handles and the parameter store.
///
/// # Safety
/// `ctx` must remain valid for the lifetime of the service registration.
pub on_register: unsafe extern "C" fn(
ctx: *mut ServiceCtx,
obs_rings: *const ObservationRingSet, // One ring per CPU per subscribed subsystem
param_store: *const KernelParamStoreShadow, // Read-only view of current parameters
) -> KabiResult,
/// Kernel delivers a batch of observations from the ring buffer.
/// Called on a background kernel thread; must return promptly (<100μs).
/// Heavy processing should be deferred to the service's own thread pool.
///
/// # Safety
/// `obs` points to `count` valid `KernelObservation` entries.
pub on_observations: unsafe extern "C" fn(
ctx: *mut ServiceCtx,
obs: *const KernelObservation,
count: u32,
) -> KabiResult,
}
/// Message sent from Tier 2 policy service to the kernel to update a parameter.
/// Submitted via a dedicated policy update ring buffer (separate from observations).
///
/// Layout is fixed at 128 bytes for KABI stability. All implicit compiler padding
/// is made explicit via named `_pad*` fields so that the layout is identical
/// across compiler versions, optimization levels, and target architectures.
/// Future fields may be added within `_reserved`; increment `PolicyConsumerVTable::version`
/// when doing so.
#[repr(C)]
pub struct PolicyUpdateMsg {
/// Which parameter to update.
pub param_id: u32,
/// Explicit padding to align `new_value` to 8 bytes.
pub _pad0: [u8; 4],
/// New value — kernel clamps to [min_value, max_value] before applying.
pub new_value: i64,
/// Parameter is valid for this many ms (0 = use param's default decay_period).
/// If the service crashes, the parameter decays after this interval.
pub valid_for_ms: u32,
/// Explicit padding to align `model_seq` to 8 bytes.
pub _pad1: [u8; 4],
/// Monotonic sequence number from the ML model that produced this update.
/// Out-of-order or duplicate updates (lower seq than current) are silently dropped.
pub model_seq: u64,
/// Optional: restrict update to a specific cgroup (0 = global / kernel-wide).
pub cgroup_id: u32,
/// Explicit padding to 128-byte total size for KABI stability.
/// Future fields may be added here; increment `version` when adding.
pub _reserved: [u8; 92],
}
const _: () = assert!(core::mem::size_of::<PolicyUpdateMsg>() == 128);
The kernel validates all PolicyUpdateMsg entries before applying:
1. param_id must exist in KernelParamStore
2. new_value is clamped to [min_value, max_value] (never rejected; clamped silently)
3. Caller must hold Capability::KernelMlTune (Tier 2 services receive this at registration if granted by the operator; see Section 22.1.8)
4. model_seq must be ≥ the last applied seq for this (param_id, cgroup_id) pair
22.1.5 Subsystem Integration Catalog
Each subsystem that participates in ML tuning registers its parameters at boot. The tables below define the initial parameter sets and observation types.
Scheduler (Section 6)
Observation types (SchedObs):
| obs_type | features[0..9] | Meaning |
|---|---|---|
TaskWoke |
latency_ns, runq_len, cpu, prev_cpu, cgroup_id | Task wakeup latency |
MigrateDecision |
src_cpu, dst_cpu, task_weight, queue_diff, benefit_ns | NUMA/load migration |
PreemptionEvent |
preemptor_prio, preemptee_prio, cgroup_id, — | Preemption occurred |
RunqueueStats |
runq_len, avg_vruntime, nr_throttled, nr_rt, cgroup_id | Per-CPU runqueue snapshot (every 10ms) |
EasDecision |
task_cgroup, chosen_cpu, energy_delta_uw, load_delta, — | EAS placement |
Tunable parameters (SchedParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
eevdf_weight_scale |
100 | 50 | 200 | 60s | Scale factor for virtual deadline computation (100 = baseline) |
migration_benefit_threshold |
1000 | 100 | 50000 | 30s | Minimum ns benefit to justify task migration |
preemption_latency_budget |
1000 | 100 | 10000 | 30s | Maximum μs a lower-priority task runs before preemption check |
eas_energy_bias |
50 | 0 | 100 | 60s | 0 = performance, 100 = max energy saving in EAS decisions |
cfs_burst_quota_us |
0 | 0 | 100000 | 60s | CFS burst tolerance for cgroup (0 = disabled) |
Memory Manager (Section 4)
Observation types (MemObs):
| obs_type | features[0..9] | Meaning |
|---|---|---|
PageFault |
cgroup, fault_type, addr_band, file_offset_band, prefetch_hit | Page fault event |
EvictionDecision |
cgroup, evicted_page_type, lru_age, refault_distance, — | Which page was evicted |
NumaMigration |
src_node, dst_node, cgroup, pages_moved, benefit_ns | NUMA migration result |
RefaultRecord |
cgroup, file_inode, page_offset, time_since_evict_ms | Previously-evicted page faulted again |
MemPressure |
node, free_pages, anon_pages, file_pages, slab_pages | Memory pressure snapshot (every 1s) |
Tunable parameters (MemParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
reclaim_aggressiveness |
100 | 25 | 400 | 30s | LRU reclaim rate relative to baseline |
prefetch_window_pages |
8 | 1 | 128 | 30s | Max pages to prefetch per fault event |
numa_migration_threshold |
200 | 50 | 2000 | 60s | Minimum benefit (ns) per page to trigger NUMA migration |
compress_entropy_threshold |
128 | 64 | 255 | 60s | Page entropy (0–255) above which compression is skipped |
swap_local_ratio |
80 | 0 | 100 | 30s | % of swap that goes to local vs RDMA remote swap (Section 5) |
TCP / Network (Section 14)
Observation types (NetObs):
| obs_type | features[0..9] | Meaning |
|---|---|---|
CongestionEvent |
cgroup, cwnd, rtt_us, retransmits, bandwidth_mbps | Congestion window event |
FlowStats |
cgroup, bytes_sent, bytes_recv, rtt_p99_us, loss_pct | Per-flow statistics snapshot |
RouteDecision |
src_addr_band, dst_addr_band, chosen_dev, alternative_dev, latency_us | Routing decision |
Tunable parameters (NetParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
tcp_initial_cwnd_scale |
10 | 2 | 100 | 30s | Initial congestion window (segments) per cgroup |
bbr_probe_rtt_interval_ms |
10000 | 200 | 60000 | 60s | BBR min-RTT probe interval |
tcp_pacing_gain_pct |
125 | 100 | 200 | 30s | BBR pacing gain percentage |
ecn_aggressiveness |
1 | 0 | 3 | 60s | 0=off, 1=ECT(1), 2=ECT(0), 3=always mark |
I/O Scheduler (Section 14.13)
Tunable parameters (IoParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
readahead_pages |
32 | 1 | 512 | 30s | Readahead window in pages |
queue_depth_target |
32 | 4 | 1024 | 30s | Target NVMe queue depth per cgroup |
latency_target_us |
0 | 0 | 100000 | 30s | 0 = throughput mode; >0 = latency target (μs) |
Power Manager (Section 6)
Tunable parameters (PowerParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
rapl_package_power_w |
0 | 5 | 400 | 10s | CPU package power cap (W); 0 = no cap |
cpu_freq_min_mhz |
0 | 0 | 10000 | 10s | Minimum CPU frequency; 0 = hardware default |
accel_power_cap_w |
0 | 0 | 1000 | 10s | Per-accelerator power cap; 0 = hardware default |
thermal_target_c |
95 | 60 | 105 | 5s | Thermal throttle target (°C) |
FMA / Security (Section 16)
Tunable parameters (FmaParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
anomaly_alert_threshold |
80 | 10 | 100 | 60s | Score (0–100) above which FMA raises alert |
health_check_interval_ms |
1000 | 100 | 60000 | 0 | How frequently to poll device health counters |
error_rate_window_ms |
5000 | 100 | 300000 | 0 | Error rate measurement window |
22.1.6 Heavy Model Integration Pattern
The "big model" pattern enables a Tier 2 driver to call any model — a large transformer,
an LLM, a remote inference service, or a custom RL policy — and feed the results back as
PolicyUpdateMsg entries. The kernel is unaware of where the result came from; it only
sees a bounded parameter update through the standard mechanism.
Data flow:
╔═══════════════════════════════════════════════════════════════════╗
║ UmkaOS Core (Tier 0, Ring 0) ║
║ ║
║ [observe_kernel! macro calls] ──► [ObservationRing per CPU] ║
║ ║
║ [KernelParamStore] ◄── [PolicyUpdateMsg ring] ◄── [validation] ║
╚══════════════╬══════════════════════════════╬═════════════════════╝
║ mmap ObservationRings ║ PolicyUpdateMsg
▼ (read-only, zero-copy) ║ (write ring, validated)
╔══════════════════════════════════╗ ║
║ Tier 2 Policy Service Process ║ ║
║ (Ring 3, hardware-isolated) ║ ║
║ ║ ║
║ ┌──────────────────────────┐ ║ ║
║ │ Observation aggregator │ ║ ║
║ │ (100ms/1s/10s windows) │ ║ ║
║ └──────────┬───────────────┘ ║ ║
║ │ feature vector ║ ║
║ ▼ ║ ║
║ ┌──────────────────────────┐ ║ ║
║ │ Inference layer │ ║ ║
║ │ (in-process small model │ ║ ║
║ │ OR call big model ──►──╫────╫─►external)║
║ │ ◄── result ────────────╫────╫─◄──────── ║
║ └──────────┬───────────────┘ ║ ║
║ │ PolicyUpdateMsg ║ ║
║ └────────────────────╫────────────╝
║ ║
╚══════════════════════════════════╝
External big model call — concrete example:
A Tier 2 scheduler policy service runs a 5-minute characterization cycle:
Every 5 minutes:
1. Drain last 5 minutes of Scheduler + Memory ObservationRings into feature matrix
(per-cgroup stats: avg task latency, cache miss rate, memory pressure, IPC, etc.)
2. If local small model is confident (prediction score > 0.85):
→ Apply parameter updates directly (Tier C path, ~100ms latency)
3. If confidence is low OR first boot OR significant workload shift detected:
→ Serialize feature matrix as JSON/protobuf
→ Call external inference service via UNIX socket or HTTP:
POST /analyze { features: [...], cgroup_ids: [...] }
→ Service runs large model (XGBoost, small transformer, RL policy)
(100ms – 5s latency acceptable at this tier)
→ Parse response: { cgroup_id: 42, param_id: "eevdf_weight_scale", value: 130 }
4. For each (param_id, new_value) in response:
→ Validate cgroup ownership (service can only tune cgroups it owns)
→ Submit PolicyUpdateMsg with valid_for_ms = 300_000 (expires in 5 min)
→ Kernel applies on next parameter read (atomic store, ~5 cycles)
5. Log all updates to FMA audit ring (Section 19.1)
→ {timestamp, service_id, model_version, param_id, old_value, new_value}
The external service can be: - A local Python/Rust process running PyTorch, XGBoost, or scikit-learn - A remote inference microservice (gRPC or HTTP) running on another node - An LLM with a structured output schema (for workload characterization and root-cause analysis) - A reinforcement learning agent maintaining state across tuning cycles
None of this requires any kernel changes: the kernel only sees PolicyUpdateMsg entries.
22.1.7 Model Weight Update Flow
When a Tier 2 service trains or fine-tunes an in-kernel model (Section 21.4), it
ships updated weights via the existing sysfs interface (Section 21.4.4). The update
is atomic from the kernel's perspective:
Tier 2 online learning loop:
Every N minutes (configurable per service):
1. Extract recent training data from ObservationRings + ground-truth outcomes
(ground truth: observed page refault rates for prefetch model; actual I/O
latencies for I/O scheduler model)
2. Run mini-batch update in Tier 2 userspace (full FP, no kernel restrictions)
3. Quantize new weights to INT8/INT16 using the .isleml binary format (Section 21.4.9)
4. Validate model offline:
- Pass the model through the load-time validator (Section 21.4.9)
- Run 1000 representative inputs; compare against previous model
- Accept only if accuracy delta > -2% (do not regress more than 2 points)
5. Write to /sys/kernel/umka/inference/models/<model_name>/model.bin
→ Kernel receives write, invokes load-time validator (Section 21.4.9)
→ On validation pass: CAS swap of AtomicModelRef pointer
→ Old model freed after RCU grace period
→ New weights active for next inference call (within microseconds)
6. If validation fails: keep previous model, log failure to FMA ring
AtomicModelRef — the kernel's handle on the active in-kernel model:
// umka-core/src/inference/model.rs
/// RCU-protected reference to the active in-kernel model.
/// Replaced atomically on weight update; old model freed after grace period.
pub struct AtomicModelRef {
pub ptr: AtomicPtr<KernelModel>, // Null = use heuristic fallback
}
impl AtomicModelRef {
/// Hot-path inference: load model pointer, run inference, drop RCU guard.
/// The rcu_read_guard prevents concurrent model replacement from freeing
/// the model while inference is in progress.
pub fn infer(&self, features: &[i32]) -> Option<i32> {
let _guard = rcu_read_lock();
let model = unsafe { self.ptr.load(Acquire).as_ref()? };
if !model.active.load(Relaxed) { return None; }
Some(model.run(features))
}
/// Called from sysfs write handler on model.bin write.
/// Validates, then atomically replaces the active model.
pub fn update(&self, new_model: Box<KernelModel>) -> Result<(), ModelError> {
new_model.validate()?;
let new_ptr = Box::into_raw(new_model);
let old_ptr = self.ptr.swap(new_ptr, AcqRel);
if !old_ptr.is_null() {
// Free old model after all CPUs pass through a quiescent state.
rcu_call(move || unsafe { drop(Box::from_raw(old_ptr)) });
}
Ok(())
}
}
22.1.8 Security and Capability Model
CAP_ML_TUNE capability. Applying PolicyUpdateMsg entries to the kernel requires
the Capability::KernelMlTune capability. This is an UmkaOS-specific capability (not a
Linux CAP_*), held by the Tier 2 service process when:
1. The operator has granted it at service registration time, OR
2. The service is one of UmkaOS's reference policy services and is cryptographically signed
Without KernelMlTune, a process can read observations from its own cgroup but cannot
write parameter updates.
Bounds enforcement. Every new_value in PolicyUpdateMsg is silently clamped to
[min_value, max_value] before being stored. An ML service that produces out-of-range
values is not rejected — the clamping is the safety mechanism.
Namespace isolation. A containerized Tier 2 service sees only its own cgroup's
parameters and observations. A service in cgroup docker/myapp cannot read memory
pressure observations from system.slice, nor can it set rapl_package_power_w globally.
Global parameter updates require CAP_ML_TUNE + CAP_SYS_ADMIN.
Audit log. Every parameter update is logged to the FMA ring (Section 19.1) with:
- {ts_ns, service_id, model_version, param_id, cgroup_id, old_value, new_value}
- Log entries are write-once; tampering with the audit ring requires CAP_SYS_ADMIN
- The FMA ring is accessible to security monitoring tools via umkafs /System/Kernel/MLAudit
Adversarial protection:
- Rate limiting: max 1000 PolicyUpdateMsg entries per service per second
- Consistency bounds: if a parameter oscillates by > 50% within 10s, an FMA alert
is raised (may indicate a misbehaving service or adversarial input to the ML model)
- Model versioning: models are refused if their validation accuracy is below 60% on
the standard benchmark set embedded in the kernel binary at build time
22.1.9 Reference Policy Services
UmkaOS ships the following Tier 2 policy services (all optional, loaded on demand):
| Service | Model | Parameters tuned | Observations consumed | Cadence |
|---|---|---|---|---|
umka-ml-sched |
Gradient-boosted trees (XGBoost) | eevdf_weight_scale, migration_benefit_threshold, eas_energy_bias |
TaskWoke, RunqueueStats, EasDecision |
Every 60s |
umka-ml-numa |
Gradient-boosted regression | numa_migration_threshold, swap_local_ratio |
NumaMigration, RefaultRecord, MemPressure |
Every 30s |
umka-ml-compress |
Random forest | compress_entropy_threshold, reclaim_aggressiveness |
PageFault, EvictionDecision, MemPressure |
Every 30s |
umka-ml-power |
Contextual bandit (online RL) | rapl_package_power_w, accel_power_cap_w, thermal_target_c |
EAS observations, accel utilization | Every 10s |
umka-ml-anomaly |
Isolation forest | anomaly_alert_threshold, health_check_interval_ms |
FmaHealth device metrics |
Every 5s |
umka-ml-tcp |
Online linear model | tcp_initial_cwnd_scale, bbr_probe_rtt_interval_ms |
CongestionEvent, FlowStats |
Every 5s |
Each service is a standalone Tier 2 driver (~500-2000 LOC) implementing the
PolicyConsumerVTable. They are independently upgradable, independently crashable
(Tier 2 restart in ~10ms, Section 10.2), and independently optional.
The umka-ml-sched service supports a "big model" extension point: if the environment
variable UMKA_ML_SCHED_REMOTE_ENDPOINT is set, it will call an external inference
service (Section 22.1.6) for low-confidence workload characterizations, falling back
to its local XGBoost model when the remote endpoint is unavailable.
22.1.10 Performance Impact
Observation emission (when at least one consumer is registered):
| Operation | Cycles | Notes |
|---|---|---|
| Static key check (disabled) | 1–3 | Predicted-not-taken branch |
| TSC read | 5–10 | RDTSC or arch-specific |
| Ring buffer write | 10–20 | One cache line write |
| Total (enabled) | ~25 cycles | ~10ns at 2.5 GHz |
Parameter read (hot path):
| Operation | Cycles | Notes |
|---|---|---|
AtomicI64::load(Relaxed) |
1–3 | Same cost as reading a global variable |
Parameter update (Tier 2 → kernel):
| Operation | Cycles | Notes |
|---|---|---|
| Ring buffer submission | ~20 | From Tier 2 userspace |
| Kernel validation + CAS | ~15 | Including capability check |
| Total end-to-end (async) | ~100μs | Dominated by scheduling latency for async path |
Decay enforcement (per schedule_tick):
The enforce_param_decay scan visits all registered parameters once per tick. With 50
parameters and a tick period of 4ms, this is ~50 comparisons per tick across all CPUs.
At 3 cycles per comparison: ~150 cycles = ~60ns. Negligible.