Skip to content

Chapter 22: AI/ML Policy Framework

Companion to Chapter 21: AI/ML and Accelerators. This chapter contains §22.1: AI/ML Policy Framework — Closed-Loop Kernel Intelligence. See 21-accelerators.md for §21.1–§21.6 (hardware accelerator framework).


22.1 AI/ML Policy Framework: Closed-Loop Kernel Intelligence

Section 21.4 defines the in-kernel inference engine — a tiny integer-only model running on the hot path (~500-5000 cycles). Section 21.4.11 defines Tier 2 inference services — more powerful models in userspace. But these sections describe the inference plumbing, not the policy integration: what kernel knobs can ML actually adjust, how does the kernel emit telemetry so ML has data to learn from, and how do results from an external "big model" (LLM, large transformer, RL agent) flow back into the kernel to affect scheduler behavior until the next tuning cycle?

This section defines the complete closed-loop framework.

22.1.1 Design Principles

Advisory, not authoritative. Every ML-adjusted parameter has a heuristic fallback. If the ML layer crashes, misbehaves, or is absent, the kernel runs its built-in heuristics. ML improves average-case performance; it never replaces correctness.

Bounded parameters. Each tunable parameter has [min, max] bounds enforced at the kernel enforcement point. An ML model cannot set eevdf_weight_scale = 1000000 any more than a sysctl can — the kernel clamps it. This makes the parameter space safe even if the model is adversarially manipulated.

Temporal decay. Parameters set by ML automatically revert to their defaults after decay_period_ms milliseconds without a refresh. If the Tier 2 service crashes or becomes unreachable, the kernel gradually returns to baseline behavior. No explicit "ML service is down" signal is needed.

Workload-specific, not global. ML tuning is cgroup-scoped: the ML layer can set different parameters for a latency-sensitive web server cgroup and a batch analytics cgroup running on the same machine. Global parameters exist but are the minority.

Latency tiers for AI decisions:

Tier Latency Mechanism Examples
A < 1 μs Pure heuristic Page fault handler, IRQ routing
B 1–50 μs In-kernel model (Section 21.4) Page prefetch stride, I/O queue reorder
C 50 μs–5 s Tier 2 service round-trip NUMA migration, compression selection, power budget
D 5 s–5 min Tier 2 + external "big model" Scheduler workload characterization, full EAS recalibration, anomaly root-cause

22.1.2 Kernel Observation Bus

Every kernel subsystem that participates in ML tuning emits observations via a zero-cost macro. Observations are stored in per-CPU ring buffers and consumed asynchronously by Tier 2 policy services.

// umka-core/src/ml/observation.rs

/// Identifies the emitting subsystem.
#[repr(u16)]
pub enum SubsystemId {
    Scheduler     = 1,
    MemoryManager = 2,
    TcpStack      = 3,
    BlockIo       = 4,
    PowerManager  = 5,
    FmaHealth     = 6,
    NvmeDriver    = 7,
    NetworkDriver = 8,
    // Room for up to 65535 subsystem IDs; new subsystems just add an entry.
}

/// Compact observation emitted by a kernel subsystem.
/// 64 bytes total — fits in one cache line.
#[repr(C)]
pub struct KernelObservation {
    pub timestamp_ns:    u64,            // Monotonic clock (TSC-derived)
    pub subsystem:       SubsystemId,    // Source subsystem
    pub obs_type:        u16,            // Subsystem-defined event type (see tables below)
    pub cgroup_id:       u32,            // Originating cgroup (0 = kernel/no cgroup)
    pub cpu_id:          u16,            // CPU where the event occurred
    pub _pad:            u16,
    pub features:        [i32; 10],      // Up to 10 integer feature dimensions
}

/// Per-CPU observation ring buffer.
/// Lock-free SPSC: kernel writes (producer), Tier 2 service reads (consumer).
/// Consumer attaches exactly one reader handle per subsystem per CPU.
pub struct ObservationRing {
    pub buf:      [KernelObservation; 4096],  // ~256 KB per CPU per subsystem
    pub head:     AtomicU64,                  // Write pointer (kernel)
    pub tail:     AtomicU64,                  // Read pointer (Tier 2)
    pub overflow: AtomicU64,                  // Dropped observation count
}

The observe_kernel! macro emits observations with zero overhead when no consumer is registered (one byte static key, branch predicted-not-taken). It depends on the collect_features! helper macro to pack feature arguments into the fixed-size [i32; 10] features array:

/// Packs up to 10 feature expressions into a `[i32; 10]` array for use in
/// `KernelObservation::features`. Accepts 1–10 positional expressions; unused
/// slots are zero-filled. All expressions must be convertible to `i32`.
///
/// # Example
/// ```
/// let arr = collect_features!(latency_ns as i32, runqueue_len, cpu_id as i32);
/// // arr == [latency_ns as i32, runqueue_len, cpu_id as i32, 0, 0, 0, 0, 0, 0, 0]
/// ```
macro_rules! collect_features {
    // Match 1–10 feature expressions and zero-pad to exactly 10 elements.
    ($f0:expr $(, $f1:expr $(, $f2:expr $(, $f3:expr $(, $f4:expr
     $(, $f5:expr $(, $f6:expr $(, $f7:expr $(, $f8:expr $(, $f9:expr)?)?)?)?)?)?)?)?)?$(,)?) => {
        [
            $f0 as i32,
            $({ let _x: i32 = $f1 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f2 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f3 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f4 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f5 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f6 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f7 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f8 as i32; _x })*.unwrap_or(0),
            $({ let _x: i32 = $f9 as i32; _x })*.unwrap_or(0),
        ]
    };
}

Implementation note: In the generated code the optional-slot pattern above is expressed via a counting macro that emits the correct number of , 0 zero-fillers, not via Option. The pseudocode above conveys the intent; the actual expansion uses a standard Rust macro repetition counter pattern. The result is always a [i32; 10] with no run-time cost — all zero-filling is resolved at compile time.

The observe_kernel! macro emits observations with zero overhead when no consumer is registered (one byte static key, branch predicted-not-taken):

/// Emit a kernel observation.
/// Overhead when disabled: 1–3 cycles (static branch miss rate ~0%).
/// Overhead when enabled:  ~10–30 cycles (TSC read + ring buffer write).
///
/// # Example
/// ```
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskWoke,
///     cgroup_id, latency_ns as i32, runqueue_len, prev_cpu);
/// ```
macro_rules! observe_kernel {
    ($subsystem:expr, $obs_type:expr, $cgroup:expr, $($feat:expr),+) => {{
        // Static key: single .byte 0x90 (NOP) patched to JMP when no consumer.
        // Runtime patcher changes this when a Tier 2 service registers.
        if static_key_enabled!(OBSERVE_ENABLED[$subsystem as usize]) {
            let __obs = KernelObservation {
                timestamp_ns: crate::arch::current::cpu::read_tsc_ns(),
                subsystem:    $subsystem,
                obs_type:     $obs_type as u16,
                cgroup_id:    $cgroup,
                cpu_id:       CpuLocal::cpu_id() as u16,
                _pad:         0,
                features:     collect_features!($($feat),+),
            };
            ObservationRing::push_current_cpu($subsystem, __obs);
        }
    }}
}

Aggregation. Tier 2 services typically consume raw observations and aggregate them into feature vectors over configurable windows (100ms, 1s, 10s, 60s). The ring buffer provides raw data; aggregation policy is entirely in Tier 2 userspace.

22.1.3 Tunable Parameter Store

Every kernel subsystem that accepts ML-driven tuning registers its parameters in a global KernelParamStore. Parameters are read by the kernel with a single atomic load (~1–3 cycles); writes require a CAS plus a version increment.

// umka-core/src/ml/params.rs

/// A single tunable kernel parameter.
/// Layout: 64 bytes (two cache lines with version to prevent torn reads).
#[repr(C, align(64))]
pub struct KernelTunableParam {
    /// Unique monotonic ID assigned at registration time.
    pub param_id:         u32,
    pub subsystem:        SubsystemId,
    pub param_name:       [u8; 24],       // Null-terminated ASCII name

    /// Current value (ML-adjusted). Read with AtomicI64::load(Relaxed).
    pub current:          AtomicI64,
    pub default_value:    i64,
    pub min_value:        i64,
    pub max_value:        i64,

    /// After this many ms without a refresh from the ML layer, the parameter
    /// automatically decays to `default_value`. 0 = no decay (permanent until reset).
    pub decay_period_ms:  u32,

    /// Monotonic version, incremented on every successful update.
    /// Consumers can detect parameter staleness by watching for version changes.
    pub version:          AtomicU64,

    /// Timestamp of last ML-driven update (ns). 0 = never updated.
    pub last_updated_ns:  AtomicU64,
}

/// Global parameter registry. Up to 1024 tunable parameters total.
/// Initialized at boot from per-subsystem `register_param!` calls.
pub struct KernelParamStore {
    pub params:    [Option<KernelTunableParam>; 1024],
    pub count:     AtomicU32,
    pub store_lock: SpinLock<()>,  // Protects registration; not held during reads
}

Reading a parameter (zero overhead when compared to hardcoded values):

// Hot-path read pattern — no lock, single atomic load:
let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
    .map_or(100, |p| p.current.load(Relaxed));

Decay enforcement runs from the scheduler tick on each CPU:

// Called from schedule_tick() — O(active_params) with early-exit on no-expiry
fn enforce_param_decay(now_ns: u64) {
    for param in PARAM_STORE.active_params() {
        if param.decay_period_ms > 0 {
            let last = param.last_updated_ns.load(Relaxed);
            if now_ns - last > param.decay_period_ms as u64 * 1_000_000 {
                param.current.store(param.default_value, Release);
                param.version.fetch_add(1, Release);
            }
        }
    }
}

22.1.4 Policy Consumer KABI (Tier 2 → Kernel)

Before the vtable is defined, the two supporting types passed during registration are specified:

// umka-core/src/ml/observation.rs  (continued)

/// Per-CPU set of observation ring buffers, one per KernelObservation category.
/// Populated by the `observe_kernel!` macro; consumed by policy services.
/// The kernel mmaps this structure read-only into the Tier 2 service address space.
pub struct ObservationRingSet {
    /// Per-category ring buffers. Indexed by `SubsystemId as usize`.
    /// Entries for unregistered subsystems are zeroed and must not be read.
    pub rings: [ObservationRing; OBSERVATION_CATEGORY_COUNT],
    /// CPU this ring set belongs to (for NUMA-aware allocation by the consumer).
    pub cpu_id: u32,
    /// Total observations dropped due to ring overflow since last reset.
    /// Updated atomically by the kernel producer; read-only for consumers.
    pub dropped: AtomicU64,
}

/// Number of observation categories (= max SubsystemId discriminant + 1,
/// rounded up to the next power of two for index-by-enum access).
pub const OBSERVATION_CATEGORY_COUNT: usize = 16;

// umka-core/src/ml/params.rs  (continued)

/// Read-only shadow view of the KernelTunableParam store for use by policy consumers.
/// Updated atomically on each policy epoch (a complete pass of decay enforcement);
/// consumers read from this without needing to acquire the param store write lock.
///
/// The kernel mmaps this structure into Tier 2 service address space as read-only.
/// Consumers detect staleness by watching `epoch`: if `epoch` changes between two reads
/// of `params[i]`, re-read the slot.
pub struct KernelParamStoreShadow {
    /// Epoch counter; incremented on every param store update.
    /// Odd epoch = update in progress; even epoch = stable snapshot.
    /// Consumers must spin-wait for an even epoch before reading `params`.
    pub epoch: AtomicU64,
    /// Snapshot of all tunable parameters as of `epoch`.
    /// Indexed by `KernelTunableParam::param_id` (0..MAX_TUNABLE_PARAMS).
    /// Entries whose corresponding param is unregistered hold `i64::MIN` as a sentinel.
    pub params: [AtomicI64; MAX_TUNABLE_PARAMS],
    /// Timestamp of the last shadow update (monotonic clock, nanoseconds since boot).
    pub last_update_ns: AtomicU64,
}

/// Maximum number of tunable parameters tracked in `KernelParamStoreShadow`.
/// `KernelParamStore` supports up to 1024 registered parameters; the shadow
/// uses the same bound so indexing is identical.
pub const MAX_TUNABLE_PARAMS: usize = 1024;

A Tier 2 policy driver that wishes to read observations and write parameter updates implements the PolicyConsumerVTable:

#[repr(C)]
pub struct PolicyConsumerVTable {
    pub vtable_size: u64,
    pub version:     u32,

    /// Kernel calls this on registration to give the service access to
    /// observation ring buffer handles and the parameter store.
    ///
    /// # Safety
    /// `ctx` must remain valid for the lifetime of the service registration.
    pub on_register: unsafe extern "C" fn(
        ctx:        *mut ServiceCtx,
        obs_rings:  *const ObservationRingSet,   // One ring per CPU per subscribed subsystem
        param_store: *const KernelParamStoreShadow, // Read-only view of current parameters
    ) -> KabiResult,

    /// Kernel delivers a batch of observations from the ring buffer.
    /// Called on a background kernel thread; must return promptly (<100μs).
    /// Heavy processing should be deferred to the service's own thread pool.
    ///
    /// # Safety
    /// `obs` points to `count` valid `KernelObservation` entries.
    pub on_observations: unsafe extern "C" fn(
        ctx:   *mut ServiceCtx,
        obs:   *const KernelObservation,
        count: u32,
    ) -> KabiResult,
}

/// Message sent from Tier 2 policy service to the kernel to update a parameter.
/// Submitted via a dedicated policy update ring buffer (separate from observations).
///
/// Layout is fixed at 128 bytes for KABI stability. All implicit compiler padding
/// is made explicit via named `_pad*` fields so that the layout is identical
/// across compiler versions, optimization levels, and target architectures.
/// Future fields may be added within `_reserved`; increment `PolicyConsumerVTable::version`
/// when doing so.
#[repr(C)]
pub struct PolicyUpdateMsg {
    /// Which parameter to update.
    pub param_id:       u32,
    /// Explicit padding to align `new_value` to 8 bytes.
    pub _pad0:          [u8; 4],
    /// New value — kernel clamps to [min_value, max_value] before applying.
    pub new_value:      i64,
    /// Parameter is valid for this many ms (0 = use param's default decay_period).
    /// If the service crashes, the parameter decays after this interval.
    pub valid_for_ms:   u32,
    /// Explicit padding to align `model_seq` to 8 bytes.
    pub _pad1:          [u8; 4],
    /// Monotonic sequence number from the ML model that produced this update.
    /// Out-of-order or duplicate updates (lower seq than current) are silently dropped.
    pub model_seq:      u64,
    /// Optional: restrict update to a specific cgroup (0 = global / kernel-wide).
    pub cgroup_id:      u32,
    /// Explicit padding to 128-byte total size for KABI stability.
    /// Future fields may be added here; increment `version` when adding.
    pub _reserved:      [u8; 92],
}
const _: () = assert!(core::mem::size_of::<PolicyUpdateMsg>() == 128);

The kernel validates all PolicyUpdateMsg entries before applying: 1. param_id must exist in KernelParamStore 2. new_value is clamped to [min_value, max_value] (never rejected; clamped silently) 3. Caller must hold Capability::KernelMlTune (Tier 2 services receive this at registration if granted by the operator; see Section 22.1.8) 4. model_seq must be ≥ the last applied seq for this (param_id, cgroup_id) pair

22.1.5 Subsystem Integration Catalog

Each subsystem that participates in ML tuning registers its parameters at boot. The tables below define the initial parameter sets and observation types.

Scheduler (Section 6)

Observation types (SchedObs):

obs_type features[0..9] Meaning
TaskWoke latency_ns, runq_len, cpu, prev_cpu, cgroup_id Task wakeup latency
MigrateDecision src_cpu, dst_cpu, task_weight, queue_diff, benefit_ns NUMA/load migration
PreemptionEvent preemptor_prio, preemptee_prio, cgroup_id, — Preemption occurred
RunqueueStats runq_len, avg_vruntime, nr_throttled, nr_rt, cgroup_id Per-CPU runqueue snapshot (every 10ms)
EasDecision task_cgroup, chosen_cpu, energy_delta_uw, load_delta, — EAS placement

Tunable parameters (SchedParam):

param_name default min max decay Effect
eevdf_weight_scale 100 50 200 60s Scale factor for virtual deadline computation (100 = baseline)
migration_benefit_threshold 1000 100 50000 30s Minimum ns benefit to justify task migration
preemption_latency_budget 1000 100 10000 30s Maximum μs a lower-priority task runs before preemption check
eas_energy_bias 50 0 100 60s 0 = performance, 100 = max energy saving in EAS decisions
cfs_burst_quota_us 0 0 100000 60s CFS burst tolerance for cgroup (0 = disabled)

Memory Manager (Section 4)

Observation types (MemObs):

obs_type features[0..9] Meaning
PageFault cgroup, fault_type, addr_band, file_offset_band, prefetch_hit Page fault event
EvictionDecision cgroup, evicted_page_type, lru_age, refault_distance, — Which page was evicted
NumaMigration src_node, dst_node, cgroup, pages_moved, benefit_ns NUMA migration result
RefaultRecord cgroup, file_inode, page_offset, time_since_evict_ms Previously-evicted page faulted again
MemPressure node, free_pages, anon_pages, file_pages, slab_pages Memory pressure snapshot (every 1s)

Tunable parameters (MemParam):

param_name default min max decay Effect
reclaim_aggressiveness 100 25 400 30s LRU reclaim rate relative to baseline
prefetch_window_pages 8 1 128 30s Max pages to prefetch per fault event
numa_migration_threshold 200 50 2000 60s Minimum benefit (ns) per page to trigger NUMA migration
compress_entropy_threshold 128 64 255 60s Page entropy (0–255) above which compression is skipped
swap_local_ratio 80 0 100 30s % of swap that goes to local vs RDMA remote swap (Section 5)

TCP / Network (Section 14)

Observation types (NetObs):

obs_type features[0..9] Meaning
CongestionEvent cgroup, cwnd, rtt_us, retransmits, bandwidth_mbps Congestion window event
FlowStats cgroup, bytes_sent, bytes_recv, rtt_p99_us, loss_pct Per-flow statistics snapshot
RouteDecision src_addr_band, dst_addr_band, chosen_dev, alternative_dev, latency_us Routing decision

Tunable parameters (NetParam):

param_name default min max decay Effect
tcp_initial_cwnd_scale 10 2 100 30s Initial congestion window (segments) per cgroup
bbr_probe_rtt_interval_ms 10000 200 60000 60s BBR min-RTT probe interval
tcp_pacing_gain_pct 125 100 200 30s BBR pacing gain percentage
ecn_aggressiveness 1 0 3 60s 0=off, 1=ECT(1), 2=ECT(0), 3=always mark

I/O Scheduler (Section 14.13)

Tunable parameters (IoParam):

param_name default min max decay Effect
readahead_pages 32 1 512 30s Readahead window in pages
queue_depth_target 32 4 1024 30s Target NVMe queue depth per cgroup
latency_target_us 0 0 100000 30s 0 = throughput mode; >0 = latency target (μs)

Power Manager (Section 6)

Tunable parameters (PowerParam):

param_name default min max decay Effect
rapl_package_power_w 0 5 400 10s CPU package power cap (W); 0 = no cap
cpu_freq_min_mhz 0 0 10000 10s Minimum CPU frequency; 0 = hardware default
accel_power_cap_w 0 0 1000 10s Per-accelerator power cap; 0 = hardware default
thermal_target_c 95 60 105 5s Thermal throttle target (°C)

FMA / Security (Section 16)

Tunable parameters (FmaParam):

param_name default min max decay Effect
anomaly_alert_threshold 80 10 100 60s Score (0–100) above which FMA raises alert
health_check_interval_ms 1000 100 60000 0 How frequently to poll device health counters
error_rate_window_ms 5000 100 300000 0 Error rate measurement window

22.1.6 Heavy Model Integration Pattern

The "big model" pattern enables a Tier 2 driver to call any model — a large transformer, an LLM, a remote inference service, or a custom RL policy — and feed the results back as PolicyUpdateMsg entries. The kernel is unaware of where the result came from; it only sees a bounded parameter update through the standard mechanism.

Data flow:

╔═══════════════════════════════════════════════════════════════════╗
║ UmkaOS Core (Tier 0, Ring 0)                                        ║
║                                                                    ║
║  [observe_kernel! macro calls] ──► [ObservationRing per CPU]      ║
║                                                                    ║
║  [KernelParamStore] ◄── [PolicyUpdateMsg ring] ◄── [validation]   ║
╚══════════════╬══════════════════════════════╬═════════════════════╝
               ║ mmap ObservationRings         ║ PolicyUpdateMsg
               ▼ (read-only, zero-copy)        ║ (write ring, validated)
╔══════════════════════════════════╗            ║
║ Tier 2 Policy Service Process    ║            ║
║ (Ring 3, hardware-isolated)      ║            ║
║                                  ║            ║
║  ┌──────────────────────────┐    ║            ║
║  │  Observation aggregator  │    ║            ║
║  │  (100ms/1s/10s windows)  │    ║            ║
║  └──────────┬───────────────┘    ║            ║
║             │ feature vector     ║            ║
║             ▼                    ║            ║
║  ┌──────────────────────────┐    ║            ║
║  │  Inference layer         │    ║            ║
║  │  (in-process small model │    ║            ║
║  │   OR call big model ──►──╫────╫─►external)║
║  │   ◄── result ────────────╫────╫─◄──────── ║
║  └──────────┬───────────────┘    ║            ║
║             │ PolicyUpdateMsg    ║            ║
║             └────────────────────╫────────────╝
║                                  ║
╚══════════════════════════════════╝

External big model call — concrete example:

A Tier 2 scheduler policy service runs a 5-minute characterization cycle:

Every 5 minutes:
  1. Drain last 5 minutes of Scheduler + Memory ObservationRings into feature matrix
     (per-cgroup stats: avg task latency, cache miss rate, memory pressure, IPC, etc.)

  2. If local small model is confident (prediction score > 0.85):
     → Apply parameter updates directly (Tier C path, ~100ms latency)

  3. If confidence is low OR first boot OR significant workload shift detected:
     → Serialize feature matrix as JSON/protobuf
     → Call external inference service via UNIX socket or HTTP:
       POST /analyze { features: [...], cgroup_ids: [...] }
     → Service runs large model (XGBoost, small transformer, RL policy)
       (100ms – 5s latency acceptable at this tier)
     → Parse response: { cgroup_id: 42, param_id: "eevdf_weight_scale", value: 130 }

  4. For each (param_id, new_value) in response:
     → Validate cgroup ownership (service can only tune cgroups it owns)
     → Submit PolicyUpdateMsg with valid_for_ms = 300_000 (expires in 5 min)
     → Kernel applies on next parameter read (atomic store, ~5 cycles)

  5. Log all updates to FMA audit ring (Section 19.1)
     → {timestamp, service_id, model_version, param_id, old_value, new_value}

The external service can be: - A local Python/Rust process running PyTorch, XGBoost, or scikit-learn - A remote inference microservice (gRPC or HTTP) running on another node - An LLM with a structured output schema (for workload characterization and root-cause analysis) - A reinforcement learning agent maintaining state across tuning cycles

None of this requires any kernel changes: the kernel only sees PolicyUpdateMsg entries.

22.1.7 Model Weight Update Flow

When a Tier 2 service trains or fine-tunes an in-kernel model (Section 21.4), it ships updated weights via the existing sysfs interface (Section 21.4.4). The update is atomic from the kernel's perspective:

Tier 2 online learning loop:

  Every N minutes (configurable per service):
    1. Extract recent training data from ObservationRings + ground-truth outcomes
       (ground truth: observed page refault rates for prefetch model; actual I/O
       latencies for I/O scheduler model)

    2. Run mini-batch update in Tier 2 userspace (full FP, no kernel restrictions)

    3. Quantize new weights to INT8/INT16 using the .isleml binary format (Section 21.4.9)

    4. Validate model offline:
       - Pass the model through the load-time validator (Section 21.4.9)
       - Run 1000 representative inputs; compare against previous model
       - Accept only if accuracy delta > -2% (do not regress more than 2 points)

    5. Write to /sys/kernel/umka/inference/models/<model_name>/model.bin
       → Kernel receives write, invokes load-time validator (Section 21.4.9)
       → On validation pass: CAS swap of AtomicModelRef pointer
       → Old model freed after RCU grace period
       → New weights active for next inference call (within microseconds)

    6. If validation fails: keep previous model, log failure to FMA ring

AtomicModelRef — the kernel's handle on the active in-kernel model:

// umka-core/src/inference/model.rs

/// RCU-protected reference to the active in-kernel model.
/// Replaced atomically on weight update; old model freed after grace period.
pub struct AtomicModelRef {
    pub ptr: AtomicPtr<KernelModel>,  // Null = use heuristic fallback
}

impl AtomicModelRef {
    /// Hot-path inference: load model pointer, run inference, drop RCU guard.
    /// The rcu_read_guard prevents concurrent model replacement from freeing
    /// the model while inference is in progress.
    pub fn infer(&self, features: &[i32]) -> Option<i32> {
        let _guard = rcu_read_lock();
        let model = unsafe { self.ptr.load(Acquire).as_ref()? };
        if !model.active.load(Relaxed) { return None; }
        Some(model.run(features))
    }

    /// Called from sysfs write handler on model.bin write.
    /// Validates, then atomically replaces the active model.
    pub fn update(&self, new_model: Box<KernelModel>) -> Result<(), ModelError> {
        new_model.validate()?;
        let new_ptr = Box::into_raw(new_model);
        let old_ptr = self.ptr.swap(new_ptr, AcqRel);
        if !old_ptr.is_null() {
            // Free old model after all CPUs pass through a quiescent state.
            rcu_call(move || unsafe { drop(Box::from_raw(old_ptr)) });
        }
        Ok(())
    }
}

22.1.8 Security and Capability Model

CAP_ML_TUNE capability. Applying PolicyUpdateMsg entries to the kernel requires the Capability::KernelMlTune capability. This is an UmkaOS-specific capability (not a Linux CAP_*), held by the Tier 2 service process when: 1. The operator has granted it at service registration time, OR 2. The service is one of UmkaOS's reference policy services and is cryptographically signed

Without KernelMlTune, a process can read observations from its own cgroup but cannot write parameter updates.

Bounds enforcement. Every new_value in PolicyUpdateMsg is silently clamped to [min_value, max_value] before being stored. An ML service that produces out-of-range values is not rejected — the clamping is the safety mechanism.

Namespace isolation. A containerized Tier 2 service sees only its own cgroup's parameters and observations. A service in cgroup docker/myapp cannot read memory pressure observations from system.slice, nor can it set rapl_package_power_w globally. Global parameter updates require CAP_ML_TUNE + CAP_SYS_ADMIN.

Audit log. Every parameter update is logged to the FMA ring (Section 19.1) with: - {ts_ns, service_id, model_version, param_id, cgroup_id, old_value, new_value} - Log entries are write-once; tampering with the audit ring requires CAP_SYS_ADMIN - The FMA ring is accessible to security monitoring tools via umkafs /System/Kernel/MLAudit

Adversarial protection: - Rate limiting: max 1000 PolicyUpdateMsg entries per service per second - Consistency bounds: if a parameter oscillates by > 50% within 10s, an FMA alert is raised (may indicate a misbehaving service or adversarial input to the ML model) - Model versioning: models are refused if their validation accuracy is below 60% on the standard benchmark set embedded in the kernel binary at build time

22.1.9 Reference Policy Services

UmkaOS ships the following Tier 2 policy services (all optional, loaded on demand):

Service Model Parameters tuned Observations consumed Cadence
umka-ml-sched Gradient-boosted trees (XGBoost) eevdf_weight_scale, migration_benefit_threshold, eas_energy_bias TaskWoke, RunqueueStats, EasDecision Every 60s
umka-ml-numa Gradient-boosted regression numa_migration_threshold, swap_local_ratio NumaMigration, RefaultRecord, MemPressure Every 30s
umka-ml-compress Random forest compress_entropy_threshold, reclaim_aggressiveness PageFault, EvictionDecision, MemPressure Every 30s
umka-ml-power Contextual bandit (online RL) rapl_package_power_w, accel_power_cap_w, thermal_target_c EAS observations, accel utilization Every 10s
umka-ml-anomaly Isolation forest anomaly_alert_threshold, health_check_interval_ms FmaHealth device metrics Every 5s
umka-ml-tcp Online linear model tcp_initial_cwnd_scale, bbr_probe_rtt_interval_ms CongestionEvent, FlowStats Every 5s

Each service is a standalone Tier 2 driver (~500-2000 LOC) implementing the PolicyConsumerVTable. They are independently upgradable, independently crashable (Tier 2 restart in ~10ms, Section 10.2), and independently optional.

The umka-ml-sched service supports a "big model" extension point: if the environment variable UMKA_ML_SCHED_REMOTE_ENDPOINT is set, it will call an external inference service (Section 22.1.6) for low-confidence workload characterizations, falling back to its local XGBoost model when the remote endpoint is unavailable.

22.1.10 Performance Impact

Observation emission (when at least one consumer is registered):

Operation Cycles Notes
Static key check (disabled) 1–3 Predicted-not-taken branch
TSC read 5–10 RDTSC or arch-specific
Ring buffer write 10–20 One cache line write
Total (enabled) ~25 cycles ~10ns at 2.5 GHz

Parameter read (hot path):

Operation Cycles Notes
AtomicI64::load(Relaxed) 1–3 Same cost as reading a global variable

Parameter update (Tier 2 → kernel):

Operation Cycles Notes
Ring buffer submission ~20 From Tier 2 userspace
Kernel validation + CAS ~15 Including capability check
Total end-to-end (async) ~100μs Dominated by scheduling latency for async path

Decay enforcement (per schedule_tick):

The enforce_param_decay scan visits all registered parameters once per tick. With 50 parameters and a tick period of 4ms, this is ~50 comparisons per tick across all CPUs. At 3 cycles per comparison: ~150 cycles = ~60ns. Negligible.