Chapter 19: Observability and Diagnostics

Fault management architecture, stable tracepoints, debugging/ptrace, unified object namespace (umkafs)

19.1 Fault Management Architecture

Inspired by: Solaris/illumos FMA. IP status: Clean — generic engineering concepts, clean-room implementation from public principles.

19.1.1 Problem

Hardware fails gradually. ECC memory corrects bit flips before they become fatal. NVMe drives report wear leveling and reallocated sectors. PCIe links log correctable errors. Network interfaces track CRC errors. These are warning signals.

Linux largely ignores them. Userspace tools (mcelog, smartctl, rasdaemon) scrape ad-hoc kernel interfaces. There is no unified kernel-level framework for collecting hardware telemetry, diagnosing trends, and taking corrective action before a crash.

UmkaOS already has crash recovery (Section 10.8). FMA extends this from reactive ("driver crashed, reload it") to proactive ("this DIMM is degrading, retire its pages before data corruption occurs").

19.1.2 Architecture

+------------------------------------------------------------------+
|                     FAULT MANAGEMENT ENGINE                       |
|                     (UmkaOS Core, kernel-internal)                  |
|                                                                  |
|  Telemetry        Diagnosis         Response                     |
|  Collector  --->  Engine      --->  Executor                     |
|                                                                  |
|  - Per-device     - Rule-based      - Retire memory pages        |
|    health buffers - Threshold        - Demote driver tier         |
|  - Ring buffer      detection       - Disable device             |
|    (lock-free)   - Correlation       - Alert via printk/uevent   |
|  - NUMA-aware      (multi-signal)   - Migrate I/O (if possible) |
+------------------------------------------------------------------+
         ^                                       |
         |  Health reports via KABI              |  Actions via registry
         |                                       v
+------------------+                   +-------------------+
| Device Drivers   |                   | Device Registry   |
| (Tier 0/1/2)    |                   | (Section 10.5)       |
+------------------+                   +-------------------+

19.1.3 Telemetry Collection

Drivers report health data to the kernel through a new KABI method:

// Appended to KernelServicesVTable (Option<...> for backward compat)

/// Report a health telemetry event.
pub fma_report_health: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    event_class: HealthEventClass,
    event_code: u32,
    severity: HealthSeverity,
    data: *const u8,
    data_len: u32,
) -> IoResultCode>,

/// Health event classification.
#[repr(u32)]
pub enum HealthEventClass {
    /// Memory: ECC corrected error, uncorrectable error, scrub result.
    Memory          = 0,
    /// Storage: SMART attribute change, wear level, reallocated sector.
    Storage         = 1,
    /// Network: CRC errors, link flaps, packet drops.
    Network         = 2,
    /// PCIe: Correctable error, AER event, link retraining.
    Pcie            = 3,
    /// Thermal: Over-temperature warning, throttling.
    Thermal         = 4,
    /// Power: Voltage out of range, power supply degradation.
    Power           = 5,
    /// Generic: Driver-defined health event.
    Generic         = 6,
    /// Accelerator: GPU, FPGA, or other accelerator health event.
    Accelerator     = 7,
}

impl HealthEventClass {
    /// Number of variants in this enum.  Used to size arrays indexed by
    /// event class, avoiding hardcoded constants.
    pub const COUNT: usize = 8;
}

#[repr(u8)]
pub enum HealthSeverity {
    /// Error was automatically corrected (e.g., ECC single-bit correction).
    /// Logged but no action required. Health score penalty: 0.
    Corrected       = 0,
    /// Informational: no action needed, for trending.
    Info            = 1,
    /// Warning: threshold approaching, admin should investigate.
    Warning         = 2,
    /// Degraded: component partially failed, corrective action taken.
    Degraded        = 3,
    /// Critical: imminent failure, immediate action required.
    Critical        = 4,
    /// Fatal — unrecoverable failure. Triggers device offline or kernel panic
    /// per escalation policy (Section 19.1.9). Health score penalty: 100.
    Fatal           = 5,
}

The data field carries event-specific payload. Standard payload formats per class:

/// Memory health payload.
#[repr(C)]
pub struct MemoryHealthData {
    /// Physical address of the error (0 = unknown).
    pub phys_addr: u64,
    /// DIMM identifier (SMBIOS handle or ACPI proximity).
    pub dimm_id: u32,
    /// Error type: 0 = correctable, 1 = uncorrectable.
    pub error_type: u32,
    /// ECC syndrome (for diagnosis).
    pub syndrome: u64,
    /// Cumulative correctable error count for this DIMM.
    pub cumulative_ce_count: u64,
}

/// Storage health payload.
#[repr(C)]
pub struct StorageHealthData {
    /// SMART attribute ID.
    pub attribute_id: u8,
    /// Current value.
    pub current: u8,
    /// Worst recorded value.
    pub worst: u8,
    /// Threshold for failure.
    pub threshold: u8,
    pub _pad_align: [u8; 4], // Explicit padding for u64 alignment (repr(C))
    /// Raw attribute value (vendor-specific).
    pub raw_value: u64,
    /// Percentage life remaining (0-100, 0xFF = unknown).
    pub life_remaining_pct: u8,
    pub _pad: [u8; 7],
}

/// Network health payload.
#[repr(C)]
pub struct NetworkHealthData {
    /// Interface index (matches the device registry's interface ID).
    pub if_index: u32,
    /// CRC error count since last report.
    pub crc_errors: u32,
    /// Link flap count since last report.
    pub link_flaps: u32,
    /// Packet drop count (RX + TX) since last report.
    pub packet_drops: u32,
    /// Current link speed in Mbps (0 = link down).
    pub link_speed_mbps: u32,
    /// Link state: 0 = down, 1 = up.
    pub link_up: u8,
    pub _pad: [u8; 3],
}

/// Thermal health payload.
#[repr(C)]
pub struct ThermalHealthData {
    /// Current temperature in millidegrees Celsius (e.g., 72500 = 72.5 C).
    pub temp_millicelsius: i32,
    /// Thermal throttling threshold in millidegrees Celsius.
    pub throttle_threshold_mc: i32,
    /// Critical shutdown threshold in millidegrees Celsius.
    pub critical_threshold_mc: i32,
    /// Whether the device is currently throttled (0 = no, 1 = yes).
    pub throttled: u8,
    /// Thermal zone identifier (device-specific).
    pub zone_id: u8,
    pub _pad: [u8; 2],
}

/// PCIe health payload.
#[repr(C)]
pub struct PcieHealthData {
    /// BDF (bus:device.function).
    pub bdf: u32,
    /// AER correctable error status register.
    pub cor_status: u32,
    /// AER uncorrectable error status register.
    pub uncor_status: u32,
    /// Current link speed (GT/s * 10, e.g., 80 = 8.0 GT/s).
    pub link_speed: u16,
    /// Current link width (x1, x4, x8, x16).
    pub link_width: u8,
    /// Link retraining count.
    pub retrain_count: u8,
}

19.1.4 Telemetry Buffer

The FMA engine maintains a per-device circular buffer of health events:

// Kernel-internal

/// Lock-free multiple-producer single-consumer (MPSC) ring buffer.
///
/// `fma_emit()` is NMI-safe: both a normal interrupt handler and an NMI handler
/// can call it concurrently for the same device ring, creating two concurrent
/// producers. A simple SPSC design (plain store to write_idx) would have a
/// write-index race in this scenario. MPSC with compare-exchange on `write_idx`
/// eliminates the race: each producer atomically claims its slot before writing.
///
/// # Memory ordering
///
/// **Producer** (claim a slot, write data, publish):
///   1. Claim slot: CAS loop on `write_idx` from `head` to `(head+1) % N`
///      with `AcqRel`/`Relaxed` ordering. Only the winner proceeds; losers retry.
///      If `(head+1) % N == read_idx` (buffer full), return `Err(FmaError::RingFull)`.
///   2. Write entry at `slots[head & (N-1)]` with a **plain store**. No ordering
///      is needed here: the slot is exclusively owned by this producer until step 3.
///   3. Set `published[head & (N-1)]` flag with `Ordering::Release`. The consumer
///      polls this flag before reading the slot, so the Release pairs with the
///      consumer's Acquire load on the flag.
///
/// **Consumer** (single reader — e.g., FMA processing kthread):
///   1. Load `read_idx` (consumer-owned; no atomic needed in single-consumer design,
///      but AtomicU64 is used for consistency and future flexibility).
///   2. Spin-check `published[read_idx & (N-1)]` with `Ordering::Acquire` until set.
///      The consumer is a kthread (not interrupt context) so brief spinning is safe.
///   3. Read `slots[read_idx & (N-1)]` with a plain load (visibility guaranteed by
///      the Acquire load of the published flag in step 2).
///   4. Clear `published[read_idx & (N-1)]` with `Ordering::Relaxed` (slot now free).
///   5. Advance `read_idx` with `Ordering::Relaxed` (only the consumer modifies it).
///
/// **Overflow check** (producer, step 1 above):
///   Load `read_idx` with `Ordering::Acquire` inside the CAS loop. A stale value
///   may cause the producer to see the buffer as fuller than it is — this is safe
///   (producer returns RingFull unnecessarily) but never causes a write past the end.
///
/// Invariant: `write_idx - read_idx <= N` at all times.
///
/// Fixed-capacity lock-free circular buffer (ring buffer) for multiple-producer,
/// single-consumer use. N must be a power of two. T must be Copy.
///
/// Storage is inline (no heap allocation), making it suitable for per-device
/// health telemetry structs that live in device state pages.
///
/// Invariants:
/// - `write_idx` is shared by all producers. Producers claim slots via CAS.
/// - `read_idx` is owned by the consumer. Only the consumer writes to it.
/// - `published[i]` is set by the producer after writing slot `i` and cleared
///   by the consumer after reading slot `i`.
/// - `write_idx - read_idx <= N` at all times.
/// - When the buffer is full, `push()` returns `Err(FmaError::RingFull)`.
/// - Both indices use wrapping arithmetic; masking by `N-1` gives slot index.
#[repr(C)]
pub struct CircularBuffer<T: Copy, const N: usize> {
    /// Shared write index (wrapping u64). Producers claim slots via CAS on this field.
    pub write_idx: AtomicU64,
    _pad0: [u8; 56],  // Separate write_idx from read_idx (cache line = 64 bytes).
    /// Consumer read index (wrapping u64). Only the consumer advances this.
    pub read_idx: AtomicU64,
    _pad1: [u8; 56],
    /// Per-slot publication flags. A producer sets `published[i]` with Release
    /// after writing `slots[i]`; the consumer clears it with Relaxed after reading.
    pub published: [AtomicBool; N],
    /// Storage slots. Index = (idx & (N-1)).
    pub slots: [UnsafeCell<MaybeUninit<T>>; N],
}

impl<T: Copy, const N: usize> CircularBuffer<T, N> {
    /// Push an item. Returns `Err(FmaError::RingFull)` if the ring is full.
    /// Safe to call from multiple concurrent producers including NMI context.
    pub fn push(&self, item: T) -> Result<(), FmaError>;

    /// Pop an item. Returns None if empty.
    /// Must only be called from a single consumer context (kthread).
    pub fn pop(&self) -> Option<T>;

    /// Number of items currently in the buffer.
    pub fn len(&self) -> usize;

    /// True if no items are available.
    pub fn is_empty(&self) -> bool;
}

pub struct DeviceHealthLog {
    /// Device node this log belongs to.
    device_id: DeviceNodeId,

    /// Circular buffer of recent events (fixed size per device).
    events: CircularBuffer<HealthEvent, 256>,

    /// Counters by event class (for fast threshold checks).
    /// Size is derived from the enum variant count, not hardcoded.
    class_counts: [AtomicU64; HealthEventClass::COUNT],

    /// Timestamp of first event in current window (for rate detection).
    window_start_ns: u64,

    /// NUMA node (allocate buffer on device's NUMA node).
    numa_node: i32,

    /// Maximum events per second per (device, event_class) pair.
    /// Events exceeding this rate are counted but not stored in the
    /// circular buffer. Default: 100.
    rate_limit_per_sec: u32,

    /// Number of events suppressed by rate limiting since last reset.
    /// When non-zero, a single "rate_limited" meta-event is recorded
    /// in the circular buffer with the suppressed count.
    suppressed_count: AtomicU64,
}

#[repr(C)]
pub struct HealthEvent {
    pub timestamp_ns: u64,
    pub class: HealthEventClass,
    pub code: u32,
    pub severity: HealthSeverity,
    pub data: [u8; 64],    // Inline payload (avoids allocation)
    pub data_len: u32,
}

Backpressure design (prevents event storms from overwhelming the telemetry path):

The ingestion path uses a two-level design: - Fast path: Atomic counter increment per (device, event_class) pair. Zero allocation. Every event is counted in class_counts regardless of rate. - Slow path: Detailed event stored in the circular buffer only if the rate is below rate_limit_per_sec (default: 100 events/second per class). Above the threshold, a single "rate_limited" meta-event is recorded with the suppressed_count value, then the counter resets. This ensures the buffer contains representative events without being flooded during error storms.

19.1.5 Diagnosis Engine

The diagnosis engine is a rule-based evaluator that runs when new telemetry arrives.

// Kernel-internal

pub struct DiagnosisRule {
    /// Human-readable name (for logging).
    name: ArrayString<64>,

    /// Match criteria.
    event_class: HealthEventClass,
    min_severity: HealthSeverity,

    /// Threshold: number of matching events within time window.
    count_threshold: u32,
    window_seconds: u32,

    /// Rate: events per second sustained over window.
    rate_threshold: Option<u32>,

    /// Value-based threshold: fires when a specific field in the health event payload
    /// falls below (or exceeds) a specified level, independent of event counts.
    ///
    /// For example, NVMe wear rules ("life < 10%") use:
    ///   field_id = DiagField::LifeRemainingPct, threshold = 10
    /// The diagnosis engine reads the field identified by `field_id` from the health
    /// event payload via the `DiagFieldAccessor` trait and fires the rule when the
    /// field's value drops below `threshold`.
    /// When `None`, only `count_threshold` / `rate_threshold` are evaluated.
    value_threshold: Option<ValueThreshold>,

    /// Correlation: require events from multiple related devices.
    correlation: Option<CorrelationRule>,

    /// Action to take when rule fires.
    action: DiagnosisAction,
}

/// Enum-based field identifiers for scalar values in health event payloads.
/// Used instead of string-based field names because Rust has no runtime
/// reflection -- struct fields don't carry name metadata at runtime. Each
/// health event payload type implements `DiagFieldAccessor` to map these
/// enum variants to the actual struct field values.
#[repr(u16)]
pub enum DiagField {
    /// NVMe endurance: remaining drive life as a percentage (0-100).
    LifeRemainingPct    = 0,
    /// Current temperature in millidegrees Celsius.
    Temperature         = 1,
    /// Cumulative correctable error count.
    CorrectableErrors   = 2,
    /// Cumulative uncorrectable error count.
    UncorrectableErrors = 3,
    /// Available spare capacity as a percentage (NVMe).
    AvailableSparePct   = 4,
    /// Power-on hours.
    PowerOnHours        = 5,
    /// Media errors (NVMe).
    MediaErrors         = 6,
    /// Memory correctable error count (DIMM health).
    MemoryCorrectableErrors = 7,
}

/// Trait implemented by each health event payload type (e.g., `StorageHealthData`,
/// `MemoryHealthData`). Maps `DiagField` variants to the corresponding scalar
/// value in the payload struct. Returns `None` if the field is not applicable
/// to this payload type (e.g., `LifeRemainingPct` on a DIMM health event).
pub trait DiagFieldAccessor {
    /// Extract the scalar value for the given field, or `None` if not applicable.
    fn field_value(&self, field: DiagField) -> Option<u64>;
}

/// Threshold applied to a specific scalar field in the health event payload.
/// Used for percentage- or gauge-based rules (e.g., NVMe wear life).
#[repr(C)]
pub struct ValueThreshold {
    /// Identifies which field in the health event payload to compare.
    /// E.g., `DiagField::LifeRemainingPct` for NVMe endurance data.
    field_id: DiagField,

    /// Numeric trigger level. The rule fires when the field's value is
    /// strictly less than this value (e.g., `10` means "< 10%").
    threshold: u32,
}

pub enum DiagnosisAction {
    /// Log a message and emit uevent. No automatic correction.
    Alert { message: ArrayString<128> },

    /// Retire specific physical pages (memory errors).
    RetirePages,

    /// Demote the device's driver to a lower tier.
    DemoteTier,

    /// Disable the device entirely.
    DisableDevice,

    /// Mark device as degraded (informational, for admin).
    MarkDegraded,

    /// Trigger live evolution (Section 12.6) to proactively replace a degrading component.
    TriggerEvolution { target_component: ArrayString<64> },
}

/// Used by the diagnosis engine (Section 19.1.5) to detect correlated failures
/// across multiple related devices. When a DiagnosisRule includes a CorrelationRule,
/// the engine evaluates the threshold across all devices sharing the specified
/// property (e.g., same memory controller, same PCIe root complex, same NUMA node).
/// The event correlation engine (Section 19.4, Health namespace) uses these rules to
/// populate cross-device fault reports visible under /Health/ByDevice/.
pub struct CorrelationRule {
    /// Require events from N distinct devices sharing this property.
    /// Example: "memory_controller" — multiple DIMMs on the same controller,
    /// "pcie_root" — multiple devices on the same PCIe root complex.
    shared_property: ArrayString<32>,
    min_devices: u32,
}

Default rules (built-in, administrator can override via /sys):

Rule	Class	Threshold	Window	Action
DIMM degradation	Memory	100 CE	1 hour	Alert + RetirePages
DIMM failure	Memory	1 UE	instant	DisableDevice + Alert
NVMe wear out	Storage	life < 10%	—	Alert
NVMe critical wear	Storage	life < 3%	—	MarkDegraded + Alert
PCIe link unstable	PCIe	10 retrains	1 minute	Alert
PCIe link failing	PCIe	50 retrains	1 minute	DemoteTier + Alert
NIC error storm	Network	1000 CRC errors	1 minute	Alert
Thermal throttling	Thermal	5 events	10 minutes	Alert
PCIe proactive swap	PCIe	30 retrains	10 minutes	TriggerEvolution + Alert
NVMe proactive swap	Storage	life < 5%	—	TriggerEvolution + Alert

Evaluation algorithm: When a new health event arrives, the diagnosis engine evaluates rules synchronously on the health telemetry workqueue (not in interrupt context). Evaluation proceeds as follows:

Filter: Select rules where event_class matches the incoming event and min_severity <= event.severity. This reduces the candidate set from all registered rules (~20-50 default + admin-added) to typically 1-5 rules. Filtering is O(n) in the total rule count; with <100 rules, this is sub-microsecond.
Evaluate thresholds (for each candidate rule):
Count threshold: Increment the per-rule event counter (stored in a per-device RuleState array). If counter exceeds count_threshold within window_seconds, the rule fires. Stale events are expired lazily using a sliding window (circular buffer of timestamps).
Rate threshold (if set): Compute event rate over the window. Fire if rate exceeds rate_threshold events/sec.
Value threshold (if set): Read the field via DiagFieldAccessor and compare against the threshold. Fire immediately on breach.
Correlation check (if correlation is set): Query the per-device counters for all devices sharing shared_property. Fire only if min_devices distinct devices have independently breached their thresholds within the window.
Execute actions: For each fired rule, execute the DiagnosisAction. If multiple rules fire on the same event, all actions execute (no mutual exclusion). Actions that modify device state (DemoteTier, DisableDevice) are idempotent.

FMA Diagnosis Rule Specification

Rule Language — a small declarative DSL compiled to a rule-table at module registration time:

// Syntax: each rule is a WHEN...THEN expression.
// Registers the rule under the given rule_id at the given confidence (0-100).

RULE <rule_id> [confidence=N]:
    WHEN
        event(<source>) [.field <op> <value>]  // match condition
        [AND event(...) ...]                    // multiple conditions (all must fire within window_ms)
        [WITHIN <window_ms> ms]                 // temporal window (default: no window = single event)
        [COUNT >= <n>]                          // rate threshold: N events in window
    THEN
        <action>(<args>)

// Field operators: ==, !=, >, >=, <, <=, CONTAINS, MATCHES (regex, max 64 chars)
// Actions: suspect(<fru>), repair_action(<verb>), retire(<fru>), alert(<severity>, <msg>)
// Severity levels: INFO, WARNING, FAULT, CRITICAL
// FRU: device_id string, e.g., "nvme0", "cpu3", "mem_channel:0"

Example rules:

RULE nvme_timeout_suspect [confidence=80]:
    WHEN event(nvme_io_timeout) COUNT >= 3 WITHIN 60000 ms
    THEN suspect("nvme0")

RULE nvme_confirmed_fault [confidence=95]:
    WHEN event(nvme_hw_error) AND event(nvme_timeout)
         WITHIN 5000 ms
    THEN repair_action("offline_drive"), alert(FAULT, "NVMe drive fault detected")

RULE mem_corrected_error [confidence=60]:
    WHEN event(edac_ce) COUNT >= 100 WITHIN 3600000 ms
    THEN suspect("mem_channel:0"), alert(WARNING, "High CE rate on DIMM")

RULE mem_uncorrected_fault [confidence=99]:
    WHEN event(edac_ue)
    THEN retire("mem_page:{page_addr}"), alert(CRITICAL, "UE: retiring page")

FMA Rule DSL — Formal Grammar

The FMA rule language uses a WHEN/THEN syntax for expressing diagnosis rules. The following BNF grammar is the authoritative specification; the rule compiler accepts exactly this grammar (no extensions, no shortcuts).

rule         ::= "RULE" rule_id ":" rule_body
rule_id      ::= identifier

rule_body    ::= "WHEN" condition_expr "THEN" action_list
                 ["PRIORITY" priority_level]
                 ["COOLDOWN" duration]

condition_expr ::= condition_term (("AND" | "OR") condition_term)*
                 | "NOT" condition_expr
                 | "(" condition_expr ")"

condition_term ::=
    event_match
  | metric_comparison
  | state_predicate
  | history_match

event_match  ::= "EVENT" "(" event_type ["," attribute_filter] ")"
event_type   ::= identifier  -- e.g., UncorrectedEcc, NvmeCrcError, TierCrash

attribute_filter ::= attribute_name "=" literal_value
                   | attribute_filter "," attribute_filter
attribute_name   ::= identifier "." identifier  -- e.g., device.pci_addr

metric_comparison ::=
    metric_path comparison_op numeric_literal [unit_suffix]
  | metric_path "IN" "[" numeric_literal ".." numeric_literal "]"

metric_path  ::= identifier ("." identifier)*  -- e.g., cpu.temp_celsius
comparison_op ::= ">" | ">=" | "<" | "<=" | "==" | "!="
unit_suffix  ::= "%" | "ms" | "us" | "ns" | "MB" | "GB" | "C"

state_predicate ::=
    "STATE" "(" device_ref ")" "==" device_state
  | "HEALTHY" "(" device_ref ")"
  | "DEGRADED" "(" device_ref ")"
  | "FAILED" "(" device_ref ")"

device_ref   ::= identifier | "SELF" | "PEER" "(" identifier ")"
device_state ::= "Healthy" | "Degraded" | "Failed" | "Missing" | "Suspect"

history_match ::=
    "HISTORY" "(" event_type "," count_expr "," duration ")"

count_expr   ::= numeric_literal ("+" | "-" | "*") numeric_literal
               | numeric_literal

action_list  ::= action ("," action)*

action       ::=
    "ALERT" "(" severity "," string_literal ")"
  | "ISOLATE" "(" device_ref ")"
  | "RELOAD" "(" device_ref ")"
  | "THROTTLE" "(" device_ref "," numeric_literal unit_suffix ")"
  | "REPLACE_PREFERRED" "(" device_ref ")"
  | "FMA_TICKET" "(" severity "," string_literal ")"
  | "CALL" "(" handler_name ["," argument_list] ")"

severity     ::= "INFO" | "WARNING" | "CRITICAL" | "FATAL"
priority_level ::= numeric_literal  -- 0 (lowest) to 255 (highest)
duration     ::= numeric_literal ("s" | "ms" | "m" | "h")

literal_value ::= numeric_literal | string_literal | boolean_literal
numeric_literal ::= ["-"] digit+ ["." digit+]
string_literal ::= '"' character* '"'
boolean_literal ::= "true" | "false"
identifier   ::= [a-zA-Z_] [a-zA-Z0-9_]*
digit        ::= [0-9]

Extended example rules (illustrating grammar usage):

RULE ecc_threshold_exceeded:
WHEN EVENT(UncorrectedEcc, device.pci_addr = "0000:01:00.0")
  AND HISTORY(UncorrectedEcc, 3, 1h)
THEN ALERT(CRITICAL, "ECC errors exceed threshold; memory DIMM likely failing"),
     FMA_TICKET(CRITICAL, "Schedule DIMM replacement on host {}"),
     REPLACE_PREFERRED(SELF)
PRIORITY 200
COOLDOWN 24h

RULE nvme_crc_degraded:
WHEN EVENT(NvmeCrcError) AND metric_comparison(io.error_rate > 5%)
THEN ALERT(WARNING, "NVMe CRC error rate elevated"),
     THROTTLE(SELF, 50%)
PRIORITY 100
COOLDOWN 5m

Rule evaluation: Rules are stored in an RcuCell<BTreeMap<RulePriority, Vec<CompiledRule>>> sorted by descending priority. Higher-priority rules are checked first; evaluation stops after the first matching rule fires (unless the rule sets a CONTINUE flag). Cooldown is enforced per rule per device via a BTreeMap<(RuleId, DeviceId), Instant>.

Registration API:

/// Register a FMA diagnosis rule module.
pub fn fma_rule_register(module: &FmaRuleModule) -> Result<FmaRuleHandle, FmaError>;

pub struct FmaRuleModule {
    /// Module name (unique; max 64 chars).
    pub name: &'static str,
    /// Rule text (DSL source; compiled at registration time).
    pub rules: &'static str,
    /// Maximum 256 rules per module.
    pub max_rules: u32,
}

Matching algorithm: Rules are compiled to a flat table (FmaRuleTable). On each event receipt: 1. Filter rules by source (hash map, O(1) per event). 2. For single-event rules: check field conditions; if all pass, trigger action. 3. For multi-event rules: insert event into a per-rule sliding window (ring buffer, size=COUNT threshold). If all conditions satisfied within the temporal window, trigger action. 4. Rule evaluation is lock-free (RCU-protected rule table; per-CPU event queues drained by the FMA processing thread).

Action vocabulary: - suspect(fru): Mark the named FRU as suspected in the fault case. - retire(fru): Offline and retire the FRU (persistent, survives reboot). - repair_action(verb): Execute a named repair action registered by the driver (e.g., "offline_drive", "reset_link", "throttle_frequency"). - alert(severity, msg): Emit a FMA fault event to umkafs /Health/faults/. - diagnose(rule_id): Chain to another rule for further diagnosis.

19.1.6 Response Executor

Actions integrate with existing kernel subsystems:

RetirePages: The memory manager (Section 4.1) is asked to remove specific physical pages from the buddy allocator. Any process mapping those pages is transparently migrated to a replacement page (copy-on-read). This is how Linux handles hardware-poisoned pages (memory_failure()), but triggered proactively.

DemoteTier: The device registry (Section 10.5) transitions the device's driver to a lower isolation tier. This uses the existing crash-recovery reload mechanism but without a crash — clean stop, change tier, restart.

DisableDevice: The registry transitions the device to Error state. The driver is stopped. The device node remains in the tree (for introspection) but accepts no I/O.

TriggerEvolution: The response executor invokes the Live Evolution framework (Section 11.6) to proactively hot-swap a degrading component before it fails. Example: FMA detects a pattern of increasing PCIe correctable errors on a bus serving a Tier 1 NIC driver. The diagnosis engine fires, and instead of waiting for a hard failure, the response executor triggers Section 12.6.3's component replacement flow to swap the NIC driver to a degraded-mode variant (e.g., conservative I/O scheduler, reduced-bandwidth path). The replacement follows the same state serialization and quiescence protocol as any live evolution (Section 12.6.3), but is initiated automatically by FMA rather than by an administrator. This closes the gap between reactive fault handling and proactive system evolution — the kernel treats impending hardware failure as a trigger for self-repair rather than merely an alert.

FMA Action Execution Order:

When FMA triggers on a fault event, registered actions execute in priority order (not registration order). Higher-priority actions run first and can abort the chain on failure.

Priority levels (highest to lowest):

Priority	Action Type	Failure behavior	Description
1	`DisableDevice`	Abort chain	Remove device from service immediately. Non-optional — isolation is the safety guarantee. If isolation fails, further actions are meaningless (device state unknown).
2	`DemoteTier`	Abort if non-best_effort	Reduce device bandwidth/request rate by moving it to a lower isolation tier. May be marked `best_effort` for degraded-but-operational scenarios.
3	`Alert` / `MarkDegraded` / `TriggerEvolution`	Continue (always best_effort)	Send notification to registered listeners (`/dev/oom`, syslog, SNMP trap) or initiate proactive replacement. Failure is logged, not fatal.
4	`RetirePages` / FMA ring write	Continue (always best_effort)	Append to FMA event ring and persistent log, retire affected pages. Failure increments `fma_dropped_events` counter.

Execution rules:

Actions execute in a dedicated kworker thread (not in interrupt context). Interrupt handler only enqueues the FaultEvent — all action execution is deferred to process context.
Timeout per action: 5 seconds. An action that does not complete in 5 seconds is treated as a failure (same as returning an error).
A non-best_effort action failure aborts the chain at that point — lower-priority actions do not run.
Multiple actions of the same type (e.g., two Alert actions): execute in registration order within the same priority level.
After DisableDevice: the device is removed from the I/O path before DemoteTier runs. This ordering prevents DemoteTier from being applied to an already-disabled device (which would be a no-op or error).

Design note: UmkaOS's priority-ordered action chain is explicitly better than Linux EDAC, which only logs faults. UmkaOS actively isolates faulty hardware before it can cause data corruption downstream, implementing a "fail-safe first" principle.

19.1.7 Linux Interface Exposure

Entirely through standard mechanisms:

sysfs (per-device, under the device registry's sysfs tree):

/sys/devices/.../health/
    status          # "ok", "warning", "degraded", "critical"
    events_total    # Total health events received
    ce_count        # Correctable error count (memory)
    ue_count        # Uncorrectable error count (memory)
    life_remaining  # Percentage (storage)
    link_retrains   # Count (PCIe)
    temperature     # Current (thermal)

procfs:

/proc/umka/fma/
    rules           # Current diagnosis rules (read/write for admin)
    events          # Recent events across all devices (ring buffer dump)
    retired_pages   # List of retired physical pages
    statistics      # Aggregate counters

uevent: Standard hotplug mechanism for pushing notifications to userspace.

ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:1f.2
SUBSYSTEM=pci
UMKA_HEALTH=degraded
UMKA_HEALTH_REASON=pcie_link_unstable

dmesg/printk: Standard kernel log for all alerts.

19.1.8 FaultEvent — Structured FMA Event Type

FaultEvent is the typed event enum passed to fma_emit(). Each variant carries structured fields so consumers (FMA rules engine, userspace, tracepoints) can react without parsing text messages.

/// Typed FMA event — passed to `fma_emit()` to record a hardware fault.
/// Stored in the per-device FMA ring (Section 19.1.2, 256-entry lock-free MPSC).
pub enum FaultEvent {
    /// ECC corrected (single-bit) memory error.
    MemoryCe {
        mci_idx: u32,       // Memory controller index
        csrow:   u32,       // CSROW (chip-select row)
        channel: u32,       // DRAM channel
        page:    u64,       // Physical page number
        offset:  u64,       // Offset within page
        syndrome: u32,      // ECC syndrome (if available, else 0)
    },
    /// ECC uncorrectable (multi-bit) memory error.
    MemoryUe {
        mci_idx: u32,
        csrow:   u32,
        channel: u32,
        page:    u64,
        /// True if the error is uncorrectable and the affected memory range must be
        /// retired. Fatal UEs trigger device offline and optionally kernel panic
        /// per the FMA escalation policy (Section 19.1.9).
        fatal:   bool,
        /// Physical byte offset within page where UE occurred.
        /// Set to `u64::MAX` if the hardware does not report a sub-page address.
        offset:  u64,
    },
    /// PCIe AER correctable error.
    PcieAerCe {
        bus_dev_fn: u32,    // PCI BDF encoding
        status:     u32,    // AER correctable status bits
    },
    /// PCIe AER uncorrectable error.
    PcieAerUe {
        bus_dev_fn: u32,
        status:     u32,
        severity:   u32,    // Fatal vs. non-fatal
    },
    /// Storage SMART health threshold crossed.
    StorageSmart {
        device_id:  u64,    // FMA device handle
        attr_id:    u8,     // SMART attribute number
        value:      u8,     // Normalized value
        threshold:  u8,     // Failure threshold
    },
    /// Thermal throttling event.
    Thermal {
        cpu_or_device: u64, // FMA device handle
        temp_milli_c:  i32, // Temperature in milli-°C
        throttle_pct:  u8,  // Throttle percentage applied
    },
    /// Generic driver-defined fault event.
    Generic {
        device_id:  u64,
        event_code: u32,
        payload:    [u8; 16],
    },
}

/// Emit a fault event to the per-device FMA ring.
/// Lock-free, NMI-safe (ring buffer + atomic tail increment).
/// May be called from interrupt or NMI context.
pub fn fma_emit(event: FaultEvent);

Existing tools that benefit without modification: - rasdaemon — uses kernel tracepoints (specifically the RAS tracepoints in /sys/kernel/debug/tracing/events/ras/) to collect hardware error events and stores them in a SQLite database; UmkaOS's sysfs health attributes provide an additional data source that rasdaemon can be extended to consume - Prometheus node_exporter — can scrape sysfs files - smartctl — storage health still works via the standard interface - systemd — can react to uevents

19.1.9 Health Score and Escalation Policy

The diagnosis engine (Section 19.1.5) detects fault patterns and fires actions (Section 19.1.6), but those actions are binary: either a specific rule fires or it does not. A complementary health score model provides a continuous signal that drives graduated escalation — throttling before draining, draining before offlining — and makes escalation decisions auditable and predictable.

19.1.9.1 Health Score Model

Each device maintains a rolling health score H ∈ [0, 100]. Score 100 = fully healthy, 0 = critically unhealthy. The score is stored in DeviceHealthLog as an AtomicU32 (scaled: stored value = H * 100, giving two decimal places of precision without floating-point in the kernel).

The score decays toward 100 over time (natural recovery) at the rate specified per severity. Each error event immediately decreases the score based on its severity:

Severity	Score Penalty	Recovery Rate	Notes
`Informational`	0	—	No impact; for trending only
`Corrected`	1	+1 per minute	ECC single-bit corrections, PCIe correctable
`Warning`	5	+2 per minute	Threshold approaching, single anomaly
`Degraded`	15	+1 per minute	Partial failure, corrective action in progress
`Critical`	40	+0.5 per minute	Imminent failure
`Fatal`	100	No recovery	Unrecoverable; score is clamped to 0

The existing HealthSeverity enum (Section 19.1.3) covers Info, Warning, Degraded, and Critical. The health score model uses Corrected (between Info and Warning) and Fatal (above Critical) — both are defined in Section 19.1.3 and referenced here for their penalty assignments in the score table above. Corrected maps hardware events that were automatically corrected at the hardware level (ECC single-bit fixes, PCIe correctable errors) — they are worth tracking for trend analysis but impose minimal penalty. Fatal maps non-recoverable hardware errors (ECC uncorrectable, PCIe fatal AER, storage uncorrectable media error) and immediately drives H to 0.

Recovery: Score recovery is applied by the FMA maintenance timer (runs every 60 seconds). The timer adds the per-severity recovery rate to each active device's score, clamping at 100. A device that has received no new error events for long enough returns to H == 100 automatically. A device with Fatal events never recovers — the score stays at 0 until the device is explicitly re-enabled by an administrator (which also resets the score).

Recent-error weighting: Errors that arrived within the last 60 seconds are weighted 3× when computing the effective score for threshold comparisons. The raw stored score reflects all historical events; the effective score used for escalation decisions amplifies recent bursts. The weighting is computed on-the-fly when an escalation check runs — the stored AtomicU32 always holds the true current score without amplification.

19.1.9.2 Escalation Thresholds

/// FMA escalation level — determines what corrective action the FMA engine
/// takes based on the device's current health score.
///
/// Escalation is evaluated after every health event and also by the FMA
/// maintenance timer. When the effective score (with recent-error weighting)
/// crosses a threshold, the corresponding escalation action is triggered.
/// Actions are idempotent: triggering `Drain` on an already-draining device
/// is a no-op.
pub enum FmaEscalation {
    /// Effective H >= 70: device is healthy. No throttling.
    Healthy,

    /// Effective H in [40, 70): emit throttle advisory.
    /// Reduces the device's request rate to `target_iops_fraction` of its
    /// configured maximum. The I/O scheduler honors this via the device's
    /// CBS bandwidth server ([Section 6.3](06-scheduling.md#63-cpu-bandwidth-guarantees)).
    /// The fraction is proportional to how far H has fallen below 70:
    ///   fraction = 0.5 + 0.5 * (H - 40) / 30   (linear interpolation)
    /// At H == 70: fraction = 1.0 (no throttling).
    /// At H == 40: fraction = 0.5 (half rate).
    Throttle { target_iops_fraction: f32 },

    /// Effective H in [20, 40): begin draining the device.
    /// No new I/O assignments are made to this device. In-flight I/O is
    /// allowed to complete. New I/O is redirected to other devices in the
    /// same storage/network pool, if available.
    Drain,

    /// Effective H in (0, 20) OR 3+ Fatal events within the last 60 seconds:
    /// offline the device.
    ///
    /// - Non-critical devices: offline immediately, send SIGHUP to all processes
    ///   with open handles to this device so they can react gracefully.
    /// - Critical devices ([Section 19.1.9.3](#19193-critical-device-designation)):
    ///   attempt failover first. If failover succeeds, offline the original device.
    ///   If failover is unavailable, continue to the `Panic` escalation.
    Offline { try_failover: bool },

    /// Effective H == 0 OR Fatal event on a non-redundant critical device with no
    /// available failover target: kernel panic with FMA diagnostic.
    ///
    /// The panic message includes the device canonical name, the triggering event,
    /// the last 16 entries from the device's `DeviceHealthLog` ring, and the
    /// current health score. This information is also saved to pstore
    /// ([Section 19.6](19-observability.md#196-pstore-panic-log-persistence)) for post-mortem analysis.
    Panic { reason: FmaPanicReason },
}

/// Reason for an FMA-triggered kernel panic.
pub enum FmaPanicReason {
    /// Health score reached 0 with no failover available.
    HealthScoreZero { device_name: ArrayString<64> },
    /// Fatal event on a non-redundant critical device.
    FatalEventNoCriticalFailover { device_name: ArrayString<64> },
}

19.1.9.3 Critical Device Designation

A device is designated "critical" if removing it from service would render the system unbootable or unmanageable. Only critical devices trigger the try_failover path and can ultimately cause an FmaEscalation::Panic.

A device is critical if any of the following apply:

It is the block device containing the root filesystem (determined by matching the device's identity against the kernel's boot parameters at startup).
It is the block device hosting the EFI system partition (on UEFI systems).
It is the primary management network interface on a headless system (a system with no local console, determined by the umka.mgmt_netdev= boot parameter or by the absence of a graphics device).

Non-critical devices — GPUs, secondary NICs, non-boot storage, accelerators — always take the Offline path without triggering a panic. The rationale: panicking to protect a non-critical device causes more harm than losing that device.

Critical device designation is fixed at boot. An administrator cannot promote a device to critical after the kernel has started (to prevent privilege escalation via FMA).

19.1.9.4 Hysteresis and Re-admission

A device that has been placed in Drain state does not immediately return to Active when its health score recovers. To prevent oscillation:

Transition from Drain back to Active requires both:
Effective health score H > 60 (well clear of the drain threshold).
At least 5 minutes have elapsed since the last new error event of Warning severity or above.
Transition from Offline requires administrator action: writing enable to the device's umkafs control file (Section 19.4.8). The health score is reset to 80 on re-enable (not 100) to reflect that the device has a recent failure history.

These hysteresis rules are enforced by the FMA maintenance timer. The timer re-evaluates escalation state for all devices every 60 seconds.

19.1.10 FMA to umkafs Publishing

The FMA ring (Section 19.1.4) and health score (Section 19.1.9) are kernel-internal structures. The umkafs /Health/devices/<dev>/ subtree (Section 19.4.3) makes them visible to userspace. This section specifies the bridge between the two.

Publishing mechanism: FMA health events are published to umkafs synchronously via fma_publish_event(), which runs in the context of the device driver reporting the event (not in a worker thread). No deferred workqueue is involved. This keeps latency low for monitoring tools polling umkafs — the file contents reflect the latest event as soon as fma_report_health() returns to the caller.

Published paths (per device with canonical name <dev> as assigned by Section 19.4.10.1):

umkafs path	Content	Update frequency
`/Health/devices/<dev>/status`	Current escalation state: `healthy` / `throttle` / `drain` / `offline` / `failed`	On every state transition
`/Health/devices/<dev>/score`	Health score H (0-100) as ASCII integer	On every event and every 60s (maintenance timer)
`/Health/devices/<dev>/events`	Last 16 events from the FMA ring as JSON lines (newest last)	On every new event (tracks ring-tail pointer)
`/Health/devices/<dev>/event_count`	Total events since device initialization (u64, ASCII decimal)	On every new event
`/Health/devices/<dev>/critical`	`1` if device is designated critical (Section 19.1.9.3), `0` otherwise	Static after boot

Read semantics: all umkafs FMA files are read-atomically — a single read() system call returns a consistent snapshot. Readers do not need to hold any lock. The FMA subsystem uses RCU for the ring-tail pointer (so readers see a coherent tail position) and a seqcount for the health score (so readers always see a complete score value, never a partially-written intermediate). Both primitives are the same ones used throughout the FMA hot path; no additional synchronization cost is added by umkafs access.

Write semantics: /Health/devices/<dev>/status is read-only for userspace. Writing to it returns EPERM. Administrative actions (force offline, re-enable) are performed via the device's dedicated control file: /Health/devices/<dev>/control (see Section 19.4.8). This prevents accidental status corruption while keeping the admin interface consistent with the rest of umkafs.

Event format in the events file — each line is one JSON object:

{"ts_ns":1234567890,"severity":"Warning","class":"IoError","detail":"CRC mismatch on sector 4096","score_after":85}

Fields:

Field	Type	Description
`ts_ns`	u64	Nanoseconds since boot at event arrival
`severity`	string	One of `Info`, `Corrected`, `Warning`, `Degraded`, `Critical`, `Fatal`
`class`	string	`HealthEventClass` variant name (e.g., `Storage`, `Memory`, `Pcie`)
`detail`	string	Human-readable description (from `HealthEvent.data`, interpreted per class)
`score_after`	u8	Health score H immediately after applying this event's penalty

The events file holds at most 16 lines. When the FMA ring has more than 16 entries, only the 16 most recent are rendered. This bound is intentional: events is for live monitoring (recent activity), not for long-term storage. Full event history is available via the procfs ring dump at /proc/umka/fma/events (Section 19.1.7).

19.2 Stable Tracepoint ABI

Inspired by: DTrace's philosophy of always-on, zero-cost, production-grade tracing. IP status: Clean — design decisions applied to eBPF (Linux mechanism), not DTrace code. Tracepoints as stable interfaces is a policy choice, not a patentable invention.

19.2.1 Problem

Linux tracepoints are considered unstable internal API. They change name, arguments, and semantics between kernel versions. Tools like bpftrace and BCC scripts break regularly. The community's position: "tracepoints are for kernel developers, not users."

This is a policy choice, not a technical necessity. UmkaOS can make a different choice.

Section 18.1.4 specifies full eBPF support. This section defines a layer on top: a set of versioned, stable, documented tracepoints that applications can depend on across kernel updates.

19.2.2 Two Categories of Tracepoints

Category 1: Stable Tracepoints (umka_tp_stable_*) — Versioned, documented, covered by the same "never break without deprecation" policy as the userspace ABI. These are the tracepoints that monitoring tools, profilers, and production observability systems can depend on.

Category 2: Debug Tracepoints (umka_tp_debug_*) — Unstable, may change between any two releases. For kernel developer use only. Same policy as Linux tracepoints.

19.2.3 Stable Tracepoint Interface

// umka-core/src/trace/stable.rs (kernel-internal)

/// A stable tracepoint definition.
pub struct StableTracepoint {
    /// Tracepoint name (e.g., "umka_tp_stable_syscall_entry").
    /// Once published, this name never changes.
    pub name: &'static str,

    /// Category for organization.
    pub category: TracepointCategory,

    /// Version of this tracepoint's argument format.
    /// Starts at 1. Bumped when arguments are added (append-only).
    pub version: u32,

    /// Argument schema (for bpftool and documentation).
    pub args: &'static [TracepointArg],

    /// Probe function pointer (None when no eBPF program is attached).
    /// When None, the tracepoint has ZERO overhead (branch on static key).
    pub probe: AtomicPtr<()>,
}

#[repr(u32)]
pub enum TracepointCategory {
    Syscall     = 0,    // Syscall entry/exit
    Scheduler   = 1,    // Context switch, wakeup, migration
    Memory      = 2,    // Page fault, allocation, reclaim, compression
    Block       = 3,    // Block I/O submit, complete
    Network     = 4,    // Packet TX/RX, socket events
    Filesystem  = 5,    // VFS operations, page cache
    Driver      = 6,    // Driver load, unload, crash, recovery
    Power       = 7,    // PM transitions, frequency changes
    Security    = 8,    // Capability checks, access denials
    Fma         = 9,    // Health events, diagnosis actions
}

pub struct TracepointArg {
    pub name: &'static str,
    pub arg_type: TracepointArgType,
    pub description: &'static str,
}

#[repr(u32)]
pub enum TracepointArgType {
    U64         = 0,
    I64         = 1,
    U32         = 2,
    I32         = 3,
    Str         = 4,    // Pointer + length
    Bytes       = 5,    // Pointer + length
    Pid         = 6,    // Process ID (u32)
    Tid         = 7,    // Thread ID (u32)
    DeviceId    = 8,    // DeviceNodeId (u64)
    Timestamp   = 9,    // Nanoseconds since boot (u64)
}

19.2.4 Zero-Overhead When Disabled

Tracepoints use static keys (same mechanism as Linux). When no eBPF program is attached, the tracepoint site is a NOP instruction. There is zero overhead in the common case — no branch prediction cost, no cache pollution from tracepoint data collection. (Note: always-on aggregation counters in Section 19.2.7 add ~1-2 ns per event via per-CPU atomic increments, independent of tracepoint enablement.)

// At the tracepoint site in kernel code:

// This compiles to a NOP when no probe is attached.
// When a probe is attached, it becomes a call.
umka_trace_stable!(syscall_entry, {
    pid: current_pid(),
    tid: current_tid(),
    syscall_nr: nr,
    arg0: args[0],
    arg1: args[1],
    arg2: args[2],
});

When an eBPF program attaches to this tracepoint, the runtime patches the NOP to a CALL instruction (instruction patching, same as Linux static keys). When the program detaches, it's patched back to NOP.

19.2.5 Stable Tracepoint Catalog

Initial set of stable tracepoints (version 1):

Syscall:

Tracepoint	Arguments	Description
`umka_tp_stable_syscall_entry`	pid, tid, nr, arg0-5, timestamp	Syscall entry
`umka_tp_stable_syscall_exit`	pid, tid, nr, ret, timestamp	Syscall return

Scheduler:

Tracepoint	Arguments	Description
`umka_tp_stable_sched_switch`	prev_pid, prev_tid, next_pid, next_tid, prev_state, cpu	Context switch
`umka_tp_stable_sched_wakeup`	pid, tid, target_cpu, timestamp	Task wakeup
`umka_tp_stable_sched_migrate`	pid, tid, orig_cpu, dest_cpu, timestamp	Task migration

Memory:

Tracepoint	Arguments	Description
`umka_tp_stable_page_fault`	pid, address, flags, timestamp	Page fault entry
`umka_tp_stable_page_alloc`	order, gfp_flags, numa_node, timestamp	Page allocation
`umka_tp_stable_page_reclaim`	nr_reclaimed, nr_scanned, priority	Reclaim cycle
`umka_tp_stable_zpool_compress`	original_size, compressed_size, algorithm	Page compressed
`umka_tp_stable_zpool_decompress`	compressed_size, latency_ns	Page decompressed

Block I/O:

Tracepoint	Arguments	Description
`umka_tp_stable_block_submit`	device_id, sector, size, op, timestamp	I/O submitted
`umka_tp_stable_block_complete`	device_id, sector, size, op, latency_ns, error	I/O completed

Network:

Tracepoint	Arguments	Description
`umka_tp_stable_net_rx`	device_id, len, protocol (ethertype, u16 network byte order), timestamp	Packet received
`umka_tp_stable_net_tx`	device_id, len, protocol (ethertype, u16 network byte order), timestamp	Packet transmitted
`umka_tp_stable_tcp_connect`	pid, saddr, sport, daddr, dport	TCP connection
`umka_tp_stable_tcp_close`	pid, saddr, sport, daddr, dport, duration_ns	TCP close

Driver:

Tracepoint	Arguments	Description
`umka_tp_stable_driver_load`	device_id, driver_name, tier, timestamp	Driver loaded
`umka_tp_stable_driver_crash`	device_id, driver_name, tier, fault_type	Driver crash
`umka_tp_stable_driver_recover`	device_id, driver_name, recovery_time_ns	Driver recovered
`umka_tp_stable_tier_demote`	device_id, driver_name, old_tier, new_tier	Tier demotion

FMA:

Tracepoint	Arguments	Description
`umka_tp_stable_fma_event`	device_id, class, severity, code	Health event
`umka_tp_stable_fma_action`	device_id, action, reason	Corrective action taken

19.2.6 Versioning Rules

Same philosophy as KABI:

Tracepoint names are permanent. Once published, a stable tracepoint name never changes and never disappears (without multi-release deprecation).
Arguments are append-only. New arguments can be added at the end. Existing arguments never change position, type, or semantics.
Version field tracks argument schema. eBPF programs can check the tracepoint version to know which arguments are available.
Deprecation requires 2+ major releases. A deprecated tracepoint continues to fire (possibly with stale/dummy data) for at least two major releases before removal.

19.2.7 Built-In Aggregation Maps

For common observability patterns, provide pre-built eBPF maps that aggregate in-kernel:

/sys/kernel/umka/trace/
    syscall_latency_hist    # Per-syscall latency histogram (log2 buckets)
    block_latency_hist      # Per-device I/O latency histogram
    sched_latency_hist      # Scheduling latency histogram
    net_packet_count        # Per-interface packet counters
    page_fault_count        # Per-process page fault counters

These are always-on aggregation counters, separate from the tracepoint mechanism described in Section 19.2.4. The distinction is important:

Tracepoints (Section 19.2.4) have zero overhead when disabled. When no eBPF program is attached, the tracepoint site is a NOP instruction — no branch, no cost. They are designed for on-demand, deep inspection.
Aggregation counters are always active but are not tracepoint-based. They are simple atomic increments (e.g., AtomicU64::fetch_add(1, Relaxed)) embedded directly in the relevant code paths (syscall entry, block I/O completion, packet RX/TX, etc.). Their cost is a single atomic increment per event — typically 1-2 ns — which is negligible compared to the operation being measured. The histogram buckets are updated in-kernel with no eBPF program required.

Aggregation counters are per-CPU (PerCpu<AggregationCounters>), avoiding cross-core cache line bouncing. The per-CPU design ensures fetch_add operations are cache-local — the x86 lock prefix cost is negligible (~1 ns) when the cache line is CPU-local because no cross-core coherence traffic is generated.

Observability tools (Prometheus, Grafana agents, etc.) can scrape the aggregation counter files directly without enabling any tracepoints.

19.2.8 Linux Tool Compatibility

bpftrace: Works unmodified. Stable tracepoints appear as standard tracepoints:

bpftrace -e 'tracepoint:umka_stable:syscall_entry { @[args->nr] = count(); }'

perf: Works unmodified. Tracepoints visible via perf list:

perf list 'umka_stable:*'
perf record -e umka_stable:block_complete -a sleep 10

bpftool: Works unmodified. Can list and inspect programs attached to stable tracepoints.

19.2.9 Audit Subsystem

UmkaOS provides a structured audit subsystem for security-relevant events, built on the tracepoint infrastructure from Section 19.2.1–16.2.8. Unlike Linux's auditd — a separate subsystem with its own filtering language, netlink protocol, and dispatcher daemon — UmkaOS's audit is integrated directly with the capability model and eBPF tracepoints. One mechanism serves both observability and security auditing, operating in one of two explicitly distinct delivery modes:

Best-effort mode (default): Audit record emission is always non-blocking. If the per-CPU ring buffer is full, the oldest unread record is overwritten. No thread ever blocks. System availability is prioritized over audit completeness.
Strict mode (audit_strict=true boot parameter): Audit record emission blocks with a configurable timeout (default 10 ms) when the ring buffer is full, applying backpressure to maintain gapless per-CPU sequences. This is NOT contradictory with the non-blocking hot path -- in strict mode, blocking occurs only on buffer overflow (a rare backpressure event), not on every record emission.

These modes are detailed in Section 19.2.9.2 (delivery semantics) and Section 19.2.9.3 (overflow policy).

19.2.9.1 Security Audit Events

Audited events fall into six categories. All events in the Capability and Authentication categories are audited by default; other categories follow configurable policy.

Category	Events	Example
Capability	grant, revoke, attenuate, delegate	Process A grants `CAP_NET` to process B
Authentication	login, logout, failed auth	SSH login attempt from 10.0.0.1
Access control	permission denied, capability check	Open `/etc/shadow`: `CAP_READ` denied
Process lifecycle	exec, exit, setuid-equiv	Process 1234 exec `/usr/bin/sudo`
Driver	load, crash, tier change	NVMe driver crashed, recovering
Configuration	sysctl change, policy load	Security policy reloaded

Each event fires a stable tracepoint in the Security category (TracepointCategory, Section 19.2.3), so the same eBPF tooling works for security monitoring. Audit tracepoints are non-blocking on the hot path: they enqueue the audit record into a per-CPU lock-free ring buffer (the same zero-overhead static-key mechanism as all other tracepoints). The delivery guarantee is enforced asynchronously by the drain thread and the configured delivery mode (Section 19.2.9.2). In strict mode, the producer blocks only when the ring buffer is full (backpressure), not on every record emission. In best-effort mode (the default), the oldest unread record is overwritten and no thread ever blocks.

19.2.9.2 Audit Record Format

/// Audit event type discriminator.
///
/// UmkaOS audit event IDs use a base of 3000 (`UMKA_AUDIT_BASE`) to avoid collision
/// with all Linux audit message type ranges. Linux's `audit.h` allocates:
///   - 1000-2099: kernel messages (commands, events, anomalies, integrity, etc.)
///   - 2100-2999: userspace messages (`AUDIT_FIRST_USER_MSG2` .. `AUDIT_LAST_USER_MSG2`)
/// By starting at 3000, UmkaOS IDs are above all currently allocated Linux ranges.
/// The Linux compatibility layer (Section 19.2.9.5) translates UmkaOS audit events to
/// standard Linux audit message types (e.g., `AUDIT_AVC`, `AUDIT_USER_AUTH`) for
/// tools like auditd and ausearch.
///
/// Base constant: `UMKA_AUDIT_BASE = 3000`.
#[repr(u16)]
pub enum AuditEventType {
    /// Capability grant (cap_id, target_pid, permissions).
    CapGrant        = 3000,
    /// Capability denial (cap_id, requested_permissions, reason).
    CapDeny         = 3001,
    /// Capability revocation (cap_id, holder_pid).
    CapRevoke       = 3002,
    /// Process lifecycle (fork, exec, exit).
    ProcessLifecycle = 3010,
    /// Security policy change (module load/unload, policy update).
    PolicyChange    = 3020,
    /// Driver isolation event (tier change, crash, recovery).
    DriverIsolation = 3030,
    /// Authentication event (login, sudo, key use).
    Auth            = 3040,
    /// Administrative action (sysctl, mount, module load).
    AdminAction     = 3050,
    /// Audit drain thread started (meta-event, Section 19.2.9.3).
    /// Emitted by the audit subsystem's initialization code to make
    /// the drain thread's lifecycle observable even though its internal
    /// operations are not individually audited (to prevent self-deadlock).
    AuditDrainStart = 3060,
    /// Audit drain thread stopped (meta-event, Section 19.2.9.3).
    AuditDrainStop  = 3061,
    /// Batch HMAC seal marker (Section 19.2.9.4).
    HmacBatchSeal   = 3062,
}

/// A single audit record. Fixed-size header followed by variable-length detail.
/// Written atomically to the per-CPU audit ring buffer.
///
/// Field order is chosen for natural alignment with `#[repr(C)]`: all u64 fields
/// first, then u32 fields, then u16 fields, then u8 fields, with explicit padding
/// to avoid compiler-inserted holes. Total header size: 56 bytes (no hidden padding).
#[repr(C)]
pub struct AuditRecord {
    // --- 8-byte aligned fields (offset 0) ---

    /// Monotonic nanosecond timestamp (same clock as tracepoints).
    pub timestamp_ns: u64,
    /// Per-CPU monotonically increasing sequence number. Each CPU maintains
    /// its own independent sequence counter. In strict mode, sequences are
    /// gapless (producers block on overflow); in best-effort mode, gaps
    /// indicate lost records and are detected via sequence discontinuities.
    /// On drain, the merge-sort thread interleaves per-CPU chains and
    /// produces a combined audit log with (cpu_id, per_cpu_sequence)
    /// tuples for total ordering reconstruction.
    pub sequence: u64,
    /// Capability handle used (or attempted) for this operation.
    /// `CapHandle` is a 64-bit opaque handle indexing into the process's
    /// capability table (Section 8.1.1, umka-core capability model). User space
    /// holds handles, never raw capability data. Defined as:
    /// `pub struct CapHandle(u64);`
    pub subject_cap: CapHandle,
    /// Opaque identifier for the target resource (inode, device, endpoint).
    pub object_id: u64,

    // --- 4-byte aligned fields (offset 32) ---

    /// Process ID of the subject.
    pub pid: u32,
    /// Thread ID of the subject.
    pub tid: u32,
    /// User ID of the subject (mapped from capability-based identity).
    pub uid: u32,

    // --- 2-byte aligned fields (offset 44) ---

    /// CPU that emitted this record. Together with `sequence`, provides a
    /// globally unique, tamper-evident record identifier.
    pub cpu_id: u16,
    /// The audit event type (which category + specific event).
    pub event_type: AuditEventType,
    /// Length of the variable-length detail payload that follows.
    pub detail_len: u16,

    // --- 1-byte aligned fields (offset 50) ---

    /// Outcome of the audited operation.
    pub result: AuditResult,

    /// Explicit padding to 8-byte alignment boundary (56 bytes total).
    /// Prevents compiler-inserted hidden padding in `#[repr(C)]`.
    pub _pad: [u8; 5],

    // Variable-length detail follows: structured key=value pairs
    // encoded as length-prefixed UTF-8 strings.
}

#[repr(u8)]
pub enum AuditResult {
    /// Operation succeeded.
    Success = 0,
    /// Operation denied by capability check.
    Denied  = 1,
    /// Operation failed for a non-security reason.
    Error   = 2,
}

Records are written to a per-CPU lock-free ring buffer, then drained to persistent storage by a pool of N drain threads (default: max(2, num_cpus / 64), tunable via umka.audit.drain_threads). Each CPU maintains its own monotonic sequence counter without requiring a global atomic counter (which would defeat per-CPU parallelism). CPU rings are statically partitioned among drain threads (CPUs 0..K to thread 0, K+1..2K to thread 1, etc.) to avoid contention. Each drain thread merges and sorts its assigned per-CPU chains by timestamp. A final lightweight merge of pre-sorted per-thread outputs produces the combined record stream with (cpu_id, per_cpu_sequence) tuples for total ordering.

Timestamp ordering: The merge-sort uses (timestamp_ns, cpu_id, per_cpu_sequence) as the composite sort key — timestamp is primary, with CPU ID and per-CPU sequence as tiebreakers. TSC synchronization (Section 6.5) bounds cross-CPU skew to <100ns on modern hardware. For systems with larger skew (NUMA, pre-invariant-TSC), the drain thread applies a per-CPU offset correction table calibrated at boot.

Variable-length record framing with overwrite safety. Each AuditRecord has a fixed-size 56-byte header followed by a variable-length detail payload (length given by detail_len). Since records vary in total size, the ring buffer uses a composed framing and overwrite-safety protocol. Each ring slot has the following byte layout:

+-------------+-------------+-------------------------------------------+------------+
| start_seq   | length      | AuditRecord header + detail payload       | end_seq    |
| (u32, 4B)   | (u32, 4B)   | (56 + detail_len bytes)                   | (u32, 4B)  |
+-------------+-------------+-------------------------------------------+------------+
 ^                           ^                                           ^
 Overwrite safety            Framing (record boundary)                   Overwrite safety

Framing protocol (record boundary detection):

The length field (second u32) stores the total size of the framed entry: 4 (start_seq) + 4 (length) + 56 (header) + detail_len (payload) + 4 (end_seq).
When a record would wrap around the end of the ring buffer (remaining space < total framed size), a skip marker is written: a slot where length == 0. This signals the consumer to skip the remaining bytes and continue from offset 0. The record is then written starting at offset 0.
Consumers read the length field at the current position. If length == 0 (skip marker), they advance to the buffer start. Otherwise, they read length bytes as a complete framed record.
The skip-marker value 0 cannot conflict with start_seq because start_seq is the first u32 in the slot while length is the second. Consumers distinguish them by position within the slot layout.
This skip-marker approach wastes at most max_record_size - 1 bytes per wrap-around but avoids the complexity of split-record handling in the lock-free read path. The wasted space is bounded because detail_len is capped at 4096 bytes (audit records with payloads exceeding this limit are truncated and tagged with a DETAIL_TRUNCATED flag).

Overwrite safety protocol (torn-read prevention):

Each entry is bracketed by start_seq (first u32) and end_seq (last u32).
The producer writes in four steps:
start_seq = write_seq | 1 (odd = write in progress)
length + entry data
end_seq = write_seq (even = commit target)
start_seq = write_seq (even = write complete, matches end_seq)
increment write_seq by 2
The consumer reads start_seq, copies the entry, reads end_seq. If start_seq != end_seq or start_seq is odd, the entry was being overwritten during the read — the consumer discards it and advances to the next entry.
The key invariant: when a write is complete, both start_seq and end_seq hold the same even value. During a write, start_seq is odd. A concurrent overwrite will first set start_seq to a new odd value, which the consumer detects on the post-copy check.
This ensures consumers never observe torn data. The two protocols compose cleanly: framing uses the length field to find record boundaries, while the sequence stamps at the slot's edges detect concurrent overwrites.

Sequence numbers are monotonically increasing within each CPU's stream. The gap-detection behavior depends on the configured delivery mode:

Strict mode (audit_strict=true): Sequence numbers are gapless. The producing thread blocks (with configurable timeout, default 10 ms) when the ring buffer is full, applying backpressure to maintain the gapless invariant. Any gap in a single CPU's sequence under strict mode indicates a bug or tampering — the audit subsystem treats this as a critical security event, emitting a synthetic AUDIT_LOST record and raising an alert via the FMA subsystem (Section 19.1).
Best-effort mode (default): Gaps are possible when the ring buffer overflows (oldest records are overwritten). Consumers detect lost records by observing discontinuities in the per-CPU sequence numbers. The records_dropped counter (per-CPU AtomicU64) tracks how many records were lost, and the drain thread includes the drop count in periodic "audit health" meta-records so that log consumers can quantify the gap.

19.2.9.3 Audit Policy Engine

The audit policy engine determines which events are recorded. Default policy:

Always audit: all capability denials, all authentication events, all exec calls.
Configurable: capability grants, driver lifecycle, configuration changes.

Policy is expressed as eBPF programs attached to audit tracepoints — the same mechanism used for security monitoring (Section 8.7). The delivery guarantee depends on the configured mode: best-effort (default) or strict (guaranteed delivery).

Overflow policy. The two delivery modes described above (Section 19.2.9.2) govern overflow behavior:

Best-effort mode (default, audit_strict=false): Drop-oldest policy. When the per-CPU ring buffer is full, the oldest unread record is overwritten. The kernel prioritizes system availability over audit completeness.
Strict mode (audit_strict=true boot parameter): The emitting thread blocks with a configurable timeout (default 10 ms) rather than overwriting records, maintaining gapless per-CPU sequences. To prevent self-deadlock, the drain thread is exempt from auditing its own operations via the non-audited I/O path:
The drain thread writes directly to a dedicated audit partition (configured at boot via audit_log_dev= kernel parameter) using raw block I/O (submit_bio directly to the block device) rather than the VFS write() syscall path that would trigger audit events. The audit partition must be a separate block device or partition — not a file on a journaled filesystem — to avoid bypassing filesystem journaling (raw submit_bio to data blocks on a journaled filesystem like ext4/XFS would corrupt the journal's consistency guarantees). The audit subsystem manages its own simple log-structured layout on this partition (sequential append with a header containing magic, version, and write offset).
On systems where a dedicated partition is unavailable, the fallback audit_log_path= parameter specifies a regular file, but in this mode the drain thread uses the VFS write path with the PF_NOAUDIT flag (no audit recursion) instead of raw submit_bio. This is slower but preserves filesystem integrity.
The block device's I/O completion path is also marked non-auditable via a per-thread flag (PF_NOAUDIT) set on the drain thread at creation.
This design ensures the drain thread can always make forward progress even when the audit ring buffers are full, breaking the circular dependency. To mitigate the resulting audit blind spot, the drain thread's own activation and deactivation are logged as meta-events (type AuditDrainStart and AuditDrainStop, Section 19.2.9.1) by the audit subsystem's initialization code, not by the drain thread itself. This ensures the drain thread's lifecycle is observable even though its internal operations are not individually audited. RT tasks (SCHED_FIFO / SCHED_RR) always use the drop-oldest policy regardless of strict mode, to preserve real-time scheduling guarantees.

Timeout behavior in strict mode: After the 10 ms timeout expires and the ring buffer is still full, the current record IS emitted by force-evicting the oldest unread entry (converting temporarily to best-effort for that single entry). This preserves the guarantee that the current event is never lost — only the oldest, already-in-buffer event can be evicted (which the drain thread is expected to have forwarded already). The eviction is tracked via two mechanisms: a per-CPU forced_eviction_count: AtomicU64 counter (incremented on each forced eviction), and a per-CPU missed_sequences sideband buffer that records the sequence number of each evicted entry. The drain thread reads the missed_sequences buffer and emits a synthetic AUDIT_LOST record for each evicted sequence, ensuring the gap is visible to verifiers even though the evicted record's content is irretrievably lost. This approach avoids both deadlock (the producing thread never blocks indefinitely) and silent loss of the current security event (the event being audited right now is the one most likely to be attacker-relevant).

/// Attach an audit policy program. Unlike debug tracepoints, audit programs
/// participate in the guaranteed-delivery path.
pub fn attach_audit_policy(
    tracepoint: &'static StableTracepoint,
    prog: &VerifiedBpfProgram,
) -> Result<AuditPolicyHandle, AuditError> {
    // Verified eBPF program acts as filter: returns true to audit, false to skip.
    // The program can inspect all tracepoint arguments to make the decision.
    // ...
}

Rate limiting. To prevent audit log flooding from misbehaving or malicious processes, each event type has a configurable rate limit (default: 10 000 events/second per type). When the rate limit is hit, the audit subsystem coalesces events into a single summary record (e.g., "5 327 additional CAP_READ denials from pid 4001 suppressed in last 1s") rather than silently dropping them. The summary record consumes a sequence number in the per-CPU stream, preserving the sequence continuity invariant (in strict mode) while bounding log volume.

The rate limiter is structured as a three-layer defense against DoS via high-rate audit-generating workloads (e.g., a process rapidly calling audited syscalls to fill the ring and stall other tenants):

/// Audit event rate limiter. Prevents DoS via high-rate audit-generating workloads
/// (e.g., a process rapidly calling audited syscalls to fill the ring and stall others).
///
/// **Three-layer defense**:
/// 1. Per-cgroup token bucket: limits events from one cgroup.
/// 2. Global byte-rate limiter: limits total audit throughput.
/// 3. Emergency throttle: drops events (with a "audit: lost N events" record) when
///    the ring consumer (auditd) is too slow.
pub struct AuditRateLimiter {
    /// Per-cgroup token bucket. Key: cgroup ID.
    /// Each cgroup gets `PER_CGROUP_BURST` tokens; refilled at `PER_CGROUP_RATE_EPS`.
    pub per_cgroup: SpinLock<BTreeMap<CgroupId, TokenBucket>>,

    /// Global byte-rate token bucket: limits total audit data throughput.
    /// Prevents one audit-heavy cgroup from consuming the entire ring even within
    /// its per-cgroup event limit (e.g., via large audit records).
    pub global_byte_rate: SpinLock<TokenBucket>,

    /// Count of events dropped since last "audit: lost N" message.
    pub events_lost: AtomicU64,

    /// Count of bytes dropped since last "audit: lost N bytes" message.
    pub bytes_lost: AtomicU64,
}

/// Token bucket for rate limiting.
/// Tokens represent events (for per-cgroup limiter) or bytes (for global limiter).
pub struct TokenBucket {
    /// Current token count (0..=capacity).
    pub tokens: u64,
    /// Maximum token count (burst size).
    pub capacity: u64,
    /// Token refill rate in tokens per second.
    pub refill_rate_per_sec: u64,
    /// Timestamp of last refill (nanoseconds since boot).
    pub last_refill_ns: u64,
}

impl TokenBucket {
    /// Try to consume `n` tokens. Returns true if tokens were available (event allowed).
    /// Refills tokens based on elapsed time since last refill before checking.
    pub fn try_consume(&mut self, n: u64, now_ns: u64) -> bool {
        let elapsed_ns = now_ns.saturating_sub(self.last_refill_ns);
        let new_tokens = elapsed_ns * self.refill_rate_per_sec / 1_000_000_000;
        self.tokens = (self.tokens + new_tokens).min(self.capacity);
        self.last_refill_ns = now_ns;
        if self.tokens >= n {
            self.tokens -= n;
            true
        } else {
            false
        }
    }
}

/// Per-cgroup audit rate: 1000 events/second with a burst of 5000.
pub const AUDIT_PER_CGROUP_RATE_EPS: u64 = 1_000;
pub const AUDIT_PER_CGROUP_BURST: u64 = 5_000;

/// Global audit byte rate: 10 MB/second with a burst of 50 MB.
pub const AUDIT_GLOBAL_BYTE_RATE: u64 = 10 * 1024 * 1024;
pub const AUDIT_GLOBAL_BYTE_BURST: u64 = 50 * 1024 * 1024;

Rate limiting enforcement in the audit event submission path:

fn audit_submit(event: AuditEvent, cgroup: CgroupId) -> AuditResult {
    let now_ns = clock_monotonic_ns();
    let event_size = event.serialized_size();

    // 1. Per-cgroup token bucket check
    if !limiter.per_cgroup.lock()
        .entry(cgroup).or_insert_with(default_bucket)
        .try_consume(1, now_ns)
    {
        limiter.events_lost.fetch_add(1, Relaxed);
        return AuditResult::DroppedRateLimit;
    }

    // 2. Global byte-rate check
    if !limiter.global_byte_rate.lock().try_consume(event_size as u64, now_ns) {
        limiter.bytes_lost.fetch_add(event_size as u64, Relaxed);
        return AuditResult::DroppedRateLimit;
    }

    // 3. Ring fullness check (emergency throttle)
    if audit_ring.is_full() {
        limiter.events_lost.fetch_add(1, Relaxed);
        // Emit "audit: lost N events" record if threshold crossed
        maybe_emit_lost_record(&limiter);
        return AuditResult::DroppedRingFull;
    }

    audit_ring.push(event);
    AuditResult::Accepted
}

The maybe_emit_lost_record function emits a synthetic AuditEventType::LostEvents record into the ring whenever events_lost crosses a reporting threshold (default: every 100 dropped events or every 1 second since the last report, whichever comes first). This preserves the property that sequence gaps in strict mode always have an explicit explanation record, even when the explanation itself is the loss record.

19.2.9.4 Tamper-Evident Log Chain

Each CPU maintains an independent HMAC chain over its audit records, making retroactive tampering detectable without cross-CPU synchronization:

Initial:   key[0] = HMAC-SHA256(boot_secret, cpu_id || "audit-chain-v1")
Evolution: key[n] = HMAC-SHA256(key[n-1], "evolve")
After computing key[n], key[n-1] is securely erased (zeroed).
Per-record: hmac[n] = HMAC-SHA256(key[n], hmac[n-1] || serialize(record[n]))

Important: The HMAC is NOT stored in the per-CPU ring buffer. The AuditRecord struct in the ring buffer contains only the event data (timestamp, event type, subject, object, detail). The HMAC is computed by the drain thread when it reads records from the per-CPU ring buffer and writes them to the persistent audit log. In the persistent format, each record (strict mode) or each batch (batched mode) is stored alongside its HMAC. This separation keeps the hot-path ring buffer append fast (no cryptographic operations on the recording CPU) while ensuring tamper evidence in the durable log.

Performance cost. HMAC-SHA256 on a typical 56-byte audit record header plus variable-length detail (average ~200-300 bytes total) takes approximately 200-500 ns per record on modern hardware. At sustained high audit rates (100K records/sec in an audit-heavy workload), this costs 20-50 ms/sec of CPU time, roughly 2-5% of one core. For most workloads (1K-10K records/sec) the cost is negligible. To mitigate the cost for non-security-critical events, HMAC computation is configurable per event category: security-critical categories (CapGrant, CapDeny, CapRevoke, Auth) always use strict mode (per-record HMAC with per-record key evolution); other categories can be configured to use batched mode or no HMAC at all. The two HMAC modes operate as follows:

Strict HMAC mode (per-record): Each record receives its own HMAC. The HMAC key evolves after every record: key[n] = HMAC-SHA256(key[n-1], "evolve"), and key[n-1] is erased. This provides per-record tamper evidence and forward secrecy.
Batched HMAC mode (per-batch): A single HMAC covers N consecutive records (default N=16). All records in a batch are serialized and HMACed together under a single key. The key evolves per-batch (not per-record): after computing the HMAC for batch B, the key evolves once to produce the key for batch B+1, and the previous key is erased. This amortizes the cryptographic cost across N records while still providing tamper evidence at batch granularity and forward secrecy at batch boundaries. Within a batch, individual record tampering is still detectable because the batch HMAC covers all records in sequence.

Batch framing: In batched mode, the drain thread accumulates N records and then writes a batch seal — a special entry with AuditEventType::HmacBatchSeal containing the batch HMAC, the batch sequence range (first and last sequence numbers covered), and the evolved key's public commitment (SHA-256 of the next key, enabling verifiers to detect key-evolution breaks). Individual records within a batch carry no HMAC; tamper evidence is provided solely by the batch seal. The verifier reads records from the persistent log until it encounters an HmacBatchSeal, verifies the batch HMAC over all preceding unsealed records (by re-serializing and re-HMACing them), then continues to the next batch. If the log ends mid-batch (e.g., crash before the drain thread wrote the seal), the trailing unsealed records are flagged as unverifiable in the verification report.

The per-category HMAC policy is set via /proc/umka/audit/hmac_policy.

The boot_secret is derived from: (a) TPM-sealed entropy if a TPM is present (preferred), or (b) hardware RNG (RDRAND/RNDR/platform-specific entropy source) collected during early boot if no TPM. There is no static fallback key embedded in the kernel image, because a key readable from the kernel binary would defeat tamper evidence (any attacker with access to the image could forge the HMAC chain). On systems without both TPM and hardware RNG, the audit subsystem generates a random seed from whatever entropy is available at boot (interrupt timing jitter, memory contents) and stores it only in kernel memory. This seed is lost on reboot, which means the HMAC chain cannot be verified across reboots without a TPM. However, within a single boot the chain provides full tamper evidence, and the inability to verify across reboots is an acceptable tradeoff: it prevents an attacker who obtains the kernel image from forging audit records. For offline log verification, the SHA-256 hash of the boot_secret (not the secret itself) is recorded in the first audit record of each boot, allowing a verifier with the original secret to replay the HMAC chain.

Each CPU's HMAC chain starts from a known initial value at boot, seeded with the CPU ID to ensure chains are distinguishable. The key material is stored in umka-core kernel memory, protected by the kernel's core isolation domain (inaccessible to drivers and all userspace). On x86, this corresponds to PKEY 0; on other architectures, equivalent protection is provided by the platform's isolation mechanism (Section 10.2). Even a compromised Tier 2 driver cannot read or forge the audit HMAC.

Tier 1 caveat: Tier 1 drivers run in Ring 0 and can in principle execute the platform's domain-switch instruction (e.g., WRPKRU on x86) to access PKEY 0 memory. This is a documented tradeoff of the Tier 1 trust model (Section 10.2): Tier 1 drivers are considered trusted for integrity but isolated for fault containment. The "inaccessible to drivers" guarantee above applies fully to Tier 2 drivers (Ring 3, hardware-enforced) and provides fault-isolation-grade protection against Tier 1 drivers (a malicious Tier 1 driver that deliberately bypasses isolation is outside the threat model — such a driver would be promoted to Tier 0 or rejected).

Forward secrecy. In strict HMAC mode, the key evolves after every record: key[n] is derived from key[n-1], and key[n-1] is securely erased (zeroed) immediately after computing key[n]. In batched HMAC mode, the key evolves after every batch (not every record), so forward secrecy granularity is per-batch rather than per-record. In both modes, forward secrecy applies to live key material in kernel memory: compromising the current in-memory HMAC key does not allow forging past records (or past batches), because previous keys have been erased from memory. An attacker who captures the current key can forge records from that point onward but cannot reconstruct earlier keys.

Forward secrecy does not conflict with post-crash verification, because verification is performed by an offline verifier that holds the original boot_secret, not by the running kernel. The verifier derives the full key sequence deterministically from boot_secret (the same KDF used during recording: key[0] = HMAC-SHA256(boot_secret, cpu_id || "audit-chain-v1"), key[n] = HMAC-SHA256(key[n-1], "evolve")). Erased in-memory keys are not recoverable from the running kernel; boot_secret is the durable verification root. On TPM-equipped systems with HMAC key checkpointing enabled (see below), derived chain keys may be written to persistent storage in TPM-encrypted form — forward secrecy in that configuration is bounded by the checkpoint interval (at most 1,000 derivations). On systems without checkpointing, no derived keys are ever persisted. On TPM-equipped systems the boot_secret is TPM-sealed (see "TPM-sealed audit key" below) and is the only secret that must be protected for offline verification. On non-TPM systems, the SHA-256 hash of the boot_secret is recorded in the first audit record so verifiers can confirm they hold the correct secret before replaying the chain.

HMAC key checkpointing: To prevent O(N) re-derivation after a crash, the current HMAC chain key is checkpointed to persistent storage every 1,000 derivations (or every 60 seconds, whichever comes first). The checkpoint is encrypted with a TPM-sealed key (Section 8.2) and stored alongside the audit log. On crash recovery, re-derivation starts from the most recent checkpoint rather than the root — at most 1,000 HMAC derivations (~50 μs at ~50 ns per HMAC) rather than potentially millions. The checkpoint interval is configurable via audit.hmac_checkpoint_interval. The checkpoint itself is a single 64-byte write (32-byte key + 32-byte HMAC of the key for integrity) — negligible I/O overhead.

Crash recovery. If the system crashes mid-chain (e.g., between writing a record's data and computing its HMAC), the HMAC chain must be resumable. On boot, the audit subsystem performs the following recovery procedure for each CPU's persisted chain:

Read the last complete record (one with a valid HMAC field) from the persisted log.
Re-derive key[record_seq] from the boot_secret (which is either TPM-unsealed or re-entered by the administrator for offline verification), and verify that record's HMAC. The boot_secret is the durable root; in-memory keys that were erased during normal operation can always be re-derived from it.
If verification succeeds, the chain is intact up to that record. Any data following the last valid HMAC is treated as an incomplete record: it is preserved in the log but tagged with a special AUDIT_CRASH_TRUNCATED sentinel (hmac field set to all 0xFF bytes) and a flag indicating the record is unverified.
The new boot starts a fresh HMAC chain (new boot_secret, new key[0] per CPU). The first record of the new chain includes a back-reference to the last verified record of the previous boot (boot_id, cpu_id, sequence number) so that verifiers can link chains across reboots.

Records with the AUDIT_CRASH_TRUNCATED sentinel are excluded from HMAC chain verification but remain in the log for forensic analysis. Log consumers and the umka-audit-verify tool recognize the sentinel and report "N crash-truncated records found" rather than treating them as tampering.

Verification. Any consumer of the audit log can verify each per-CPU chain independently by re-deriving the key sequence from boot_secret and replaying the HMAC computation over that CPU's stored records. A mismatch at position N in CPU C's chain proves that record N (or a predecessor on that CPU) was modified or deleted after writing. The per-CPU design means a single CPU's chain can be verified without needing records from other CPUs. The umka-audit-verify tool accepts the boot_secret (or, on TPM systems, unseals it automatically) and performs the full chain replay offline.

Threat model refinement. The HMAC chain prevents retroactive tampering by unprivileged code or compromised Tier 2 drivers — they cannot read the HMAC key from Core memory. A Tier 0/Core compromise (full kernel compromise) CAN read the key and rewrite history. This is an intentional limitation: UmkaOS's audit system provides a strong integrity guarantee against all threats SHORT OF a full kernel compromise. A full kernel compromise is out of scope for audit integrity (if the kernel is compromised, the attacker can modify anything). For regulated environments requiring hardware-backed audit integrity even against kernel compromise, use the TPM-backed audit root described below.

TPM-backed audit root (when TPM 2.0 present). When a TPM 2.0 is available, the audit subsystem uses hardware-backed key management that elevates the integrity guarantee beyond what software-only HMAC chains can provide:

Hardware entropy: At boot, the initial HMAC key key[0] is generated by TPM2_GetRandom() (not by the kernel's CSPRNG) — hardware entropy source. This ensures the key is not derivable from any software-observable state.
PCR-sealed key storage: key[0] is immediately sealed to a TPM PCR policy: TPM2_Create(parent=srk, sensitive=key[0], policy=PCR_7_AND_PCR_10). The plaintext is overwritten (zeroed) after sealing. PCR 7 covers the Secure Boot chain (firmware, bootloader, kernel image); PCR 10 covers the IMA measurement log (all executed code and loaded drivers). If an attacker modifies the kernel or early boot components, the audit key becomes unavailable — the system cannot produce valid audit HMACs, making the compromise visible to remote attestation.
Session-bound unseal: The kernel only holds the sealed blob. To compute HMACs, it calls TPM2_Unseal() which returns the key only if PCRs 7 and 10 match (verified boot + IMA measurement log intact). The unseal call happens once at boot and the key is held in a TPM-protected session (TPM2_StartAuthSession with tpmKey parameter) bound to the audit drain thread. The session is invalidated on any PCR change, preventing a runtime attacker from continuing to use the key after compromising the boot chain.
Hardware monotonic counter: The TPM's hardware monotonic counter (TPM2_NV_Increment on a counter index) is incremented on each audit epoch boundary (every 1000 records), providing a tamper-evident record count that survives reboot. An attacker who replays an old audit log cannot produce a monotonic counter value consistent with the current epoch. The counter value is included in the epoch's batch seal record.
Export with TPM quote: When userspace calls ioctl(AUDIT_EXPORT), the kernel includes the TPM quote (TPM2_Quote over PCRs 7, 10, and the audit counter index) alongside the audit records. An external verifier can check: PCR values -> boot chain integrity -> audit chain validity. The quote is signed by the TPM's attestation key (AK), which chains to the TPM's endorsement key (EK) and the manufacturer's certificate. This provides a hardware-rooted proof that the audit records were produced by a verified kernel on verified hardware.

Append-only enforcement without TPM. In environments without TPM (containers, VMs, embedded systems), the audit records are cryptographically chained (each record includes the HMAC of the previous record). While an attacker with kernel access can recompute the chain, they cannot produce records with timestamps that predate records already exported to an external syslog server. Configure auditd to stream records to a remote syslog in near-real-time (< 1s buffer) via the NETLINK_AUDIT socket (Section 19.2.9.5) for effective append-only semantics in practice. The combination of HMAC chains (detecting tampering between export intervals) and near-real-time export (minimizing the window for undetected tampering) provides a practical integrity guarantee for non-TPM environments.

19.2.9.5 Linux Compatibility

UmkaOS exports audit records in formats that standard Linux audit tools understand, so existing security infrastructure works without modification.

auditctl / ausearch / aureport. UmkaOS translates its audit records to Linux audit format (type=SYSCALL msg=audit(...), type=AVC, type=USER_AUTH, etc.) and delivers them to userspace via the NETLINK_AUDIT socket (see below). The kernel never writes to /var/log/audit/audit.log directly — that is the responsibility of the userspace audit daemon (auditd or go-audit), which receives records from the netlink socket and handles persistence. This avoids circular dependencies (audit subsystem depending on VFS, block I/O, etc.) and maintains kernel/userspace separation. The auditctl command sets filter rules, which are internally compiled to eBPF audit policy programs.

audit netlink socket. UmkaOS implements the NETLINK_AUDIT protocol so that auditd or go-audit can receive events in real time. The translation layer maps UmkaOS capability events to the closest Linux audit message types (e.g., capability denial maps to type=AVC).

journald integration. Audit records are also forwarded to the systemd journal with structured fields (_AUDIT_TYPE=, _AUDIT_ID=, OBJECT_PID=, etc.), making them queryable via journalctl:

journalctl _AUDIT_TYPE=AVC --since "5 minutes ago"
journalctl _AUDIT_TYPE=USER_AUTH _HOSTNAME=prod-web-01

syslog forwarding. For centralized log collection (Splunk, Elasticsearch, Graylog), audit records can be forwarded over syslog (RFC 5424) with configurable facility and severity mapping via /etc/umka/audit.conf.

19.3 Debugging and Process Inspection

Inspired by: Solaris/illumos mdb, Linux ptrace, seL4 capability-gated debugging. IP status: Clean — ptrace is a standard POSIX interface; capability-gating access control is a design policy, not a patentable mechanism.

19.3.1 Capability-Gated ptrace

Linux problem: ptrace is a powerful but coarse-grained tool. A single PTRACE_ATTACH call gives the debugger complete control over the target process: read and write arbitrary memory, read and write registers, inject signals, single-step instructions, intercept syscalls. Access control is limited to UID checks and the optional Yama LSM (/proc/sys/kernel/yama/ptrace_scope). In container environments, ptrace across namespace boundaries is blocked by user namespace checks, but within a namespace any process with the same UID (or CAP_SYS_PTRACE) can attach to any other. There is no way to grant partial debug access — it is all or nothing.

UmkaOS design: Each ptrace operation requires a specific combination of capabilities. The debugger must hold explicit capability tokens scoped to the target process and the operations it intends to perform.

Operation	Required Capability
`PTRACE_ATTACH` / `PTRACE_SEIZE`	`CAP_DEBUG` on target process
`PTRACE_CONT` / `PTRACE_DETACH`	`CAP_DEBUG` on target process
`PTRACE_KILL`	`CAP_DEBUG` + `CAP_KILL` (or same UID) on target process
`PTRACE_INTERRUPT`	`CAP_DEBUG` on target process
`PTRACE_PEEKDATA` / `PTRACE_POKEDATA`	`CAP_DEBUG` + `READ` or `WRITE` on target address space
`PTRACE_PEEKTEXT` / `PTRACE_POKETEXT`	`CAP_DEBUG` + `READ` or `WRITE` on target address space (aliases for PEEKDATA/POKEDATA — UmkaOS uses unified address space)
`PTRACE_GETSIGINFO` / `PTRACE_SETSIGINFO`	`CAP_DEBUG` + `READ` or `WRITE` on target address space
`PTRACE_GETREGS` / `PTRACE_SETREGS`	`CAP_DEBUG` + `READ` or `WRITE` on register state
`PTRACE_GETFPREGS` / `PTRACE_SETFPREGS`	`CAP_DEBUG` + `READ` or `WRITE` on register state
`PTRACE_SINGLESTEP`	`CAP_DEBUG` + `EXECUTE` control
`PTRACE_SYSCALL`	`CAP_DEBUG` + `SYSCALL_TRACE`

// umka-core/src/debug/ptrace.rs

/// ptrace request type, corresponding to Linux's PTRACE_* constants.
/// Passed to the `ptrace(2)` syscall as the `request` argument.
#[repr(u32)]
pub enum PtraceRequest {
    /// Read a word from tracee's memory at `addr`. Returns the word value.
    PeekData   = 2,
    /// Read a word from tracee's text segment at `addr`.
    PeekText   = 1,
    /// Write a word to tracee's memory at `addr`.
    PokeData   = 5,
    /// Write a word to tracee's text segment at `addr`.
    PokeText   = 4,
    /// Get the tracee's general-purpose registers.
    GetRegs    = 12,
    /// Set the tracee's general-purpose registers.
    SetRegs    = 13,
    /// Get the tracee's floating-point registers.
    GetFpRegs  = 14,
    /// Set the tracee's floating-point registers.
    SetFpRegs  = 15,
    /// Attach to the tracee, stopping it with SIGSTOP.
    Attach     = 16,
    /// Detach from the tracee, optionally delivering a signal.
    Detach     = 17,
    /// Single-step the tracee: deliver SIGTRAP after the next instruction.
    SingleStep = 9,
    /// Continue the tracee, optionally delivering a signal.
    Cont       = 7,
    /// Send SIGKILL to the tracee.
    Kill       = 8,
    /// Resume tracee, stopping at next syscall entry/exit.
    Syscall    = 24,
    /// Get the siginfo_t for the tracee's current stop signal.
    GetSigInfo = 0x4202,
    /// Set the siginfo_t for the tracee's current stop signal.
    SetSigInfo = 0x4203,
    /// Attach without stopping (PTRACE_SEIZE, Linux 3.4+).
    Seize      = 0x4206,
    /// Interrupt a PTRACE_SEIZE'd tracee (PTRACE_INTERRUPT, Linux 3.4+).
    Interrupt  = 0x4207,
    /// Resume after PTRACE_SEIZE (PTRACE_CONT variant for seized tracees).
    // Note: same as Cont but only valid after Seize.
    // Represented separately for type safety.
}

/// Validate that the calling process holds sufficient capabilities
/// for the requested ptrace operation on the target.
fn check_ptrace_cap(
    caller: &Process,
    target: &Process,
    request: PtraceRequest,
) -> Result<(), CapError> {
    // Caller must hold CAP_DEBUG scoped to the target's object ID.
    let debug_cap = caller.cap_table.lookup(
        target.object_id(),
        PermissionBits::DEBUG,
    )?;

    // Additional permission checks based on the operation.
    match request {
        PtraceRequest::PeekData { .. } | PtraceRequest::PeekText { .. }
        | PtraceRequest::GetSigInfo => {
            caller.cap_table.check(debug_cap, PermissionBits::READ)?;
        }
        PtraceRequest::PokeData { .. } | PtraceRequest::PokeText { .. }
        | PtraceRequest::SetSigInfo { .. } => {
            caller.cap_table.check(debug_cap, PermissionBits::WRITE)?;
        }
        PtraceRequest::GetRegs | PtraceRequest::GetFpRegs => {
            caller.cap_table.check(debug_cap, PermissionBits::READ)?;
        }
        PtraceRequest::SetRegs { .. } | PtraceRequest::SetFpRegs { .. } => {
            caller.cap_table.check(debug_cap, PermissionBits::WRITE)?;
        }
        PtraceRequest::SingleStep => {
            caller.cap_table.check(debug_cap, PermissionBits::EXECUTE)?;
        }
        PtraceRequest::Syscall => {
            caller.cap_table.check(debug_cap, PermissionBits::SYSCALL_TRACE)?;
        }
        PtraceRequest::Kill => {
            caller.cap_table.check(debug_cap, PermissionBits::KILL)?;
        }
        PtraceRequest::Attach | PtraceRequest::Seize
        | PtraceRequest::Cont | PtraceRequest::Detach
        | PtraceRequest::Interrupt => {
            // CAP_DEBUG alone is sufficient for these operations.
        }
    }
    Ok(())
}

Namespace isolation: The CAP_DEBUG capability is scoped to the target's namespace. A debugger in namespace A cannot debug a process in namespace B unless it holds a CAP_DEBUG token that was explicitly delegated across the namespace boundary (using the standard capability delegation mechanism from Section 8.1). Cross-namespace debugging requires both a CAP_DEBUG on the target and a CAP_NS_TRAVERSE on every intermediate namespace. This makes container breakout via ptrace structurally impossible — there is no ambient authority to override.

seccomp interaction: When a debugger attaches to a seccomp-sandboxed process, the sandbox remains in effect. The debugger can observe syscalls (with PTRACE_SYSCALL) but cannot inject syscalls that the target's seccomp filter would deny. This prevents a class of attacks where a debugger is used to bypass seccomp restrictions.

Domain isolation interaction: As described in Section 10.2, ptrace reads/writes to domain-protected memory go through the kernel-mediated PKRU path. The debugger never gains direct access to the target's isolation domain. Capability checks happen before the kernel performs the PKRU switch.

19.3.2 Hardware Debug Registers

Each architecture provides hardware breakpoint and watchpoint registers. UmkaOS exposes these through the ptrace interface, with debug register state saved and restored as part of ArchContext on every context switch.

Architecture	HW Breakpoints	HW Watchpoints	Mechanism
x86-64	4 total (`DR0`-`DR3`, shared)	← shared with breakpoints	`DR7` configures each DR0-DR3 as breakpoint or watchpoint; 4 total in any combination. `DR6` status
AArch64	2-16 (implementation defined)	2-16 (implementation defined)	`DBGBCR`/`DBGBVR`, `DBGWCR`/`DBGWVR`
ARMv7	2-16 (implementation defined)	2-16 (implementation defined)	`DBGBCR`/`DBGBVR` via cp14
RISC-V	via trigger module	via trigger module	`tselect`, `tdata1`-`tdata3` CSRs
PPC32	1 (IAC1)	1-2 (DAC1, DAC2)	IAC/DAC SPRs, DBCR0/DBCR1 control
PPC64LE	1 (CIABR)	1 (DAWR0), 2 on POWER10 (DAWR0/1)	CIABR, DAWR0/DAWRX0 SPRs

// umka-core/src/arch/x86_64/context.rs (excerpt)

/// x86-64 debug register state, saved/restored on context switch.
#[repr(C)]
pub struct DebugRegState {
    /// Address breakpoint registers.
    pub dr0: u64,
    pub dr1: u64,
    pub dr2: u64,
    pub dr3: u64,
    /// Debug status register (read on debug exception, cleared after).
    pub dr6: u64,
    /// Debug control register (enables breakpoints, sets conditions).
    pub dr7: u64,
}

// umka-core/src/arch/aarch64/context.rs (excerpt)

// Per ARM DDI 0487 (ARM Architecture Reference Manual)
// ID_AA64DFR0_EL1.BRPs field + 1 gives actual count; 16 is the architectural maximum.
// (DBGDIDR is the AArch32 equivalent; AArch64 uses ID_AA64DFR0_EL1.)
const MAX_HW_BREAKPOINTS: usize = 16;  // AArch64: up to 16 (ID_AA64DFR0_EL1.BRPs+1)
const MAX_HW_WATCHPOINTS: usize = 16;  // AArch64: up to 16 (ID_AA64DFR0_EL1.WRPs+1)
// x86-64: DR0-DR3 = 4 breakpoints, DR0-DR3 = 4 watchpoints (same regs, different config)
// Runtime: query actual count via ID_AA64DFR0_EL1 on AArch64, assume 4 on x86-64

/// AArch64 debug register state. The number of breakpoint/watchpoint
/// register pairs is discovered at boot via ID_AA64DFR0_EL1.
pub struct DebugRegState {
    /// Breakpoint control/value register pairs.
    pub bcr: [u32; MAX_HW_BREAKPOINTS],
    pub bvr: [u64; MAX_HW_BREAKPOINTS],
    /// Watchpoint control/value register pairs.
    pub wcr: [u32; MAX_HW_WATCHPOINTS],
    pub wvr: [u64; MAX_HW_WATCHPOINTS],
    /// Actual number of pairs available on this CPU.
    pub num_brps: u8,
    pub num_wrps: u8,
}

Debug register state is part of ArchContext (Section 7.1.1) and is saved/restored on every context switch. When a hardware breakpoint or watchpoint fires, the CPU generates a debug exception (#DB on x86-64, Breakpoint Exception (EC=0x30/0x31) on AArch64, breakpoint exception on RISC-V). Hardware watchpoints generate Watchpoint Exception (EC=0x34/0x35) on AArch64. The exception handler checks whether the faulting thread is being ptraced. If so, the event is delivered to the debugger as a SIGTRAP with si_code set to TRAP_HWBKPT. If the thread is not being debugged, the signal is delivered directly to the process (default action: terminate with core dump).

19.3.3 Core Dump Generation

On receipt of a fatal signal (SIGSEGV, SIGABRT, SIGFPE, SIGBUS, SIGILL, SIGSYS), the kernel generates an ELF core dump before terminating the process.

Contents of a core dump:

Register state (general-purpose, floating-point, vector, debug registers)
Memory mappings (VMA list with permissions, file backing, offsets)
Writable memory segments (stack, heap, anonymous mappings)
Signal information (siginfo_t for the fatal signal)
Auxiliary vector (AT_* entries)
Thread list with per-thread register state (for multi-threaded processes)

Capability gating: Core dump generation requires write access to the dump destination. The process's capability set must include WRITE on the target path (a filesystem location) or WRITE on the pipe to a handler program (configured via /proc/sys/kernel/core_pattern, same as Linux). If the process does not hold the required capability, no dump is written and the kernel logs a diagnostic message.

Core dump filter: A per-process bitmask controls which VMA types are included, compatible with Linux's /proc/pid/coredump_filter:

Bit	VMA Type	Default
0	Anonymous private	on
1	Anonymous shared	on
2	File-backed private	off
3	File-backed shared	off
4	ELF headers	on
5	Private huge pages	on
6	Shared huge pages	off
7	Private DAX pages	off
8	Shared DAX pages	off

Compressed core dumps: When umka.coredump_compress=zstd is set (boot parameter or runtime sysctl), core dumps are compressed with zstd at level 3 before writing. For large processes (multi-GB heaps), this reduces dump size by 5-10x and reduces I/O time, making core dumps practical in production. The resulting file has a standard zstd frame that tools can decompress before loading into GDB.

19.3.4 Kernel Debugging and Crash Dumps

Kernel panic handler: On kernel panic, the handler (Section 3.2.3, Tier 0 code) captures a comprehensive snapshot of system state:

Register state of all CPUs — The panicking CPU sends an IPI (or NMI on x86-64 if IPIs are not functioning) to all other CPUs. Each CPU saves its register state to a per-CPU crash save area and halts.
Kernel stack — The faulting CPU's kernel stack is captured, with ORC-based unwinding (Section 19.7.9.1) to produce a symbolic backtrace.
Kernel log buffer — The most recent 64 KB of the kernel ring buffer (printk output) is included.
Capability table state — A summary of the capability table (number of entries, recent grants/revocations) for post-mortem security analysis.
Driver registry state — Status of all registered drivers, including tier, device bindings, and crash counts.

The panic handler writes all of this into the reserved crash region as an ELF core dump (see Section 10.8 for crash recovery and the NVMe polled write-out path).

kdump equivalent: For systems that require maximum crash dump reliability, UmkaOS supports reserving a crash kernel memory region at boot (umka.crashkernel=256M). On panic, kexec loads the crash kernel, which boots into a minimal environment with a single purpose: write the dump to persistent storage and reboot. The crash kernel is stripped to the minimum — serial driver (Tier 0), block driver (Tier 0, polled mode), ELF writer, and nothing else. No scheduler, no interrupts, no capability system.

GDB remote stub: Development builds can enable a built-in GDB remote protocol stub (umka.gdb=serial or umka.gdb=net). This provides full kernel debugging over a serial port or UDP connection:

Set breakpoints in kernel code (software breakpoints via int3 / brk / ebreak)
Single-step through kernel execution paths
Read and write kernel memory and registers
Inspect per-CPU state, thread lists, and capability tables
Attach to a running kernel or halt at boot (umka.gdb_wait=1)

The GDB stub is compiled out of release builds (#[cfg(feature = "gdb-stub")]). It is never present in production kernels — this is a development-only facility.

Required GDB RSP (Remote Serial Protocol) packets. The stub implements the following packet set, which provides full-featured kernel debugging compatible with unmodified GDB clients (including IDE integrations such as CLion, VS Code, and Eclipse CDT):

Packet	Purpose	Notes
`?`	Halt reason	Returns `S05` (SIGTRAP) or `S09` (SIGKILL)
`g` / `G`	Read/write all registers	Architecture-specific register set (x86-64: 24 regs, AArch64: 34 regs, RISC-V: 33 regs)
`p n` / `P n=r`	Read/write single register	Register number `n` is arch-specific; GDB target descriptions declare the mapping
`m addr,len` / `M addr,len:data`	Read/write memory	Kernel virtual addresses; physical memory access via `qRcmd` monitor commands
`c [addr]` / `s [addr]`	Continue / single-step	Single-step uses hardware debug facilities: x86-64 `EFLAGS.TF`, AArch64 `MDSCR_EL1.SS`, RISC-V `dcsr.step`
`z0,addr,len` / `Z0,addr,len`	Software breakpoint insert/remove	Replaces instruction with `int3` (x86-64) / `brk #0` (AArch64) / `ebreak` (RISC-V); original bytes saved in a per-CPU breakpoint table
`z1,addr,len` / `Z1,addr,len`	Hardware breakpoint insert/remove	Uses debug registers: x86-64 DR0-DR3 with DR7 condition bits, AArch64 BVR0-BVR15/BCR0-BCR15, RISC-V trigger module `tdata1`/`tdata2`
`z2,addr,len` / `Z2,addr,len`	Write watchpoint insert/remove	x86-64: DR0-DR3 with DR7 `W` condition; AArch64: WVR/WCR; RISC-V: trigger module with `store` match
`z3,addr,len` / `Z3,addr,len`	Read watchpoint insert/remove	x86-64: DR0-DR3 with DR7 `R` condition; AArch64: WVR/WCR with `load` match; RISC-V: trigger module with `load` match
`z4,addr,len` / `Z4,addr,len`	Access watchpoint (read+write) insert/remove	x86-64: DR0-DR3 with DR7 `RW` condition; AArch64/RISC-V: combined load+store match
`qSupported`	Feature negotiation	Reports: `PacketSize=4096`, `vContSupported+`, `swbreak+`, `hwbreak+`, `qXfer:features:read+`
`vCont`	Extended continue/step	`vCont;c` continues all CPUs; `vCont;s:N` single-steps CPU N while others remain halted
`H g tid` / `H c tid`	Set thread for subsequent register/continue ops	`tid` = CPU number + 1 (GDB thread IDs are 1-based); `H g 0` selects the CPU that hit the breakpoint
`qfThreadInfo` / `qsThreadInfo`	Enumerate threads (CPUs)	Returns one "thread" per online CPU; `qfThreadInfo` returns the first batch, `qsThreadInfo` returns subsequent batches, terminated by `l`
`qC`	Current thread	Returns the CPU that triggered the debug event (breakpoint, watchpoint, or single-step trap)
`qRcmd,hex`	Remote monitor command	Supported commands: `reset` (warm reboot), `panic` (trigger kernel panic for crash dump), `dump <addr> <len>` (hex dump of physical memory), `cpuinfo` (print per-CPU state summary)
`qXfer:features:read:target.xml`	Target description	Provides GDB with the architecture-specific register layout so that register names, sizes, and types are correctly displayed

Multi-CPU handling. When one CPU hits a breakpoint or watchpoint, the GDB stub must halt all other CPUs to present a consistent view of kernel state:

The trapping CPU disables its local interrupts and enters the GDB stub event loop.
The stub sends an NMI (x86-64) or FIQ (AArch64) or IPI (RISC-V, ARMv7) to all other online CPUs with a HALT_FOR_DEBUG flag in the IPI payload.
Each receiving CPU saves its full register state into a per-CPU GdbCpuState struct and spins in a wait loop polling an atomic resume flag.
The GDB stub presents each CPU as a separate GDB thread (thread ID = CPU number + 1). H g N selects CPU N for register reads/writes; H c -1 continues all CPUs when the user issues a continue command.
On continue (c or vCont;c), the stub sets the resume flag for all CPUs. Each CPU restores its register state and returns to its interrupted context. The trapping CPU resumes last, after re-enabling its breakpoint.

// umka-core/src/debug/gdb_stub.rs

/// Per-CPU register state saved when halted for GDB debugging.
/// Each online CPU saves its state here when it receives the
/// `HALT_FOR_DEBUG` IPI, and restores from here on resume.
pub struct GdbCpuState {
    /// Full register file (architecture-specific).
    /// x86-64: 24 registers (RAX-R15, RIP, RFLAGS, CS, SS, DS, ES, FS, GS).
    /// AArch64: X0-X30, SP, PC, PSTATE.
    /// RISC-V: x0-x31, pc.
    pub regs: SavedRegs,

    /// The CPU's current halt reason, or `None` if it was halted by IPI
    /// rather than by hitting a breakpoint/watchpoint itself.
    pub halt_reason: Option<GdbHaltReason>,

    /// Atomic flag polled by the halted CPU. The GDB stub sets this to
    /// `true` when the user issues a continue/step command.
    pub resume: AtomicBool,

    /// If single-stepping, this is `true` and the CPU will re-enter the
    /// stub after executing one instruction.
    pub single_step: bool,
}

/// Reason a CPU entered the GDB stub.
pub enum GdbHaltReason {
    /// Software breakpoint (`int3` / `brk` / `ebreak`).
    SwBreakpoint { addr: u64 },
    /// Hardware breakpoint (debug register match on instruction fetch).
    HwBreakpoint { addr: u64 },
    /// Write watchpoint (debug register match on store).
    WriteWatchpoint { addr: u64 },
    /// Read watchpoint (debug register match on load).
    ReadWatchpoint { addr: u64 },
    /// Access watchpoint (debug register match on load or store).
    AccessWatchpoint { addr: u64 },
    /// Single-step trap (TF flag / SS bit / dcsr.step).
    SingleStep,
    /// Halted by IPI from another CPU's debug event.
    HaltedByIpi,
}

Transport layer. The GDB stub communicates over either a serial port (umka.gdb=serial, default baud 115200, 8N1) or a UDP socket (umka.gdb=net,addr=10.0.0.1,port=1234). The serial transport uses the standard GDB RSP framing: $packet-data#checksum, with +/- acknowledgment. The UDP transport uses the same framing but without acknowledgment (UDP is already checksummed; retransmission is handled at the GDB client level). The stub processes one packet at a time in a polling loop — no interrupts or DMA are used for transport I/O while the kernel is halted, ensuring deterministic behavior.

Driver crash debugging: When a Tier 1 driver crashes, the fault handler (Section 19.1) captures the crash context before initiating recovery. The captured state includes:

Driver thread register state (all register classes)
Isolation domain state (PKRU value, domain assignment)
Driver-private memory snapshot (pages in the driver's isolation domain, up to a configurable limit, default 4 MB)
Recent ring buffer entries from the driver's communication channels
IOMMU mapping state for the driver's devices

This context is written to:

/sys/kernel/umka/drivers/{name}/crash_dump

The file persists until the next driver load or until explicitly cleared. Tools like umka-crashdump or GDB (with UmkaOS-aware scripts) can parse the dump for root cause analysis without reproducing the crash.

19.3.5 /proc/pid Interface

UmkaOS provides compatibility with the Linux /proc/pid interface that debuggers, profilers, and monitoring tools depend on. Each entry is capability-checked individually.

Path	Content	Capability Required
`/proc/pid/maps`	Memory mappings (address, perms, offset, device, inode, path)	`CAP_DEBUG` or same-process
`/proc/pid/mem`	Process memory (seek + read/write)	`CAP_DEBUG` + `READ`/`WRITE`
`/proc/pid/status`	Task state, memory usage, capability summary	None (public fields) or `CAP_DEBUG` (private fields)
`/proc/pid/stack`	Kernel stack trace	`CAP_DEBUG` + `KERNEL_READ`
`/proc/pid/syscall`	Current syscall number and arguments	`CAP_DEBUG`
`/proc/pid/wchan`	Wait channel (function name where task is sleeping)	None
`/proc/pid/coredump_filter`	Core dump VMA filter bitmask	Same-process or `CAP_DEBUG`

Per-access capability checking: Unlike Linux, where /proc/pid/mem is checked at open() time and then freely readable, UmkaOS checks capabilities on every read() and write() call. This eliminates TOCTOU vulnerabilities where a capability is revoked between open and access — a revoked CAP_DEBUG takes effect immediately, even on already-open file descriptors.

// umka-compat/src/procfs/mem.rs

/// Read handler for /proc/pid/mem.
/// Capability is checked on every read, not just on open.
fn proc_pid_mem_read(
    file: &ProcFile,
    buf: &mut [u8],
    offset: u64,
) -> Result<usize, IoError> {
    let caller = current_process();
    let target = file.target_process();

    // Re-check capability on every access (not cached from open).
    if caller.pid() != target.pid() {
        caller.cap_table.lookup(
            target.object_id(),
            PermissionBits::DEBUG | PermissionBits::READ,
        ).map_err(|_| IoError::PermissionDenied)?;
    }

    // Perform the read through the kernel's memory access path.
    // For domain-protected memory, this goes through the PKRU
    // mediation path (Section 10.2).
    target.address_space().read_remote(offset, buf)
}

Public vs. private fields in /proc/pid/status: Fields like State, Pid, PPid, Uid, Gid, and VmSize are considered public and readable without CAP_DEBUG (same as Linux). Fields like VmPeak, VmData, VmStk, CapInh, CapPrm, CapEff, Seccomp, and voluntary_ctxt_switches are private and require CAP_DEBUG on the target. This prevents information leakage that could aid side-channel or timing attacks while preserving compatibility with tools like ps and top that only read public fields.

19.4 Unified Object Namespace

Inspired by: Windows NT Object Manager, Plan 9 namespace concepts. IP status: Clean — basic OS design concept from Multics (1960s), any NT patents expired (filed 1989-1993, expired 2009-2013).

19.4.1 Problem

Linux organizes kernel resources through multiple unrelated mechanisms:

Files/sockets/pipes → file descriptors (integer indices into per-process table)
Processes → PIDs (global integer namespace)
Signals → signal numbers (per-process bitmask)
IPC → sysv IPC keys, POSIX named semaphores, futex addresses
Devices → /dev nodes (major/minor numbers)
Kernel tunables → /proc/sys (sysctl)
Device tree → /sys (sysfs)
Timer resources → timerfd, POSIX timers (separate handle space)
Event notification → eventfd, epoll (yet another handle space)

There is no unified way to enumerate "all kernel resources held by process X" or "all kernel resources related to device Y." Each subsystem has its own introspection mechanism (or none at all).

UmkaOS already has a capability system (Section 8.1) where every resource is accessed through capability tokens. The object namespace makes this explicit and queryable — a hierarchical tree where every kernel object has a canonical path and uniform access control.

19.4.2 Design: Kernel-Internal Object Tree

// umka-core/src/namespace/mod.rs (kernel-internal)

/// Every kernel resource is an Object.
pub struct Object {
    /// Unique object ID (monotonically increasing, never reused).
    pub id: ObjectId,

    /// Object type (from existing cap/mod.rs ObjectType).
    pub object_type: ObjectType,

    /// Reference count.
    pub refcount: AtomicU32,

    /// Capability security descriptor.
    pub security: SecurityDescriptor,

    /// Type-specific data (union).
    pub data: ObjectData,
}

/// The namespace tree.
pub struct ObjectNamespace {
    /// Root directory of the namespace.
    root: ObjectDirectory,

    /// Sparse index for O(1) average-case lookup by ObjectId.
    ///
    /// ObjectIds are monotonically increasing and never reused, so a dense
    /// array indexed by ObjectId would grow without bound as objects are
    /// created and destroyed over the system's lifetime (e.g., a long-running
    /// server creates millions of short-lived process objects). A hash map
    /// provides O(1) average lookup while using memory proportional to the
    /// number of *live* objects, not to the total number ever allocated.
    ///
    /// The key is the ObjectId's numeric value. Entries are inserted on
    /// registration and removed when the object's refcount reaches zero.
    // Note: `HashMap` and `Vec` in these definitions are kernel-internal equivalents
    // (slab-backed hash table and bounded array, respectively), not `std` types. The
    // kernel uses `SlabHashMap` (slab-allocated, interrupt-safe, with RCU-protected
    // lookup) and `BoundedVec` (capacity-limited, slab-allocated). The `std`-style
    // type names are used here for readability.
    index: HashMap<u64, *mut Object>,
}

/// A directory in the namespace — contains named references to objects.
pub struct ObjectDirectory {
    /// This directory's object identity.
    object: Object,

    /// Named entries (sorted for binary search). For directories with fewer than
    /// 4096 entries, a sorted Vec provides cache-friendly binary search and compact
    /// memory layout. Directories exceeding 4096 entries (e.g., /Processes on busy
    /// servers) switch to a BTreeMap internally for O(log n) insertion, avoiding
    /// the O(n) element shift cost of Vec insertion at scale.
    entries: Vec<(ArrayString<64>, ObjectEntry)>,
}

pub enum ObjectEntry {
    /// Direct reference to an object.
    Object(ObjectId),
    /// Subdirectory.
    ///
    /// `Box` is required here: `ObjectEntry` is stored inside `ObjectDirectory`,
    /// which is stored inside `ObjectEntry::Directory`. Without indirection, the
    /// type would be infinitely sized and Rust would reject it at compile time.
    Directory(Box<ObjectDirectory>),
    /// Symbolic link to another path in the namespace.
    Symlink(ArrayString<256>),
}

19.4.3 Namespace Layout

/                                       (root)
+-- Devices                             (device registry mirror)
|   +-- pci0000:00
|   |   +-- 0000:00:1f.2               (SATA controller)
|   |   +-- 0000:03:00.0               (NVMe)
|   |   +-- 0000:04:00.0               (NIC)
|   +-- usb1
|   |   +-- usb1-1                      (hub)
|   |       +-- usb1-1.1               (keyboard)
|
+-- Drivers                             (loaded driver instances)
|   +-- umka-nvme                       (driver object)
|   +-- umka-e1000                      (driver object)
|   +-- umka-xhci                       (driver object)
|
+-- Processes                           (process objects)
|   +-- 1                               (init/systemd)
|   |   +-- Threads
|   |   |   +-- 1                       (main thread)
|   |   |   +-- 2                       (worker thread)
|   |   +-- Handles                     (fd table: capabilities)
|   |   |   +-- 0                       (stdin - pipe)
|   |   |   +-- 1                       (stdout - tty)
|   |   |   +-- 3                       (socket)
|   |   +-- Memory                      (VMA tree)
|   |   +-- Capabilities                (capability set)
|   +-- 42                              (some user process)
|
+-- Memory                              (physical memory regions)
|   +-- Node0                           (NUMA node 0)
|   +-- Node1                           (NUMA node 1)
|
+-- Network                             (network stack objects)
|   +-- Interfaces
|   |   +-- eth0                        (NIC)
|   |   +-- lo                          (loopback)
|   +-- Sockets                         (open sockets)
|
+-- IPC                                 (IPC endpoints)
|   +-- Pipes
|   +-- SharedMemory
|   +-- Semaphores
|
+-- Security                            (security policy objects)
|   +-- Capabilities                    (capability type registry)
|   +-- LSM                             (security module state)
|
+-- Health                              (FMA — Section 19.1)
|   +-- ByDevice
|   |   +-- 0000:03:00.0               (NVMe health)
|   +-- RetiredPages
|   +-- DiagnosisRules
|
+-- Scheduler                           (scheduler state)
|   +-- RunQueues
|   +-- CbsServers                      (Section 6.3)
|
+-- Tracing                             (Section 19.2)
    +-- StableTracepoints
    +-- AggregationMaps

19.4.4 What The Namespace Provides

Uniform enumeration: "Show me everything related to device 0000:03:00.0" is a namespace traversal starting at /Devices/pci0000:00/0000:03:00.0, following links to its driver instance (/Drivers/umka-nvme), its health data (/Health/ByDevice/0000:03:00.0), and any processes with open handles to it.

Uniform security: Every object in the namespace has a SecurityDescriptor that ties into the capability system. Access checks are uniform regardless of object type.

Uniform lifecycle: Objects are reference-counted. When refcount hits zero, the type-specific destructor runs. No per-subsystem cleanup code — the namespace manages object lifetime uniformly.

Cross-reference discovery: "What processes have handles to this device?" is a query across /Processes/*/Handles/*, filtering by target ObjectId. Without the namespace, this requires per-subsystem ad-hoc code.

19.4.5 Registration Strategy: Eager vs Lazy

Not all kernel objects are registered with equal urgency. High-frequency objects would add unacceptable overhead if registered on every creation:

Eagerly registered (created infrequently, high value for introspection): - Devices, drivers, processes, NUMA nodes, cgroups, IPC endpoints, security policies. - Registered at creation time, deregistered at destruction.

Lazily registered (created/destroyed at high frequency): - File descriptors, sockets, VMAs (virtual memory areas), anonymous pages. - The namespace entry is created on first query, not on creation. The namespace maintains a per-process "generation counter"; when a query finds a stale generation, it re-syncs from the kernel's authoritative data structures (fd table, VMA tree). - This means /proc/<pid>/umka/objects may have brief inconsistencies (a just-closed fd might still appear) but the hot path (open/close/read/write) has zero overhead.

Memory budget per eagerly-registered object:

Component	Per Object	Notes
Object struct (fixed fields)	~64 bytes	ID, type, refcount, security
Namespace entry (name + link)	~80 bytes	ArrayString<64> + pointer
SlabHashMap index entry	~16 bytes	Pointer + occupancy bit
Total per object	~160 bytes

Typical system: ~2000 eagerly registered objects (devices + drivers + processes) = ~320 KB baseline. File descriptors are lazy, so they don't add to this baseline.

19.4.6 Linux Interface Exposure — Standard Mechanisms

The namespace is kernel-internal. Linux applications never see it. But UmkaOS-specific tools can access it through standard Linux interfaces as additive extensions:

Via procfs (new entries under /proc/umka/, additive):

/proc/umka/objects/
    summary             # Total object count by type
    by_type/
        Device          # List of all device objects
        Process         # List of all process objects
        FileDescriptor  # List of all open FDs system-wide
        Socket          # List of all sockets
        Capability      # List of all capability grants
    by_id/
        <object_id>     # Full details of a specific object

/proc/umka/namespace/
    tree                # Full namespace tree (text dump, similar to NT WinObj)
    resolve/<path>      # Resolve a namespace path to an object

/proc/<pid>/umka/
    capabilities        # Full capability set for this process
    objects             # All objects this process holds handles to
    namespace_view      # This process's view of the namespace

Via sysfs (additive attributes on existing device nodes):

/sys/devices/.../umka_object_id     # Object ID in the namespace
/sys/devices/.../umka_refcount      # Current reference count
/sys/devices/.../umka_capabilities  # Capabilities granted for this device

Via a pseudo-filesystem (umkafs, mountable):

mount -t umkafs none /mnt/umka

# # Now the object namespace is browsable as a filesystem:
ls /mnt/umka/Devices/pci0000:00/
cat /mnt/umka/Processes/42/Capabilities
cat /mnt/umka/Health/ByDevice/0000:03:00.0/status

This is the same pattern as Linux's debugfs, tracefs, configfs — a pseudo- filesystem for kernel introspection. No new syscalls. Standard open/read/readdir. Any Linux tool that can read files can introspect the namespace.

19.4.7 umkafs Detail

// umka-compat/src/umkafs/mod.rs

/// umkafs: pseudo-filesystem exposing the object namespace.
/// Mounted via: mount -t umkafs none /mountpoint
///
/// Read-only by default. Write access (for admin operations like
/// forcing a driver reload or revoking a capability) requires
/// CAP_SYS_ADMIN (mapped to UmkaOS's admin capability set).
pub struct UmkaFs {
    /// Reference to the kernel's object namespace.
    namespace: &'static ObjectNamespace,

    /// Mount options.
    options: UmkaFsMountOptions,
}

pub struct UmkaFsMountOptions {
    /// Show only objects matching this type filter.
    pub type_filter: Option<ObjectType>,

    /// Show only objects accessible to this UID.
    pub uid_filter: Option<u32>,

    /// Maximum depth of directory listing (avoid huge listings).
    pub max_depth: u32,

    /// Enable write operations (default: false).
    pub writable: bool,
}

umkafs file format for object details:

$ cat /mnt/umka/Devices/pci0000:00/0000:03:00.0
type: Device
id: 4217
refcount: 3
state: Active
driver: umka-nvme
tier: 1
bus: pci
vendor: 0x144d
device: 0xa808
class: 0x010802
numa_node: 0
health: ok
power: D0Active
capabilities_granted: 2
  cap[0]: DMA_ACCESS (perms: READ|WRITE)
  cap[1]: INTERRUPT (perms: MANAGE_IRQ)
handles_held_by:
  process 1 (systemd): fd 7
  process 834 (postgres): fd 12
  process 835 (postgres): fd 13

19.4.8 Admin Operations via umkafs (write)

With writable mount option and admin capabilities:

# # Force a driver reload (triggers crash recovery path on a healthy driver)
echo "reload" > /mnt/umka/Drivers/umka-nvme/control

# # Revoke a specific capability
echo "revoke 4217:0" > /mnt/umka/Security/Capabilities/control

# # Disable a device (sets to Error state in registry)
echo "disable" > /mnt/umka/Devices/pci0000:00/0000:04:00.0/control

# # Change FMA diagnosis rule threshold
echo "threshold 200" > /mnt/umka/Health/DiagnosisRules/dimm_degradation/control

# # Force tier demotion
echo "demote 2" > /mnt/umka/Drivers/umka-e1000/tier

These are all standard file write operations. Any shell script, ansible playbook, or management tool can use them. No custom CLI tools required.

19.4.9 How Subsystems Register Objects

Each kernel subsystem registers its objects with the namespace during initialization:

// Example: Device registry registers each device node.
// In umka-core/src/registry/mod.rs:

fn register_device_node(&mut self, node: &DeviceNode) {
    // Note: `format!()` here represents `ArrayString::from_fmt()` (stack-allocated,
    // fixed-capacity formatting). The kernel does not use heap-allocated `String`.
    let path = format!("/Devices/{}", node.sysfs_path());
    self.namespace.register(path, Object::from_device(node));
}

// Example: Process creation registers process object.
// In umka-core/src/sched/process.rs:

fn create_process(&mut self, ...) -> Process {
    let proc = Process::new(...);
    // Note: `format!()` here represents `ArrayString::from_fmt()` (stack-allocated,
    // fixed-capacity formatting). The kernel does not use heap-allocated `String`.
    let path = format!("/Processes/{}", proc.pid);
    self.namespace.register(path, Object::from_process(&proc));
    proc
}

// Objects are automatically deregistered when they are destroyed
// (refcount -> 0 triggers namespace removal).

19.4.10 Device Naming and Registration

The /Devices/ hierarchy is kernel-arbitrated: drivers do not choose their own names. The kernel's device model assigns a canonical name at device enumeration time, before any driver binds. This prevents naming collisions and ensures the namespace layout is stable and predictable across reboots.

19.4.10.1 Canonical Naming Convention

Each device type uses a deterministic naming scheme based on its bus topology:

Device type	Name format	Example
PCI device	`pci-<domain>:<bus>:<slot>.<func>`	`pci-0000:02:00.0`
USB device	`usb-<bus>:<port>`	`usb-1:3`
Platform device	`platform-<name>-<index>`	`platform-serial-0`
Virtio device	`virtio-<type>-<id>`	`virtio-blk-0`
ACPI device	`acpi-<hid>-<uid>`	`acpi-PNP0501-0`
NVMe namespace	`nvme-<ctrl>n<ns>`	`nvme-0n1`

These names are derived purely from bus-reported topology data (domain, bus, slot, function, port, UID). They are stable across reboots as long as the hardware topology does not change, and they are globally unique within a running system because bus topology values are assigned by hardware and are physically distinct.

19.4.10.2 Kernel-Arbitrated Registration

Drivers register with the device model via umkafs_register_device(handle), where handle is the device's DeviceHandle obtained from the KABI (Section 10.5). The kernel computes the canonical name from the device's bus topology and returns it to the driver. The driver does not supply a name.

/// Register a device with the umkafs namespace.
///
/// The kernel computes the canonical `/Devices/` name from the device's
/// bus topology (PCI BDF, USB bus:port, ACPI HID:UID, etc.) and creates
/// the namespace entry. The canonical name is returned to the caller so
/// the driver can log it and reference it in diagnostics.
///
/// Returns `Err(IslefsError::Exists)` if a device with the same canonical
/// name is already registered (indicates a bus enumeration bug — see
/// Section 19.4.10.3).
///
/// # Safety
///
/// `handle` must be a valid `DeviceHandle` obtained from the KABI during
/// driver initialization. It must not have been passed to
/// `umkafs_register_device` previously.
pub unsafe fn umkafs_register_device(
    handle: DeviceHandle,
) -> Result<ArrayString<64>, IslefsError>;

/// Remove a device's umkafs namespace entry.
///
/// Called during driver teardown. Automatically removes all symlinks
/// pointing to this device's canonical path.
///
/// # Safety
///
/// `handle` must be a valid registered `DeviceHandle`. After this call,
/// the handle must not be used with any umkafs API.
pub unsafe fn umkafs_deregister_device(handle: DeviceHandle);

19.4.10.3 Collision Prevention

The kernel MUST reject umkafs_register_device() calls that would produce a duplicate canonical name. Because canonical names are derived deterministically from hardware bus topology, a duplicate can only arise if the device model assigned the same bus address to two distinct devices simultaneously — which indicates a bus enumeration bug in the kernel, not a driver bug.

On rejection: - umkafs_register_device() returns Err(IslefsError::Exists). - The kernel emits a KERN_WARNING via printk identifying both the existing and attempted registration, including the canonical name and the bus topology of each conflicting device. - The event is also recorded as a HealthEventClass::Generic health event (Section 19.1.3) on the conflicting device, with HealthSeverity::Warning, so it appears in the FMA ring and can trigger diagnosis rules.

The rejected device is not registered. The driver that called umkafs_register_device for the duplicate receives Err and must fail its initialization, causing the device model to place the device in Error state.

19.4.10.4 Human-Readable Alias Symlinks

In addition to the canonical path (e.g., /Devices/pci-0000:02:00.0), the device model publishes human-readable symlinks under class-specific alias directories:

/Devices/by-class/
    block/
        sda  ->  ../../pci-0000:02:00.0        (first block device)
        sdb  ->  ../../pci-0000:03:00.0        (second block device)
    net/
        eth0  ->  ../../pci-0000:04:00.0       (first Ethernet NIC)
    nvme/
        nvme0  ->  ../../pci-0000:05:00.0      (first NVMe controller)

Aliases are assigned in device discovery order. Unlike canonical names, aliases may potentially collide if two drivers independently attempt to claim the same alias (for example, if two NVMe drivers both try to register the nvme0 alias). Collision handling for aliases differs from canonical names:

The first registration of a given alias wins.
Subsequent registrations that would produce a duplicate alias are assigned a deduplicated name: <alias>~<n> where n is the lowest positive integer that produces a unique name (e.g., sda~1, sda~2).
A KERN_WARNING is emitted identifying both the winning and losing alias, along with the canonical names of both devices. Alias collisions are expected to be uncommon (they require two devices of the same class discovered in ambiguous order) and are not treated as errors.

Aliases are reconstructed at each boot in discovery order and are not guaranteed stable across reboots (unlike canonical names). Persistent stable naming for userspace (e.g., /dev/disk/by-id/) is the responsibility of udev rules operating on canonical names and device attributes, not of the umkafs alias directory.

19.4.11 Relationship to Existing Interfaces

The namespace does NOT replace /proc, /sys, or /dev. Those remain as the Linux- compatible interfaces that existing tools depend on. The namespace is an additional unified view:

Interface	Audience	Purpose	Status
`/proc`	Linux tools (ps, top, etc.)	Process info, kernel stats	Required for compat
`/sys`	Linux tools (udev, lspci, etc.)	Device tree, attributes	Required for compat
`/dev`	Linux tools (everything)	Device access	Required for compat
`/proc/umka/*`	UmkaOS-aware admin tools	Object namespace queries	New, additive
`umkafs`	UmkaOS-aware admin tools	Full namespace browse/control	New, additive, optional

The namespace is the kernel-internal source of truth. /proc, /sys, and /dev are generated from it (just as /sys is generated from the device registry, and /proc from process state). The difference is that the namespace provides a unified view where cross-subsystem queries are natural.

19.4.12 Unified Management CLI (`islectl`)

Multiple kernel subsystems expose sysfs/procfs interfaces that administrators interact with through different tools (umka-mltool, veritysetup, sysctl, direct sysfs writes). islectl provides a single-entry-point CLI for UmkaOS system administration.

Design principle: islectl is a userspace tool that reads from the Unified Object Namespace (Section 19.4) and writes to sysfs/procfs. It is a convenience layer, not a privileged daemon — every operation it performs is also possible via direct sysfs writes or existing tools. This ensures the kernel never depends on islectl and that scripting/automation can bypass it entirely.

Subcommands:

Subcommand	Description	Kernel sections
`islectl device list\\|info\\|health`	Device registry queries	Section 10.5
`islectl driver load\\|unload\\|tier`	Driver management	Section 11.1, Section 10.4
`islectl policy list\\|swap\\|compare`	Policy module management	Section 18.7
`islectl intent set\\|status\\|explain`	Intent management	Section 6.7
`islectl fma rules\\|events\\|status`	Fault management queries	Section 19.1
`islectl evolve status\\|apply\\|rollback`	Live evolution management	Section 12.6
`islectl cluster join\\|leave\\|status\\|nodes`	Cluster orchestration	Section 5.1
`islectl power budget\\|status`	Power budgeting	Section 6.4

Output format: JSON by default (machine-parseable for automation pipelines). --human flag enables formatted tables for interactive use. --watch streams updates in real-time (poll-based on sysfs inotify).

Cluster mode: When run on a cluster node, subcommands accept a --cluster flag to operate on the cluster fabric via distributed IPC (Section 5.1). islectl cluster status shows all nodes. islectl --node=N device list queries a specific remote node. Cluster operations are strictly read-only unless explicitly confirmed with --yes (to prevent accidental cross-node mutations).

Implementation phases: - Phase 3: Basic device, driver, and policy commands - Phase 4: FMA, intent, and evolution management - Phase 5: Cluster commands and cross-node operations

19.5 EDAC — Error Detection and Correction Framework

EDAC (Error Detection and Correction) is the kernel framework for reporting CPU and memory hardware errors to userspace. On production servers, EDAC is the primary early-warning system for failing DIMMs: correctable errors (CEs) reliably precede uncorrectable errors (UEs), giving the system operator time to replace hardware before data loss occurs.

19.5.1 Architecture

EDAC decouples hardware-specific memory controller drivers (e.g., amd64_edac for AMD processors, Intel iMC drivers for Intel platforms) from the common error reporting infrastructure. The framework:

Collects error counts from hardware via polling or MCE (Machine Check Exception) notifications.
Aggregates errors per DIMM, per channel, per memory controller.
Exposes a sysfs hierarchy at /sys/bus/edac/ for Linux-compatible tooling (edac-util, rasdaemon).
Integrates with FMA (Section 19.1) for correlated fault detection and proactive response.

The EDAC framework lives in umka-core/src/edac/. Hardware-specific drivers (e.g., umka-core/src/edac/amd64.rs) register McController instances at boot via edac_mc_add_mc() after the memory controller is detected on the bus.

19.5.2 Core Data Structures

// umka-core/src/edac/mod.rs

/// A memory controller — one per CPU socket in NUMA systems.
/// Registered by hardware-specific drivers at boot.
pub struct McController {
    /// Index across all registered controllers (mc0, mc1, ...).
    pub mc_idx:      u32,
    /// DRAM type capabilities this controller supports (DDR4, DDR5, LPDDR5, ...).
    pub mtype_cap:   MemTypeCapability,
    /// Active EDAC mode (what error detection hardware enforces).
    pub edac_mode:   EdacMode,
    /// Error detection capabilities this controller advertises.
    pub edac_cap:    EdacCapability,
    /// Device tree node for this controller.
    pub dev:         Arc<DevNode>,
    /// Chip-select rows (ranks) managed by this controller.
    pub csrows:      Vec<Arc<CsRow>>,
    /// Hardware-specific operations table.
    pub ops:         &'static McOps,
    /// Driver name, e.g. "amd64_edac". Null-terminated, fixed length.
    pub ctl_name:    [u8; 32],
    /// Controller label, e.g. "MC#0". Null-terminated, fixed length.
    pub mc_name:     [u8; 32],
    /// Hardware scrub mode (none, HW background scrub, software scrub).
    pub scrub_mode:  ScrubMode,
    /// Total correctable errors observed since boot (or last reset).
    pub ce_count:    AtomicU64,
    /// Total uncorrectable errors observed since boot (or last reset).
    pub ue_count:    AtomicU64,
    /// Polling interval in milliseconds. 0 = interrupt-driven only.
    pub poll_msec:   u32,
}

/// Detected EDAC mode (error detection/correction capability in use).
#[repr(u8)]
pub enum EdacMode {
    /// No error detection.
    None        = 0,
    /// Error detection only (no correction).
    EcOnly      = 1,
    /// Single-bit error correction, double-bit detection (SECDED).
    Secded      = 2,
    /// S4ECD4ED: 4-bit symbol correction, 4-bit detection.
    S4ecd4ed    = 3,
    /// S8ECD8ED: 8-bit symbol correction, 8-bit detection.
    S8ecd8ed    = 4,
    /// S16ECD16ED: 16-bit symbol correction, 16-bit detection.
    S16ecd16ed  = 5,
}

/// One chip-select row — maps to one or two physical DIMMs.
pub struct CsRow {
    pub csrow_idx: u32,
    /// First physical page frame in this rank's address range.
    pub first_page: u64,
    /// Last physical page frame in this rank's address range (inclusive).
    pub last_page:  u64,
    /// Page address mask used for address decoding.
    pub page_mask:  u64,
    /// Smallest addressable unit in bytes (ECC granule).
    pub grain:      u32,
    /// DRAM package type.
    pub dtype:      DimmType,
    /// Per-channel state within this chip-select row.
    pub channels:   Vec<CsRowChannel>,
    pub ce_count:   AtomicU64,
    pub ue_count:   AtomicU64,
}

/// One DRAM channel within a chip-select row.
pub struct CsRowChannel {
    pub chan_idx: u32,
    pub ce_count: AtomicU64,
    /// DIMM label. Human-readable slot identifier, settable from userspace.
    /// Example: "CPU_SrcID#0_MC#0_Chan#0_DIMM#0"
    pub label:    [u8; 32],
    /// DIMM information, if a DIMM is populated in this slot.
    pub dimm:     Option<Arc<DimmInfo>>,
}

/// Physical DIMM information.
pub struct DimmInfo {
    pub dimm_idx:  u32,
    pub dtype:     DimmType,
    pub mtype:     MemType,
    pub edac_mode: EdacMode,
    /// Size in pages.
    pub nr_pages:  u64,
    /// ECC granule in bytes.
    pub grain:     u32,
    pub ce_count:  AtomicU64,
    pub ue_count:  AtomicU64,
    /// Human-readable label (set by userspace via sysfs write, e.g. from DMI slot data).
    pub label:     [u8; 64],
}

/// Hardware-specific operations provided by each memory controller driver.
pub struct McOps {
    /// Poll hardware for new CE/UE counts.
    /// Called every `poll_msec` ms from a kernel timer, or on MCE notification.
    /// Must be async-safe (called from both process and interrupt contexts).
    pub check_error: fn(mci: &mut McController),

    /// Inject a synthetic error for hardware validation (e.g., post-RAS testing).
    /// Optional — not all platforms support error injection.
    pub inject_error: Option<fn(mci: &McController, inject: &EdacInject) -> Result<(), EdacError>>,

    /// Reset CE/UE counters back to zero.
    /// Requires `CAP_SYS_ADMIN`. Called from sysfs write to `reset_counters`.
    pub reset_counters: Option<fn(mci: &mut McController)>,
}

19.5.3 Error Reporting

When McOps::check_error() detects new errors, or when the MCE handler decodes a memory error, it calls the EDAC reporting functions:

// umka-core/src/edac/report.rs

/// Report a correctable error (CE).
///
/// Called from both the polling path (preemptible process context) and the
/// MCE handler (NMI context on x86-64). The implementation must be NMI-safe:
/// no sleeping, no heap allocation, no unbounded loops.
///
/// # Arguments
/// - `mci`:     The memory controller that detected the error.
/// - `csrow`:   Chip-select row index where the error was located.
/// - `channel`: Channel index within the chip-select row.
/// - `page`:    Physical page frame number where the error occurred (0 = unknown).
/// - `offset`:  Byte offset within the page (0 = unknown).
/// - `grain`:   ECC granule size in bytes.
/// - `syndrome`:ECC syndrome reported by hardware (opaque; hardware-specific decode).
/// - `msg`:     Additional context string (static, NMI-safe).
pub fn edac_mc_handle_ce(
    mci:      &McController,
    csrow:    u32,
    channel:  u32,
    page:     u64,
    offset:   u32,
    grain:    u32,
    syndrome: u32,
    msg:      &'static str,
) {
    // 1. Increment counters atomically (Relaxed ordering: counters are
    //    monotonically increasing; total ordering not required here).
    mci.ce_count.fetch_add(1, Ordering::Relaxed);
    let row = &mci.csrows[csrow as usize];
    row.ce_count.fetch_add(1, Ordering::Relaxed);
    row.channels[channel as usize].ce_count.fetch_add(1, Ordering::Relaxed);

    // 2. Write to kernel log (pr_warn equivalent). NMI-safe printk path.
    klog!(Warn, "EDAC MC{}: CE page 0x{:x} offset 0x{:x} grain {} syndrome 0x{:x} {}",
          mci.mc_idx, page, offset, grain, syndrome, msg);

    // 3. Emit FMA event (Section 19.1). Written to the per-device FMA ring.
    //    FMA event emission is lock-free and NMI-safe (ring buffer + atomic tail).
    fma_emit(FaultEvent::MemoryCe {
        mci_idx:  mci.mc_idx,
        csrow,
        channel,
        page,
        syndrome,
    });

    // 4. Threshold check: if CE rate exceeds CE_ADVISORY_PER_DAY on this DIMM,
    //    escalate to FMA advisory. Rate is computed by FMA from the event timestamps
    //    in the FMA ring; EDAC itself only records the raw count.
    //    The threshold is configurable via sysfs (see Section 19.5.4).
}

/// Report an uncorrectable error (UE). The affected memory page is compromised.
///
/// A UE means a multi-bit error that ECC could not correct. The contents of the
/// affected memory are unreliable. If that memory held kernel code or critical
/// data structures, a panic is the safest response.
pub fn edac_mc_handle_ue(
    mci:     &McController,
    csrow:   u32,
    channel: u32,
    page:    u64,
    msg:     &'static str,
) {
    mci.ue_count.fetch_add(1, Ordering::Relaxed);
    mci.csrows[csrow as usize].ue_count.fetch_add(1, Ordering::Relaxed);

    klog!(Crit, "EDAC MC{}: UE page 0x{:x} csrow {} channel {} {}",
          mci.mc_idx, page, csrow, channel, msg);

    fma_emit(FaultEvent::MemoryUe {
        mci_idx: mci.mc_idx,
        csrow,
        channel,
        page,
        fatal: true,
        offset: u64::MAX,   // Sub-page offset not reported by EDAC UE handler.
    });

    // Panic if configured (sysctl kernel.panic_on_uncorrected_error).
    // Default is true on production builds; false on developer builds.
    if sysctl_read_bool("kernel.panic_on_uncorrected_error") {
        kernel_panic("EDAC: uncorrectable memory error");
        // unreachable
    }

    // If not panicking, attempt to offline the affected page.
    // memory_hotplug::offline_page() migrates any present mappings away
    // and marks the page as PG_hwpoison so the page allocator never reuses it.
    // This can fail (e.g., page holds a pinned DMA buffer); failure is logged.
    if page != 0 {
        if let Err(e) = memory_hotplug::offline_page(page) {
            klog!(Err, "EDAC MC{}: failed to offline page 0x{:x}: {:?}", mci.mc_idx, page, e);
        }
    }
}

CE threshold and advisory — The per-DIMM CE rate is computed by the FMA engine from timestamps in the FMA event ring. When the rate exceeds the configured threshold (default: 100 CE/day per DIMM), the FMA engine transitions the device to DeviceHealth::Degrading and emits a maintenance advisory. The threshold is configurable per-controller at /sys/bus/edac/devices/mc0/ce_count_limit.

MCE integration on x86-64 — The MCE handler (invoked from NMI context) calls mce_log_event() which, after MCE severity triage, calls edac_mc_handle_ce or edac_mc_handle_ue. AMD platforms register an edac_mce_amd decode handler that translates AMD MCA bank bits (MCA_STATUS, MCA_ADDR) into EDAC events. Intel platforms use the iMC driver's MCE decode path analogously.

19.5.4 sysfs Interface

EDAC registers its sysfs hierarchy under /sys/bus/edac/devices/. All writes require CAP_SYS_ADMIN.

/sys/bus/edac/devices/
  mc0/                          # Memory controller 0
    ce_count                    # (r)  Total CEs since boot or last reset
    ue_count                    # (r)  Total UEs since boot or last reset
    reset_counters              # (w)  Write "1" to reset all CE/UE counters
    ce_count_limit              # (rw) CE/day advisory threshold (default: 100)
    mc_name                     # (r)  Controller name, e.g. "F19h_MC0"
    edac_mode                   # (r)  Active EDAC mode, e.g. "SECDED"
    size_mb                     # (r)  Total memory under this controller (MB)
    scrub_mode                  # (rw) Active scrub mode; "none" / "hw_scrub"
    csrow0/                     # Chip-select row 0
      ce_count                  # (r)  CE count for this rank
      ue_count                  # (r)  UE count for this rank
      size_mb                   # (r)  Rank size in MB
      edac_mode                 # (r)  EDAC mode for this rank
      dev_type                  # (r)  DIMM package: RDIMM / LRDIMM / UDIMM / NVDIMM
      mem_type                  # (r)  DRAM standard: DDR4 / DDR5 / LPDDR5
      ch0_ce_count              # (r)  CE count for channel 0
      ch0_dimm_label            # (rw) DIMM slot label (set by userspace from DMI data)
      ch1_ce_count              # (r)  CE count for channel 1
      ch1_dimm_label            # (rw) DIMM slot label for channel 1

19.5.5 umkafs Integration

EDAC data is also mirrored in the umkafs unified namespace (Section 19.4), enabling cross-subsystem queries alongside device health, scheduler state, and cluster topology:

/Memory/EDAC/
  mc0/
    ce_count
    ue_count
    ce_count_limit
    csrow0/
      ce_count
      ue_count
      ch0_ce_count
      ch0_dimm_label
      ch1_ce_count
      ch1_dimm_label

The umkafs entries are thin wrappers over the same McController/CsRow data structures — reads are lock-free atomic loads; writes are forwarded to the same sysfs write handlers (same permission checks apply).

19.5.6 Integration with FMA

Each edac_mc_handle_ce/ue call emits an FaultEvent into the FMA ring (Section 19.1.3). The FMA diagnosis engine correlates these events:

Pattern	FMA Diagnosis	Response
Rising CE rate on one DIMM	`DeviceHealth::Degrading`	Maintenance advisory; retire pages if possible
UE event	`DeviceHealth::Failed`	Attempt `memory_hotplug::offline_page`; alert operator
Multiple DIMMs in same rank failing	Possible MC fault	Escalate to platform-level alert
CE burst correlated with thermal event	Thermal-induced instability	Correlate with FMA thermal events

The FMA ring can be consumed by the umka-ml-anomaly service (Section 22.1) for time-series analysis of CE rates. The service fits an exponential moving average over the CE event timestamps and alerts if the rate exceeds three standard deviations above the per-DIMM baseline.

Implementation phases: - Phase 1: Framework registration (edac_mc_add_mc, edac_mc_handle_ce/ue), sysfs hierarchy - Phase 2: AMD (amd64_edac) and Intel iMC drivers - Phase 4: umkafs integration, FMA correlation rules, umka-ml-anomaly CE-rate monitor

19.5.7 Polling Mechanism

Hardware memory controllers expose ECC error counts through memory-mapped registers or model-specific registers (MSRs). UmkaOS reads these via a dedicated polling controller that runs on a kworker thread at a configurable interval.

19.5.7.1 Controller and Driver Interface

// umka-core/src/edac/poller.rs

/// EDAC polling controller — one singleton per system, initialized at boot.
///
/// Owns the kworker thread and the registered driver list. Drivers register
/// at bus-scan time (PCI enumeration for Intel iMC / AMD UMC controllers)
/// and deregister on hot-remove.
pub struct EdacPoller {
    /// Polling interval in milliseconds. Default: 1000 ms.
    /// Range: 100 ms (aggressive, ~0.1% overhead) to 60000 ms (conservative).
    /// Readable and writable via `/System/Kernel/edac/poll_interval_ms`.
    /// A write takes effect on the next wakeup; no restart is required.
    pub poll_interval_ms: AtomicU32,
    /// Registered EDAC drivers. One entry per memory controller instance.
    /// Protected by `RwLock` so that bus-scan (writer) and the poller thread
    /// (reader) do not contend on steady-state polls.
    pub drivers: RwLock<Vec<Box<dyn EdacDriver>>>,
    /// Handle to the dedicated kworker thread (`edac_poller/0`).
    /// Created at EDAC framework init; runs until system shutdown.
    pub poller: KworkerHandle,
    /// Monotonically increasing correctable error count since boot.
    /// Incremented under the driver's `RwLock` read guard; readable via
    /// `/System/Kernel/edac/ce_count` without locking.
    pub total_ce_count: AtomicU64,
    /// Monotonically increasing uncorrectable error count since boot.
    pub total_ue_count: AtomicU64,
    /// Sliding window of UE timestamps (seconds since boot), used to detect
    /// UE bursts that trigger a panic. Ring of `UE_BURST_WINDOW` entries.
    pub ue_window: SpscRing<u64, UE_BURST_WINDOW>,
}

/// Maximum UE events in the burst window before a panic is triggered.
/// A burst of 3 UEs within 60 seconds indicates multi-bit failure spreading
/// beyond what page-offlining can contain; continuing risks silently corrupt data.
pub const UE_PANIC_THRESHOLD: usize = 3;
/// Width of the UE burst detection window in seconds.
pub const UE_BURST_WINDOW_SECS: u64 = 60;
/// Ring capacity for the UE burst window (must be >= UE_PANIC_THRESHOLD).
pub const UE_BURST_WINDOW: usize = 8;

/// Per-memory-controller EDAC driver interface.
///
/// Implemented by hardware-specific drivers (e.g., `amd64_edac`, `intel_imc_edac`).
/// All methods are called from the poller kthread (preemptible process context)
/// unless noted otherwise. Implementations may sleep, allocate memory, and acquire
/// non-IRQ-safe locks.
pub trait EdacDriver: Send + Sync {
    /// Human-readable controller name for logging and sysfs, e.g., `"AMD_UMC_0"`,
    /// `"Intel_IMC_0"`. Must be unique across all registered drivers.
    fn name(&self) -> &str;

    /// Poll hardware ECC registers once and return all errors detected since the
    /// previous call to `clear_counts()`.
    ///
    /// The implementation reads the relevant status/count registers from hardware
    /// (e.g., AMD `MCG_STATUS`, `MCA_STATUS_UMC`; Intel `MC{n}_STATUS` MSRs or
    /// iMC MMIO error registers) and translates them to `EdacError` values.
    ///
    /// Returns an empty `Vec` if no new errors have occurred since the last poll.
    /// A non-empty return does not imply a fatal condition — correctable errors
    /// are expected on aging hardware.
    fn poll(&mut self) -> Vec<EdacError>;

    /// Clear the hardware error count registers so that the next `poll()` call
    /// reports only newly detected errors, not cumulative counts.
    ///
    /// Called immediately after `poll()` by the poller thread. On platforms where
    /// clearing is not possible (read-only sticky registers), the driver must
    /// track the previous value internally and return the delta in `poll()`.
    fn clear_counts(&mut self);
}

/// A single ECC error event detected by polling a hardware memory controller.
pub struct EdacError {
    /// Error classification.
    pub kind: EdacErrorKind,
    /// Physical memory address where the error occurred, if the hardware
    /// provides address decoding. `None` if the controller does not record
    /// the fault address (e.g., LPDDR5 in some mobile configurations).
    pub phys_addr: Option<PhysAddr>,
    /// DIMM physical location decoded from the hardware address (channel,
    /// slot, rank, bank, row, column). Fields not decoded by the hardware
    /// are set to `u32::MAX`.
    pub location: EdacLocation,
    /// ECC syndrome bits as reported by hardware. Opaque; interpretation
    /// is hardware-specific and provided for diagnostic/logging purposes only.
    pub syndrome: u64,
    /// Count of how many times this error (same `phys_addr`) has been seen
    /// in the current polling epoch. Most hardware reports 1 per poll cycle;
    /// some aggregate counts across multiple addresses.
    pub count: u32,
}

/// Physical DIMM location decoded by the hardware memory controller driver.
/// Fields that the hardware does not provide are `u32::MAX`.
pub struct EdacLocation {
    pub channel: u32,
    pub slot:    u32,
    pub rank:    u32,
    pub bank:    u32,
    pub row:     u32,
    pub column:  u32,
}

/// Classification of a detected ECC error.
pub enum EdacErrorKind {
    /// Correctable Error (CE): single-bit error that the ECC hardware corrected
    /// transparently. The read data delivered to the CPU was correct. CEs do not
    /// cause data loss. A rising CE rate on a single DIMM location is the
    /// canonical predictor of an imminent uncorrectable error.
    Correctable,
    /// Uncorrectable Error (UE): multi-bit error that ECC could not correct.
    /// The data at the affected physical address is unreliable. If that address
    /// holds kernel code, a page table, or a live task's memory, a panic is the
    /// safest response to prevent silent corruption from propagating.
    Uncorrectable,
    /// Correctable Error at a high rate: more than 100 CEs per hour on the same
    /// DIMM location. The error is still hardware-corrected, but the rate indicates
    /// the DIMM is degrading. Page-offlining is recommended before a UE occurs.
    /// Emitted as a separate variant (not `Correctable`) so the poller can take
    /// proactive action without waiting for the FMA CE-rate threshold check.
    CorrectableHighRate,
}

19.5.7.2 Polling Algorithm

The poller kthread (edac_poller/0) runs the following loop:

EDAC poller loop (runs every poll_interval_ms milliseconds):

for each driver in EdacPoller.drivers (read lock held):
    errors = driver.poll()
    driver.clear_counts()

    for each error in errors:
        match error.kind:

        Correctable | CorrectableHighRate:
            EdacPoller.total_ce_count += error.count
            driver's McController.ue_count unchanged
            csrow.ce_count += error.count
            emit FaultEvent::MemoryCe { ... } to FMA ring

            if error.kind == CorrectableHighRate:
                emit FaultEvent with HealthSeverity::Warning
                if error.phys_addr is Some(addr):
                    offline_page(addr)  // remove from allocator pool
                    klog!(Warn, "EDAC {}: CE high-rate, page 0x{addr:x} offlined", driver.name())

        Uncorrectable:
            EdacPoller.total_ue_count += error.count
            csrow.ue_count += error.count
            emit FaultEvent::MemoryUe { ... } with HealthSeverity::Critical

            if error.phys_addr is Some(addr):
                offline_page(addr)
                klog!(Crit, "EDAC {}: UE, page 0x{addr:x} offlined", driver.name())

            // Record UE timestamp in burst window.
            ue_window.push(now_secs())

            // Evict entries older than UE_BURST_WINDOW_SECS.
            while ue_window.front() < now_secs() - UE_BURST_WINDOW_SECS:
                ue_window.pop_front()

            // Burst check: too many UEs in the sliding window → panic.
            if ue_window.len() >= UE_PANIC_THRESHOLD:
                kernel_panic(
                    "EDAC: {} uncorrectable memory errors in {}s — halting",
                    UE_PANIC_THRESHOLD, UE_BURST_WINDOW_SECS
                )
                // unreachable

Page offlining. offline_page(phys_addr) calls memory_hotplug::offline_page() (Section 4.2), which removes the page from the buddy allocator's free pool. The page is never allocated again. Existing mappings of the page are not forcibly invalidated — the process or kernel subsystem that holds the mapping may read corrupt data, but further allocations will not land on the bad page. A UE in a kernel mapping triggers a panic before offline_page is called, since the kernel cannot safely continue with a bad kernel page.

Panic threshold rationale. Three UEs in 60 seconds indicates a fault condition beyond individual page-offlining: either the memory controller itself is failing, or a hardware fault is affecting a wide address range. Continuing risks silently corrupt data in pages that have not yet been detected as bad. A panic with a clear EDAC message is the correct response; the operator can analyze the FMA event log to determine the scope of the failure.

19.5.7.3 Userspace Visibility

Error counters and configuration are exposed through two parallel namespaces:

sysfs (Linux-compatible, for edac-util, rasdaemon):

/sys/bus/edac/devices/mc{n}/
    ce_count                   (r)  Total CE count since boot (or last reset)
    ue_count                   (r)  Total UE count since boot (or last reset)
    reset_counters             (w)  Write "1" to clear all counters (CAP_SYS_ADMIN)
    scrub_mode                 (rw) "none" or "hw_scrub"
    csrow{m}/ch{c}_ce_count    (r)  Per-channel CE count for rank m, channel c
    csrow{m}/ue_count          (r)  Per-rank UE count

umkafs (cross-subsystem queries, Prometheus scraping):

/System/Kernel/edac/
    poll_interval_ms           (rw) Polling interval; range [100, 60000]; default 1000
    ce_count                   (r)  System-wide total CE count since boot
    ue_count                   (r)  System-wide total UE count since boot
    dimm/{location}/ce_count   (r)  Per-DIMM CE count (location = "mc0_csrow0_ch0")
    dimm/{location}/ue_count   (r)  Per-DIMM UE count

POLLPRI notification. A UE event sets the POLLPRI flag on the memory controller's character device /dev/edac/mc{n}. Userspace monitoring daemons (e.g., mcelog, edac-util) may poll()/epoll() on this fd to receive prompt notification without busy-waiting. CE events do not set POLLPRI; they are only reported through the sysfs/umkafs counters and the FMA event ring.

19.6 pstore — Panic Log Persistence

pstore (persistent storage) preserves kernel panic logs, oops messages, and console output across reboots by writing them to non-volatile storage during the panic path. On production datacenter hosts, pstore is the primary mechanism for post-mortem analysis of kernel panics when a full kdump capture is impractical (insufficient crashkernel reservation, early-boot panics before kdump is armed, or EFI-only hosts).

19.6.1 Architecture

pstore decouples the persistence backend (EFI variables, ramoops DRAM region, BERT ACPI table) from the event producers (panic handler, oops handler, console, MCE log). The framework:

Calls all registered backends on panic/oops via the KmsgDumper interface.
Mounts pstorefs at /sys/fs/pstore/ after boot (done by systemd-pstore.service or manually by the administrator).
Each saved record appears as a virtual file: dmesg-efi-12345, console-efi-12345, mce-efi-12346, etc.
Files are removed from pstorefs (and erased from the backend) via unlink() or automatically by systemd-pstore.service, which copies them to /var/lib/systemd/pstore/ first.

The pstore framework lives in umka-core/src/pstore/. Backend drivers live in umka-core/src/pstore/efi.rs, umka-core/src/pstore/ramoops.rs, and (read-only) umka-core/src/pstore/bert.rs.

19.6.2 Backend Interface

// umka-core/src/pstore/mod.rs

/// Registered pstore backend. One instance per storage medium.
/// Backends are registered at boot via `pstore_register()`.
pub struct PstoreInfo {
    /// Backend name, e.g. "efi-pstore", "ramoops". Null-terminated.
    pub name:       [u8; 32],
    /// Which record types this backend handles.
    pub flags:      PstoreFlags,
    /// Maximum panic reason severity this backend accepts.
    pub max_reason: KmsgDumpReason,

    /// Write one pstore record during panic/oops.
    ///
    /// # Safety
    ///
    /// Called from NMI/panic context. Must not sleep, allocate heap memory,
    /// or acquire non-spinlock mutexes. The implementation must complete in
    /// bounded time (no retry loops on EFI runtime errors).
    ///
    /// Returns the backend-assigned record ID on success, used later by `erase`.
    pub write: fn(info: &PstoreInfo, record: &mut PstoreRecord) -> Result<u64, PstoreError>,

    /// Read the next available record (enumeration).
    ///
    /// Called repeatedly at pstorefs mount time until it returns `None`.
    /// Must be callable from process context (may sleep for slow backends).
    pub read: fn(info: &PstoreInfo, iter: &mut PstoreReadIter) -> Option<PstoreRecord>,

    /// Erase record `id` from the backend.
    ///
    /// Called when userspace unlinks the corresponding pstorefs file.
    /// Must be callable from process context.
    pub erase: fn(info: &PstoreInfo, id: u64) -> Result<(), PstoreError>,
}

/// A single pstore record (one logical unit of saved state).
pub struct PstoreRecord {
    /// Record type tag.
    pub type_:      PstoreType,
    /// Backend-assigned ID (set by `write`, used by `erase`).
    pub id:         u64,
    /// Monotonically increasing panic counter (survives reboot via EFI variable).
    pub count:      u32,
    /// Reason the kernel was dumping (panic, oops, MCE, ...).
    pub reason:     KmsgDumpReason,
    /// Part number for multi-part records (0-based). Large logs are split into
    /// backend-sized chunks and stored as parts 0, 1, 2, ...
    pub part:       u32,
    /// True if `buf` contains LZ4-compressed data.
    pub compressed: bool,
    /// Number of valid bytes in `buf`.
    pub size:       usize,
    /// Pre-allocated, statically-sized buffer (NMI-safe — no heap).
    /// Size determined by the backend at registration time.
    pub buf:        &'static mut [u8],
}

/// Panic dump reason, in order of increasing severity.
#[repr(u32)]
pub enum KmsgDumpReason {
    Panic       = 1,
    Oops        = 2,
    Emerg       = 3,
    Restart     = 4,
    Halt        = 5,
    Poweroff    = 6,
    SoftRestart = 7,
    Mce         = 8,
}

bitflags! {
    /// Which record types a backend supports.
    pub struct PstoreFlags: u32 {
        const DMESG   = 0x0001;
        const CONSOLE = 0x0002;
        const FTRACE  = 0x0004;
        const MCE     = 0x0008;
        const PMSG    = 0x0010;
    }
}

19.6.3 EFI Backend (`efi_pstore`)

The EFI backend stores records as EFI non-volatile NVRAM variables, which survive across hard resets and power cycles on UEFI platforms.

Variable naming — Each record part becomes one EFI variable:

VendorGuid: {9f2f919e-b88e-4e47-...}
VariableName: "dump-type0-<part>-<count>-<timestamp>"

where part = part number, count = panic counter, timestamp = seconds since epoch from the EFI GetTime() runtime service.

Size limit — EFI variable size is platform-specific. The backend queries QueryVariableInfo() at registration time and stores the result in max_var_size. Typical values: 8–64 KB per variable. Large dmesg logs are split into ceil(uncompressed_size / max_chunk_size) parts.

Compression — Each chunk is LZ4-compressed before writing. The compression buffer is statically allocated at registration time (no heap allocation in panic path). If LZ4 cannot compress a chunk below max_var_size, the chunk is stored uncompressed and PstoreRecord::compressed is set to false.

NMI-safety on x86-64 — EFI runtime services (SetVariable) are not unconditionally NMI-safe on all firmware implementations. The EFI backend disables interrupts (cli) and acquires an NMI-safe spinlock (no scheduler interaction) before entering EFI runtime. This matches the approach used by Linux's efi_call_rts() path.

// umka-core/src/pstore/efi.rs

pub struct EfiPstoreBackend {
    pub info:         PstoreInfo,
    /// Maximum bytes per EFI variable (from QueryVariableInfo).
    pub max_var_size: usize,
    /// Pre-allocated, stack-pinned compression output buffer.
    pub compress_buf: [u8; 65536],
    /// Monotonically increasing panic count, stored in one dedicated EFI variable.
    pub panic_count:  AtomicU32,
    /// Lock serializing EFI runtime service calls (NMI-safe spinlock).
    pub efi_lock:     NmiSpinlock,
}

impl EfiPstoreBackend {
    /// Boot-time initialization: read panic_count from EFI, register with pstore.
    pub fn init() -> Result<(), PstoreError> {
        let backend = Self::new()?;
        pstore_register(backend.info)
    }
}

Boot-time enumeration — At pstorefs mount, EfiPstoreBackend::read() calls GetNextVariableName() in a loop collecting all variables matching the GUID. It reassembles multi-part records (sorted by part number), decompresses if compressed = true, and yields PstoreRecord instances for each complete log.

Erase path — EfiPstoreBackend::erase() calls SetVariable() with DataSize = 0, which deletes the EFI variable. If systemd-pstore.service is running, it erases all records after copying them to /var/lib/systemd/pstore/, preventing EFI NVRAM exhaustion across many panics.

19.6.4 Ramoops Backend

For systems without EFI — embedded boards, some RISC-V platforms, BIOS-based x86-64 hosts — ramoops writes to a reserved DRAM region that survives soft resets (warm reboots) but not power loss.

Configuration — via device-tree node or ACPI SSDT:

ramoops {
    compatible = "ramoops";
    memory-region = <&pstore_reserved>;  /* Reserved DRAM, carved out at early boot */
    record-size    = <0x20000>;          /* 128 KB per dmesg record slot */
    console-size   = <0x40000>;          /* 256 KB console ring */
    ftrace-size    = <0>;                /* No ftrace backend in this config */
    max-reason     = <1>;                /* PANIC only */
    ecc            = <16>;               /* 16-byte ECC for ramoops header */
};

On-disk layout — The reserved region is divided into fixed-size slots. Each slot begins with a RamoopsHeader:

/// Header preceding each ramoops record slot (stored in DRAM).
#[repr(C, packed)]
pub struct RamoopsHeader {
    /// Magic number: 0x43415441 ("CATA") — validates slot is written.
    pub magic:       u32,
    /// Panic count at time of write.
    pub count:       u32,
    /// Wall-clock timestamp (seconds since epoch) at time of write.
    pub time:        u64,
    /// True if the data following this header is LZ4-compressed.
    pub compressed:  u8,
    /// ECC block size in bytes (for header integrity check). 0 = no ECC.
    pub ecc_size:    u8,
    /// Size of this header structure (for forward compatibility).
    pub header_size: u32,
    /// Number of valid data bytes following the header.
    pub data_size:   u32,
}

On boot, the ramoops driver maps the reserved region, reads each slot, validates the magic and (if configured) the ECC check over the header, and presents valid slots as pstorefs files. Slots with invalid magic are treated as empty.

19.6.5 pstorefs

pstorefs is a RAM-backed virtual filesystem mounted at /sys/fs/pstore/. It is populated at mount time by iterating all registered backends' read() functions. Each record is represented as a read-only file:

Filename	Contents
`dmesg-efi-<id>`	Kernel log tail from the last panic (part 0)
`dmesg-efi-<id>-<part>`	Continuation parts of the same log
`console-efi-<id>`	Console output captured during panic
`mce-efi-<id>`	MCE information decoded at panic time
`dmesg-ramoops-<id>`	Kernel log from ramoops backend

Reading — Files are read-only. If the stored record is LZ4-compressed, the pstorefs file layer decompresses on first read into a per-file page cache. Reads thereafter are served from the page cache.

Deletion — unlink() on a pstorefs file calls PstoreInfo::erase() on the owning backend, removing the record from NVRAM/DRAM. This is the mechanism by which systemd-pstore.service reclaims NVRAM space.

systemd-pstore.service interaction: 1. Service starts on every boot. 2. Mounts /sys/fs/pstore/ if not already mounted. 3. Copies each file to /var/lib/systemd/pstore/<hostname>/<boot-id>/. 4. Calls unlink() on each file to trigger backend erase. 5. If /var/lib/systemd/pstore/ is full (configurable threshold): rotates oldest panic logs out before copying new ones.

19.6.6 Panic Handler Integration

The kernel panic path calls kmsg_dump(KmsgDumpReason::Panic) which iterates all registered KmsgDumper callbacks in priority order. pstore registers at priority KMSG_DUMP_PSTORE. The pstore dump callback:

// umka-core/src/pstore/dump.rs

/// KmsgDumper callback — called from panic/oops context.
///
/// # Constraints
///
/// - NMI context on x86-64 (MCE-originated panics): no sleeping,
///   no heap allocation, no non-NMI-safe locks.
/// - Process context otherwise (software panics, oops).
/// - Must complete in bounded time; EFI backends have a firmware timeout
///   of ~10 seconds per variable write.
pub fn pstore_kmsg_dump(dumper: &mut KmsgDumper, reason: KmsgDumpReason) {
    // 1. Check if reason >= min(backend.max_reason) for any registered backend.
    //    If no backend wants this reason, return immediately.
    if !pstore_has_backend_for(reason) {
        return;
    }

    // 2. Compute which backends will accept this record.
    let backends = pstore_backends_for_flags(PstoreFlags::DMESG);

    // 3. Read from the kernel message ring buffer tail using the kmsg_dump iterator.
    //    Iterates newest-first; we collect up to the backend's chunk capacity.
    let mut iter = dumper.iter_lines();
    let mut part = 0u32;

    while let Some(chunk) = collect_chunk(&mut iter, MAX_CHUNK_BYTES) {
        // 4. Compress chunk (LZ4). Uses statically allocated compression workspace.
        let (buf, compressed) = lz4_compress_static(chunk);

        // 5. Write to each registered backend.
        for backend in &backends {
            let mut record = PstoreRecord {
                type_:      PstoreType::Dmesg,
                id:         0,   // assigned by backend write()
                count:      PANIC_COUNT.load(Ordering::Relaxed),
                reason,
                part,
                compressed,
                size:       buf.len(),
                buf:        backend.record_buf_mut(),
            };
            record.buf[..buf.len()].copy_from_slice(buf);
            record.size = buf.len();
            let _ = (backend.info.write)(&backend.info, &mut record);
            // Errors are silently dropped: we are in panic context and cannot
            // meaningfully recover. The panic continues regardless.
        }

        part += 1;
    }
}

19.6.7 umkafs Integration

pstore records from the most recent panic are also reflected in the umkafs unified namespace (Section 19.4):

/System/Kernel/PanicLog/
  last_panic_time       # Wall-clock timestamp of most recent panic (epoch seconds)
  last_panic_reason     # Panic string (first line of dmesg-* record)
  panic_count           # Monotonically increasing counter across all panics
  records/              # Symlinks into /sys/fs/pstore/
    dmesg-efi-12345 -> /sys/fs/pstore/dmesg-efi-12345
    console-efi-12346 -> /sys/fs/pstore/console-efi-12346

last_panic_time and last_panic_reason are populated at pstorefs mount time from the most recent record (highest count). They are read-only; clearing requires unlink() of the underlying pstorefs file (which requires CAP_SYS_ADMIN).

Implementation phases: - Phase 1: pstore framework, EFI backend, pstorefs - Phase 2: ramoops backend (for non-EFI architectures: ARMv7, RISC-V, PPC32) - Phase 3: umkafs integration, systemd-pstore.service compatibility - Phase 4: BERT (read-only ACPI firmware-reported errors) backend

19.7 Performance Monitoring Unit (perf_event_open)

The perf_event_open subsystem exposes CPU hardware performance counters, kernel software events, and tracepoints to userspace through a uniform file-descriptor interface. It is the foundation for perf stat, perf record, bpftrace, BCC tools, and JVM JIT profiling.

UmkaOS implements the full Linux perf_event_open interface — identical syscall number, identical perf_event_attr wire format, identical ioctl codes, identical /proc/sys knobs — while replacing Linux's internal implementation with a design that avoids the global perf_event_mutex, uses per-CPU contexts as the primary scheduling unit, and offloads NMI-context work to per-CPU kernel threads.

19.7.1 Syscall Interface

perf_event_open(attr, pid, cpu, group_fd, flags) → fd | errno

Syscall number: 298 (x86-64). All six UmkaOS architectures use the same ABI value as their Linux counterpart (see Section 19.7.13 for the full table).

Parameters:

Parameter	Type	Meaning
`attr`	`*const PerfEventAttr`	Event configuration (see Section 19.7.2)
`pid`	`i32`	`0` = current task; `-1` = all tasks on `cpu`; `>0` = specific task (requires `PTRACE_MODE_READ` or `CAP_SYS_PTRACE`)
`cpu`	`i32`	`≥0` = pin to specific CPU; `-1` = follow task across CPUs (requires `pid != -1`)
`group_fd`	`i32`	`-1` = standalone event; or fd of group leader (events measured as an atomic group sharing the same set of hardware counters)
`flags`	`u64`	See flag table below

flags bitmask:

Flag	Value	Effect
`PERF_FLAG_FD_NO_GROUP`	`0x01`	Ignore `group_fd` even if non-negative
`PERF_FLAG_FD_OUTPUT`	`0x02`	Redirect output to `group_fd`'s ring buffer
`PERF_FLAG_PID_CGROUP`	`0x04`	`pid` is a cgroup fd, not a task pid
`PERF_FLAG_FD_CLOEXEC`	`0x08`	Set `O_CLOEXEC` on the returned fd

Return value and errors:

Errno	Condition
`EINVAL`	Invalid `attr` fields, unknown event type, or conflicting flags
`EPERM`	Insufficient privilege for the requested event (see Section 19.7.13)
`EMFILE`	Per-process fd limit exhausted
`ENODEV`	Event type not supported on this CPU (e.g., Intel PT on AMD)
`EACCES`	`perf_event_paranoid` blocks the request (see Section 19.7.13)
`ENOENT`	Referenced tracepoint event ID does not exist
`ESRCH`	Target `pid` does not exist
`EOPNOTSUPP`	Requested sample type or branch filter not supported by this PMU

The returned file descriptor supports read(), mmap(), ioctl(), poll()/epoll(), and close(). It does not support write().

19.7.2 perf_event_attr Wire Format

The perf_event_attr struct is exactly 128 bytes on all architectures and matches the Linux ABI. The kernel accepts any size field value — if size is larger than the kernel's known struct size, trailing bytes are silently ignored (forward compatibility). If size is smaller, missing fields are treated as zero (backward compatibility).

/// Performance event attributes — wire-format compatible with Linux perf_event_attr.
/// Size: 128 bytes. Must not grow beyond 128 bytes without bumping `size` protocol.
///
/// # Safety
///
/// This struct is shared with userspace via the `perf_event_open` syscall.
/// All fields must be read with copy-in (via `copy_from_user`), never via raw pointer.
#[repr(C)]
pub struct PerfEventAttr {
    /// Event type (PERF_TYPE_*). Selects which PMU driver handles this event.
    pub event_type: u32,
    /// sizeof(perf_event_attr) as seen by the calling binary. Kernel uses this
    /// to determine which fields are present. See backward/forward compat above.
    pub size: u32,
    /// Event-specific configuration word. Meaning depends on event_type:
    /// - HARDWARE: PERF_COUNT_HW_* constant
    /// - SOFTWARE: PERF_COUNT_SW_* constant
    /// - TRACEPOINT: tracepoint ID from umkafs/tracefs
    /// - HW_CACHE: encoded cache level/op/result (see Section 19.7.3)
    /// - RAW: raw MSR event code, programmed directly into PMU
    pub config: u64,
    /// If FREQ flag set: sampling frequency in Hz (kernel adjusts period dynamically).
    /// If FREQ flag clear: count between samples (fixed period).
    pub sample_period_or_freq: u64,
    /// PERF_SAMPLE_* bitmask. Determines what data is recorded per sample.
    pub sample_type: u64,
    /// PERF_FORMAT_* bitmask. Determines layout of data returned by read().
    pub read_format: u64,
    /// Packed bitfield of boolean event modifiers (disabled, inherit, pinned, …).
    pub flags: PerfEventFlags,
    /// If PERF_WATERMARK: ring buffer watermark in bytes before wakeup.
    /// Otherwise: number of samples before wakeup.
    pub wakeup_events_or_watermark: u32,
    /// Breakpoint type (PERF_TYPE_BREAKPOINT only): HW_BREAKPOINT_X/R/W/RW.
    pub bp_type: u32,
    /// For BREAKPOINT: breakpoint address. For RAW/HW_CACHE: extended config1.
    pub bp_addr_or_config1: u64,
    /// For BREAKPOINT: breakpoint length (1/2/4/8). For RAW/HW_CACHE: config2.
    pub bp_len_or_config2: u64,
    /// PERF_SAMPLE_BRANCH_* bitmask for branch stack sampling (LBR on x86-64).
    pub branch_sample_type: u64,
    /// User-space register mask for PERF_SAMPLE_REGS_USER.
    pub sample_regs_user: u64,
    /// Stack dump size in bytes for PERF_SAMPLE_STACK_USER (max 65528).
    pub sample_stack_user: u32,
    /// Clock source ID for PERF_SAMPLE_TIME. -1 = default (CLOCK_MONOTONIC_RAW).
    pub clockid: i32,
    /// Interrupt-context register mask for PERF_SAMPLE_REGS_INTR.
    pub sample_regs_intr: u64,
    /// AUX ring buffer watermark in bytes (for Intel PT / ARM SPE).
    pub aux_watermark: u32,
    /// Maximum callchain depth for PERF_SAMPLE_CALLCHAIN.
    pub sample_max_stack: u16,
    pub _reserved_2: u16,
    /// AUX sample size limit (Intel PT / ARM SPE per-sample cap).
    pub aux_sample_size: u32,
    pub _reserved_3: u32,
    /// Signal data for PERF_SAMPLE_SIGTRAP (sent to task on overflow).
    pub sig_data: u64,
    /// Extended event config word (Intel: config3 = extended PEBS options).
    pub config3: u64,
}

bitflags! {
    /// Boolean modifiers packed into a u64 bitfield within perf_event_attr.
    /// Bit positions match Linux exactly.
    pub struct PerfEventFlags: u64 {
        /// Event starts disabled; requires PERF_EVENT_IOC_ENABLE.
        const DISABLED          = 1 << 0;
        /// Children created via fork() inherit this event.
        const INHERIT           = 1 << 1;
        /// Pin event to PMU slot; return EBUSY if unavailable.
        const PINNED            = 1 << 2;
        /// Exclusive use of PMU; no other events on this CPU while active.
        const EXCLUSIVE         = 1 << 3;
        /// Count only in user space (ring 3).
        const EXCLUDE_USER      = 1 << 4;
        /// Count only in kernel space (ring 0).
        const EXCLUDE_KERNEL    = 1 << 5;
        /// Exclude hypervisor events.
        const EXCLUDE_HV        = 1 << 6;
        /// Exclude idle task.
        const EXCLUDE_IDLE      = 1 << 7;
        /// Enable ring buffer mmap().
        const MMAP              = 1 << 8;
        /// Include comm events (task name changes).
        const COMM              = 1 << 9;
        /// Use sampling frequency (sample_period_or_freq = Hz), not fixed period.
        const FREQ              = 1 << 10;
        /// Inherit counter value across fork (not just enable/disable).
        const INHERIT_STAT      = 1 << 11;
        /// Enable event on exec.
        const ENABLE_ON_EXEC    = 1 << 12;
        /// Emit PERF_RECORD_TASK on fork/exit.
        const TASK              = 1 << 13;
        /// Use ring buffer watermark (wakeup_events_or_watermark = bytes).
        const WATERMARK         = 1 << 14;
        /// Precise IP level 1 (PEBS on x86-64): allow constant skid.
        const PRECISE_IP_1      = 1 << 15;
        /// Precise IP level 2: request zero skid (hardware best-effort).
        const PRECISE_IP_2      = 1 << 16;
        /// Include mmap records for executable mappings.
        const MMAP_DATA         = 1 << 17;
        /// Include sample_id in all non-SAMPLE record types.
        const SAMPLE_ID_ALL     = 1 << 18;
        /// Exclude events from non-host (guest) context.
        const EXCLUDE_HOST      = 1 << 19;
        /// Exclude events from host context (count only in guest).
        const EXCLUDE_GUEST     = 1 << 20;
        /// Exclude kernel callchains.
        const EXCLUDE_CALLCHAIN_KERNEL = 1 << 21;
        /// Exclude user callchains.
        const EXCLUDE_CALLCHAIN_USER   = 1 << 22;
        /// Enable AUX ring buffer (Intel PT, ARM SPE).
        const MMAP2             = 1 << 23;
        /// Emit PERF_RECORD_COMM with exec flag on exec.
        const COMM_EXEC         = 1 << 24;
        /// Use task clock (per-task time) for timestamps.
        const USE_CLOCKID       = 1 << 25;
        /// Include context-switch records in the ring buffer.
        const CONTEXT_SWITCH    = 1 << 26;
        /// Write ring buffer from the end (tail first, for overwrite mode).
        const WRITE_BACKWARD    = 1 << 27;
        /// Emit PERF_RECORD_NAMESPACES on namespace entry.
        const NAMESPACES        = 1 << 28;
        /// Emit PERF_RECORD_KSYMBOL for new kernel symbols (eBPF JIT).
        const KSYMBOL           = 1 << 29;
        /// Emit PERF_RECORD_BPF_EVENT for BPF program load/unload.
        const BPF_EVENT         = 1 << 30;
        /// Enable AUX output interleaved in the main ring buffer.
        const AUX_OUTPUT        = 1 << 31;
        /// Emit PERF_RECORD_CGROUP for cgroup switches.
        const CGROUP            = 1 << 32;
        /// Send SIGTRAP to task on sample overflow.
        const SIGTRAP           = 1 << 33;
    }
}

19.7.3 Event Types (PERF_TYPE_*)

The event_type field selects the PMU driver and determines how config is interpreted.

PERF_TYPE_HARDWARE (0) — CPU hardware counters. The config field selects a generic hardware event; the PMU driver maps each to the appropriate hardware-specific event code for the current microarchitecture.

`config`	Constant	Description
0	`PERF_COUNT_HW_CPU_CYCLES`	CPU clock cycles
1	`PERF_COUNT_HW_INSTRUCTIONS`	Instructions retired
2	`PERF_COUNT_HW_CACHE_REFERENCES`	Last-level cache references
3	`PERF_COUNT_HW_CACHE_MISSES`	Last-level cache misses
4	`PERF_COUNT_HW_BRANCH_INSTRUCTIONS`	Branch instructions retired
5	`PERF_COUNT_HW_BRANCH_MISSES`	Mispredicted branches
6	`PERF_COUNT_HW_BUS_CYCLES`	Bus cycles
7	`PERF_COUNT_HW_STALLED_CYCLES_FRONTEND`	Cycles stalled in the frontend
8	`PERF_COUNT_HW_STALLED_CYCLES_BACKEND`	Cycles stalled in the backend
10	`PERF_COUNT_HW_REF_CPU_CYCLES`	Reference CPU cycles (unthrottled)

If a generic event is unavailable on the host PMU, perf_event_open returns EOPNOTSUPP.

PERF_TYPE_SOFTWARE (1) — Kernel software counters. No hardware counter slots are consumed; counting is performed in software at the corresponding kernel callsites via per-CPU atomic increments.

`config`	Constant	Description
0	`PERF_COUNT_SW_CPU_CLOCK`	CPU wall-clock time (CLOCK_MONOTONIC_RAW)
1	`PERF_COUNT_SW_TASK_CLOCK`	Task CPU time (only while task runs)
2	`PERF_COUNT_SW_PAGE_FAULTS`	Page faults (minor + major)
3	`PERF_COUNT_SW_CONTEXT_SWITCHES`	Context switches
4	`PERF_COUNT_SW_CPU_MIGRATIONS`	Task migrations between CPUs
5	`PERF_COUNT_SW_PAGE_FAULTS_MIN`	Minor page faults (no I/O)
6	`PERF_COUNT_SW_PAGE_FAULTS_MAJ`	Major page faults (I/O required)
7	`PERF_COUNT_SW_ALIGNMENT_FAULTS`	Alignment faults (fixup path)
8	`PERF_COUNT_SW_EMULATION_FAULTS`	Emulated instruction faults
10	`PERF_COUNT_SW_DUMMY`	No-op placeholder event
11	`PERF_COUNT_SW_BPF_OUTPUT`	BPF program output via `bpf_perf_event_output`

PERF_TYPE_TRACEPOINT (2) — Kernel tracepoints. The config field is the numeric tracepoint ID exposed through umkafs at /System/Kernel/Tracing/events/<subsystem>/<name>/id. These IDs match Linux's tracefs (/sys/kernel/tracing/events/) for tool compatibility. See Section 19.2 for the tracepoint ABI.

PERF_TYPE_HW_CACHE (3) — Cache-level events. The config field encodes three sub-fields packed into a single u64:

config = cache_id | (cache_op_id << 8) | (cache_result_id << 16)

Sub-field	Values
`cache_id`	0=L1D, 1=L1I, 2=LLC, 3=ITLB, 4=DTLB, 5=BPU, 6=NODE
`cache_op_id`	0=READ, 1=WRITE, 2=PREFETCH
`cache_result_id`	0=ACCESS, 1=MISS

Not all combinations are available on all microarchitectures. Unavailable combinations return EOPNOTSUPP.

PERF_TYPE_RAW (4) — Raw PMU event. The config field is the raw event-select value programmed directly into the PMU's event-select register (e.g., IA32_PERFEVTSELx on Intel x86-64, PMEVTYPER<n>_EL0 on AArch64). Format is PMU-specific and documented in the CPU vendor's software optimization manual. Requires perf_event_paranoid ≤ 1 or CAP_PERFMON.

PERF_TYPE_BREAKPOINT (5) — Hardware data/instruction breakpoint. The bp_addr_or_config1 field is the watch address, bp_len_or_config2 is the access width (1, 2, 4, or 8 bytes), and bp_type is the access type:

`bp_type`	Value	Meaning
`HW_BREAKPOINT_EMPTY`	0	Breakpoint disabled
`HW_BREAKPOINT_R`	1	Read watchpoint
`HW_BREAKPOINT_W`	2	Write watchpoint
`HW_BREAKPOINT_RW`	3	Read or write watchpoint
`HW_BREAKPOINT_X`	4	Instruction execution breakpoint

Breakpoints use hardware debug registers (DR0-DR3 on x86-64, DBGBCRn_EL1/DBGWCRn_EL1 on AArch64, DBGBCRn/DBGWCRn on ARMv7). Execution breakpoints fire before the instruction commits; data watchpoints fire after the faulting load/store.

PERF_TYPE_MAX (6+) — Dynamic PMU types registered by drivers at boot time. Examples: Intel PT (Processor Trace), ARM SPE (Statistical Profiling Extension), platform uncore PMUs (memory controllers, PCIe root complexes). Each dynamic PMU has a type number allocated by pmu_register() at driver init time; userspace discovers it via /sys/bus/event_source/devices/<name>/type.

19.7.4 Sample Type Flags (PERF_SAMPLE_*)

The sample_type bitmask in perf_event_attr determines what data is recorded in each ring buffer sample. Each set bit appends a fixed or variable-length field to the PERF_RECORD_SAMPLE record, in the order listed below. The same sample_type value is stored in perf.data headers so that perf report and other tools know the record layout without kernel involvement.

Bit	Value	Field added to sample record
`PERF_SAMPLE_IP`	`0x00001`	Instruction pointer at time of sample
`PERF_SAMPLE_TID`	`0x00002`	`pid` (process ID) and `tid` (thread ID) as two `u32` words
`PERF_SAMPLE_TIME`	`0x00004`	Timestamp in nanoseconds (CLOCK_MONOTONIC_RAW by default)
`PERF_SAMPLE_ADDR`	`0x00008`	Faulted virtual address (memory events, requires hardware support)
`PERF_SAMPLE_READ`	`0x00010`	Counter values at sample time (layout per `read_format`)
`PERF_SAMPLE_CALLCHAIN`	`0x00020`	Kernel and user callchain (depth limited by `sample_max_stack`)
`PERF_SAMPLE_ID`	`0x00040`	Unique event ID (for matching records across multiplexed events)
`PERF_SAMPLE_CPU`	`0x00080`	CPU number and a reserved padding word
`PERF_SAMPLE_PERIOD`	`0x00200`	Sample period at the time of the overflow
`PERF_SAMPLE_STREAM_ID`	`0x00400`	Group leader event ID
`PERF_SAMPLE_RAW`	`0x00800`	Raw tracepoint data bytes (for `PERF_TYPE_TRACEPOINT`)
`PERF_SAMPLE_BRANCH_STACK`	`0x01000`	Branch stack (LBR on Intel x86-64, BRBE on AArch64)
`PERF_SAMPLE_REGS_USER`	`0x02000`	User-space register file (mask from `sample_regs_user`)
`PERF_SAMPLE_STACK_USER`	`0x04000`	User stack dump (up to `sample_stack_user` bytes)
`PERF_SAMPLE_WEIGHT`	`0x08000`	Event weight / memory access latency in cycles
`PERF_SAMPLE_DATA_SRC`	`0x10000`	Memory data source (L1/L2/LLC/DRAM/remote hit/miss encoding)
`PERF_SAMPLE_IDENTIFIER`	`0x20000`	Duplicate event ID placed first in record (for fast demux)
`PERF_SAMPLE_TRANSACTION`	`0x40000`	TSX transaction flags and abort reason
`PERF_SAMPLE_REGS_INTR`	`0x80000`	Register file at interrupt time (mask from `sample_regs_intr`)
`PERF_SAMPLE_PHYS_ADDR`	`0x100000`	Physical address of `PERF_SAMPLE_ADDR` (requires `CAP_SYS_ADMIN`)
`PERF_SAMPLE_AUX`	`0x200000`	AUX area data appended inline in sample
`PERF_SAMPLE_CGROUP`	`0x400000`	Cgroup ID at sample time
`PERF_SAMPLE_DATA_PAGE_SIZE`	`0x800000`	Page size backing the sampled data address
`PERF_SAMPLE_CODE_PAGE_SIZE`	`0x1000000`	Page size backing the sample IP
`PERF_SAMPLE_WEIGHT_STRUCT`	`0x2000000`	Extended weight struct (latency, type, variance fields)

19.7.5 Internal Data Structures

The internal design uses three layers: per-event state (PerfEvent), per-CPU hardware context (PerfEventContext), and an optional task-level overlay (TaskPerfCtx). UmkaOS uses per-CPU contexts as the primary scheduling unit, matching PMU hardware reality. Task-pinned events attach an overlay on top of the CPU context and are swapped in/out on each context switch.

// umka-core/src/perf/event.rs

/// Per-event kernel object. One instance per open perf fd.
///
/// Reference-counted via `Arc<PerfEvent>`. Weak references are used for group
/// sibling links to avoid cycles: the group leader holds `Arc` references to
/// siblings, and siblings hold a `Weak` back to the leader.
pub struct PerfEvent {
    /// Copy of userspace-supplied attributes, validated at open time.
    pub attr: PerfEventAttr,
    /// Static reference to the PMU driver that owns this event.
    /// Immutable after open; never null.
    pub pmu: &'static dyn PmuOps,
    /// The per-CPU context that owns this event.
    pub ctx: Arc<PerfEventContext>,
    /// Group leader (None if this event is its own leader).
    pub group_leader: Option<Weak<PerfEvent>>,
    /// Sibling events in the same group (only populated for group leaders).
    pub siblings: Mutex<Vec<Weak<PerfEvent>>>,
    /// Ring buffer for sample data. None if `attr.mmap` was not set.
    /// Shared with userspace via Arc; same physical pages, no data copy.
    pub ring_buffer: Option<Arc<PerfRingBuffer>>,
    /// Current scheduling state (PerfEventState).
    pub state: AtomicU32,
    /// Accumulated event count. Updated on context-out and explicit reads.
    pub count: AtomicU64,
    /// Total nanoseconds this event has been enabled (for multiplexing scale).
    pub time_enabled_ns: AtomicU64,
    /// Total nanoseconds this event has run on PMU hardware.
    pub time_running_ns: AtomicU64,
    /// Optional overflow callback. Invoked from per-CPU kthread (Stage 2).
    pub overflow_handler: Option<PerfOverflowFn>,
    /// CPU this event is pinned to, or -1 to follow the owning task.
    pub cpu: i32,
    /// Hardware counter slot index assigned by the PMU driver.
    /// -1 means the event is not currently scheduled onto hardware.
    pub hw_counter_idx: i32,
    /// Unique event ID assigned at open time (used with PERF_FORMAT_ID).
    pub id: u64,
}

/// Scheduling state of a PerfEvent.
#[repr(u32)]
pub enum PerfEventState {
    /// Explicitly disabled by userspace (attr.disabled or IOC_DISABLE).
    Off      = 0,
    /// Desired but hardware resource unavailable (slot conflict).
    Error    = 1,
    /// Enabled but not currently on PMU: preempted by multiplexing.
    Inactive = 2,
    /// Currently programmed into a hardware counter and counting.
    Active   = 3,
}

/// Overflow callback type.
///
/// Called from the per-CPU perf sampler kthread (not NMI context).
/// May allocate memory and acquire non-NMI-safe locks.
pub type PerfOverflowFn = fn(event: &Arc<PerfEvent>, regs: &SavedRegs, count: u64);

// umka-core/src/perf/context.rs

/// Maximum hardware performance counters simultaneously active on one CPU.
/// Intel: 4-8 general + 3 fixed = 11 max. AMD: 6 general = 6 max.
/// ARM PMUv3: 6 general + 1 cycle = 7 max. Use 16 as safe upper bound.
pub const PERF_MAX_ACTIVE: usize = 16;

/// Per-CPU PMU context.
///
/// One instance per CPU, allocated at boot time and stored in the CpuLocal area
/// (see Section 3.2). There is no global PMU lock. Each CPU context is protected
/// by its own lock. Cross-CPU operations (e.g., reading a task event from a
/// remote CPU) use IPI + per-CPU lock sequences.
///
/// # Hot Path
///
/// `active[0..active_count]` is accessed on **every context switch**
/// (`perf_schedule_in` / `perf_schedule_out`). This path MUST be lock-free.
/// The previous design used `SpinlockIrq<Vec<Arc<PerfEvent>>>` which acquired a
/// lock on every context switch — unacceptable at context switch frequencies of
/// 100k–1M/sec/CPU. The new design uses a fixed-size array of raw event pointers
/// with an atomic count so the context switch path never acquires a lock, never
/// touches a reference count, and never touches the heap allocator.
pub struct PerfEventContext {
    /// CPU index this context belongs to.
    pub cpu: u32,

    /// Active event pointers: `active[0..active_count]` are valid non-null
    /// pointers to `PerfEvent` objects kept alive by `events_lock.active_events`.
    ///
    /// Lock-free read during context switch: load `active_count` with
    /// `Ordering::Acquire`, then iterate `active[0..count]`.
    ///
    /// Written only during event add/remove (infrequent), under `events_lock`.
    /// Removal decrements `active_count` with `Ordering::Release` BEFORE
    /// clearing the pointer, ensuring lockless readers never see a stale index.
    pub active: [*const PerfEvent; PERF_MAX_ACTIVE],
    /// Number of valid entries in `active[]`. Atomic so the context switch path
    /// can read it without holding `events_lock`.
    pub active_count: AtomicU32,

    /// Lock protecting modifications to `active[]` and the mutable event lists.
    /// NOT held during context switch — only during event add/remove/rotate.
    pub events_lock: Mutex<PerfEventMutable>,

    /// Number of general-purpose hardware counter slots on this CPU.
    pub hpc_slots: u8,
    /// Number of fixed-function counter slots (e.g., Intel fixed PMC0-2).
    pub fixed_slots: u8,
    /// Task-level overlay: swapped on context switch for pid-pinned events.
    /// None if no task-pinned events are active on this CPU.
    pub task_ctx: Option<Arc<TaskPerfCtx>>,
    /// Total nanoseconds this context has been enabled.
    pub time_enabled_ns: AtomicU64,
    /// Total nanoseconds events have been running on PMU hardware.
    pub time_running_ns: AtomicU64,
    /// Per-CPU kthread handle for async sample processing (Section 19.7.9).
    pub sampler_thread: KthreadHandle,
    /// SPSC ring from NMI handler to sampler kthread (Stage 1 → Stage 2).
    pub raw_sample_queue: SpscRing<RawSample, RAW_SAMPLE_QUEUE_DEPTH>,
    /// Bitmask of hardware counter slots currently in use.
    pub enabled_counter_mask: u64,
}

/// Mutable event state, modified only under `PerfEventContext::events_lock`.
pub struct PerfEventMutable {
    /// Authoritative list of active events (keeps `Arc<PerfEvent>` alive so that
    /// the raw pointers in `PerfEventContext::active[]` remain valid).
    pub active_events: ArrayVec<Arc<PerfEvent>, PERF_MAX_ACTIVE>,
    /// Events waiting for a hardware counter slot (multiplexing queue).
    pub pending_events: VecDeque<Arc<PerfEvent>>,
}

// SAFETY: `PerfEventContext::active[]` contains raw pointers to `PerfEvent`
// objects. Those objects are kept alive by `Arc<PerfEvent>` entries in
// `PerfEventMutable::active_events`. Raw pointers are set before `active_count`
// is incremented (Release store) and cleared only after `active_count` is
// decremented (Release store), so lockless readers always see consistent state.
unsafe impl Send for PerfEventContext {}
unsafe impl Sync for PerfEventContext {}

/// Task-level PMU overlay.
///
/// Created on the first `perf_event_open` with `pid != -1` for that task.
/// Stored in the `Task` struct and swapped into the owning CPU context on
/// schedule-in; removed on schedule-out.
pub struct TaskPerfCtx {
    /// Back-reference to the owning task (Weak to avoid cycle).
    pub task: Weak<Task>,
    /// Events tracking this specific task (pid-pinned).
    pub task_events: Mutex<Vec<Arc<PerfEvent>>>,
}

/// Capacity of the raw-sample SPSC queue (entries; must be a power of 2).
/// Sized to absorb bursts at the maximum PMI rate before the kthread drains it.
/// At 100 kHz max sample rate and 1 ms kthread wake latency: 100 entries minimum.
/// 512 provides comfortable headroom for multi-counter simultaneous overflow.
const RAW_SAMPLE_QUEUE_DEPTH: usize = 512;

/// Minimal sample record written by the NMI handler (Stage 1).
/// Transferred to the sampler kthread for full record construction (Stage 2).
#[repr(C)]
pub struct RawSample {
    /// Instruction pointer at time of PMI.
    pub ip: u64,
    /// Stack pointer (for user stack unwind in kthread).
    pub sp: u64,
    /// Frame pointer (0 if frame pointer elimination is active).
    pub fp: u64,
    /// Nanosecond timestamp from the timekeeping fast path (no lock required).
    pub timestamp_ns: u64,
    /// CPU number.
    pub cpu: u32,
    /// PID of the interrupted task.
    pub pid: u32,
    /// TID of the interrupted task.
    pub tid: u32,
    /// Saved general-purpose registers at time of interrupt.
    pub regs: SavedRegs,
    /// Index into `active[]` identifying the overflowed counter.
    pub event_idx: u16,
}

Context switch fast path. perf_schedule_out and perf_schedule_in are called on every context switch for every CPU that has active perf events. They MUST be lock-free. The fixed-size active[] array and atomic active_count make this possible: the context switch path reads the count once with Ordering::Acquire, then iterates the raw pointer array — no lock acquisition, no reference count manipulation, no heap access.

// umka-core/src/perf/context.rs (context switch fast path)

/// Called on context switch out — removes the outgoing task's events from PMU hardware.
///
/// # Safety
///
/// The caller (scheduler) guarantees no concurrent modification of `active[]`
/// during this call: `events_lock` cannot be held across a `schedule()` call
/// because `schedule()` is not nestable with `events_lock`. The `active_count`
/// Acquire load pairs with the Release store in `perf_event_add`/`perf_event_del`,
/// ensuring the pointer writes are visible before the count is visible here.
pub fn perf_schedule_out(ctx: &PerfEventContext) {
    let count = ctx.active_count.load(Ordering::Acquire) as usize;
    // SAFETY: active[0..count] are valid non-null pointers (invariant maintained
    // by events_lock: pointers are written before count is incremented, and
    // count is decremented before pointers are cleared).
    for i in 0..count {
        let event = unsafe { &*ctx.active[i] };
        event.pmu.event_stop(event, PERF_EF_UPDATE);
        event.pmu.event_del(event, 0);
    }
}

/// Called on context switch in — programs the incoming task's events into PMU hardware.
///
/// # Safety
///
/// Same invariants as `perf_schedule_out`.
pub fn perf_schedule_in(ctx: &PerfEventContext) {
    let count = ctx.active_count.load(Ordering::Acquire) as usize;
    // SAFETY: active[0..count] are valid non-null pointers (same invariant).
    for i in 0..count {
        let event = unsafe { &*ctx.active[i] };
        event.pmu.event_add(event, PERF_EF_START).ok();
    }
}

Event add slow path. Adding and removing events is infrequent (user calls perf_event_open / closes the fd). These paths hold events_lock and update both the Arc-owning active_events list (which keeps the objects alive) and the raw pointer array read by the fast path:

// umka-core/src/perf/context.rs (event add/remove slow paths)

/// Add a new event to this CPU context.
///
/// Slow path: acquires `events_lock`. Must not be called from context switch
/// or NMI context.
pub fn perf_event_add(ctx: &PerfEventContext, event: Arc<PerfEvent>) -> Result<()> {
    let mut mutable = ctx.events_lock.lock();

    if mutable.active_events.len() >= PERF_MAX_ACTIVE {
        // No free HW slot: queue for multiplexing rotation.
        mutable.pending_events.push_back(event);
        return Ok(());
    }

    // Initialize and start the hardware counter.
    event.pmu.event_init(&event)?;
    event.pmu.event_add(&event, PERF_EF_START)?;

    let idx = mutable.active_events.len();
    // Store raw pointer BEFORE incrementing count (Release ordering on the
    // count store pairs with Acquire on the fast-path load).
    ctx.active[idx] = Arc::as_ptr(&event);
    mutable.active_events.push(event);

    // Release store: makes the pointer write visible to lockless readers
    // before they can observe the incremented count.
    ctx.active_count.fetch_add(1, Ordering::Release);
    Ok(())
}

/// Remove an event from this CPU context.
///
/// Slow path: acquires `events_lock`. Decrements `active_count` with Release
/// BEFORE removing the pointer so lockless readers at the old count can still
/// safely dereference `active[old_count - 1]`.
pub fn perf_event_del(ctx: &PerfEventContext, event: &Arc<PerfEvent>) -> Result<()> {
    let mut mutable = ctx.events_lock.lock();

    let pos = mutable.active_events
        .iter()
        .position(|e| Arc::ptr_eq(e, event))
        .ok_or(Error::NotFound)?;

    event.pmu.event_stop(event, PERF_EF_UPDATE);
    event.pmu.event_del(event, 0);

    // Swap-remove to fill the hole: move the last pointer into pos.
    let last = mutable.active_events.len() - 1;
    // Decrement count first (Release) so lockless readers stop at `last`.
    ctx.active_count.fetch_sub(1, Ordering::Release);
    // Now fill the hole with the last element's pointer (safe: readers
    // use the decremented count and never reach index `last` again).
    ctx.active[pos] = Arc::as_ptr(&mutable.active_events[last]);
    mutable.active_events.swap_remove(pos);

    // Promote a pending event if one is waiting.
    if let Some(next) = mutable.pending_events.pop_front() {
        let _ = perf_event_add_locked(ctx, &mut mutable, next);
    }
    Ok(())
}

// umka-core/src/perf/ring_buffer.rs

/// Shared ring buffer between kernel and userspace.
///
/// The same physical pages are mapped twice: into the kernel's direct-map for
/// writing, and into the userspace address space (via mmap) for reading.
/// No data is ever copied between the two mappings.
pub struct PerfRingBuffer {
    /// Pointer to the 4 KB mmap header page. Writable by kernel, mapped
    /// read-only into userspace as the first page of the mmap region.
    pub header: NonNull<PerfMmapPage>,
    /// Pointer to the data pages. Writable by kernel.
    /// `data_size` must be a power of 2.
    pub data: NonNull<[u8]>,
    /// Size of the data region in bytes (power of 2).
    pub data_size: usize,
    /// Kernel write position in bytes (wraps mod data_size).
    /// Written with Release ordering; userspace reads with Acquire.
    pub data_head: AtomicU64,
    /// Optional AUX buffer for Intel PT / ARM SPE trace output.
    /// Tuple: (pointer to AUX pages, size in bytes).
    pub aux_data: Option<(NonNull<[u8]>, usize)>,
    /// Kernel AUX write position.
    pub aux_head: AtomicU64,
    /// Physical page descriptors for the data region. Kept alive until the
    /// last Arc reference to this PerfRingBuffer is dropped, which may be
    /// after the perf fd is closed if userspace still has an mmap open.
    pub pages: Vec<Arc<PhysPage>>,
    /// Samples dropped because the ring buffer was full.
    pub lost_samples: AtomicU64,
}

19.7.6 perf_event_mmap_page Header

The first 4 KB page of the mmap() region is the PerfMmapPage header. It exposes kernel-maintained state to userspace without system calls: ring buffer positions, time-conversion coefficients, and an optional direct PMC read shortcut (rdpmc on x86-64).

The layout exactly matches Linux's perf_event_mmap_page:

/// First page of the perf mmap region — stable ABI shared with userspace.
///
/// Fields in the first 1 KB are stable across kernel versions.
/// Fields at and beyond offset 1 KB are architecture-specific PMC shortcuts;
/// userspace must check `capabilities` before using them.
#[repr(C)]
pub struct PerfMmapPage {
    /// Kernel ABI version. Currently 0. Userspace checks on open.
    pub version:         u32,
    /// Compatibility version. Must be 0 for userspace to proceed.
    pub compat_version:  u32,
    /// Sequence lock word. Userspace must retry if this changes during read.
    pub lock:            u32,
    /// PMC index for `rdpmc` (1-based; 0 = not available).
    pub index:           u32,
    /// Offset added to `rdpmc` result to get the signed event count.
    pub offset:          i64,
    /// Total time this event has been enabled (nanoseconds).
    pub time_enabled:    u64,
    /// Total time this event has been running on PMU (nanoseconds).
    pub time_running:    u64,
    /// Capability flags:
    ///   bit 0 = cap_bit0 (always 1; reserved)
    ///   bit 1 = cap_user_time (time_* fields are valid)
    ///   bit 2 = cap_user_rdpmc (index/offset are valid for rdpmc)
    ///   bit 3 = cap_user_time_zero (time_zero is valid)
    pub capabilities:    u64,
    /// PMC bit width (for sign-extending the rdpmc result).
    pub pmc_width:       u16,
    /// Shift for TSC-to-nanosecond conversion: ns = (tsc * time_mult) >> time_shift.
    pub time_shift:      u16,
    /// Multiplier for TSC-to-nanosecond conversion.
    pub time_mult:       u32,
    /// Nanosecond offset added after TSC scaling.
    pub time_offset:     i64,
    /// Nanosecond base time at the TSC reference point for time_zero.
    pub time_zero:       u64,
    /// Size of this struct in bytes (currently 4096).
    pub size:            u32,
    pub _reserved_1:     u32,
    /// TSC snapshot used as the reference for time_zero.
    pub time_cycles:     u64,
    /// TSC mask for CPUs with < 64-bit TSC.
    pub time_mask:       u64,
    /// Padding to 1 KB boundary.
    pub _pad:            [u8; 928],

    // Ring buffer control — at offset 1024 (0x400):
    /// Write head in bytes. Updated by kernel with Release ordering.
    pub data_head:       AtomicU64,
    /// Read tail in bytes. Updated by userspace with Release ordering.
    pub data_tail:       AtomicU64,
    /// Byte offset from mmap start to first data byte (= 4096 = one page).
    pub data_offset:     u64,
    /// Data region size in bytes.
    pub data_size:       u64,

    // AUX ring buffer control (Intel PT / ARM SPE):
    /// AUX write head. Updated by kernel.
    pub aux_head:        AtomicU64,
    /// AUX read tail. Updated by userspace.
    pub aux_tail:        AtomicU64,
    /// Byte offset from mmap start to first AUX byte.
    pub aux_offset:      u64,
    /// AUX region size in bytes.
    pub aux_size:        u64,
}

Counting without syscalls — For events where cap_user_rdpmc = 1, userspace can read the PMU counter directly using the rdpmc instruction (x86-64) or the equivalent on other architectures:

/* x86-64 userspace pseudocode — no syscall required */
uint64_t perf_rdpmc(struct perf_event_mmap_page *pc) {
    uint32_t seq, idx;
    int64_t  count;
    do {
        seq   = __atomic_load_n(&pc->lock, __ATOMIC_ACQUIRE);
        idx   = pc->index;
        count = pc->offset;
        if (idx)
            count += (int64_t)__builtin_ia32_rdpmc(idx - 1);
    } while (__atomic_load_n(&pc->lock, __ATOMIC_ACQUIRE) != seq);
    return (uint64_t)count;
}

This seqlock protocol is identical to Linux. perf stat uses it to achieve near-zero overhead in counting mode.

19.7.7 PMU Driver Trait (PmuOps)

UmkaOS replaces Linux's struct pmu (a C struct of nullable function pointers with manual lifetime discipline) with the PmuOps trait. The trait enforces at compile time that every required method is implemented and that the driver is Send + Sync + 'static, making per-CPU access safe without runtime checks.

// umka-core/src/perf/pmu.rs

/// PMU hardware driver interface.
///
/// Implementors are registered at boot via `pmu_register()` and live for the
/// lifetime of the kernel. All methods are called from process context unless
/// explicitly marked as interrupt-disabled paths.
pub trait PmuOps: Send + Sync + 'static {
    /// PMU driver name, e.g. "intel-core", "arm-pmuv3", "amd-core".
    fn name(&self) -> &'static str;

    /// Dynamic PMU type number allocated by `pmu_register()`.
    /// For built-in types (HARDWARE, SOFTWARE, etc.) returns the fixed constant.
    fn pmu_type(&self) -> u32;

    /// Validate and initialize a newly created event for this PMU.
    ///
    /// Called once at `perf_event_open()` time, from process context.
    /// Must verify `event.attr.config` is a valid event code for this PMU,
    /// map generic `PERF_TYPE_HARDWARE` codes to hardware-specific event selects,
    /// allocate any per-event private state, and set `event.hw_counter_idx = -1`.
    fn event_init(&self, event: &mut PerfEvent) -> Result<(), PmuError>;

    /// Schedule event onto the PMU hardware.
    ///
    /// Called with interrupts disabled on the schedule-in path or when the
    /// multiplexer installs this event. Assigns a free hardware counter slot
    /// and programs the event-select and counter-value MSRs/CSRs/SPRs.
    ///
    /// `flags` bits:
    /// - `PERF_EF_START` (0x01): call `event_start` immediately after adding.
    /// - `PERF_EF_RELOAD` (0x02): event is returning from multiplexer; reload
    ///   the saved partial count (for correct scaling).
    fn event_add(&self, event: &mut PerfEvent, flags: u32) -> Result<(), PmuError>;

    /// Remove event from the PMU (without destroying it).
    ///
    /// Called with interrupts disabled on schedule-out or when the multiplexer
    /// rotates this event out. Must save the current hardware count to
    /// `event.count` before clearing the PMU registers.
    ///
    /// `flags` bits:
    /// - `PERF_EF_UPDATE` (0x04): accumulate current hardware count into
    ///   `event.count` before removing.
    fn event_del(&self, event: &mut PerfEvent, flags: u32);

    /// Start counting (unmask / enable the hardware counter).
    ///
    /// Called after `event_add` when `attr.disabled` is clear, or in response
    /// to `PERF_EVENT_IOC_ENABLE`. The counter was already programmed by
    /// `event_add`; this call only unmasks it.
    fn event_start(&self, event: &mut PerfEvent, flags: u32);

    /// Stop counting (mask / disable the hardware counter).
    ///
    /// Called in response to `PERF_EVENT_IOC_DISABLE` or before `event_del`.
    /// Does not remove the event from the PMU; `event_start` can resume it.
    fn event_stop(&self, event: &mut PerfEvent, flags: u32);

    /// Read the current hardware counter value and accumulate into `event.count`.
    ///
    /// Called in response to a `read()` syscall on the perf fd, or before
    /// `event_del` with `PERF_EF_UPDATE`. Must be idempotent (calling twice
    /// without an intervening `event_start` must not double-count).
    fn event_read(&self, event: &mut PerfEvent) -> u64;

    /// Context switch in: install the incoming task's events onto the PMU.
    ///
    /// Called by the context-switch path after the task switch completes.
    /// Swaps in `ctx.task_ctx` (if any) and calls `event_add` for each
    /// task-pinned event.
    fn event_schedule_in(&self, ctx: &mut PerfEventContext);

    /// Context switch out: remove the outgoing task's events from the PMU.
    ///
    /// Called before the task switch. Calls `event_del` (with `PERF_EF_UPDATE`)
    /// for each task-pinned event and removes the task overlay from `ctx.task_ctx`.
    fn event_schedule_out(&self, ctx: &mut PerfEventContext);

    /// Return static hardware capabilities of this PMU.
    fn capabilities(&self) -> PmuCapabilities;

    /// Handle a sampling overflow interrupt (NMI on x86-64, PMI on others).
    ///
    /// Called from the architecture interrupt handler. Must do the absolute
    /// minimum: push a `RawSample` to `queue` (non-blocking) and reprogram
    /// the counter. Full sample construction happens in the sampler kthread.
    ///
    /// Must not: allocate memory, take sleeping locks, or unwind the call stack.
    ///
    /// Default implementation: no-op (for non-sampling PMUs).
    fn event_overflow(
        &self,
        _event: &PerfEvent,
        _raw: &mut RawSample,
        _queue: &SpscRing<RawSample, RAW_SAMPLE_QUEUE_DEPTH>,
    ) {
    }
}

/// Static capability descriptor for a PMU.
pub struct PmuCapabilities {
    /// Number of general-purpose hardware counters.
    pub num_gp_counters: u8,
    /// Number of fixed-function counters (cycles, instructions, ref-cycles, etc.).
    pub num_fixed_counters: u8,
    /// Bit width of each counter register (typically 48 on modern CPUs).
    pub counter_width: u8,
    /// Supports overflow interrupt (sampling mode).
    pub supports_sampling: bool,
    /// Supports hardware callchain (e.g., Intel LBR, ARM BRBE).
    pub supports_callchain: bool,
    /// Supports cgroup-scoped filtering in hardware.
    pub supports_cgroup_events: bool,
    /// Supports precise instruction pointer (PEBS on Intel, SPE on AArch64).
    pub supports_precise_ip: bool,
    /// Supports branch stack sampling.
    pub supports_branch_stack: bool,
    /// Maximum branch stack depth (0 if not supported).
    pub max_branch_depth: u8,
}

/// Error returned by PMU driver operations.
#[derive(Debug)]
pub enum PmuError {
    /// The requested event code is not supported by this PMU.
    UnsupportedEvent,
    /// All hardware counter slots are occupied.
    NoSlotAvailable,
    /// Event conflicts with a concurrently scheduled exclusive event.
    ExclusiveConflict,
    /// PMU hardware returned an error status.
    HardwareError(u32),
}

19.7.8 Architecture PMU Implementations

Each architecture provides a PmuOps implementation registered at boot by the architecture hardware initialization code. PMU characteristics are discovered at runtime (number of counters from CPUID / PMU control registers), not compile-time constants.

Architecture	Driver name	GP counters	Fixed counters	Sampling	Precise IP
x86-64 (Intel, gen ≥ 3)	`intel-core`	4–8 (CPUID.0AH)	3 (cycles, instrs, ref-cycles)	Yes (PEBS)	Yes (PEBS exact IP)
x86-64 (AMD Zen2+)	`amd-core`	6	0	Yes (NMI on overflow)	No
AArch64 (PMUv3)	`arm-pmuv3`	≥6 (PMCR_EL0.N)	1 (PMCCNTR_EL0)	Yes (GIC PMI)	Yes (with SPE)
ARMv7	`arm-pmu`	≥4 (PMCR.N)	1 (PMCCNTR)	Yes (GIC PMI)	No
RISC-V (SBI PMU ext.)	`riscv-pmu`	Firmware-defined	3 (HPMCOUNTER3-5)	Firmware-dependent	No
PPC64LE (POWER10)	`power-pmu`	6 (PMC1-6)	2 (MMCR0 fixed)	Yes (SIGP PMI)	No

Intel Core PMU (intel-core). Counter count and event availability are discovered via CPUID leaf 0x0A (Architectural Performance Monitoring): - EAX[15:8]: number of general-purpose counters per logical processor. - EAX[23:16]: counter bit width. - EBX[6:0]: bitmask of available architectural events (0 = event available).

Events are programmed into IA32_PERFEVTSELx MSRs (event select, unit mask, USR/OS filters, edge detect, interrupt enable) and counters read from IA32_PMCx MSRs. Fixed-function counters use IA32_FIXED_CTR_CTRL and IA32_FIXED_CTRx. PEBS (Precise Event-Based Sampling) is activated by setting the corresponding bit in IA32_PEBS_ENABLE; the CPU writes complete PEBS records to a ring buffer in a kernel-allocated PEBS buffer mapped at a fixed virtual address per logical CPU.

AMD Core PMU (amd-core). Programs PERFEVTSEL0–PERFEVTSEL5 and reads PERFCTR0–PERFCTR5. On Zen2+, the extended event-select format encodes bits [11:8] of the event code into PERFEVTSEL[35:32]. Overflow generates a standard x86-64 NMI via the local APIC Performance Counter Interrupt (PCI) vector.

ARM PMUv3 (arm-pmuv3). The count of general-purpose counters (PMCR_EL0.N) is read at boot and may range from 0 to 31. Events are programmed into PMEVTYPER<n>_EL0 (event type, filter flags) and counted in PMEVCNTR<n>_EL0. The dedicated cycle counter uses PMCCNTR_EL0 and PMCCFILTR_EL0. Overflow interrupts arrive via a GIC PPI (per-processor interrupt) configured by firmware (typically IRQ 23 in the PPI range). If ID_AA64DFR0_EL1.PMSVer >= 1, the ARM Statistical Profiling Extension is available for precise-IP sampling.

RISC-V SBI PMU (riscv-pmu). Counter management goes through the SBI PMU extension (RISC-V SBI spec chapter 11). The kernel calls sbi_pmu_counter_get_info() at boot to enumerate available counters and their capabilities. sbi_pmu_counter_start() and sbi_pmu_counter_stop() program and deprogramm counters. Overflow notification depends on the SBI firmware implementation; UmkaOS requires at minimum SBI v0.3 with PMU extension ID 0x504D55.

POWER PMU (power-pmu). Events are programmed via MMCR0, MMCR1, MMCRA, and MMCR2 SPRs. Counter values are read from PMC1–PMC6. A Performance Monitor Alert (SIGP signal bit in the exception register) is generated on overflow and handled via the dedicated POWER performance monitor exception vector. On POWER10, memory latency events are available through the combination of MMCRA load/store sampling mode and the PM_MEM_LATENCY_* event codes.

19.7.9 Sampling Overflow Handling

Sampling events trigger a PMI (Performance Monitoring Interrupt) — an NMI on x86-64 delivered via the APIC Performance Counter vector, and a regular interrupt on other architectures — when the hardware counter overflows from the programmed start value back through zero.

Two-stage design. Linux performs full callchain unwinding and stack copying directly in the NMI handler. This is unsafe on deeply nested kernel frames and risks stack overflow or incorrect unwinding. UmkaOS splits the work:

Stage 1 — NMI/interrupt handler (minimal work):

// umka-core/src/arch/x86_64/perf/overflow.rs

/// PMU overflow NMI handler (x86-64).
///
/// # Safety
///
/// Called from NMI context. Must not sleep, allocate heap memory, or acquire
/// any lock that is not NMI-safe. Must complete in bounded time.
pub unsafe fn perf_nmi_handler(regs: &SavedRegs) {
    // Read overflow status: IA32_PERF_GLOBAL_STATUS (MSR 0x38E).
    let overflow_mask = rdmsr(IA32_PERF_GLOBAL_STATUS);
    if overflow_mask == 0 {
        return; // Spurious NMI — not from the PMU.
    }

    // Acknowledge overflow bits before re-enabling counters.
    wrmsr(IA32_PERF_GLOBAL_OVF_CTRL, overflow_mask);

    let ctx = CpuLocal::get().perf_ctx;
    // Lock-free read: load count with Acquire, then iterate raw pointers.
    // NMI handlers must not acquire locks (risk of deadlock if NMI fires while
    // events_lock is held). The fixed-size active[] array with atomic count makes
    // this safe — see PerfEventContext doc for pointer lifetime invariants.
    let active_count = ctx.active_count.load(Ordering::Acquire) as usize;

    for idx in 0..active_count {
        if overflow_mask & (1u64 << idx) == 0 {
            continue;
        }
        // SAFETY: active[0..active_count] are valid non-null pointers.
        let event = unsafe { &*ctx.active[idx] };

        let raw = RawSample {
            ip:           regs.ip,
            sp:           regs.sp,
            fp:           regs.bp,
            timestamp_ns: timekeeping_fast_ns(),
            cpu:          cpu_id() as u32,
            pid:          current_task().pid,
            tid:          current_task().tid,
            regs:         *regs,
            event_idx:    idx as u16,
        };

        // Non-blocking push. If queue is full, increment lost_samples and drop.
        if ctx.raw_sample_queue.try_push(raw).is_err() {
            if let Some(rb) = &event.ring_buffer {
                rb.lost_samples.fetch_add(1, Ordering::Relaxed);
            }
        }

        // Reprogram counter to -(period) so next overflow fires after `period` events.
        let width = event.pmu.capabilities().counter_width;
        let period = event.attr.sample_period_or_freq;
        let start_val = (1u64 << width).wrapping_sub(period);
        wrmsr(IA32_PMC0 + idx as u32, start_val);
    }

    // Re-enable global performance counter control.
    wrmsr(IA32_PERF_GLOBAL_CTRL, ctx.enabled_counter_mask);
}

Stage 2 — per-CPU sampler kthread (full work):

// umka-core/src/perf/sampler.rs

/// Per-CPU perf sampler kthread.
///
/// Runs at `SCHED_FIFO` priority 1 (lowest RT class), pinned to its CPU.
/// Woken by a flag set in the scheduler tick when `raw_sample_queue` is
/// non-empty. Performs all work that is unsafe in NMI context: callchain
/// unwinding, user stack copying, ring buffer writes.
pub fn perf_sampler_thread(cpu: u32) -> ! {
    loop {
        kthread_park_wait(); // Blocks until woken.

        let ctx = CpuLocal::for_cpu(cpu).perf_ctx;

        while let Some(raw) = ctx.raw_sample_queue.try_pop() {
            // Resolve the event that overflowed.
            let event = match ctx.active_events
                .lock()
                .get(raw.event_idx as usize)
                .map(Arc::clone)
            {
                Some(e) => e,
                None    => continue,
            };

            let rb = match event.ring_buffer.as_ref().map(Arc::clone) {
                Some(rb) => rb,
                None     => continue,
            };

            // Build the complete perf record (may sleep, may allocate).
            let record = build_perf_sample(&event, &raw);
            ring_buffer_write(&rb, &record);

            // Invoke optional overflow handler (e.g., attached eBPF program).
            if let Some(handler) = event.overflow_handler {
                handler(&event, &raw.regs, event.attr.sample_period_or_freq);
            }

            // Wake userspace readers if the wakeup threshold is reached.
            if ring_buffer_should_wakeup(&rb, &event) {
                rb.waitqueue.wake_all();
            }
        }
    }
}

/// Common perf event header prepended to every ring buffer record.
/// Matches Linux `struct perf_event_header` exactly (8 bytes).
#[repr(C)]
pub struct PerfEventHeader {
    /// Record type. `PERF_RECORD_SAMPLE = 9` for sampled events.
    /// Other values: `PERF_RECORD_MMAP = 1`, `PERF_RECORD_LOST = 2`,
    /// `PERF_RECORD_COMM = 3`, `PERF_RECORD_EXIT = 4`, `PERF_RECORD_THROTTLE = 5`,
    /// `PERF_RECORD_UNTHROTTLE = 6`, `PERF_RECORD_FORK = 7`, `PERF_RECORD_READ = 8`.
    pub type_: u32,
    /// Misc flags encoding the CPU execution mode at sample time.
    /// `PERF_RECORD_MISC_KERNEL = 1`, `PERF_RECORD_MISC_USER = 2`,
    /// `PERF_RECORD_MISC_HYPERVISOR = 3`, `PERF_RECORD_MISC_GUEST_KERNEL = 5`,
    /// `PERF_RECORD_MISC_GUEST_USER = 6`.
    pub misc: u16,
    /// Total size of this record in bytes, including the header.
    /// The ring buffer consumer advances `data_tail` by this amount after
    /// reading the record. Records are always aligned to 8 bytes.
    pub size: u16,
}

/// Special call chain context markers written into `callchain[]` to indicate
/// a mode switch in the call stack (e.g., user → kernel boundary).
/// These match Linux `PERF_CONTEXT_*` constants exactly.
pub const PERF_CONTEXT_HV:           u64 = u64::MAX - 31;
pub const PERF_CONTEXT_KERNEL:       u64 = u64::MAX - 127;
pub const PERF_CONTEXT_USER:         u64 = u64::MAX - 511;
pub const PERF_CONTEXT_GUEST:        u64 = u64::MAX - 2047;
pub const PERF_CONTEXT_GUEST_KERNEL: u64 = u64::MAX - 2175;
pub const PERF_CONTEXT_GUEST_USER:   u64 = u64::MAX - 2559;

/// Maximum call chain depth recorded in a `PerfRecordSample`.
/// Matches Linux `PERF_MAX_STACK_DEPTH`. The call chain is terminated by
/// `PERF_CONTEXT_*` markers when the unwinder crosses a privilege boundary.
pub const PERF_MAX_STACK_DEPTH: usize = 127;

/// A single perf sample record written to the perf mmap ring buffer.
///
/// This struct represents the in-memory layout of a `PERF_RECORD_SAMPLE` record
/// as it appears in the ring buffer. It is binary-compatible with Linux perf ABI,
/// allowing unmodified userspace tools (`perf`, `bpftrace`, BCC) to consume it.
///
/// # Variable-length layout
///
/// Not all fields are present in every record. The `perf_event_attr::sample_type`
/// bitmask controls which optional fields are included. Fields are written in the
/// order they appear in this struct, with absent fields skipped entirely (not
/// zeroed). The `header.size` field specifies the total byte length of the record
/// as written; the ring buffer consumer must use `header.size` — not `sizeof` this
/// struct — to advance past the record.
///
/// Fields after `callchain_nr` / `callchain` are only present if the corresponding
/// `PERF_SAMPLE_*` bit is set:
///
/// | Field(s)         | `PERF_SAMPLE_*` bit      | Value if absent |
/// |------------------|--------------------------|-----------------|
/// | `id`             | `PERF_SAMPLE_IDENTIFIER` | not written     |
/// | `ip`             | `PERF_SAMPLE_IP`         | not written     |
/// | `pid`, `tid`     | `PERF_SAMPLE_TID`        | not written     |
/// | `time_ns`        | `PERF_SAMPLE_TIME`       | not written     |
/// | `addr`           | `PERF_SAMPLE_ADDR`       | not written     |
/// | `cpu`, `_cpu_res`| `PERF_SAMPLE_CPU`        | not written     |
/// | `period`         | `PERF_SAMPLE_PERIOD`     | not written     |
/// | `callchain_nr` + `callchain[0..callchain_nr]` | `PERF_SAMPLE_CALLCHAIN` | not written |
/// | `raw_size` + `raw_data[0..raw_size]` | `PERF_SAMPLE_RAW` | not written |
///
/// The `encode()` method serializes only the fields selected by `sample_type`
/// into a flat `Vec<u8>`, which is then written to the ring buffer via
/// `ring_buffer_write()`.
///
/// # Linux ABI compatibility
///
/// Field ordering and types match the Linux kernel's `perf_sample_data` and the
/// on-disk `perf.data` format parsed by `perf report`. The `PERF_RECORD_SAMPLE`
/// type value is 9, matching `enum perf_event_type` in Linux.
///
/// Alignment: the record as a whole is padded to an 8-byte boundary so that the
/// next record's header starts 8-byte aligned. `header.size` includes this padding.
pub struct PerfRecordSample {
    /// Common perf event header (8 bytes). Always present.
    pub header: PerfEventHeader,

    /// Sample identifier (present if `PERF_SAMPLE_IDENTIFIER` set, 8 bytes).
    /// Matches the `perf_event`'s `id` field assigned at `perf_event_open` time.
    /// Placed first (before `ip`) when `PERF_SAMPLE_IDENTIFIER` is set, so that
    /// the consumer can identify the event without parsing the full record layout.
    pub id: u64,

    /// Instruction pointer at sample time (present if `PERF_SAMPLE_IP` set).
    pub ip: u64,

    /// PID and TID of the sampled thread (present if `PERF_SAMPLE_TID` set).
    pub pid: u32,
    pub tid: u32,

    /// Timestamp in nanoseconds (monotonic, from `timekeeping_fast_ns()`)
    /// (present if `PERF_SAMPLE_TIME` set).
    pub time_ns: u64,

    /// Memory address that triggered the sample, for memory-access events
    /// (present if `PERF_SAMPLE_ADDR` set). Zero if not applicable.
    pub addr: u64,

    /// CPU number on which the sample was taken (present if `PERF_SAMPLE_CPU` set).
    pub cpu: u32,
    /// Reserved padding to align `cpu` pair to 8 bytes. Always zero.
    pub _cpu_res: u32,

    /// Sampling period (number of events between consecutive samples)
    /// (present if `PERF_SAMPLE_PERIOD` set).
    pub period: u64,

    /// Number of call chain entries that follow (present if `PERF_SAMPLE_CALLCHAIN` set).
    /// A value of 0 means no call chain was captured (e.g., unwinding failed).
    pub callchain_nr: u64,

    /// Call chain instruction pointers (present if `PERF_SAMPLE_CALLCHAIN` set).
    /// Only `callchain[0..callchain_nr]` are written to the ring buffer.
    /// `PERF_CONTEXT_*` markers are interspersed to indicate mode transitions
    /// (e.g., `PERF_CONTEXT_USER` separates the kernel and user portions of the
    /// stack). Maximum depth is `PERF_MAX_STACK_DEPTH = 127` entries (inclusive
    /// of context markers); the unwinder stops when this limit is reached.
    pub callchain: [u64; PERF_MAX_STACK_DEPTH],

    /// Raw record length in bytes (present if `PERF_SAMPLE_RAW` set).
    /// The `raw_data` that follows is `raw_size` bytes long, then padded to
    /// align the next field to 8 bytes.
    pub raw_size: u32,
    // raw_data follows in the ring buffer: [u8; raw_size], padded to 8-byte alignment.
    // It is not a fixed array member here because its length is variable.
    // The `encode()` method appends raw_data bytes directly after raw_size.

    /// `sample_type` bitmask from `perf_event_attr` that was active when this
    /// sample was captured. Determines which of the above fields are valid.
    /// Not written to the ring buffer (internal use only for `encode()`).
    pub sample_type: u64,

    /// Raw sample data bytes (present if `PERF_SAMPLE_RAW` set).
    /// Contains the raw tracepoint record or BPF-provided data, up to 65535 bytes.
    /// Not directly represented in the ring buffer as a fixed array — appended
    /// as `raw_size` bytes after the `raw_size` field during `encode()`.
    pub raw_data: Vec<u8>,
}

/// Construct a complete `PERF_RECORD_SAMPLE` from a `RawSample`.
fn build_perf_sample(event: &PerfEvent, raw: &RawSample) -> PerfRecordSample {
    let st = event.attr.sample_type;
    let mut rec = PerfRecordSample::new(st);

    if st & PERF_SAMPLE_IP        != 0 { rec.ip        = raw.ip; }
    if st & PERF_SAMPLE_TID       != 0 { rec.pid       = raw.pid; rec.tid = raw.tid; }
    if st & PERF_SAMPLE_TIME      != 0 { rec.time      = raw.timestamp_ns; }
    if st & PERF_SAMPLE_CPU       != 0 { rec.cpu       = raw.cpu; }
    if st & PERF_SAMPLE_PERIOD    != 0 { rec.period    = event.attr.sample_period_or_freq; }
    if st & PERF_SAMPLE_ID        != 0 { rec.id        = event.id; }

    if st & PERF_SAMPLE_CALLCHAIN != 0 {
        rec.callchain = callchain_unwind(
            raw.ip, raw.sp, raw.fp,
            event.attr.sample_max_stack,
            event.attr.flags.contains(PerfEventFlags::EXCLUDE_CALLCHAIN_KERNEL),
            event.attr.flags.contains(PerfEventFlags::EXCLUDE_CALLCHAIN_USER),
        );
    }
    if st & PERF_SAMPLE_REGS_USER != 0 {
        rec.regs_user = Some(raw.regs.clone());
    }
    if st & PERF_SAMPLE_STACK_USER != 0 {
        // copy_from_user is safe in kthread context (process context, not NMI).
        rec.stack_user = copy_user_stack(raw.sp, event.attr.sample_stack_user as usize);
    }
    rec
}

The sampler kthread is woken by a check in scheduler_tick():

// umka-core/src/sched/tick.rs (relevant fragment)

pub fn scheduler_tick(cpu: u32) {
    let ctx = CpuLocal::for_cpu(cpu).perf_ctx;
    if !ctx.raw_sample_queue.is_empty() {
        ctx.sampler_thread.wake();
    }
    perf_rotate_context(ctx); // multiplexing rotation (Section 19.7.11)
}

Sampler Thread Configuration:

Thread priority: SCHED_FIFO at priority 5 (below all RT workloads at priority 10+, above normal timesharing). This allows the sampler to preempt normal tasks for accurate samples without affecting RT responsiveness. One sampler thread per CPU (perf_sampler/{cpu_id}).
Wake source: Each CPU's PMU (Performance Monitoring Unit) overflow interrupt triggers a per-CPU perf_sample_pending flag. The interrupt handler sets the flag and does NOT call the sampler directly (interrupt context). The scheduler's post_irq_work() hook calls sampler_thread_wakeup() at the next scheduling point after the interrupt.
Wake throttle: The sampler thread enforces a minimum inter-wake interval of 1000/sample_rate_hz milliseconds. If the PMU fires faster than this (e.g., due to a very hot loop causing frequent PMU overflows), samples are coalesced: the sampler processes all pending sample records in a single wakeup instead of waking once per overflow. Maximum coalesced samples per wakeup: 256 (to bound sampler thread runtime).
CPU budget: The sampler thread has a CBS bandwidth limit of 5% CPU per core by default. Configurable via perf_sampler_cpu_budget_pct sysctl. If the budget is exceeded, the sampler yields and posts remaining samples in the next scheduling slot.
Buffer overflow: If the per-CPU sample ring buffer fills before the sampler thread drains it (ring size: 2048 samples × 128 bytes = 256 KiB per CPU), overflow samples are counted in perf_overflow_count (accessible via umkafs) and dropped. The sampler thread increases its wakeup frequency by 25% for 10 seconds after a drop to reduce future overflow probability.

19.7.9.1 Callchain Unwinder Specification

The callchain_unwind() function referenced in build_perf_sample above performs kernel and user stack unwinding to produce the callchain[] array embedded in PERF_RECORD_SAMPLE records. This section specifies the unwinding algorithm, validation strategy, and performance budget.

Primary algorithm — ORC (Oops Rewind Capability) unwinder. UmkaOS uses the same design as Linux's ORC unwinder. ORC generates a compact per-instruction unwind table during compilation, stored in two ELF sections:

.orc_unwind — array of OrcEntry structs, one per code range, specifying how to recover the previous frame's stack pointer and return address from the current SP and IP.
.orc_unwind_ip — sorted array of instruction pointer offsets corresponding 1:1 with .orc_unwind entries.

Each OrcEntry encodes the SP offset, the frame pointer register (if any), and the return address location relative to the computed previous SP. ORC lookup is an O(log N) binary search on the .orc_unwind_ip array, keyed by the current instruction pointer. ORC handles code compiled without frame pointers and correctly unwinds through interrupt frames, exception frames, and signal trampoline frames because the compiler generates explicit entries for prologue and epilogue transitions.

/// Compact ORC unwind entry. Stored in `.orc_unwind` section.
/// One entry per code range (typically per function, with extra entries
/// for prologue/epilogue/exceptional regions).
#[repr(C, packed)]
pub struct OrcEntry {
    /// Signed offset from the current SP to the previous frame's SP.
    /// A value of `ORC_REG_UNDEFINED` indicates the end of the chain.
    pub sp_offset: i16,
    /// Signed offset from the computed previous SP to the return address.
    pub ra_offset: i16,
    /// Register used as the base for SP recovery:
    /// `ORC_REG_SP` (current stack pointer), `ORC_REG_FP` (frame pointer),
    /// or `ORC_REG_UNDEFINED` (end of chain / not recoverable).
    pub sp_reg: u8,
    /// Register used as the base for FP recovery (if frame pointers are in use).
    pub fp_reg: u8,
    /// Flags: `ORC_TYPE_CALL` (normal call frame), `ORC_TYPE_REGS` (full
    /// interrupt/exception register save), `ORC_TYPE_REGS_PARTIAL` (partial
    /// register save, e.g., fast syscall entry).
    pub entry_type: u8,
    /// Padding for alignment.
    pub _pad: u8,
}

/// Register encoding constants for `OrcEntry::sp_reg` and `fp_reg`.
pub const ORC_REG_UNDEFINED: u8 = 0;
pub const ORC_REG_SP: u8        = 1;
pub const ORC_REG_FP: u8        = 2;

/// Entry type constants for `OrcEntry::entry_type`.
pub const ORC_TYPE_CALL: u8         = 0;
pub const ORC_TYPE_REGS: u8         = 1;
pub const ORC_TYPE_REGS_PARTIAL: u8 = 2;

Fallback — frame pointer unwinding. If the ORC table lookup fails for a given IP (e.g., code in a dynamically loaded eBPF JIT region, or a Tier 2 driver loaded without ORC tables), the unwinder falls back to frame pointer chain walking: follow rbp (x86-64) / x29 (AArch64) / s0 (RISC-V) / r11 (ARMv7) to the previous frame, read the return address from the conventional location adjacent to the frame pointer. Frame pointer unwinding requires that kernel code is compiled with -fno-omit-frame-pointer (enforced in the UmkaOS build system for all in-kernel code; Tier 1 driver code is similarly required to preserve frame pointers). The unwinder marks frames recovered via the fallback path with an internal FRAME_FP_FALLBACK flag (not exposed to userspace) for diagnostic purposes.

Per-Architecture Frame Layouts and Fallback Unwinding:

The performance profiler uses two unwinding strategies: 1. DWARF unwinding (preferred): Uses .eh_frame / .debug_frame CFI records. Accurate but slow (~1-5μs per frame); used for offline symbolization. 2. Frame pointer unwinding (perf hot path): O(1) per frame; requires -fno-omit-frame-pointer (UmkaOS kernel is compiled with frame pointers always; Tier 2 drivers may not be).

Per-architecture frame pointer layout:

Architecture	Frame pointer register	Return address location	Stack layout
x86-64	RBP	[RBP+8] (caller's RIP)	`push RBP; mov RBP, RSP` at function entry
AArch64	X29 (FP)	[X29+8] (LR saved by callee)	`stp x29, x30, [sp, #-16]!; mov x29, sp`
ARMv7	R11	[R11-4] (return address = saved LR)	`push {fp, lr}; add fp, sp, #4`
RISC-V 64	S0/FP (X8)	[S0-8] (saved RA)	`sd ra, -8(sp); addi s0, sp, frame_size`
PPC64LE	R1 (SP)	[R1+16] (LR save area, ABI-defined)	ABI-defined back-chain word at [R1+0]
PPC32	R1 (SP)	[R1+4] (LR save area)	Back-chain at [R1+0], LR at [R1+4]

Fallback for missing frame pointers: When frame pointer unwinding fails (frame pointer is zero or points outside the stack): 1. Attempt ORC (Oops Rewind Capability) unwinding if ORC metadata is available for the code address. ORC is a compact, architecture-specific alternative to DWARF generated at compile time (scripts/orc_gen). 2. If ORC is unavailable, scan the stack for plausible return addresses. Scanning rules: (a) pointer must be word-aligned (8-byte-aligned on 64-bit architectures, 4-byte-aligned on 32-bit); (b) pointer must fall within an executable .text segment mapped for the current kernel or a loaded driver; (c) the 1-5 bytes immediately preceding the candidate address must decode as a CALL / BL / jalr instruction (verified by checking the opcode byte pattern for each architecture); (d) at most 256 stack words (2 KB on 64-bit) are scanned per frame level before the scanner advances to the next candidate or stops; (e) at most 32 heuristic frames total are recovered before the walk terminates. 3. Mark heuristically-recovered frames with FRAME_HEURISTIC flag in the sample; symbolizer skips CFI validation for these.

Sampling overhead: Frame pointer walk costs ~3-10ns per frame (pointer dereference + bounds check). Maximum 128 frames unwound per sample; truncated stacks are marked TRUNCATED.

Stack validity checking. Before following any pointer during unwinding, the unwinder validates that the pointer lies within a known kernel stack region:

The current task's kernel stack (task.stack_base .. task.stack_base + STACK_SIZE).
The per-CPU IRQ stack (if the unwinder is walking through an interrupt frame).
The per-CPU exception stack (e.g., x86-64 IST stacks for double fault, NMI, MCE).

If a candidate frame pointer or return address fails validation, the unwinder terminates the chain at that point and writes a PERF_CONTEXT_MAX sentinel (equivalent to [invalid frame]) into the callchain array. It does not fault or panic — an invalid frame simply truncates the trace. For user-mode portions of the callchain, addresses are validated against the task's user-mode VMA mappings; unmapped user addresses similarly terminate the user portion of the unwind.

Maximum depth. 64 frames per kernel callchain, 64 frames per user callchain, for a combined maximum of 128 frames plus PERF_CONTEXT_* markers. The callchain_unwind() function accepts a max_frames parameter (capped at PERF_MAX_STACK_DEPTH = 127 including context markers) and returns the actual number of frames unwound.

Performance budget. Less than 1 us for typical 10-frame kernel stacks. The ORC binary search is O(log N) over approximately 100K-500K entries for a typical kernel image, which completes in under 100 ns. The per-frame SP/RA recovery is a few memory loads per frame. The dominant cost for deep stacks (> 20 frames) or cold paths is L2/L3 cache misses when accessing the .orc_unwind table or stack pages that have been evicted. The sampler kthread (Stage 2) runs in process context, so these cache misses are tolerable — they do not affect NMI latency.

// umka-core/src/perf/unwind.rs

/// Unwind the kernel call stack starting from the given register state.
///
/// Fills `out` with instruction pointers from innermost (most recent call)
/// to outermost (entry point / syscall boundary). `PERF_CONTEXT_KERNEL`
/// is prepended as the first entry; `PERF_CONTEXT_USER` is inserted before
/// any user-mode frames if `include_user` is true.
///
/// Returns the number of entries written to `out` (including context markers).
/// Stops at `max_frames`, at the first invalid frame, or at the bottom of
/// the call stack — whichever comes first.
///
/// # Algorithm selection
///
/// 1. Look up `ip` in the ORC table (`.orc_unwind_ip` binary search).
/// 2. If found, use the `OrcEntry` to compute the previous SP and RA.
/// 3. If not found, fall back to frame-pointer walking from `fp`.
/// 4. Validate every pointer against known stack regions before dereferencing.
pub fn callchain_unwind(
    ip: u64,
    sp: u64,
    fp: u64,
    max_frames: u32,
    exclude_kernel: bool,
    exclude_user: bool,
) -> ([u64; PERF_MAX_STACK_DEPTH], u64) {
    // Returns (callchain array, number of valid entries).
    // Implementation walks the ORC table with frame-pointer fallback.
    unimplemented!("specified above")
}

19.7.10 Ring Buffer Write Protocol

The ring buffer uses a single-producer (kernel) / single-consumer (userspace) lock-free protocol. The kernel is the sole writer; userspace is the sole reader.

Kernel write path:

// umka-core/src/perf/ring_buffer.rs

/// Write a complete perf record to the ring buffer.
///
/// # Ordering
///
/// Sample data is written to the circular buffer before `data_head` is updated.
/// `data_head` is stored with Release ordering. Userspace reads `data_head`
/// with Acquire ordering, ensuring it sees the complete sample data.
///
/// On ARMv7 and RISC-V (no hardware TSO), a `SeqCst` fence is emitted before
/// the `data_head` store to prevent the store from being reordered ahead of the
/// data writes by the CPU's out-of-order write buffer.
pub fn ring_buffer_write(rb: &PerfRingBuffer, record: &PerfRecordSample) {
    let bytes = record.encode();
    let len   = bytes.len() as u64;

    let head = rb.data_head.load(Ordering::Acquire);
    let tail = {
        // SAFETY: header pointer is valid and kernel-writable for the ring
        // buffer lifetime. Tail is only written by userspace (Release) and
        // read here (Acquire) — no race with the kernel write path.
        unsafe { (*rb.header.as_ptr()).data_tail.load(Ordering::Acquire) }
    };

    // Full condition: ring has no room for this record.
    if head.wrapping_sub(tail) + len > rb.data_size as u64 {
        rb.lost_samples.fetch_add(1, Ordering::Relaxed);
        emit_lost_record(rb, 1);
        return;
    }

    // Write bytes into the circular data buffer (wrapping at data_size).
    let data = unsafe {
        core::slice::from_raw_parts_mut(rb.data.as_ptr() as *mut u8, rb.data_size)
    };
    let offset = (head as usize) & (rb.data_size - 1);
    ring_copy(data, offset, &bytes);

    // Fence before head update on weakly-ordered architectures.
    #[cfg(any(target_arch = "arm", target_arch = "riscv64"))]
    core::sync::atomic::fence(Ordering::SeqCst);

    let new_head = head + len;
    rb.data_head.store(new_head, Ordering::Release);
    // Mirror into the mmap header page for userspace.
    unsafe {
        (*rb.header.as_ptr()).data_head.store(new_head, Ordering::Release);
    }
}

/// Copy `src` into `buf` starting at `offset`, wrapping around at `buf.len()`.
fn ring_copy(buf: &mut [u8], offset: usize, src: &[u8]) {
    let cap   = buf.len();
    let first = (cap - offset).min(src.len());
    buf[offset..offset + first].copy_from_slice(&src[..first]);
    if first < src.len() {
        buf[..src.len() - first].copy_from_slice(&src[first..]);
    }
}

Userspace read protocol (no system call required):

/* Pseudocode — purely userspace, no kernel involvement */
void perf_drain_ring(struct perf_event_mmap_page *hdr, void *data) {
    uint64_t size = hdr->data_size;
    uint64_t head = __atomic_load_n(&hdr->data_head, __ATOMIC_ACQUIRE);
    uint64_t tail = hdr->data_tail;

    while (tail != head) {
        struct perf_event_header *rec =
            (void *)((char *)data + (tail & (size - 1)));
        handle_record(rec);
        tail += rec->size;
    }
    __atomic_store_n(&hdr->data_tail, tail, __ATOMIC_RELEASE);
}

mmap size constraints. The mmap() call on a perf fd must request exactly (1 + 2^n) pages: 1 header page followed by 2^n data pages. Requesting any other size returns EINVAL. Typical configurations used by tools:

mmap pages	Data region	Typical use
1 + 2	8 KB	Breakpoint events
1 + 8	32 KB	Light profiling
1 + 64	256 KB	Default `perf record`
1 + 512	2 MB	High-frequency sampling
1 + 2048	8 MB	Intel PT / ARM SPE AUX buffer

PERF_RECORD_LOST emission. When the ring is full and a sample is dropped, UmkaOS emits a PERF_RECORD_LOST record (type 2) to the ring buffer if there is space for the small fixed-size lost record. This matches Linux behavior and allows perf report to report the number of dropped samples.

19.7.11 Event Multiplexing

When more events are requested than the PMU has hardware counter slots, UmkaOS rotates events in and out of the PMU on each scheduler tick. Accumulated counts are scaled by the time_enabled / time_running ratio to give approximate full-interval counts.

Rotation algorithm. UmkaOS uses deterministic round-robin rotation. Linux uses a semi-random scheduling order that can starve low-priority events; UmkaOS's deterministic order ensures every event gets equal time.

The rotation runs in perf_rotate_context(), called from scheduler_tick():

// umka-core/src/perf/context.rs

/// Rotate events in a CPU context: swap out active set, swap in pending set.
///
/// Called from the scheduler tick with interrupts disabled on `ctx.cpu`.
/// Holds `events_lock` for the duration — this is acceptable because rotation
/// is an infrequent slow path (once per tick, not once per context switch).
/// The context switch hot path (`perf_schedule_in`/`perf_schedule_out`) is
/// lock-free and will not contend here.
pub fn perf_rotate_context(ctx: &mut PerfEventContext) {
    let total_slots = ctx.hpc_slots as usize + ctx.fixed_slots as usize;
    let now_ns = timekeeping_fast_ns();

    let mut mutable = ctx.events_lock.lock();

    // 1. Save counts, update time_running, remove active events from hardware.
    for event in mutable.active_events.iter() {
        ctx_pmu(ctx).event_del(event, PERF_EF_UPDATE);
        event.time_running_ns.fetch_add(
            now_ns.wrapping_sub(event.last_schedule_in_ns),
            Ordering::Relaxed,
        );
    }

    // 2. Move all active events to the back of the pending queue.
    //    Zero the active_count first (Release) so the NMI handler's lockless
    //    read sees an empty active set while we rebuild it.
    ctx.active_count.store(0, Ordering::Release);
    let drained: Vec<_> = mutable.active_events.drain(..).collect();
    mutable.pending_events.extend(drained);

    // 3. Pull the next batch from the front of the pending queue and
    //    rebuild the raw pointer array.
    let count = total_slots.min(mutable.pending_events.len());
    for i in 0..count {
        let event = mutable.pending_events.pop_front().unwrap();
        event.last_schedule_in_ns = now_ns;
        // Store pointer before incrementing count.
        ctx.active[i] = Arc::as_ptr(&event);
        mutable.active_events.push(event);
    }
    // Release store: makes all pointer writes visible before the new count.
    ctx.active_count.store(count as u32, Ordering::Release);

    // 4. Program the new active set into hardware.
    for event in mutable.active_events.iter() {
        ctx_pmu(ctx).event_add(event, PERF_EF_START | PERF_EF_RELOAD);
    }

    // 5. Update time_enabled for all events (active + pending).
    let elapsed = now_ns.wrapping_sub(ctx.last_rotate_ns);
    ctx.last_rotate_ns = now_ns;
    for event in mutable.active_events.iter() {
        event.time_enabled_ns.fetch_add(elapsed, Ordering::Relaxed);
    }
    for event in mutable.pending_events.iter() {
        event.time_enabled_ns.fetch_add(elapsed, Ordering::Relaxed);
    }
}

Scaling formula. Userspace computes the full-interval estimate from the read() return value when PERF_FORMAT_TOTAL_TIME_ENABLED and PERF_FORMAT_TOTAL_TIME_RUNNING are set:

scaled_count = raw_count × (time_enabled_ns / time_running_ns)

The kernel returns these values in the read() response:

/// Layout returned by read() on a perf fd.
/// Which fields are present depends on attr.read_format.
#[repr(C)]
pub struct PerfReadFormat {
    /// Raw accumulated event count (always present).
    pub value:        u64,
    /// Total nanoseconds the event has been enabled.
    /// Present if PERF_FORMAT_TOTAL_TIME_ENABLED is set.
    pub time_enabled: u64,
    /// Total nanoseconds the event has been running on PMU hardware.
    /// Present if PERF_FORMAT_TOTAL_TIME_RUNNING is set.
    pub time_running: u64,
    /// Event unique ID.
    /// Present if PERF_FORMAT_ID is set.
    pub id:           u64,
    /// Samples lost since last read.
    /// Present if PERF_FORMAT_LOST is set (kernel 6.0+ extension).
    pub lost:         u64,
}

Pinned events. Events with PerfEventFlags::PINNED set are never rotated out. If a pinned event cannot be scheduled because no hardware slot is available, it transitions to PerfEventState::Error. Subsequent read() calls on that fd return EINVAL. Userspace may recover by closing the fd and reopening with fewer concurrent events.

Group atomicity. Events sharing the same group (same group_fd) are scheduled atomically. Either all siblings fit simultaneously into available hardware slots, or none are scheduled. This guarantees that ratios computed between group members (instructions per cycle, cache miss rate) are always measured over the same time interval with no inter-event skew.

19.7.12 ioctl Operations

All ioctl codes are identical to Linux (same numeric values, same semantics). The third argument to ioctl(fd, request, arg) is the arg; for codes with PERF_IOC_FLAG_GROUP support, the arg | PERF_IOC_FLAG_GROUP form applies the operation to all events in the group simultaneously.

ioctl request	Code	arg type	Description
`PERF_EVENT_IOC_ENABLE`	`0x2400`	—	Enable event (begin counting)
`PERF_EVENT_IOC_DISABLE`	`0x2401`	—	Disable event (stop counting; count is preserved)
`PERF_EVENT_IOC_REFRESH`	`0x2402`	`int`	Enable for `arg` overflow samples, then auto-disable
`PERF_EVENT_IOC_RESET`	`0x2403`	—	Reset accumulated count to zero (does not affect enabled state)
`PERF_EVENT_IOC_PERIOD`	`0x40082404`	`u64 *`	Update sample period; takes effect after the next overflow
`PERF_EVENT_IOC_SET_OUTPUT`	`0x2405`	`int` (fd)	Redirect samples to another fd's ring buffer
`PERF_EVENT_IOC_SET_FILTER`	`0x40082406`	`char *`	Set ftrace filter string (TRACEPOINT events only)
`PERF_EVENT_IOC_ID`	`0x80082407`	`u64 *`	Write the event's unique ID into `*arg`
`PERF_EVENT_IOC_SET_BPF`	`0x40042408`	`int` (bpf fd)	Attach `BPF_PROG_TYPE_PERF_EVENT` program to this event
`PERF_EVENT_IOC_PAUSE_OUTPUT`	`0x40042409`	`int` (bool)	Pause or resume ring buffer output without disabling counting
`PERF_EVENT_IOC_QUERY_BPF`	`0x8008240a`	`perf_event_query_bpf *`	Retrieve IDs of all attached BPF programs
`PERF_EVENT_IOC_MODIFY_ATTRIBUTES`	`0x4008240b`	`perf_event_attr *`	Modify writable attributes of an existing event in-place

PERF_EVENT_IOC_SET_BPF. Attaches an eBPF program of type BPF_PROG_TYPE_PERF_EVENT to the event. The program runs in the sampler kthread context (Stage 2, Section 19.7.9) on every overflow sample, before the sample is written to the ring buffer. The program may filter samples by returning 0 (drop), or may call bpf_perf_event_output() to write a custom record to a BPF_MAP_TYPE_PERF_EVENT_ARRAY. Requires CAP_BPF + CAP_PERFMON. See Section 18.5 for eBPF verifier and JIT details.

PERF_EVENT_IOC_MODIFY_ATTRIBUTES. Permitted modifications are limited to fields that can be changed safely on a live event: sample_period_or_freq, wakeup_events_or_watermark, clockid, and sig_data. Attempting to change event_type, config, or structural flags returns EINVAL.

19.7.13 /proc/sys Tunables and Linux Compatibility

UmkaOS exports the same /proc/sys/kernel/perf_* tunable knobs as Linux, with identical names, semantics, and default values.

/proc/sys/kernel/perf_event_paranoid

Controls which callers may open perf events. Checked at perf_event_open() time against the caller's credentials and capabilities.

Value	Restriction
`-1`	No restriction. All events available to all users.
`0`	Raw tracepoints and kernel events allowed to all users.
`1` (default)	CPU-wide hardware events allowed for non-root processes; per-process events always allowed.
`2`	Only aggregate kernel-wide stats allowed for non-root; no per-process hardware events.
`3`	`perf_event_open` returns `EACCES` for all non-root callers.

CAP_SYS_ADMIN or CAP_PERFMON bypasses all paranoid restrictions. CAP_PERFMON (introduced in Linux 5.8) is the preferred least-privilege capability for performance monitoring; UmkaOS recognizes it with the same semantics.

/proc/sys/kernel/perf_event_max_sample_rate

Maximum sampling frequency in Hz. Default: 100000. Sampling frequency requests above this value (when PerfEventFlags::FREQ is set) are silently clamped. The kernel dynamically reduces this limit if cumulative PMI processing time exceeds perf_cpu_time_max_percent of available CPU time, and recovers it as load decreases. Reduction and recovery are reported to userspace via PERF_RECORD_THROTTLE and PERF_RECORD_UNTHROTTLE ring buffer records.

/proc/sys/kernel/perf_event_mlock_kb

Maximum kilobytes of ring buffer memory that may be mlock()-ed (pinned against swap) per user account. Default: 516 KB, sufficient for one default-sized perf record ring buffer (1 header page + 128 data pages). Each mmap() of a ring buffer deducts from the caller's per-user quota.

/proc/sys/kernel/perf_cpu_time_max_percent

Maximum percentage of CPU time that PMI processing may consume before the kernel throttles the sample rate. Default: 25. When exceeded, UmkaOS reduces the active sample period and emits PERF_RECORD_THROTTLE. Recovery (PERF_RECORD_UNTHROTTLE) occurs when the PMI load drops below half the threshold for at least one second.

umkafs unified namespace. The same knobs are accessible via umkafs (Section 19.4):

/System/Kernel/Perf/
  paranoid                  # rw; maps to perf_event_paranoid
  max_sample_rate           # rw; maps to perf_event_max_sample_rate
  mlock_kb                  # rw; maps to perf_event_mlock_kb
  cpu_time_max_percent      # rw; maps to perf_cpu_time_max_percent
  pmu/
    <name>/                 # one directory per registered PmuOps driver
      type                  # ro; dynamic PMU type number
      cpumask               # ro; CPUs where this PMU is available
      nr_addr_filters       # ro; number of address filters supported
      format/               # ro; event format descriptors (per-field files)
      events/               # ro; vendor-named canonical event aliases

Syscall number table. All values match the Linux ABI for the corresponding architecture:

Architecture	Syscall number
x86-64	298
AArch64	241
ARMv7	364
RISC-V 64	241
PPC32	319
PPC64LE	319

Tool compatibility. The following widely-used tools work without modification on UmkaOS:

Tool	Mechanism	Notes
`perf stat`	`perf_event_open` + `read()` + userspace rdpmc	Full group event support including PERF_FORMAT_*
`perf record`	`perf_event_open` + `mmap()` + ring buffer	All PERF_RECORD_* types emitted
`perf report`	Reads `perf.data`	No kernel interaction; purely userspace
`perf script`	Reads `perf.data`	No kernel interaction
`bpftrace`	`perf_event_open` + `PERF_EVENT_IOC_SET_BPF`	Full compat; BPF_PROG_TYPE_PERF_EVENT
BCC (`profile`, `offcputime`, `funclatency`, etc.)	Same as bpftrace	Full compat
`pmu-tools` (`toplev`, `ocperf`)	`PERF_TYPE_RAW` event codes	Arch-specific; works on Intel/AMD/ARM
JVM via `perf-map-agent`	Reads `/tmp/perf-PID.map` for JIT symbols	Pure userspace file protocol; zero kernel changes
`perf c2c` (false sharing detection)	`PERF_SAMPLE_DATA_SRC` + `PERF_SAMPLE_ADDR`	Requires hardware data-source support (Intel/AMD)
Intel VTune Profiler (CLI)	`perf_event_open` + raw Intel events	Intel x86-64 only

Implementation phases: - Phase 1: Core infrastructure — PerfEvent, PerfEventContext, PmuOps trait, PERF_TYPE_SOFTWARE events, ring buffer mmap protocol, basic ioctl set (ENABLE/DISABLE/RESET/ID), perf_event_paranoid enforcement, Intel Core PMU driver. - Phase 2: Sampling — NMI handler (x86-64), per-CPU sampler kthread, callchain unwinding (frame-pointer), PERF_TYPE_HARDWARE for Intel and AMD, PERF_RECORD_SAMPLE ring buffer records, PERF_RECORD_LOST. - Phase 3: Tracepoint events (PERF_TYPE_TRACEPOINT), hardware cache events (PERF_TYPE_HW_CACHE), ARM PMUv3 driver, RISC-V SBI PMU driver, PPC64LE POWER PMU driver. - Phase 4: Advanced features — Intel PEBS precise-IP, LBR branch stack (PERF_SAMPLE_BRANCH_STACK), Intel PT AUX buffer, ARM SPE, PERF_TYPE_BREAKPOINT, eBPF integration (PERF_EVENT_IOC_SET_BPF), PERF_RECORD_MMAP2. - Phase 5: Multiplexing deterministic rotation, cgroup-scoped events (PERF_FLAG_PID_CGROUP), PERF_EVENT_ATTR_INHERIT across fork, uncore PMU drivers (memory controller, PCIe root complex), dynamic sample rate throttling.

19.8 Kernel Parameter Store (Typed Sysctl)

Replaces /proc/sys/ sysctl text files with a typed, schema-bearing parameter store exposed through umkafs. All existing /proc/sys/ paths continue to work unchanged as a compatibility shim.

19.8.1 Problem with `/proc/sys/` Sysctl

Every Linux kernel tunable is a plain text file. This creates several operational problems:

No type information at the interface. Writing "abc" to a numeric sysctl returns EINVAL, but only at write time. There is no machine-readable way to discover that the file expects an integer rather than a string.
No range information. A monitoring agent cannot determine that net.core.somaxconn accepts values from 1 to 65535 without consulting out-of-tree documentation.
No enumeration schema. Tools that want to list all tunables with their types, defaults, and constraints must scrape kernel source or man pages — there is no structured registry.
No per-namespace scoping metadata. Some sysctls are per-network-namespace, others are host-global; this distinction is not queryable at the interface.

19.8.2 Design: Typed Parameter Descriptors

Each kernel tunable is described by a KernelParam struct registered at link time. The registration is zero-cost: the descriptor lives in a read-only ELF section (.umka_params), and the umkafs driver iterates that section to populate the namespace. No runtime registration call is needed.

// umka-core/src/params/mod.rs

/// A kernel tunable parameter descriptor. Registered at compile time via macro.
pub struct KernelParam {
    /// Canonical dot-separated name (e.g., "net.core.somaxconn").
    pub name: &'static str,
    /// Type and constraints.
    pub schema: ParamSchema,
    /// Human-readable one-line description.
    pub description: &'static str,
    /// Whether CAP_SYS_ADMIN is required to write this parameter.
    pub privileged: bool,
    /// Whether the parameter is scoped per-user-namespace (false = host-global only).
    pub per_namespace: bool,
    /// Read the current value.
    pub getter: fn() -> ParamValue,
    /// Validate and apply a new value.
    pub setter: fn(ParamValue) -> Result<(), ParamError>,
}

/// Type and constraints for a kernel parameter.
pub enum ParamSchema {
    U32  { min: u32, max: u32, default: u32 },
    U64  { min: u64, max: u64, default: u64 },
    I32  { min: i32, max: i32, default: i32 },
    Bool { default: bool },
    Str  { max_len: usize, default: &'static str },
    Enum { choices: &'static [&'static str], default: usize },
    /// Bitmask of named flags. Humans write `"flag1|flag2"`, kernel parses bitmask.
    Flags { known_flags: &'static [(&'static str, u64)], default: u64 },
}

/// A typed parameter value, as read from or written to the parameter store.
pub enum ParamValue {
    U32(u32),
    U64(u64),
    I32(i32),
    Bool(bool),
    Str(BoundedStr),
    Enum(usize),
    Flags(u64),
}

/// Error returned when a write fails schema validation.
pub enum ParamError {
    /// Value is outside the declared `min`/`max` range.
    OutOfRange { min: u64, max: u64 },
    /// String is not one of the declared enum choices.
    InvalidChoice { choices: &'static [&'static str] },
    /// Value tag does not match the parameter's declared type.
    TypeMismatch,
    /// Write rejected by the setter (e.g., value conflicts with hardware state).
    SetterRejected,
}

19.8.3 umkafs Layout

Each parameter appears as a directory under /System/Kernel/params/. The directory name is the canonical dot-separated parameter name. Within the directory, fixed-name metadata files provide the schema and current value:

/System/Kernel/params/<name>/
    value         rw   Current value. Text on read; text or binary on write.
    default       ro   Factory default value (text).
    schema        ro   JSON object: type, constraints, default.
    description   ro   One-line human-readable description.
    privileged    ro   "1" if CAP_SYS_ADMIN is required to write, "0" otherwise.
    namespace     ro   "per-ns" or "global".

The top-level directory also exposes a single aggregate file for bulk enumeration:

/System/Kernel/params/.schema.json   ro   JSON array of all parameter schemas.

Example schema file for net.core.somaxconn:

{"type":"u32","min":1,"max":65535,"default":128,"privileged":true,"namespace":"per-ns"}

19.8.4 Read and Write Protocol

Text reads (for shell scripts and human inspection):

cat /System/Kernel/params/net.core.somaxconn/value
128

Binary reads (for programmatic, type-safe access — avoids string parsing):

/* Binary wire format: 1-byte tag followed by the typed value. */
struct ParamValue {
    uint8_t  tag;      /* 1=U32, 2=U64, 3=I32, 4=Bool, 5=Str, 6=Enum, 7=Flags */
    union {
        uint32_t u32_val;
        uint64_t u64_val;
        int32_t  i32_val;
        uint8_t  bool_val;
        uint64_t enum_idx;
        uint64_t flags_val;
        char     str_val[256];
    };
};

int fd = open("/System/Kernel/params/net.core.somaxconn/value", O_RDONLY);
struct ParamValue pv;
read(fd, &pv, sizeof(pv));
/* pv.tag == 1 (U32); pv.u32_val == 128 */

Text writes (validated against the schema; EINVAL on type or range error):

echo 4096 > /System/Kernel/params/net.core.somaxconn/value

Binary writes (exact typed value; no text parsing):

struct ParamValue pv = { .tag = 1 /* U32 */, .u32_val = 4096 };
write(fd, &pv, sizeof(pv));

Both text and binary writes pass through the same schema validation path before the setter is called.

19.8.5 Schema Validation on Write

// umka-core/src/params/write.rs

/// Validate `raw` bytes against the parameter schema and, if valid, call the setter.
///
/// # Errors
///
/// Returns `ParamError` if the value fails type or range validation, or if the
/// setter rejects the value.
pub fn param_write(param: &KernelParam, raw: &[u8]) -> Result<(), ParamError> {
    let value = parse_text_or_binary(raw)?;
    match (&param.schema, &value) {
        (ParamSchema::U32 { min, max, .. }, ParamValue::U32(v)) => {
            if v < min || v > max {
                return Err(ParamError::OutOfRange {
                    min: u64::from(*min),
                    max: u64::from(*max),
                });
            }
        }
        (ParamSchema::U64 { min, max, .. }, ParamValue::U64(v)) => {
            if v < min || v > max {
                return Err(ParamError::OutOfRange { min: *min, max: *max });
            }
        }
        (ParamSchema::I32 { min, max, .. }, ParamValue::I32(v)) => {
            if v < min || v > max {
                return Err(ParamError::OutOfRange {
                    min: *min as u64,
                    max: *max as u64,
                });
            }
        }
        (ParamSchema::Enum { choices, .. }, ParamValue::Enum(idx)) => {
            if *idx >= choices.len() {
                return Err(ParamError::InvalidChoice { choices });
            }
        }
        (ParamSchema::Flags { known_flags, .. }, ParamValue::Flags(v)) => {
            let all_known: u64 = known_flags.iter().map(|(_, bit)| bit).fold(0, |a, b| a | b);
            if v & !all_known != 0 {
                return Err(ParamError::OutOfRange { min: 0, max: all_known });
            }
        }
        _ => return Err(ParamError::TypeMismatch),
    }
    (param.setter)(value).map_err(|_| ParamError::SetterRejected)
}

19.8.6 Registration Macro

Kernel subsystems register parameters using a declarative macro. The macro emits a KernelParam descriptor into the .umka_params ELF section; the linker script groups all descriptors into a contiguous array bounded by __umka_params_start and __umka_params_end. The umkafs driver walks that range at mount time to populate /System/Kernel/params/. There is no runtime list insertion and no dynamic allocation during registration.

// Example registration in umka-net/src/core/socket.rs

kernel_param! {
    name:          "net.core.somaxconn",
    schema:        ParamSchema::U32 { min: 1, max: 65535, default: 128 },
    description:   "Maximum backlog for listen() sockets.",
    privileged:    true,
    per_namespace: true,
    getter:        || ParamValue::U32(NET_SOMAXCONN.load(Ordering::Relaxed)),
    setter:        |v| {
        if let ParamValue::U32(n) = v {
            NET_SOMAXCONN.store(n, Ordering::Relaxed);
            Ok(())
        } else {
            Err(ParamError::TypeMismatch)
        }
    },
}

19.8.7 eBPF Access

eBPF programs can read (but not write) kernel parameters via the bpf_param_read(name, value_out) helper. The eBPF verifier resolves the name string at JIT time, validates that the referenced parameter exists and is readable, and replaces the helper call with a direct memory load of the parameter's current value. There is no runtime string parsing in the hot path.

Write access is intentionally excluded from eBPF: parameter writes require CAP_SYS_ADMIN enforcement that the verifier cannot safely model for all program attachment points.

19.8.8 Bulk Enumeration

Orchestration tools and monitoring agents can enumerate all parameters without iterating individual directories:

ls /System/Kernel/params/
fs.file-max/
kernel.perf_event_paranoid/
net.core.somaxconn/
net.ipv4.tcp_rmem/
vm.swappiness/
...

# Machine-readable schema for all parameters in one read:
cat /System/Kernel/params/.schema.json
[
  {"name":"net.core.somaxconn","type":"u32","min":1,"max":65535,
   "default":128,"privileged":true,"namespace":"per-ns",
   "description":"Maximum backlog for listen() sockets."},
  {"name":"vm.swappiness","type":"u32","min":0,"max":200,
   "default":60,"privileged":true,"namespace":"global",
   "description":"Tendency of the kernel to swap anonymous memory."},
  ...
]

The .schema.json file is generated on-the-fly by the umkafs driver by iterating the .umka_params section; it is not stored anywhere on disk.

19.8.9 Linux Compatibility

/proc/sys/ text files continue to work identically. Each existing sysctl path is implemented as a read-write passthrough in the compat layer that delegates to the corresponding KernelParam getter and setter. Shell scripts, Ansible playbooks, and container runtimes that write to /proc/sys/net/core/somaxconn or similar paths observe unchanged behavior.

sysctl(8) tool works without modification. It reads and writes /proc/sys/ paths, which pass through the compat shim.

sysctl(2) syscall (deprecated since Linux 5.5, retained for binary compatibility) is handled by umka-compat. The syscall translates the legacy integer key array into a dot-separated parameter name, then dispatches to the same getter/setter as the umkafs and /proc/sys/ paths.

/System/Kernel/params/ is an UmkaOS addition. No Linux kernel or userspace tool uses this path; it is purely additive.

Implementation phases: - Phase 1: Core infrastructure — KernelParam, ParamSchema, ParamValue, kernel_param! macro, .umka_params ELF section, umkafs driver population. Migrate all parameters currently exported via /proc/sys/ to typed descriptors. - Phase 2: /proc/sys/ compat shim backed by KernelParam getters/setters. sysctl(2) syscall dispatch. Text and binary read/write protocol on value file. - Phase 3: .schema.json aggregate file. eBPF bpf_param_read helper and verifier support. Per-namespace parameter scoping enforcement. - Phase 4: sysctl(8) extended output mode showing type and range annotations (opt-in flag; default output unchanged for compat).

Cross-references: Section 19.4 (umkafs unified namespace), Section 15 (Linux compatibility layer), Section 1.1 (UmkaOS design principles).