Skip to content

Chapter 20: Observability and Diagnostics

Fault management architecture, stable tracepoints, debugging/ptrace, unified object namespace (umkafs)


Observability is built around the Fault Management Architecture (FMA): a unified sense-think-act remediation loop with structured error diagnosis and autonomous response (driver reload, page retirement, device quarantine). Stable tracepoints provide an ABI-stable versioned interface for performance tools — tracepoint IDs survive kernel updates. umkafs (/ukfs/) exposes kernel state as a native filesystem separate from sysfs (which remains for Linux compatibility). EDAC provides proactive page retirement driven by correctable-error trending. PMU and pstore provide hardware profiling and panic log persistence.

20.1 Fault Management Architecture

Inspired by: Solaris/illumos FMA. IP status: Clean — generic engineering concepts, clean-room implementation from public principles.

20.1.1 Problem

Hardware fails gradually. ECC memory corrects bit flips before they become fatal. NVMe drives report wear leveling and reallocated sectors. PCIe links log correctable errors. Network interfaces track CRC errors. These are warning signals.

Linux largely ignores them. Userspace tools (mcelog, smartctl, rasdaemon) scrape ad-hoc kernel interfaces. There is no unified kernel-level framework for collecting hardware telemetry, diagnosing trends, and taking corrective action before a crash.

UmkaOS already has crash recovery (Section 11.9). FMA extends this from reactive ("driver crashed, reload it") to proactive ("this DIMM is degrading, retire its pages before data corruption occurs").

20.1.2 Architecture

+------------------------------------------------------------------+
|                     FAULT MANAGEMENT ENGINE                       |
|                     (UmkaOS Core, kernel-internal)                  |
|                                                                  |
|  Telemetry        Diagnosis         Response                     |
|  Collector  --->  Engine      --->  Executor                     |
|                                                                  |
|  - Per-device     - Rule-based      - Retire memory pages        |
|    health buffers - Threshold        - Demote driver tier         |
|  - Ring buffer      detection       - Disable device             |
|    (lock-free)   - Correlation       - Alert via printk/uevent   |
|  - NUMA-aware      (multi-signal)   - Migrate I/O (if possible) |
+------------------------------------------------------------------+
         ^                                       |
         |  Health reports via KABI              |  Actions via registry
         |                                       v
+------------------+                   +-------------------+
| Device Drivers   |                   | Device Registry   |
| (Tier 0/1/2)    |                   | (Section 11.4)       |
+------------------+                   +-------------------+

20.1.3 Telemetry Collection

Drivers report health data to the kernel through a new KABI method:

// Appended to KernelServicesVTable (Option<...> for backward compat)

/// Report a health telemetry event.
pub fma_report_health: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    event_class: HealthEventClass,
    event_code: u32,
    severity: HealthSeverity,
    data: *const u8,
    data_len: u32,
) -> IoResultCode>,
/// Health event classification.
#[repr(u32)]
pub enum HealthEventClass {
    /// Memory: ECC corrected error, uncorrectable error, scrub result.
    Memory          = 0,
    /// Storage: SMART attribute change, wear level, reallocated sector.
    Storage         = 1,
    /// Network: CRC errors, link flaps, packet drops.
    Network         = 2,
    /// PCIe: Correctable error, AER event, link retraining.
    Pcie            = 3,
    /// Thermal: Over-temperature warning, throttling.
    Thermal         = 4,
    /// Power: Voltage out of range, power supply degradation.
    Power           = 5,
    /// Generic: Driver-defined health event.
    Generic         = 6,
    /// Accelerator: GPU, FPGA, or other accelerator health event.
    Accelerator     = 7,
    /// CPU: MCE, thermal throttle, microcode error, or CPU-specific fault.
    Cpu             = 8,
    /// Performance: Accelerator submission timeout, sustained throughput
    /// degradation, or latency anomaly. Distinct from `Accelerator` (which
    /// covers hardware faults like ECC or thermal events) — `Performance`
    /// tracks behavioral degradation that may not indicate a hardware fault.
    Performance     = 9,
}

impl HealthEventClass {
    /// Number of variants in this enum.  Used to size arrays indexed by
    /// event class, avoiding hardcoded constants.
    /// **Maintenance rule**: This value MUST be updated when adding or removing
    /// enum variants. A compile-time assertion (`const_assert!(COUNT == ...)`)
    /// in the implementation crate guards against stale values.
    pub const COUNT: usize = 10;
}
const_assert!(HealthEventClass::COUNT == 10);

#[repr(u8)]
pub enum HealthSeverity {
    /// Error was automatically corrected (e.g., ECC single-bit correction).
    /// Logged but no action required. Health score penalty: 0.
    Corrected       = 0,
    /// Informational: no action needed, for trending.
    Info            = 1,
    /// Warning: threshold approaching, admin should investigate.
    Warning         = 2,
    /// Degraded: component partially failed, corrective action taken.
    Degraded        = 3,
    /// Critical: imminent failure, immediate action required.
    Critical        = 4,
    /// Fatal — unrecoverable failure. Triggers device offline or kernel panic
    /// per escalation policy (Section 20.1.9). Health score penalty: 100.
    Fatal           = 5,
}

The data field carries event-specific payload. Standard payload formats per class:

/// Memory health payload.
#[repr(C)]
pub struct MemoryHealthData {
    /// Physical address of the error (0 = unknown).
    pub phys_addr: u64,
    /// DIMM identifier (SMBIOS handle or ACPI proximity).
    pub dimm_id: u32,
    /// Error type: 0 = correctable, 1 = uncorrectable.
    pub error_type: u32,
    /// ECC syndrome (for diagnosis).
    pub syndrome: u64,
    /// Cumulative correctable error count for this DIMM.
    pub cumulative_ce_count: u64,
    /// Cumulative uncorrectable error count for this DIMM.
    pub cumulative_ue_count: u64,
    /// DIMM temperature in millidegrees Celsius (0 = unknown/unsupported).
    /// Read from SMBIOS Type 28 or in-band thermal sensor if available.
    pub temp_millidegrees: u32,
    pub _pad: u32,
}
// MemoryHealthData: u64(8) + u32(4)*2 + u64(8)*3 + u32(4)*2 = 48 bytes.
const_assert!(size_of::<MemoryHealthData>() == 48);

/// Storage health payload.
#[repr(C)]
pub struct StorageHealthData {
    /// SMART attribute ID.
    pub attribute_id: u8,
    /// Current value.
    pub current: u8,
    /// Worst recorded value.
    pub worst: u8,
    /// Threshold for failure.
    pub threshold: u8,
    pub _pad_align: [u8; 4], // Explicit padding for u64 alignment (repr(C))
    /// Raw attribute value (vendor-specific).
    pub raw_value: u64,
    /// Percentage life remaining (0-100, 0xFF = unknown).
    pub life_remaining_pct: u8,
    pub _pad: [u8; 7],
}
// StorageHealthData: 1+1+1+1+4+8+1+7 = 24 bytes.
const_assert!(size_of::<StorageHealthData>() == 24);

/// Network health payload.
#[repr(C)]
pub struct NetworkHealthData {
    /// Interface index (matches the device registry's interface ID).
    pub if_index: u32,
    /// CRC error count since last report.
    pub crc_errors: u32,
    /// Link flap count since last report.
    pub link_flaps: u32,
    /// Packet drop count (RX + TX) since last report.
    pub packet_drops: u32,
    /// Current link speed in Mbps (0 = link down).
    pub link_speed_mbps: u32,
    /// Link state: 0 = down, 1 = up.
    pub link_up: u8,
    pub _pad: [u8; 3],
}
// NetworkHealthData: 4+4+4+4+4+1+3 = 24 bytes.
const_assert!(size_of::<NetworkHealthData>() == 24);

/// Thermal health payload.
#[repr(C)]
pub struct ThermalHealthData {
    /// Current temperature in millidegrees Celsius (e.g., 72500 = 72.5 C).
    pub temp_millicelsius: i32,
    /// Thermal throttling threshold in millidegrees Celsius.
    pub throttle_threshold_mc: i32,
    /// Critical shutdown threshold in millidegrees Celsius.
    pub critical_threshold_mc: i32,
    /// Whether the device is currently throttled (0 = no, 1 = yes).
    pub throttled: u8,
    /// Thermal zone identifier (device-specific).
    pub zone_id: u8,
    pub _pad: [u8; 2],
}
// ThermalHealthData: i32(4)*3 + u8(1)*2 + [u8;2] = 16 bytes.
const_assert!(size_of::<ThermalHealthData>() == 16);

/// PCIe health payload.
/// Layout (20 bytes): bdf(4) + cor_status(4) + uncor_status(4) +
/// link_speed(2) + link_width(1) + _pad(1) + retrain_count(4) = 20.
#[repr(C)]
pub struct PcieHealthData {
    /// BDF (bus:device.function).
    pub bdf: u32,
    /// AER correctable error status register.
    pub cor_status: u32,
    /// AER uncorrectable error status register.
    pub uncor_status: u32,
    /// Current link speed (GT/s * 10, e.g., 80 = 8.0 GT/s).
    pub link_speed: u16,
    /// Current link width (x1, x4, x8, x16).
    pub link_width: u8,
    /// Explicit padding for u32 alignment. Must be zeroed.
    pub _pad: u8,
    /// Link retraining count.
    // u32: wraps in ~8,171 years at 1 retrain/minute. Safe for 50-year uptime.
    pub retrain_count: u32,
}
const_assert!(core::mem::size_of::<PcieHealthData>() == 20);

Hwmon integration: Hardware monitoring sensors (temperature, voltage, fan speed) registered via the HwmonDevice trait (Section 13.13) emit FaultEvent::Thermal events through the standard fma_report_health() KABI method above. There is no separate hwmon-to-FMA bypass path. All hwmon threshold violations (e.g., temp1_crit exceeded) are routed through this telemetry collection interface, ensuring that the FMA diagnosis engine has a unified view of all hardware health events regardless of their source (EDAC, SMART, hwmon, PCIe AER).

20.1.4 Telemetry Buffer

The FMA engine maintains a per-device circular buffer of health events:

// Kernel-internal

// Health telemetry uses `BoundedMpmcRing<T, N>` (Section 3.1.11.4) — the
// canonical kernel-wide lock-free ring buffer. It is NMI-safe (multiple
// concurrent producers including NMI context), inline-storage (no heap
// allocation), and supports the MPSC pattern used by fma_emit() producers
// and the single FMA processing kthread consumer.

pub struct DeviceHealthLog {
    /// Device node this log belongs to.
    device_id: DeviceNodeId,

    /// Circular buffer of recent events (fixed size per device).
    events: BoundedMpmcRing<HealthEvent, 256>,

    /// Counters by event class (for fast threshold checks).
    /// Size is derived from the enum variant count, not hardcoded.
    class_counts: [AtomicU64; HealthEventClass::COUNT],

    /// Timestamp of first event in current window (for rate detection).
    /// AtomicU64: read by `fma_emit()` (NMI-safe, lock-free hot path) and written
    /// by the FMA kthread when rotating rate windows. Without atomicity, a
    /// concurrent read from NMI context races with the kthread write — a data
    /// race that is undefined behavior on non-x86 architectures (ARM, RISC-V
    /// do not guarantee atomic u64 reads for plain types).
    window_start_ns: AtomicU64,

    /// NUMA node (allocate buffer on device's NUMA node).
    numa_node: i32,

    /// Maximum events per second per (device, event_class) pair.
    /// Events exceeding this rate are counted but not stored in the
    /// circular buffer. Default: 100.
    rate_limit_per_sec: u32,

    /// Number of events suppressed by rate limiting since last reset.
    /// When non-zero, a single "rate_limited" meta-event is recorded
    /// in the circular buffer with the suppressed count.
    suppressed_count: AtomicU64,
}

#[repr(C)]
pub struct HealthEvent {
    pub timestamp_ns: u64,
    pub class: HealthEventClass,      // offset  8, 4 bytes
    pub code: u32,                    // offset 12, 4 bytes
    pub severity: HealthSeverity,     // offset 16, 1 byte (#[repr(u8)])
    pub _severity_pad: [u8; 3],       // offset 17, 3 bytes explicit padding
    pub data_len: u32,                // offset 20, 4 bytes (moved before data to eliminate tail padding)
    pub data: [u8; 64],              // offset 24, 64 bytes — inline payload (avoids allocation)
}
const_assert!(size_of::<HealthEvent>() == 88);
// Layout: 8+4+4+1+3(pad)+4+64 = 88 bytes, explicit padding, no implicit holes.

Backpressure design (prevents event storms from overwhelming the telemetry path):

The ingestion path uses a two-level design: - Fast path: Atomic counter increment per (device, event_class) pair. Zero allocation. Every event is counted in class_counts regardless of rate. - Slow path: Detailed event stored in the circular buffer only if the rate is below rate_limit_per_sec (default: 100 events/second per class). Above the threshold, a single "rate_limited" meta-event is recorded with the suppressed_count value, then the counter resets. This ensures the buffer contains representative events without being flooded during error storms.

Kernel-internal safe wrapper for fma_report_health:

/// Report a health telemetry event. Safe wrapper around the KABI vtable
/// function pointer. Called by drivers and subsystems to report hardware
/// health events to the FMA diagnosis engine.
pub fn fma_report_health(
    device: DeviceHandle,
    class: HealthEventClass,
    code: u32,
    severity: HealthSeverity,
    data: &[u8],
) -> IoResultCode {
    // SAFETY: vtable pointer is valid for the lifetime of the kernel.
    // data pointer and length are derived from a valid slice.
    let vtable = kernel_services_vtable();
    match vtable.fma_report_health {
        Some(f) => unsafe { f(device, class, code, severity, data.as_ptr(), data.len() as u32) },
        None => IoResultCode::ENOSYS,
    }
}

FaultEvent source registry: Subsystems register event types via fma_register_event_source(class: HealthEventClass, source_name: &str, event_codes: &[(u32, &str)]) -> Result<(), FmaError>. Each (class, event_code) pair is globally unique — registration fails with EEXIST on collision. Built-in sources (EDAC, PCIe AER, thermal) are registered at boot. Driver-provided sources register during probe.

20.1.4.1 FMA Event Source Registry

All subsystems that emit FaultEvents:

Subsystem HealthEventClass Event codes Source file
EDAC Memory CE (0x01), UE (0x02), Scrub (0x03) Section 20.6
PCIe AER Pcie CorrErr (0x01), UncorrErr (0x02), LinkRetrain (0x03) Section 11.4
NVMe SMART Storage WearLevel (0x01), Reallocated (0x02), Temperature (0x03) Section 15.19
NIC driver Network CrcError (0x01), LinkFlap (0x02), RingCorruption (0x03) Section 16.13
Thermal Thermal OverTemp (0x01), Throttling (0x02) Section 7.4
CPU MCE Cpu MachineCheck (0x01), ThermalThrottle (0x02) Section 2.23
Crash recovery Generic DriverCrash (0x1001), DeviceResetFailed (0x1003) Section 11.9
Accelerator Accelerator EccError (0x01), SubmitTimeout (0x02) Section 22.1
DSM Performance WritebackDegraded (0x01), ProbeMiss (0x02) Section 6.12

New subsystems register via fma_register_event_source() (Section 20.1).

20.1.5 Diagnosis Engine

The diagnosis engine is a rule-based evaluator that runs when new telemetry arrives.

// Kernel-internal

pub struct DiagnosisRule {
    /// Human-readable name (for logging).
    name: ArrayString<64>,

    /// Match criteria.
    event_class: HealthEventClass,
    min_severity: HealthSeverity,

    /// Threshold: number of matching events within time window.
    count_threshold: u32,
    window_seconds: u32,

    /// Rate: events per second sustained over window.
    rate_threshold: Option<u32>,

    /// Value-based threshold: fires when a specific field in the health event payload
    /// falls below (or exceeds) a specified level, independent of event counts.
    ///
    /// For example, NVMe wear rules ("life < 10%") use:
    ///   field_id = DiagField::LifeRemainingPct, threshold = 10
    /// The diagnosis engine reads the field identified by `field_id` from the health
    /// event payload via the `DiagFieldAccessor` trait and fires the rule when the
    /// field's value drops below `threshold`.
    /// When `None`, only `count_threshold` / `rate_threshold` are evaluated.
    value_threshold: Option<ValueThreshold>,

    /// Correlation: require events from multiple related devices.
    correlation: Option<CorrelationRule>,

    /// Action to take when rule fires.
    action: DiagnosisAction,
}

/// Enum-based field identifiers for scalar values in health event payloads.
/// Used instead of string-based field names because Rust has no runtime
/// reflection -- struct fields don't carry name metadata at runtime. Each
/// health event payload type implements `DiagFieldAccessor` to map these
/// enum variants to the actual struct field values.
#[repr(u16)]
pub enum DiagField {
    /// NVMe endurance: remaining drive life as a percentage (0-100).
    LifeRemainingPct    = 0,
    /// Current temperature in millidegrees Celsius.
    Temperature         = 1,
    /// Cumulative correctable error count.
    CorrectableErrors   = 2,
    /// Cumulative uncorrectable error count.
    UncorrectableErrors = 3,
    /// Available spare capacity as a percentage (NVMe).
    AvailableSparePct   = 4,
    /// Power-on hours.
    PowerOnHours        = 5,
    /// Media errors (NVMe).
    MediaErrors         = 6,
    /// Memory correctable error count (DIMM health).
    /// For `MemoryHealthData` payloads, this returns the same value as
    /// `CorrectableErrors` (both map to `cumulative_ce_count`). The
    /// distinction exists because `CorrectableErrors` is the generic variant
    /// usable by any payload type, while this variant is memory-specific.
    /// For `StorageHealthData`, both return `None`.
    MemoryCorrectableErrors = 7,
}

/// Trait implemented by each health event payload type (e.g., `StorageHealthData`,
/// `MemoryHealthData`). Maps `DiagField` variants to the corresponding scalar
/// value in the payload struct. Returns `None` if the field is not applicable
/// to this payload type (e.g., `LifeRemainingPct` on a DIMM health event).
pub trait DiagFieldAccessor {
    /// Extract the scalar value for the given field, or `None` if not applicable.
    /// Returns `i64` (not `u64`) to handle signed values: negative temperatures,
    /// signed error codes, and threshold deltas. Matches `DiagPayloadDescriptor::field_value`
    /// return type for consistency across all field extraction paths.
    fn field_value(&self, field: DiagField) -> Option<i64>;
}

Health Payload Type Registry

Health event types register their DiagFieldAccessor implementation at boot time via a static inventory pattern:

Each subsystem declares its health payload type descriptor:

pub trait DiagPayloadDescriptor: Send + Sync {
    /// Unique type identifier (e.g., 0x01 = Storage, 0x02 = Memory).
    fn type_id(&self) -> u8;
    /// Human-readable name for diagnostics.
    fn name(&self) -> &'static str;
    /// Supported fields for value-based diagnosis rules.
    fn supported_fields(&self) -> &'static [DiagField];
    /// Extract a named field value from the raw health event payload.
    /// The descriptor knows the payload layout (keyed by `type_id()`),
    /// so it encapsulates the transmute from `[u8]` to the typed struct
    /// internally. Returns `None` if `field` is not supported or `data`
    /// is too short.
    fn field_value(&self, data: &[u8], data_len: u32, field: DiagField) -> Option<i64>;
}

/// Global registry of health payload types. Two-phase initialization:
/// 1. During boot, subsystems call `register_diag_payload()` which inserts
///    into a temporary `SpinLock<ArrayVec<...>>` (cold path).
/// 2. `fma_freeze_registry()` moves the contents into the OnceCell. After
///    freeze, reads are a plain `&ArrayVec` with no synchronization.
///
/// The diagnosis engine evaluates rules on health events (warm/hot FMA path);
/// locking a SpinLock on every health event to read a static registry would
/// be unnecessary overhead.
static DIAG_PAYLOAD_REGISTRY_BOOT: SpinLock<ArrayVec<&'static dyn DiagPayloadDescriptor, 32>>
    = SpinLock::new(ArrayVec::new());
pub static DIAG_PAYLOAD_REGISTRY: OnceCell<ArrayVec<&'static dyn DiagPayloadDescriptor, 32>>
    = OnceCell::new();

Registration example:

// In storage subsystem init:
fn storage_fma_init() {
    DIAG_PAYLOAD_REGISTRY_BOOT.lock().push(&STORAGE_HEALTH_DESCRIPTOR);
}

When a health event arrives, FMA looks up the type_id in the registry to obtain the appropriate DiagFieldAccessor for field-level rule evaluation. Custom health payloads (e.g., accelerator-specific telemetry) register through the same mechanism during their subsystem's init phase.

/// Example: `DiagFieldAccessor` implementation for `StorageHealthData`.
/// Each health payload type provides a similar mapping from `DiagField`
/// variants to its own struct fields.
impl DiagFieldAccessor for StorageHealthData {
    fn field_value(&self, field: DiagField) -> Option<i64> {
        match field {
            DiagField::LifeRemainingPct    => Some(self.life_remaining_pct as i64),
            DiagField::Temperature         => None, // not applicable to storage
            DiagField::CorrectableErrors   => None,
            DiagField::UncorrectableErrors => None,
            DiagField::AvailableSparePct   => None, // carried separately in NVMe-specific payload
            DiagField::PowerOnHours        => None,
            DiagField::MediaErrors         => None,
            DiagField::MemoryCorrectableErrors => None,
        }
    }
}

/// Example: `DiagFieldAccessor` implementation for `MemoryHealthData`.
/// Demonstrates a payload type where multiple fields return real values.
/// The EDAC subsystem (edac-error-detection-and-correction-framework) emits `MemoryHealthData` events
/// with per-DIMM correctable/uncorrectable error counts and temperature.
impl DiagFieldAccessor for MemoryHealthData {
    fn field_value(&self, field: DiagField) -> Option<i64> {
        match field {
            DiagField::CorrectableErrors      => Some(self.cumulative_ce_count as i64),
            DiagField::UncorrectableErrors    => Some(self.cumulative_ue_count as i64),
            DiagField::Temperature            => Some(self.temp_millidegrees as i64),
            DiagField::MemoryCorrectableErrors => Some(self.cumulative_ce_count as i64),
            // Memory payloads do not carry storage-specific fields.
            DiagField::LifeRemainingPct       => None,
            DiagField::AvailableSparePct      => None,
            DiagField::PowerOnHours           => None,
            DiagField::MediaErrors            => None,
        }
    }
}

/// Threshold applied to a specific scalar field in the health event payload.
/// Used for percentage- or gauge-based rules (e.g., NVMe wear life).
// kernel-internal, not KABI — diagnosis policy struct, not exposed to userspace.
#[repr(C)]
pub struct ValueThreshold {
    /// Identifies which field in the health event payload to compare.
    /// E.g., `DiagField::LifeRemainingPct` for NVMe endurance data.
    field_id: DiagField,

    /// Numeric trigger level. The rule fires when the field's value is
    /// strictly less than this value (e.g., `10` means "< 10%").
    threshold: u32,
}

pub enum DiagnosisAction {
    /// Log a message and emit uevent. No automatic correction.
    Alert { message: ArrayString<128> },

    /// Retire specific physical pages (memory errors). The `start_pfn` identifies
    /// the first page frame to retire; `count` is the number of contiguous pages.
    RetirePages { start_pfn: u64, count: u32 },

    /// Demote the device's driver to a lower tier.
    DemoteTier,

    /// Disable the device entirely.
    DisableDevice,

    /// Mark device as degraded (informational, for admin).
    MarkDegraded,

    /// Trigger live evolution ([Section 13.18](13-device-classes.md#live-kernel-evolution)) to proactively replace a degrading component.
    TriggerEvolution { target_component: ArrayString<64> },

    /// Signal the scheduler to avoid a NUMA node with degraded memory (ECC
    /// uncorrectable errors, offline DIMMs). The scheduler's NUMA placement
    /// policy ({ref:eevdf-and-cbs-scheduling#numa-balancing}  <!-- UNRESOLVED -->) treats the flagged
    /// node as a last-resort migration target until the degradation is resolved.
    AvoidNumaNode { node_id: u32 },
}

/// Demote a device to a lower isolation tier. Called by FMA when diagnosis
/// rules determine the device is unreliable at its current tier.
/// Returns Ok(new_tier) on success, Err if no lower tier available.
pub fn fma_demote_device(
    dev: &DeviceHandle,
    reason: &DiagnosisReport,
) -> Result<IsolationTier, FmaError> {
    // 1. Signal driver: prepare for tier demotion (DrainNotify).
    // 2. Wait for in-flight I/O to complete (timeout: 5 seconds).
    // 3. Unload driver from current isolation domain.
    // 4. Reload driver in next-lower tier (Tier1 → Tier2, Tier2 → disabled).
    // 5. Transition DeviceState: Active → Degraded.
    // 6. Log FMA event with diagnosis report.
}

/// Disable a device entirely. Called by FMA when uncorrectable errors
/// exceed threshold. Device remains visible in sysfs (for diagnostics)
/// but all I/O returns -EIO.
pub fn fma_disable_device(
    dev: &DeviceHandle,
    reason: &DiagnosisReport,
) -> Result<(), FmaError> {
    // 1. Remove device from all active I/O paths.
    // 2. Transition DeviceState: * → Error.
    // 3. All pending I/O: complete with -EIO.
    // 4. Device node remains in /sys for introspection.
}

Device state transitions are managed by the device registry (Section 11.4). FMA actions
invoke registry APIs that perform the actual tier demotion or device disable.

/// Used by the diagnosis engine (Section 20.1.5) to detect correlated failures
/// across multiple related devices. When a DiagnosisRule includes a CorrelationRule,
/// the engine evaluates the threshold across all devices sharing the specified
/// property (e.g., same memory controller, same PCIe root complex, same NUMA node).
/// The event correlation engine (unified-object-namespace, Health namespace) uses these rules to
/// populate cross-device fault reports visible under /Health/ByDevice/.
pub struct CorrelationRule {
    /// Stable identifier for this correlation rule (unique within the
    /// parent DiagnosisRule's scope). Used for sysfs control, logging, and
    /// cooldown tracking (the FMA engine stores per-rule cooldown state
    /// keyed by `(rule_id, DeviceId)`).
    pub rule_id: ArrayString<32>,

    /// Temporal window in nanoseconds within which correlated events must
    /// occur. Events older than `window_ns` are expired from the sliding
    /// window. A value of 0 means "instantaneous" — all events must be
    /// present in the current evaluation cycle (no history).
    pub window_ns: u64,

    /// Minimum number of matching events (summed across all devices in
    /// the correlation group) required within `window_ns` before the rule
    /// fires. This is an aggregate count — individual devices may
    /// contribute one or more events each.
    pub min_events: u32,

    /// Minimum number of distinct devices that must have contributed at
    /// least one event within `window_ns`. Ensures failures are spatially
    /// distributed, not repeated errors on a single device.
    pub min_devices: u32,

    /// Property matchers that define the correlation group. All matchers
    /// must be satisfied (AND semantics) — a device is included in the
    /// group only if it shares every listed property with the originating
    /// device. Example: `[("memory_controller", "mc0"), ("numa_node", "0")]`
    /// matches DIMMs on memory controller mc0 in NUMA node 0.
    pub properties: ArrayVec<PropertyMatcher, 4>,
}

/// A single property matcher for correlation grouping.
pub struct PropertyMatcher {
    /// Property name to match (e.g., "memory_controller", "pcie_root",
    /// "numa_node", "usb_hub"). Resolved by the bus-specific
    /// `DiagPropertyResolver` registered for this property name.
    pub property_name: ArrayString<32>,

    /// Expected value, or empty to mean "same as the originating device."
    /// When empty, the resolver determines the originating device's value
    /// for this property and matches all peer devices that share it.
    /// When set, only devices whose property value equals this string
    /// are included (e.g., "0" for a specific NUMA node).
    pub expected_value: ArrayString<32>,
}

/// Resolves a shared-property name to the set of devices that share that
/// property with a given device.  One resolver is registered per bus type
/// at subsystem init (e.g., `PciePropertyResolver`, `UsbPropertyResolver`,
/// `PlatformPropertyResolver`).  The diagnosis engine calls all registered
/// resolvers when evaluating a `CorrelationRule`.
pub trait DiagPropertyResolver {
    /// Return all devices that share `property` with `device`.
    /// For example, `PciePropertyResolver::resolve("pcie_root", dev)` returns
    /// every device on the same PCIe root complex as `dev`.
    ///
    /// Returns at most 64 device IDs (bounded by `ArrayVec` capacity).
    /// If more than 64 devices share the property, only the first 64 are
    /// returned — correlation rules should set `min_devices` well below 64.
    fn resolve(&self, property: &str, device: DeviceId) -> ArrayVec<DeviceId, 64>;
}

/// Registration table for property resolvers.  Bus subsystems call
/// `diag_register_resolver()` during their init sequence (e.g., PCIe
/// init registers `PciePropertyResolver` which handles `"pcie_root"`,
/// `"pcie_switch"`, and `"pcie_slot"` properties; USB init registers
/// `UsbPropertyResolver` which handles `"usb_controller"` and
/// `"usb_hub"`).  The table is RCU-protected for lock-free read access
/// from the diagnosis evaluation hot path.
pub fn diag_register_resolver(resolver: &'static dyn DiagPropertyResolver) -> Result<(), FmaError>;

Default rules (built-in, administrator can override via /sys):

Rule Class Threshold Window Action
DIMM degradation Memory 100 CE 1 hour Alert + RetirePages
DIMM failure Memory 1 UE instant DisableDevice + Alert
NVMe wear out Storage life < 10% Alert
NVMe critical wear Storage life < 3% MarkDegraded + Alert
PCIe link unstable PCIe 10 retrains 1 minute Alert
PCIe link failing PCIe 50 retrains 1 minute DemoteTier + Alert
NIC error storm Network 1000 CRC errors 1 minute Alert
Thermal throttling Thermal 5 events 10 minutes Alert
PCIe proactive swap PCIe 30 retrains 10 minutes TriggerEvolution + Alert
NVMe proactive swap Storage life < 5% TriggerEvolution + Alert

Evaluation algorithm: When a new health event arrives, the diagnosis engine evaluates rules synchronously on the health telemetry workqueue (not in interrupt context). Evaluation proceeds as follows:

  1. Filter: Select rules where event_class matches the incoming event and min_severity <= event.severity. This reduces the candidate set from all registered rules (~20-50 default + admin-added) to typically 1-5 rules. Filtering is O(n) in the total rule count; with <100 rules, this is sub-microsecond.

  2. Evaluate thresholds (for each candidate rule):

  3. Count threshold: Increment the per-rule event counter (stored in a per-device RuleState array). If counter exceeds count_threshold within window_seconds, the rule fires. Stale events are expired lazily using a sliding window (circular buffer of timestamps).
  4. Rate threshold (if set): Compute event rate over the window. Fire if rate exceeds rate_threshold events/sec.
  5. Value threshold (if set): Read the field via DiagFieldAccessor and compare against the threshold. Fire immediately on breach.

  6. Correlation check (if correlation is set): Query the per-device counters for all devices sharing shared_property. Fire only if min_devices distinct devices have independently breached their thresholds within the window.

  7. Execute actions: Rules are evaluated in descending priority order (see compiled rule evaluation below). Evaluation stops after the first matching rule fires, unless the rule sets a CONTINUE flag — in that case, the rule's action executes and evaluation proceeds to the next lower-priority rule. This first-match-stops semantic prevents redundant actions on the same event while allowing explicit multi-action chains via CONTINUE. Actions that modify device state (DemoteTier, DisableDevice) are idempotent.

20.1.5.1 FMA Diagnosis Rule Specification

Rule Language — a small declarative DSL compiled to a rule-table at module registration time:

// Syntax: each rule is a WHEN...THEN expression.
// Registers the rule under the given rule_id at the given confidence (0-100).

RULE <rule_id>:
    WHEN
        EVENT(<source>) [.field <op> <value>]   // match condition
        [AND EVENT(...) ...]                    // multiple conditions (all must fire within window_ms)
        [WITHIN <window_ms> ms]                 // temporal window (default: no window = single event)
        [COUNT >= <n>]                          // rate threshold: N events in window
    THEN
        <ACTION>(<args>)
        [CONFIDENCE <N>]                        // confidence level 0-100 (in rule body, per BNF)

// Field operators: ==, !=, >, >=, <, <=, CONTAINS, MATCHES (regex, max 64 chars)
// Actions: SUSPECT(<fru>), ISOLATE(<dev>), RELOAD(<dev>), ALERT(<severity>, <msg>),
//          THROTTLE(<dev>, <limit>), RETIRE(<resource>), REPLACE_PREFERRED(<dev>),
//          FMA_TICKET(<severity>, <msg>), CALL(<handler>)
// Severity levels: INFO, WARNING, CRITICAL, FATAL
// FRU: device_id string, e.g., "nvme0", "cpu3", "mem_channel:0"

Example rules:

RULE nvme_timeout_suspect [confidence=80]:
    WHEN EVENT(nvme_io_timeout) COUNT >= 3 WITHIN 60000 ms
    THEN SUSPECT("nvme0")

RULE nvme_confirmed_fault [confidence=95]:
    WHEN EVENT(nvme_hw_error) AND EVENT(nvme_timeout)
         WITHIN 5000 ms
    THEN ISOLATE(SELF), ALERT(CRITICAL, "NVMe drive fault detected")

RULE mem_corrected_error [confidence=60]:
    WHEN EVENT(edac_ce) COUNT >= 100 WITHIN 3600000 ms
    THEN SUSPECT("mem_channel:0"), ALERT(WARNING, "High CE rate on DIMM")

RULE mem_uncorrected_fault [confidence=99]:
    WHEN EVENT(edac_ue)
    THEN RETIRE("mem_page:{page_addr}"), ALERT(CRITICAL, "UE: retiring page")

20.1.5.1.1 FMA Rule DSL — Formal Grammar

Quick reference — keywords and their roles at a glance:

Keyword Role Example
RULE Names a rule RULE nvme_fault:
WHEN Condition expression WHEN EVENT(nvme_hw_error)
AND, OR, NOT Boolean operators WHEN A AND NOT B
EVENT(type) Match a specific event EVENT(edac_ue)
COUNT >= N WITHIN T Threshold over time window COUNT >= 3 WITHIN 60000 ms
THEN Action list THEN SUSPECT("nvme0")
ALERT(sev, msg) Raise an FMA alert ALERT(CRITICAL, "fault")
ISOLATE(dev) Remove device from service ISOLATE(SELF)
RELOAD(dev) Trigger Tier 1 driver reload RELOAD(SELF)
SUSPECT(dev) Mark device suspect SUSPECT("nvme0")
RETIRE(resource) Permanently remove resource RETIRE("mem_page:0x...")
PRIORITY N Rule evaluation order (lower = first) PRIORITY 10
COOLDOWN T Minimum interval between firings COOLDOWN 300000 ms

The FMA rule language uses a WHEN/THEN syntax for expressing diagnosis rules. The following BNF grammar is the authoritative specification; the rule compiler accepts exactly this grammar (no extensions, no shortcuts).

rule         ::= "RULE" rule_id ":" rule_body
rule_id      ::= identifier

rule_body    ::= "WHEN" condition_expr "THEN" action_list
                 ["CONFIDENCE" confidence_level]
                 ["PRIORITY" priority_level]
                 ["COOLDOWN" duration]

condition_expr ::= condition_term (("AND" | "OR") condition_term)*
                 | "NOT" condition_expr
                 | "(" condition_expr ")"

condition_term ::=
    event_match
  | metric_comparison
  | state_predicate
  | history_match

event_match  ::= "EVENT" "(" event_type ["," attribute_filter] ")"
event_type   ::= identifier  -- e.g., UncorrectedEcc, NvmeCrcError, TierCrash

attribute_filter ::= attribute_name "=" literal_value
                   | attribute_filter "," attribute_filter
attribute_name   ::= identifier "." identifier  -- e.g., device.pci_addr

metric_comparison ::=
    metric_path comparison_op numeric_literal [unit_suffix]
  | metric_path "IN" "[" numeric_literal ".." numeric_literal "]"

metric_path  ::= identifier ("." identifier)*  -- e.g., cpu.temp_celsius
comparison_op ::= ">" | ">=" | "<" | "<=" | "==" | "!="
unit_suffix  ::= "%" | "ms" | "us" | "ns" | "MB" | "GB" | "C"

state_predicate ::=
    "STATE" "(" device_ref ")" "==" device_state
  | "HEALTHY" "(" device_ref ")"
  | "DEGRADED" "(" device_ref ")"
  | "ERROR" "(" device_ref ")"

device_ref   ::= identifier | "SELF" | "PEER" "(" identifier ")"
device_state ::= "Active" | "Degraded" | "Error" | "Missing" | "Suspect"

history_match ::=
    "HISTORY" "(" event_type "," count_expr "," duration ")"

count_expr   ::= numeric_literal ("+" | "-" | "*") numeric_literal
               | numeric_literal

action_list  ::= action ("," action)*

action       ::=
    "ALERT" "(" severity "," string_literal ")"
  | "ISOLATE" "(" device_ref ")"
  | "RELOAD" "(" device_ref ")"
  | "THROTTLE" "(" device_ref "," numeric_literal unit_suffix ")"
  | "REPLACE_PREFERRED" "(" device_ref ")"
  | "SUSPECT" "(" device_ref ")"
  | "RETIRE" "(" resource_ref ")"
  | "FMA_TICKET" "(" severity "," string_literal ")"
  | "CALL" "(" handler_name ["," argument_list] ")"

severity     ::= "INFO" | "WARNING" | "CRITICAL" | "FATAL"
confidence_level ::= numeric_literal  -- 0 (no confidence) to 100 (certain)
priority_level ::= numeric_literal  -- 0 (lowest) to 255 (highest)
duration     ::= numeric_literal ("s" | "ms" | "m" | "h")

literal_value ::= numeric_literal | string_literal | boolean_literal
numeric_literal ::= ["-"] digit+ ["." digit+]
string_literal ::= '"' character* '"'
boolean_literal ::= "true" | "false"
identifier   ::= [a-zA-Z_] [a-zA-Z0-9_]*
digit        ::= [0-9]

Extended example rules (illustrating grammar usage):

RULE ecc_threshold_exceeded:
WHEN EVENT(UncorrectedEcc, device.pci_addr = "0000:01:00.0")
  AND HISTORY(UncorrectedEcc, 3, 1h)
THEN ALERT(CRITICAL, "ECC errors exceed threshold; memory DIMM likely failing"),
     FMA_TICKET(CRITICAL, "Schedule DIMM replacement on host {}"),
     REPLACE_PREFERRED(SELF)
PRIORITY 200
COOLDOWN 24h

RULE nvme_crc_degraded:
WHEN EVENT(NvmeCrcError) AND io.error_rate > 5%
THEN ALERT(WARNING, "NVMe CRC error rate elevated"),
     THROTTLE(SELF, 50%)
PRIORITY 100
COOLDOWN 5m

Compiled rule types — The DSL compiler produces the following structures at rule-module registration time:

/// Rule identifier: the DSL rule name from `RULE name:` declarations.
/// Non-integer type → BTreeMap key is justified per collection policy.
pub type RuleId = ArrayString<64>;

/// A single compiled FMA rule — the runtime representation of one RULE...WHEN...THEN block.
pub struct CompiledRule {
    /// Rule name from the DSL `RULE name:` declaration. Used as the
    /// cooldown map key and in FMA event log messages.
    pub name: RuleId,
    /// Compiled condition tree (evaluated against incoming events).
    pub condition: CompiledCondition,
    /// Action to execute when the condition fires.
    pub action: DiagAction,
    /// Rule priority (0 = lowest, 255 = highest). Controls evaluation order.
    pub priority: u8,
}

/// Condition tree produced by the DSL compiler. Leaf nodes are threshold
/// checks; interior nodes are boolean combinators (AND/OR, max 4 children
/// per node to bound stack depth during recursive evaluation).
pub enum CompiledCondition {
    /// Field value exceeds a fixed threshold.
    ThresholdAbove { field: &'static str, threshold: i64 },
    /// Field value falls below a fixed threshold.
    ThresholdBelow { field: &'static str, threshold: i64 },
    /// Field value changes faster than the allowed rate (delta/sec).
    RateOfChange { field: &'static str, max_delta_per_sec: i64 },
    /// All child conditions must be satisfied (short-circuit evaluation).
    And(ArrayVec<CompiledCondition, 4>),
    /// At least one child condition must be satisfied (short-circuit evaluation).
    Or(ArrayVec<CompiledCondition, 4>),
}

/// Maximum rules per module (matches FmaRuleModule.max_rules).
pub const MAX_FMA_RULES: usize = 256;

/// Flat table of compiled rules for one module.  Stored in an RCU cell
/// so the FMA evaluation hot path reads the table lock-free.
pub type FmaRuleTable = ArrayVec<CompiledRule, MAX_FMA_RULES>;

Rule evaluation: Rules are stored in an RcuCell<FmaRuleTable> — a flat array pre-sorted by descending priority at registration time. Evaluation is a linear scan with early exit on first match: higher-priority rules are checked first; evaluation stops after the first matching rule fires (unless the rule sets a CONTINUE flag). Worst case is O(MAX_FMA_RULES) but typical is O(1–10) due to early exit. Cooldown is enforced per rule per device via a BTreeMap<(RuleId, DeviceId), u64> (value = last-fired timestamp in nanoseconds from ktime_get_ns(); Instant is not available in #![no_std]). The BTreeMap key is a composite non-integer type, so BTreeMap is the correct collection per policy.

Registration API:

/// Register a FMA diagnosis rule module.
pub fn fma_rule_register(module: &FmaRuleModule) -> Result<FmaRuleHandle, FmaError>;

pub struct FmaRuleModule {
    /// Module name (unique; max 64 chars).
    pub name: &'static str,
    /// Rule text (DSL source; compiled at registration time).
    pub rules: &'static str,
    /// Maximum rules this module is expected to produce. Registration fails
    /// with `FmaError::TooManyRules` if DSL compilation exceeds this limit.
    /// Must be <= MAX_FMA_RULES (256).
    pub max_rules: u32,
}

Matching algorithm: Rules are compiled to a flat table (FmaRuleTable). On each event receipt: 1. Filter rules by source (hash map, O(1) per event). 2. For single-event rules: check field conditions; if all pass, trigger action. 3. For multi-event rules: insert event into a per-rule sliding window (ring buffer, size=COUNT threshold). If all conditions satisfied within the temporal window, trigger action. 4. Rule evaluation is lock-free (RCU-protected rule table; per-CPU event queues drained by the FMA processing thread).

Action vocabulary: - SUSPECT(fru): Mark the named FRU as suspected in the fault case. - RETIRE(resource): Offline and retire the resource (persistent, survives reboot). - ISOLATE(dev): Remove device from service (offline the device, stop its driver). - RELOAD(dev): Trigger Tier 1 driver reload for the device. Implementation: fma_action_reload() in Section 11.9. - ALERT(severity, msg): Emit a FMA fault event to umkafs /Health/faults/. - THROTTLE(dev, limit): Reduce device throughput to the specified limit. - REPLACE_PREFERRED(dev): Trigger live evolution to proactively replace a degrading component. - FMA_TICKET(severity, msg): Create a persistent fault ticket for operator action. - CALL(handler, args...): Invoke a named handler registered by the driver (e.g., "reset_link", "throttle_frequency").

20.1.6 Response Executor

Actions integrate with existing kernel subsystems:

RetirePages: The memory manager (Section 4.1) is asked to remove specific physical pages from the buddy allocator. Any process mapping those pages is transparently migrated to a replacement page (copy-on-read). This is how Linux handles hardware-poisoned pages (memory_failure()), but triggered proactively.

RetirePages partial failure: If offline_page() fails for some pages in a RetirePages batch (e.g., page is locked, page is kernel text): the successful pages remain offline, the failed pages are re-queued for retry after 60 seconds. After 3 failed retries, the page is marked PG_HWPOISON and an FMA event RetirePageFailed is emitted.

ReloadDriver: The response executor calls fma_action_reload() (Section 11.9) which invokes the crash recovery subsystem's driver reload sequence (steps 6-8). This action is the mechanism that closes the FMA-to-recovery loop: DriverCrash events flow from the crash recovery subsystem into FMA (step 3a of the recovery sequence), and FMA's RELOAD action flows back to trigger the actual reload. The target tier is determined by the crash recovery subsystem's auto-demotion logic (based on crash_count), not by FMA — FMA decides whether to reload, the crash recovery subsystem decides at which tier.

DemoteTier: The device registry (Section 11.4) transitions the device's driver to a lower isolation tier. This uses the existing crash-recovery reload mechanism but without a crash — clean stop, change tier, restart.

DisableDevice: The registry transitions the device to Error state. The driver is stopped. The device node remains in the tree (for introspection) but accepts no I/O.

TriggerEvolution: The response executor invokes the Live Evolution framework (Section 13.18) to proactively hot-swap a degrading component before it fails. Example: FMA detects a pattern of increasing PCIe correctable errors on a bus serving a Tier 1 NIC driver. The diagnosis engine fires, and instead of waiting for a hard failure, the response executor triggers Section 13.18's component replacement flow to swap the NIC driver to a degraded-mode variant (e.g., conservative I/O scheduler, reduced-bandwidth path). The replacement follows the same state serialization and quiescence protocol as any live evolution (Section 13.18), but is initiated automatically by FMA rather than by an administrator. This closes the gap between reactive fault handling and proactive system evolution — the kernel treats impending hardware failure as a trigger for self-repair rather than merely an alert.

FMA Action Execution Order:

When FMA triggers on a fault event, registered actions execute in priority order (not registration order). Higher-priority actions run first and can abort the chain on failure.

Priority levels (highest to lowest):

Priority Action Type Failure behavior Description
1 DisableDevice Abort chain Remove device from service immediately. Non-optional — isolation is the safety guarantee. If isolation fails, further actions are meaningless (device state unknown).
2 DemoteTier Abort if non-best_effort Reduce device bandwidth/request rate by moving it to a lower isolation tier. May be marked best_effort for degraded-but-operational scenarios.
3 Alert / MarkDegraded / TriggerEvolution Continue (always best_effort) Send notification to registered listeners (/dev/oom, syslog, SNMP trap) or initiate proactive replacement. Failure is logged, not fatal.
4 RetirePages / FMA ring write Continue (always best_effort) Append to FMA event ring and persistent log, retire affected pages. Failure increments fma_dropped_events counter.

Execution rules:

  1. Actions execute in a dedicated kworker thread (not in interrupt context). Interrupt handler only enqueues the FaultEvent — all action execution is deferred to process context.
  2. Timeout per action: 5 seconds. An action that does not complete in 5 seconds is treated as a failure (same as returning an error).
  3. A non-best_effort action failure aborts the chain at that point — lower-priority actions do not run.
  4. Multiple actions of the same type (e.g., two Alert actions): execute in registration order within the same priority level.
  5. After DisableDevice: the device is removed from the I/O path before DemoteTier runs. This ordering prevents DemoteTier from being applied to an already-disabled device (which would be a no-op or error).

FMA/PMU kthread ordering: FMA kthread runs at SCHED_OTHER priority 0 (nice 10). PMU sampler kthread runs under CBS guarantee (5% CPU budget). To prevent recursive feedback: (1) PMU sampler does NOT emit FMA events — it writes directly to the perf ring buffer. (2) FMA tracepoints (fma_event_report, fma_action_taken) are exempt from PMU sampling — the PMU subsystem maintains an exclusion list for self-referential tracepoints.

Design note: UmkaOS's priority-ordered action chain is explicitly better than Linux EDAC, which only logs faults. UmkaOS actively isolates faulty hardware before it can cause data corruption downstream, implementing a "fail-safe first" principle.

Integration with the health-score model: The five-level action-priority chain above handles rule-fired events (binary triggers). The health-score and escalation model (Section 20.1) provides a complementary continuous signal: as the device health score degrades through the ThrottleOfflinePanic thresholds, each escalation level maps to the corresponding action priority (Throttle ≈ Priority 2, Offline ≈ Priority 1, Panic = kernel panic). Both paths feed the same response executor.

20.1.6.1 Coordination Between Rule-Fired Actions and Health Score Escalation

The two escalation models (rule-fired binary actions and continuous health score) are unified through a single DeviceEscalationState per device. Neither model operates independently — they are inputs to the same state machine.

Rule-fired actions drive health score transitions: When the diagnosis engine fires a rule-based action, the response executor applies a corresponding health score penalty before executing the action. This ensures the health score always reflects the device's true state, even when the triggering event was a binary rule match rather than a severity-weighted health event:

Rule-Fired Action Health Score Effect Resulting Escalation Level
ALERT(Warning, ...) −5 (same as Warning severity event) May cross into Throttle (<70)
THROTTLE(dev, limit) −15 (same as Degraded severity event) Throttle (if not already)
ISOLATE(dev) −40 (same as Critical severity event) Offline
RELOAD(dev) −15 (Degraded-equivalent; driver reload is a soft recovery) May cross into Throttle
RETIRE(resource) −5 per retired page (capped at −40 total) Depends on accumulation
REPLACE_PREFERRED(dev) −15 (proactive replacement implies degradation) Throttle

Health score thresholds trigger rule-fired actions: When the health score crosses a threshold due to accumulated severity-weighted events (not due to a rule-fired action), the escalation state machine generates a synthetic action:

Health Score Transition Synthetic Action Generated
H drops below 70 (enter Throttle) THROTTLE(dev, permille) with permille computed from H
H drops below 20 (enter Offline) ISOLATE(dev) — device taken offline
H reaches 0 (enter Panic, critical device) Kernel panic with FMA diagnostic

Feedback loop prevention: To avoid cascading escalation (rule fires → score drops → threshold crosses → synthetic action fires → score drops further), the following guard is enforced:

  • Health score penalties from synthetic actions (generated by threshold crossings) are NOT applied. Only the original severity-weighted event that caused the threshold crossing contributes to the score. The synthetic action is purely an effect, not a new input.
  • Health score penalties from explicit rule-fired actions (fired by diagnosis engine rules) ARE applied, because the rule match represents new diagnostic information that the health score should reflect.

External health score adjustments: Monitoring agents (e.g., IPMI BMC, out-of-band management) can inject health events via the FMA telemetry API (fma_report_event() with HealthSeverity and device context). These events are processed identically to hardware-reported events: the severity penalty is applied to the health score, and threshold crossings trigger the same escalation state machine. There is no separate "external" escalation path — external inputs are first-class health events.

20.1.7 Linux Interface Exposure

Entirely through standard mechanisms:

sysfs (per-device, under the device registry's sysfs tree):

/sys/devices/.../health/
    status          # "ok", "warning", "degraded", "critical"
    events_total    # Total health events received
    ce_count        # Correctable error count (memory)
    ue_count        # Uncorrectable error count (memory)
    life_remaining  # Percentage (storage)
    link_retrains   # Count (PCIe)
    temperature     # Current (thermal)

procfs:

/proc/umka/fma/
    rules           # Current diagnosis rules (read/write for admin)
    events          # Recent events across all devices (ring buffer dump)
    retired_pages   # List of retired physical pages
    statistics      # Aggregate counters

uevent: Standard hotplug mechanism for pushing notifications to userspace.

ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:1f.2
SUBSYSTEM=pci
UMKA_HEALTH=degraded
UMKA_HEALTH_REASON=pcie_link_unstable

dmesg/printk: Standard kernel log for all alerts.

20.1.8 FaultEvent — Structured FMA Event Type

FaultEvent is the typed event enum passed to fma_emit(). Each variant carries structured fields so consumers (FMA rules engine, userspace, tracepoints) can react without parsing text messages.

/// Typed FMA event — passed to `fma_emit()` to record a hardware fault.
/// Stored in the per-device FMA ring (Section 20.1.2, 256-entry lock-free MPSC).
pub enum FaultEvent {
    /// ECC corrected (single-bit) memory error.
    MemoryCe {
        mci_idx: u32,       // Memory controller index
        csrow:   u32,       // CSROW (chip-select row)
        channel: u32,       // DRAM channel
        page:    u64,       // Physical page number
        offset:  u64,       // Offset within page
        syndrome: u32,      // ECC syndrome (if available, else 0)
    },
    /// ECC uncorrectable (multi-bit) memory error.
    MemoryUe {
        mci_idx: u32,
        csrow:   u32,
        channel: u32,
        page:    u64,
        /// True if the error is uncorrectable and the affected memory range must be
        /// retired. Fatal UEs trigger device offline and optionally kernel panic
        /// per the FMA escalation policy (Section 20.1.9).
        fatal:   bool,
        /// Physical byte offset within page where UE occurred.
        /// Set to `u64::MAX` if the hardware does not report a sub-page address.
        offset:  u64,
    },
    /// PCIe AER correctable error.
    PcieAerCe {
        bus_dev_fn: u32,    // PCI BDF encoding
        status:     u32,    // AER correctable status bits
    },
    /// PCIe AER uncorrectable error.
    PcieAerUe {
        bus_dev_fn: u32,
        status:     u32,
        severity:   u32,    // Fatal vs. non-fatal
    },
    /// Storage SMART health threshold crossed.
    StorageSmart {
        device_id:  u64,    // FMA device handle
        attr_id:    u8,     // SMART attribute number
        value:      u8,     // Normalized value
        threshold:  u8,     // Failure threshold
    },
    /// Thermal throttling event.
    Thermal {
        cpu_or_device: u64, // FMA device handle
        temp_milli_c:  i32, // Temperature in milli-°C
        throttle_pct:  u8,  // Throttle percentage applied
    },
    /// Generic driver-defined fault event.
    Generic {
        device_id:  u64,
        event_code: u32,
        payload:    [u8; 16],
    },
    /// Driver crash — a Tier 1 or Tier 2 driver has panicked or been killed.
    /// Emitted by the crash recovery subsystem ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation))
    /// so that FMA can track crash frequency and escalate (e.g., demote tier, disable device).
    DriverCrash {
        device:      u64,       // DeviceNodeId of the device whose driver crashed
        driver_id:   u64,       // DriverId (unique driver instance identifier)
        tier:        u8,        // IsolationTier (1 = Tier 1, 2 = Tier 2)
        crash_count: u32,       // Cumulative crash count for this driver instance
    },
    /// DMA quiescence timeout — a device did not stop DMA within the required deadline.
    /// Emitted when the 100 ms DMA quiescence poll expires during driver teardown
    /// ([Section 11.6](11-drivers.md#device-services-and-boot--dma-quiescence-and-flr)).
    DmaTimeout {
        bus_dev_fn:  u32,       // PCI BDF encoding of the offending device
        timeout_ms:  u32,       // Timeout that expired (typically 100)
    },
    /// PCIe FLR timeout — Function Level Reset did not complete within the deadline.
    /// Emitted when FLR completion polling exceeds the configured timeout (500 ms default)
    /// ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation--flr-timeout-recovery)).
    /// The device is permanently faulted after this event.
    PcieFlrTimeout {
        bus_dev_fn:  u32,       // PCI BDF encoding
        vendor_id:   u16,       // PCI vendor ID
        device_id:   u16,       // PCI device ID
        escalation:  u8,        // Escalation result: 0 = SBR succeeded, 1 = SBR failed,
                                //   2 = slot power cycle succeeded, 3 = permanent fault
    },
    /// CPU Machine Check Exception. Emitted by the MCE handler when the
    /// hardware reports a machine check error (x86 `MCi_STATUS`, ARM SEA/SEI).
    /// Carries the physical address (if available) and the bank index.
    CpuMce {
        /// Physical address of the faulting location, or `u64::MAX` if the
        /// hardware does not report a valid address (ADDRV bit not set).
        phys_addr:   u64,
        /// Machine check bank index (x86: 0..N_BANKS-1, typically 0-31).
        bank:        u8,
    },
    /// Network link flap — a NIC's link state toggled (up→down or down→up)
    /// more frequently than expected. Emitted when the FMA link-flap detector
    /// observes more than `link_flap_threshold` transitions within a
    /// configurable window (default: 5 transitions in 30 seconds).
    NetworkLinkFlap {
        /// Interface index of the NIC whose link is flapping.
        ifindex:     u32,
    },
    /// Performance degradation detected by a device health monitor.
    /// Emitted when a device's measured performance metric falls below the
    /// expected baseline by a configurable threshold (e.g., NVMe latency
    /// exceeds 2x baseline, NIC throughput drops below 50% of line rate).
    PerformanceDegraded {
        /// FMA device handle of the degraded device.
        device:      DeviceHandle,
        /// Subsystem-specific metric identifier. Interpretation depends on
        /// the device class: NVMe: 0=read_latency, 1=write_latency;
        /// NIC: 0=rx_throughput, 1=tx_throughput, 2=pps.
        metric:      u32,
    },
    /// Device reset failed — a hardware reset (FLR, SBR, or slot power cycle)
    /// did not restore the device to a functional state. Emitted after all
    /// reset escalation attempts have been exhausted. The device is marked
    /// permanently faulted and removed from service.
    DeviceResetFailed {
        /// FMA device handle of the device whose reset failed.
        device:      DeviceHandle,
        /// Type of reset that was attempted last:
        ///   0 = FLR (Function Level Reset)
        ///   1 = SBR (Secondary Bus Reset)
        ///   2 = Slot power cycle
        ///   3 = Platform-specific reset (e.g., ACPI _RST)
        reset_type:  u8,
    },
}

/// Emit a fault event to the per-device FMA ring.
/// Lock-free, NMI-safe (ring buffer + atomic tail increment).
/// May be called from interrupt or NMI context.
///
/// **Dual emission**: `fma_emit()` performs two actions at the same callsite:
/// 1. Converts `FaultEvent` to `HealthEvent` and pushes it into the per-device
///    `BoundedMpmcRing<HealthEvent, 256>` for the FMA diagnosis engine.
/// 2. Fires the appropriate `umka_tp_stable_fma_*` tracepoint
///    ([Section 20.2](#stable-tracepoint-abi--per-subsystem-fma-tracepoints)) with the full
///    structured payload, enabling perf/eBPF consumers to observe hardware faults
///    in real time without polling the FMA ring.
///
/// The tracepoint fires unconditionally (subject to the static-key NOP optimization
/// described in [Section 20.2](#stable-tracepoint-abi--zero-overhead-when-disabled)): if no eBPF
/// program is attached to the tracepoint, the fire is a NOP instruction. If a program
/// is attached, it receives the full `FaultEvent` fields as tracepoint arguments.
///
/// Both operations are lock-free and NMI-safe. The tracepoint fire is a per-CPU
/// atomic probe pointer load + indirect call (when enabled) — no additional
/// synchronization beyond the ring buffer's existing atomic tail increment.
pub fn fma_emit(event: FaultEvent);

/// Emit a panic-context fault event to the pre-allocated per-CPU crash buffer.
/// Called from the kernel panic handler. This function is NMI-safe:
///
/// - **No locks**: writes to the per-CPU crash buffer via a plain store to a
///   pre-allocated, pre-pinned memory region. No spinlocks, no mutexes.
/// - **No allocation**: the crash buffer is allocated once at boot time
///   (one `FmaCrashRecord` per CPU) and never freed.
/// - **No function pointers**: no indirect calls (tracepoints, LSM hooks)
///   that could fault if the kernel is in an inconsistent state.
///
/// The crash buffer survives for consumption by pstore (persistent storage)
/// and kdump (crash dump). On reboot, the pstore driver reads the crash
/// buffer from the reserved memory region and exposes it via
/// `/sys/fs/pstore/fma-crash-*`.
///
/// # Design
///
/// Each CPU has a single `FmaCrashRecord` slot. If multiple panics race
/// (e.g., NMI during panic), the last writer wins — this is acceptable
/// because the system is already in a terminal state.
pub fn fma_panic_event(reason: &'static str);

/// Pre-allocated per-CPU crash buffer for panic-context FMA events.
/// Allocated at boot from the early memblock allocator (before slab is
/// available). Placed in a pstore-registered memory region so contents
/// survive warm reboot.
#[repr(C, align(256))]
pub struct FmaCrashRecord {
    /// Monotonic timestamp at panic time (read from TSC/cntvct directly,
    /// no timekeeping locks).
    pub timestamp_ns: u64,
    /// CPU that panicked.
    pub cpu_id: u32,
    /// Panic reason (null-terminated, truncated to 128 bytes).
    pub reason: [u8; 128],
    /// Current task PID at panic time (read from per-CPU current_task).
    pub pid: u32,
    /// Current task comm (process name, 16 bytes matching Linux TASK_COMM_LEN).
    pub comm: [u8; 16],
    /// Stack pointer at panic entry (for offline stack unwinding by kdump).
    pub stack_ptr: u64,
    /// Instruction pointer at panic entry.
    pub instruction_ptr: u64,
    /// Effective isolation tier at crash time: 0=Tier0 (native or promoted),
    /// 1=Tier1, 2=Tier2. For promoted-T0 drivers (originally Tier 1 but
    /// running as Tier 0 due to missing hardware isolation), this is 0.
    pub effective_tier: u8,
    /// The tier declared in the driver manifest, before any promotion.
    /// For native Tier 0 drivers, this equals `effective_tier` (0).
    /// For promoted-T0 drivers (e.g., Tier 1 on RISC-V without hardware
    /// isolation), `declared_tier` is 1 while `effective_tier` is 0.
    /// This lets crash analysts distinguish native-T0 from promoted-T0.
    pub declared_tier: u8,
    /// Padding for 256-byte placement alignment (cache-line isolation
    /// between per-CPU slots, avoiding false sharing).
    pub _pad: [u8; 78],
}
// FmaCrashRecord: 8+4+128+4+16+8+8+1+1+78 = 256 bytes per CPU.
// repr(C, align(256)) ensures each per-CPU slot is on its own cache-line
// set, preventing false sharing between CPUs during panic. Persistent
// across reboot (pstore-registered memory region).
const_assert!(core::mem::size_of::<FmaCrashRecord>() == 256);

/// Per-CPU crash buffer array. Indexed by cpu_id.
/// Allocated at boot, registered with pstore as a persistent memory region.
pub static FMA_CRASH_BUFFERS: PerCpu<FmaCrashRecord> = PerCpu::zeroed();

20.1.8.1 FaultEvent to HealthEvent Conversion

fma_emit() stores events in the per-device BoundedMpmcRing<HealthEvent, 256>, but callers pass a typed FaultEvent enum. The conversion maps each FaultEvent variant to the flat HealthEvent representation used by the ring buffer, diagnosis engine, and sysfs export. This conversion is inline and allocation-free — it runs in NMI context.

impl FaultEvent {
    /// Convert a typed `FaultEvent` into a flat `HealthEvent` suitable for
    /// the per-device ring buffer. The conversion is infallible and allocation-free.
    ///
    /// # Mapping rules
    ///
    /// - `class`: Derived from the variant discriminant (e.g., `MemoryCe` → `Memory`,
    ///   `PcieAerCe` → `Pcie`, `StorageSmart` → `Storage`).
    /// - `code`: A per-class event code uniquely identifying the variant within its class.
    ///   Codes are stable and never reused.
    /// - `severity`: Derived from the variant semantics (CE → `Corrected`, UE fatal →
    ///   `Fatal`, UE non-fatal → `Critical`, SMART threshold → `Warning`, etc.).
    /// - `data[64]`: The variant's structured fields serialized in little-endian,
    ///   packed layout. The field order matches the `FaultEvent` variant field
    ///   declaration order. Unused trailing bytes are zero-filled.
    /// - `data_len`: Actual number of valid bytes written to `data[]`.
    pub fn to_health_event(&self, timestamp_ns: u64) -> HealthEvent {
        match self {
            // --- Memory subsystem ---
            // class = Memory (0), code = 0x0001
            // data layout: [mci_idx: u32, csrow: u32, channel: u32, page: u64,
            //               offset: u64, syndrome: u32]  = 32 bytes
            FaultEvent::MemoryCe { mci_idx, csrow, channel, page, offset, syndrome } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&mci_idx.to_le_bytes());
                data[4..8].copy_from_slice(&csrow.to_le_bytes());
                data[8..12].copy_from_slice(&channel.to_le_bytes());
                data[12..20].copy_from_slice(&page.to_le_bytes());
                data[20..28].copy_from_slice(&offset.to_le_bytes());
                data[28..32].copy_from_slice(&syndrome.to_le_bytes());
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Memory,
                    code: 0x0001,
                    severity: HealthSeverity::Corrected,
                    data,
                    data_len: 32,
                }
            }
            // class = Memory (0), code = 0x0002
            // data layout: [mci_idx: u32, csrow: u32, channel: u32, page: u64,
            //               fatal: u8, _pad: [u8;3], offset: u64]  = 32 bytes
            FaultEvent::MemoryUe { mci_idx, csrow, channel, page, fatal, offset } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&mci_idx.to_le_bytes());
                data[4..8].copy_from_slice(&csrow.to_le_bytes());
                data[8..12].copy_from_slice(&channel.to_le_bytes());
                data[12..20].copy_from_slice(&page.to_le_bytes());
                data[20] = *fatal as u8;
                // bytes 21-23: padding (zero-filled)
                data[24..32].copy_from_slice(&offset.to_le_bytes());
                let severity = if *fatal {
                    HealthSeverity::Fatal
                } else {
                    HealthSeverity::Critical
                };
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Memory,
                    code: 0x0002,
                    severity,
                    data,
                    data_len: 32,
                }
            }
            // --- PCIe subsystem ---
            // class = Pcie (3), code = 0x0001
            // data layout: [bus_dev_fn: u32, status: u32]  = 8 bytes
            FaultEvent::PcieAerCe { bus_dev_fn, status } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&bus_dev_fn.to_le_bytes());
                data[4..8].copy_from_slice(&status.to_le_bytes());
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Pcie,
                    code: 0x0001,
                    severity: HealthSeverity::Corrected,
                    data,
                    data_len: 8,
                }
            }
            // class = Pcie (3), code = 0x0002
            // data layout: [bus_dev_fn: u32, status: u32, severity: u32]  = 12 bytes
            FaultEvent::PcieAerUe { bus_dev_fn, status, severity: sev } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&bus_dev_fn.to_le_bytes());
                data[4..8].copy_from_slice(&status.to_le_bytes());
                data[8..12].copy_from_slice(&sev.to_le_bytes());
                let health_sev = if *sev & 0x1 != 0 {
                    HealthSeverity::Fatal
                } else {
                    HealthSeverity::Critical
                };
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Pcie,
                    code: 0x0002,
                    severity: health_sev,
                    data,
                    data_len: 12,
                }
            }
            // --- Storage subsystem ---
            // class = Storage (1), code = 0x0001
            // data layout: [device_id: u64, attr_id: u8, value: u8, threshold: u8]  = 11 bytes
            FaultEvent::StorageSmart { device_id, attr_id, value, threshold } => {
                let mut data = [0u8; 64];
                data[0..8].copy_from_slice(&device_id.to_le_bytes());
                data[8] = *attr_id;
                data[9] = *value;
                data[10] = *threshold;
                let severity = if *value <= *threshold {
                    HealthSeverity::Critical
                } else {
                    HealthSeverity::Warning
                };
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Storage,
                    code: 0x0001,
                    severity,
                    data,
                    data_len: 11,
                }
            }
            // --- Thermal subsystem ---
            // class = Thermal (4), code = 0x0001
            // data layout: [cpu_or_device: u64, temp_milli_c: i32, throttle_pct: u8]  = 13 bytes
            FaultEvent::Thermal { cpu_or_device, temp_milli_c, throttle_pct } => {
                let mut data = [0u8; 64];
                data[0..8].copy_from_slice(&cpu_or_device.to_le_bytes());
                data[8..12].copy_from_slice(&temp_milli_c.to_le_bytes());
                data[12] = *throttle_pct;
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Thermal,
                    code: 0x0001,
                    severity: HealthSeverity::Warning,
                    data,
                    data_len: 13,
                }
            }
            // --- Generic driver-defined ---
            // class = Generic (6) by default. Drivers that need a specific class
            // (e.g., Cpu, Memory) should use the corresponding typed FaultEvent
            // variant instead.
            // code = event_code as provided by the driver.
            // data layout: [device_id: u64, payload: [u8; 16]]  = 24 bytes
            FaultEvent::Generic { device_id, event_code, payload } => {
                let mut data = [0u8; 64];
                data[0..8].copy_from_slice(&device_id.to_le_bytes());
                data[8..24].copy_from_slice(payload);
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Generic,
                    code: *event_code,
                    severity: HealthSeverity::Info,
                    data,
                    data_len: 24,
                }
            }
            // --- Driver crash ---
            // class = Generic (6), code = 0x1001
            // Note: code 0x1001 (not 0x0001) avoids collision with driver-provided
            // Generic event codes which start at 0x0001. The unique key for FMA events
            // is (class, code); DriverCrash is (Generic, 0x1001). No collision with
            // FaultEvent::Generic which uses driver-provided codes in the 0x0001+ range,
            // because kernel-internal Generic codes use the 0x1000+ range.
            // data layout: [device: u64, driver_id: u64, tier: u8, crash_count: u32]  = 21 bytes
            FaultEvent::DriverCrash { device, driver_id, tier, crash_count } => {
                let mut data = [0u8; 64];
                data[0..8].copy_from_slice(&device.to_le_bytes());
                data[8..16].copy_from_slice(&driver_id.to_le_bytes());
                data[16] = *tier;
                data[17..21].copy_from_slice(&crash_count.to_le_bytes());
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Generic,
                    code: 0x1001,
                    severity: HealthSeverity::Critical,
                    data,
                    data_len: 21,
                }
            }
            // --- DMA timeout ---
            // class = Pcie (3), code = 0x0003
            // data layout: [bus_dev_fn: u32, timeout_ms: u32]  = 8 bytes
            FaultEvent::DmaTimeout { bus_dev_fn, timeout_ms } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&bus_dev_fn.to_le_bytes());
                data[4..8].copy_from_slice(&timeout_ms.to_le_bytes());
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Pcie,
                    code: 0x0003,
                    severity: HealthSeverity::Warning,
                    data,
                    data_len: 8,
                }
            }
            // --- PCIe FLR timeout ---
            // class = Pcie (3), code = 0x0004
            // data layout: [bus_dev_fn: u32, vendor_id: u16, device_id: u16, escalation: u8]  = 9 bytes
            FaultEvent::PcieFlrTimeout { bus_dev_fn, vendor_id, device_id, escalation } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&bus_dev_fn.to_le_bytes());
                data[4..6].copy_from_slice(&vendor_id.to_le_bytes());
                data[6..8].copy_from_slice(&device_id.to_le_bytes());
                data[8] = *escalation;
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Pcie,
                    code: 0x0004,
                    severity: HealthSeverity::Fatal,
                    data,
                    data_len: 9,
                }
            }
            // --- CPU MCE ---
            // class = Cpu (8), code = 0x0001
            // data layout: [phys_addr: u64, bank: u8]  = 9 bytes
            FaultEvent::CpuMce { phys_addr, bank } => {
                let mut data = [0u8; 64];
                data[0..8].copy_from_slice(&phys_addr.to_le_bytes());
                data[8] = *bank;
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Cpu,
                    code: 0x0001,
                    severity: HealthSeverity::Critical,
                    data,
                    data_len: 9,
                }
            }
            // --- Network link flap ---
            // class = Network (2), code = 0x0001
            // data layout: [ifindex: u32]  = 4 bytes
            FaultEvent::NetworkLinkFlap { ifindex } => {
                let mut data = [0u8; 64];
                data[0..4].copy_from_slice(&ifindex.to_le_bytes());
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Network,
                    code: 0x0001,
                    severity: HealthSeverity::Warning,
                    data,
                    data_len: 4,
                }
            }
            // --- Performance degradation ---
            // class = Performance (9), code = 0x0001
            // data layout: [device: u64, metric: u32]  = 12 bytes
            FaultEvent::PerformanceDegraded { device, metric } => {
                let mut data = [0u8; 64];
                let dev_val: u64 = (*device).into();
                data[0..8].copy_from_slice(&dev_val.to_le_bytes());
                data[8..12].copy_from_slice(&metric.to_le_bytes());
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Performance,
                    code: 0x0001,
                    severity: HealthSeverity::Warning,
                    data,
                    data_len: 12,
                }
            }
            // --- Device reset failed ---
            // class = Generic (6), code = 0x1003
            // data layout: [device: u64, reset_type: u8]  = 9 bytes
            FaultEvent::DeviceResetFailed { device, reset_type } => {
                let mut data = [0u8; 64];
                let dev_val: u64 = (*device).into();
                data[0..8].copy_from_slice(&dev_val.to_le_bytes());
                data[8] = *reset_type;
                HealthEvent {
                    timestamp_ns,
                    class: HealthEventClass::Generic,
                    code: 0x1003,
                    severity: HealthSeverity::Fatal,
                    data,
                    data_len: 9,
                }
            }
        }
    }
}

Variant-to-class/code mapping summary:

FaultEvent variant HealthEventClass code severity data_len
MemoryCe Memory (0) 0x0001 Corrected 32
MemoryUe Memory (0) 0x0002 Fatal / Critical (based on fatal field) 32
PcieAerCe Pcie (3) 0x0001 Corrected 8
PcieAerUe Pcie (3) 0x0002 Fatal / Critical (based on AER severity bit 0) 12
DmaTimeout Pcie (3) 0x0003 Warning 8
PcieFlrTimeout Pcie (3) 0x0004 Fatal 9
StorageSmart Storage (1) 0x0001 Critical / Warning (value ≤ threshold → Critical) 11
Thermal Thermal (4) 0x0001 Warning 13
Generic Generic (6) driver-provided Info 24
DriverCrash Generic (6) 0x1001 Critical 21
CpuMce Cpu (8) 0x0001 Critical 9
NetworkLinkFlap Network (2) 0x0001 Warning 4
PerformanceDegraded Performance (9) 0x0001 Warning 12
DeviceResetFailed Generic (6) 0x1003 Fatal 9

Codes within each class are stable (never reused after deprecation). New FaultEvent variants added in future versions receive new codes, never reuse existing ones.

Existing tools that benefit without modification: - rasdaemon — uses kernel tracepoints (specifically the RAS tracepoints in /sys/kernel/debug/tracing/events/ras/) to collect hardware error events and stores them in a SQLite database; UmkaOS's sysfs health attributes provide an additional data source that rasdaemon can be extended to consume - Prometheus node_exporter — can scrape sysfs files - smartctl — storage health still works via the standard interface - systemd — can react to uevents

20.1.9 Health Score and Escalation Policy

The diagnosis engine (Section 20.1) detects fault patterns and fires actions (Section 20.1), but those actions are binary: either a specific rule fires or it does not. A complementary health score model provides a continuous signal that drives graduated escalation — throttling before draining, draining before offlining — and makes escalation decisions auditable and predictable.

20.1.9.1 Health Score Model

Each device maintains a rolling health score H ∈ [0, 100]. Score 100 = fully healthy, 0 = critically unhealthy. The score is stored in DeviceHealthLog as an AtomicU32 (scaled: stored value = H * 100, giving two decimal places of precision without floating-point in the kernel).

The score decays toward 100 over time (natural recovery) at the rate specified per severity. Each error event immediately decreases the score based on its severity:

Severity Score Penalty Recovery Rate Notes
Corrected (0) 0 Hardware auto-corrected; tracked for trends only
Info (1) 1 +1 per minute Informational, no action needed
Warning (2) 5 +2 per minute Threshold approaching, single anomaly
Degraded (3) 15 +1 per minute Partial failure, corrective action in progress
Critical (4) 40 +0.5 per minute Imminent failure
Fatal (5) 100 No recovery Unrecoverable; score is clamped to 0

The HealthSeverity enum (Section 20.1) defines six levels in ascending severity order: Corrected (0) < Info (1) < Warning (2) < Degraded (3) < Critical (4) < Fatal (5). The numeric value is the severity rank. Corrected is the lowest severity — hardware auto-corrected events (ECC single-bit fixes, PCIe correctable errors) that impose no health penalty but are worth tracking for trend analysis. Fatal is the highest — unrecoverable hardware errors (ECC uncorrectable, PCIe fatal AER, storage uncorrectable media error) that immediately drive the health score to 0.

Recovery: Score recovery is applied by the FMA maintenance timer (runs every 60 seconds). The timer adds the per-severity recovery rate to each active device's score, clamping at 100. A device that has received no new error events for long enough returns to H == 100 automatically. A device with Fatal events never recovers — the score stays at 0 until the device is explicitly re-enabled by an administrator (which also resets the score).

Health score storage: DeviceHealth.health_score is an AtomicU32 with Relaxed loads / Release stores. The FMA engine updates the score from the device's IRQ-affinity CPU; other CPUs read the latest value within one memory ordering window. Health checks are warm-path (per-error-event, not per-packet), so a single atomic suffices — CpuLocal (the register-based fixed-layout hot-path struct with ~10 kernel-wide fields) is reserved for kernel-global per-CPU data, not per-device data.

Recent-error weighting: Errors that arrived within the last 60 seconds are weighted 3× when computing the effective score for threshold comparisons. The raw stored score reflects all historical events; the effective score used for escalation decisions amplifies recent bursts. The weighting is computed on-the-fly when an escalation check runs — the stored AtomicU32 always holds the true current score without amplification.

20.1.9.2 Escalation Thresholds

/// FMA escalation level — determines what corrective action the FMA engine
/// takes based on the device's current health score.
///
/// Escalation is evaluated after every health event and also by the FMA
/// maintenance timer. When the effective score (with recent-error weighting)
/// crosses a threshold, the corresponding escalation action is triggered.
/// Actions are idempotent: triggering `Drain` on an already-draining device
/// is a no-op.
pub enum FmaEscalation {
    /// Effective H >= 70: device is healthy. No throttling.
    Healthy,

    /// Effective H in [40, 70): throttle I/O rate proportionally
    /// ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees)).
    /// The permille value is proportional to how far H has fallen below 70:
    ///   permille = 500 + 500 * (H - 40) / 30   (integer linear interpolation)
    /// At H == 70: permille = 1000 (no throttling).
    /// At H == 40: permille = 500 (half rate).
    /// Range: [500, 1000]. No floating-point arithmetic in kernel code.
    Throttle { target_iops_permille: u16 },

    /// Effective H in [10, 40): drain all in-flight I/O and quiesce the device.
    /// New I/O submissions return `-EBUSY`. Existing submissions complete normally.
    /// The device transitions to `DeviceState::Degraded` in the device registry.
    /// If the error rate stabilizes (no new events for `fma_stabilization_window_ms`),
    /// the device may be promoted back to `Throttle` or `Healthy`.
    /// If errors continue, escalation proceeds to `Offline`.
    Drain,

    /// Effective H in [0, 10): take device offline. For critical devices
    /// ([Section 20.1](#fault-management-architecture--critical-device-designation)):
    ///   attempt failover first. If failover succeeds, offline the original device.
    ///   If failover is unavailable, continue to the `Panic` escalation.
    Offline { try_failover: bool },

    /// Effective H == 0 OR Fatal event on a non-redundant critical device with no
    /// available failover target: kernel panic with FMA diagnostic.
    ///
    /// The panic message includes the device canonical name, the triggering event,
    /// the last 16 entries from the device's `DeviceHealthLog` ring, and the
    /// current health score. This information is also saved to pstore
    /// ([Section 20.7](#pstore-panic-log-persistence)) for post-mortem analysis.
    Panic { reason: FmaPanicReason },
}

/// Reason for an FMA-triggered kernel panic.
pub enum FmaPanicReason {
    /// Health score reached 0 with no failover available.
    HealthScoreZero { device_name: ArrayString<64> },
    /// Fatal event on a non-redundant critical device.
    FatalEventNoCriticalFailover { device_name: ArrayString<64> },
}

20.1.9.3 Device State Mapping

The following table shows how EDAC/hardware health reporting maps through FMA escalation to the device registry's DeviceState:

Source DeviceHealth (EDAC) FmaEscalation DeviceState (Registry)
No errors H >= 70 Healthy Active
Correctable errors accumulating 40 <= H < 70 Throttle { permille } Active (throttled)
Correctable threshold exceeded 10 <= H < 40 Drain Degraded
Uncorrectable error H < 10 Offline { try_failover } Error
Fatal / non-redundant critical H == 0 Panic { reason } N/A (kernel panic)

The FMA engine is the sole authority for health-to-state transitions. Device drivers and EDAC handlers emit health events; the FMA diagnosis engine computes the health score; and the FMA response executor applies the corresponding DeviceState transition via the device registry API.

20.1.9.4 Critical Device Designation

A device is designated "critical" if removing it from service would render the system unbootable or unmanageable. Only critical devices trigger the try_failover path and can ultimately cause an FmaEscalation::Panic.

A device is critical if any of the following apply:

  • It is the block device containing the root filesystem (determined by matching the device's identity against the kernel's boot parameters at startup).
  • It is the block device hosting the EFI system partition (on UEFI systems).
  • It is the primary management network interface on a headless system (a system with no local console, determined by the umka.mgmt_netdev= boot parameter or by the absence of a graphics device).

Non-critical devices — GPUs, secondary NICs, non-boot storage, accelerators — always take the Offline path without triggering a panic. The rationale: panicking to protect a non-critical device causes more harm than losing that device.

Critical device designation is fixed at boot. An administrator cannot promote a device to critical after the kernel has started (to prevent privilege escalation via FMA).

20.1.9.5 Hysteresis and Re-admission

A device that has been placed in Drain state does not immediately return to Active when its health score recovers. To prevent oscillation:

  • Transition from Drain back to Active requires both:
  • Effective health score H > 60 (well clear of the drain threshold).
  • At least 5 minutes have elapsed since the last new error event of Warning severity or above.
  • Transition from Offline requires administrator action: writing enable to the device's umkafs control file (Section 20.5). The health score is reset to 80 on re-enable (not 100) to reflect that the device has a recent failure history.

These hysteresis rules are enforced by the FMA maintenance timer. The timer re-evaluates escalation state for all devices every 60 seconds.

20.1.10 FMA to umkafs Publishing

The FMA ring (Section 20.1) and health score (Section 20.1) are kernel-internal structures. The umkafs /Health/devices/<dev>/ subtree (Section 20.5) makes them visible to userspace. This section specifies the bridge between the two.

Publishing mechanism: FMA health events are published to umkafs synchronously via fma_publish_event(), which runs in the context of the device driver reporting the event (not in a worker thread). No deferred workqueue is involved. This keeps latency low for monitoring tools polling umkafs — the file contents reflect the latest event as soon as fma_report_health() returns to the caller.

/// Publish an FMA health event to the umkafs namespace.
/// Called from `fma_report_health()` after the event is recorded in the FMA ring.
///
/// # Arguments
/// * `device` — Device identifier for the umkafs path hierarchy.
/// * `event` — The health event to publish (already recorded in the FMA ring).
///
/// Updates the umkafs files under `/Health/devices/<dev>/` atomically:
/// status, score, events (ring tail), event_count.
fn fma_publish_event(device: &DeviceHandle, event: &FmaHealthEvent);

Published paths (per device with canonical name <dev> as assigned by Section 20.5):

umkafs path Content Update frequency
/Health/devices/<dev>/status Current escalation state: healthy / throttle / drain / offline / failed On every state transition
/Health/devices/<dev>/score Health score H (0-100) as ASCII integer On every event and every 60s (maintenance timer)
/Health/devices/<dev>/events Last 16 events from the FMA ring as JSON lines (newest last) On every new event (tracks ring-tail pointer)
/Health/devices/<dev>/event_count Total events since device initialization (u64, ASCII decimal) On every new event
/Health/devices/<dev>/critical 1 if device is designated critical (Section 20.1), 0 otherwise Static after boot

Read semantics: all umkafs FMA files are read-atomically — a single read() system call returns a consistent snapshot. Readers do not need to hold any lock. The FMA subsystem uses RCU for the ring-tail pointer (so readers see a coherent tail position) and a seqcount for the health score (so readers always see a complete score value, never a partially-written intermediate). Both primitives are the same ones used throughout the FMA hot path; no additional synchronization cost is added by umkafs access.

Write semantics: /Health/devices/<dev>/status is read-only for userspace. Writing to it returns EPERM. Administrative actions (force offline, re-enable) are performed via the device's dedicated control file: /Health/devices/<dev>/control (see Section 20.5). This prevents accidental status corruption while keeping the admin interface consistent with the rest of umkafs.

Event format in the events file — each line is one JSON object:

{"ts_ns":1234567890,"severity":"Warning","class":"IoError","detail":"CRC mismatch on sector 4096","score_after":85}

Fields:

Field Type Description
ts_ns u64 Nanoseconds since boot at event arrival
severity string One of Info, Corrected, Warning, Degraded, Critical, Fatal
class string HealthEventClass variant name (e.g., Storage, Memory, Pcie)
detail string Human-readable description (from HealthEvent.data, interpreted per class)
score_after u8 Health score H immediately after applying this event's penalty

The events file holds at most 16 lines. When the FMA ring has more than 16 entries, only the 16 most recent are rendered. This bound is intentional: events is for live monitoring (recent activity), not for long-term storage. Full event history is available via the procfs ring dump at /proc/umka/fma/events (Section 20.1).


20.2 Stable Tracepoint ABI

Inspired by: DTrace's philosophy of always-on, zero-cost, production-grade tracing. IP status: Clean — design decisions applied to eBPF (Linux mechanism), not DTrace code. Tracepoints as stable interfaces is a policy choice, not a patentable invention.

20.2.1 Problem

Linux tracepoints are considered unstable internal API. They change name, arguments, and semantics between kernel versions. Tools like bpftrace and BCC scripts break regularly. The community's position: "tracepoints are for kernel developers, not users."

This is a policy choice, not a technical necessity. UmkaOS can make a different choice.

Section 19.2 specifies full eBPF support. This section defines a layer on top: a set of versioned, stable, documented tracepoints that applications can depend on across kernel updates.

20.2.2 Two Categories of Tracepoints

Category 1: Stable Tracepoints (umka_tp_stable_*) — Versioned, documented, covered by the same "never break without deprecation" policy as the userspace ABI. These are the tracepoints that monitoring tools, profilers, and production observability systems can depend on.

Category 2: Debug Tracepoints (umka_tp_debug_*) — Unstable, may change between any two releases. For kernel developer use only. Same policy as Linux tracepoints.

20.2.3 Stable Tracepoint Interface

// umka-core/src/trace/stable.rs (kernel-internal)

/// A stable tracepoint definition.
pub struct StableTracepoint {
    /// Tracepoint name (e.g., "umka_tp_stable_syscall_entry").
    /// Once published, this name never changes.
    pub name: &'static str,

    /// Numeric tracepoint ID, assigned at registration time by a global
    /// `AtomicU32` counter (`TRACEPOINT_ID_COUNTER`). This ID is the value
    /// that userspace passes as `perf_event_attr.config` when opening a
    /// `PERF_TYPE_TRACEPOINT` event ([Section 20.8](#performance-monitoring-unit)).
    ///
    /// Exposed to userspace at:
    ///   `/ukfs/kernel/tracing/events/<category>/<name>/id`
    /// (and the Linux-compatible symlink `/sys/kernel/tracing/events/<category>/<name>/id`)
    ///
    /// IDs are assigned monotonically starting from 1 and never reused.
    /// A u32 counter supports >4 billion tracepoint registrations — sufficient
    /// for 50-year uptime even with aggressive dynamic tracepoint creation,
    /// because tracepoints are registered at boot or module load (bounded by
    /// the number of kernel subsystems, not by time).
    ///
    /// **Policy exemption**: Exempt from the u64-for-all-identifiers policy
    /// ([Section 1.3](01-overview.md#performance-budget)). This is a
    /// registration-bounded counter (thousands of registrations across the
    /// kernel lifetime), not a time-bounded counter. u32 provides >4 billion
    /// slots, which is orders of magnitude beyond the number of kernel
    /// subsystems + modules that could ever register tracepoints.
    pub id: AtomicU32,

    /// Category for organization.
    pub category: TracepointCategory,

    /// Version of this tracepoint's argument format.
    /// Starts at 1. Bumped when arguments are added (append-only).
    pub version: u32,

    /// Argument schema (for bpftool and documentation).
    pub args: &'static [TracepointArg],

    /// Probe function pointer (None when no eBPF program is attached).
    /// When None, the tracepoint has ZERO overhead (branch on static key).
    pub probe: AtomicPtr<()>,
}

/// Global monotonic counter for tracepoint ID assignment.
/// Starts at 1 (ID 0 is reserved as "invalid / not yet registered").
/// Incremented with `Relaxed` ordering — ID uniqueness is guaranteed by
/// the single-threaded registration path (tracepoints are registered
/// during boot or under a registration lock, never concurrently).
static TRACEPOINT_ID_COUNTER: AtomicU32 = AtomicU32::new(1);

/// Register a tracepoint: assigns a unique numeric ID and exposes it
/// in the tracefs namespace. Called during subsystem init (boot time)
/// or driver module load. Must not be called from interrupt context.
pub fn register_tracepoint(tp: &StableTracepoint) {
    let id = TRACEPOINT_ID_COUNTER.fetch_add(1, Ordering::Relaxed);
    tp.id.store(id, Ordering::Release);
    // Create tracefs entry:
    //   /ukfs/kernel/tracing/events/<category>/<name>/id  → contains `id` as decimal
    //   /ukfs/kernel/tracing/events/<category>/<name>/format → argument schema
    tracefs_create_tracepoint_entry(tp, id);
}

#[repr(u32)]
pub enum TracepointCategory {
    Syscall     = 0,    // Syscall entry/exit
    Scheduler   = 1,    // Context switch, wakeup, migration
    Memory      = 2,    // Page fault, allocation, reclaim, compression
    Block       = 3,    // Block I/O submit, complete
    Network     = 4,    // Packet TX/RX, socket events
    Filesystem  = 5,    // VFS operations, page cache
    Driver      = 6,    // Driver load, unload, crash, recovery
    Power       = 7,    // PM transitions, frequency changes
    Security    = 8,    // Capability checks, access denials
    Fma         = 9,    // Health events, diagnosis actions
    Accelerator = 10,   // GPU, FPGA, inference engine events
}

impl TracepointCategory {
    /// Number of defined categories. Used for array sizing in category-indexed
    /// data structures (e.g., per-category tracepoint counters).
    pub const COUNT: usize = 11;
}

pub struct TracepointArg {
    pub name: &'static str,
    pub arg_type: TracepointArgType,
    pub description: &'static str,
}

#[repr(u32)]
pub enum TracepointArgType {
    U64         = 0,
    I64         = 1,
    U32         = 2,
    I32         = 3,
    Str         = 4,    // Pointer + length
    Bytes       = 5,    // Pointer + length
    Pid         = 6,    // Process ID (u32)
    Tid         = 7,    // Thread ID (u32)
    DeviceId    = 8,    // DeviceNodeId (u64)
    Timestamp   = 9,    // Nanoseconds since boot (u64)
}

20.2.4 Zero-Overhead When Disabled

Tracepoints use static keys (same mechanism as Linux). When no eBPF program is attached, the tracepoint site is a NOP instruction. There is zero overhead in the common case — no branch prediction cost, no cache pollution from tracepoint data collection. (Note: always-on aggregation counters in Section 20.2 add ~1-2 ns per event via per-CPU atomic increments, independent of tracepoint enablement.)

// At the tracepoint site in kernel code:

// This compiles to a NOP when no probe is attached.
// When a probe is attached, it becomes a call.
umka_trace_stable!(syscall_entry, {
    pid: current_pid(),
    tid: current_tid(),
    syscall_nr: nr,
    arg0: args[0],
    arg1: args[1],
    arg2: args[2],
});

When an eBPF program attaches to this tracepoint, the runtime patches the NOP to a CALL instruction (instruction patching, same as Linux static keys). When the program detaches, it's patched back to NOP.

Tracepoint dispatch table: Each tracepoint has a TracepointCallsite struct containing the static key, the attached probe list (RcuList<BpfProg>), and the tracepoint's stable ID. When perf_event_open() or bpf(BPF_LINK_CREATE) attaches a BPF program to a tracepoint, the program is appended to the probe list under RCU protection. The dispatch path iterates the list on each tracepoint hit. The global TRACEPOINT_TABLE: XArray<TracepointCallsite> maps tracepoint IDs to callsites.

Tracepoint-to-BPF dispatch: when a tracepoint fires, the kernel calls probe_fn for each attached BPF program. probe_fn is registered via bpf_raw_tracepoint_open(tp_id, prog_fd). The TracepointCallsite maintains the RCU-protected list of attached BpfProg references. Dispatch cost: one RCU read + one indirect call per attached program.

RCU semantics for tracepoint probes: All tracepoint probe callbacks execute within an implicit RCU read-side critical section. The tracepoint infrastructure calls rcu_read_lock() before the first probe and rcu_read_unlock() after the last probe returns. Consequences: - Probes MUST NOT sleep (no mutex, no GFP_KERNEL allocation). - Probes CAN safely dereference RCU-protected pointers (task structs, routing entries, capability tables) without additional locking. - Probe unregistration uses synchronize_rcu() to ensure no probe is executing when its function pointer is freed. This matches Linux's tracepoint RCU model.

eBPF latency budget on critical tracepoints: Critical-path tracepoints (sched_switch, sched_wakeup, irq_handler_entry) enforce a per-invocation eBPF execution budget of 1 us (cycle watchdog). If a BPF program exceeds this budget, it is immediately detached from the tracepoint and an FMA event eBpfBudgetExceeded is emitted. The budget is enforced by the instruction counter in the BPF JIT — no timer needed. Non-critical tracepoints (sys_enter, net_dev_xmit) use the default 10 ms budget.

20.2.5 Stable Tracepoint Catalog

Initial set of stable tracepoints (version 1).

Namespace conventions for all stable tracepoints: - All pid and tid fields report init PID namespace values (global PIDs/TIDs). Container-local PIDs are not available in tracepoint arguments — correlate via pid_vnr() in eBPF programs attached to these tracepoints if container-local PIDs are needed. This matches AuditRecord.pid/tid semantics (see below). - All timestamp fields use CLOCK_MONOTONIC_RAW (raw hardware clock, NOT adjusted for time namespace offsets). Tracepoint timestamps are always in the init time namespace for cross-container consistency. Same clock source as AuditRecord.timestamp_ns (Section 20.3).

Syscall:

Tracepoint Arguments Description
umka_tp_stable_syscall_entry pid, tid, nr, arg0-5, timestamp Syscall entry
umka_tp_stable_syscall_exit pid, tid, nr, ret, timestamp Syscall return

Scheduler:

Tracepoint Arguments Description
umka_tp_stable_sched_switch prev_pid, prev_tid, prev_comm: [u8; 16], prev_prio: i32, prev_state, next_pid, next_tid, next_comm: [u8; 16], next_prio: i32, cpu Context switch. Callsite: emitted from context_switch() (Section 7.3) immediately after the register-level context switch completes (arch-specific switch_to() return point). Fires once per context switch; the tracepoint is in Tier 0 (umka-core) hot path.
umka_tp_stable_sched_wakeup pid, tid, target_cpu, timestamp Task wakeup
umka_tp_stable_sched_migrate pid, tid, orig_cpu, dest_cpu, timestamp Task migration
umka_tp_stable_sched_gang_cross_numa gang_id: u64, requested_cores: u32, home_node: u32, spill_node: u32, spill_count: u32 Gang allocation spilled across NUMA nodes. Emitted when core_provision_gang() cannot satisfy the request from a single NUMA node and allocates cores from a secondary node (Section 7.11). Operators use this to detect NUMA-suboptimal gang placements.

Memory:

Tracepoint Arguments Description
umka_tp_stable_page_fault pid, address, flags, timestamp Page fault entry
umka_tp_stable_page_alloc order, gfp_flags, numa_node, timestamp Page allocation
umka_tp_stable_page_reclaim nr_reclaimed, nr_scanned, priority Reclaim cycle
umka_tp_stable_compress_pool_compress original_size, compressed_size, algorithm Page compressed
umka_tp_stable_compress_pool_decompress compressed_size, latency_ns Page decompressed

Block I/O:

Tracepoint Arguments Description
umka_tp_stable_block_submit device_id, sector, size, op, timestamp I/O submitted
umka_tp_stable_block_complete device_id, sector, size, op, latency_ns, error I/O completed

Network:

Tracepoint Arguments Description
umka_tp_stable_net_rx device_id, len, protocol (ethertype, u16 network byte order), timestamp Packet received
umka_tp_stable_net_tx device_id, len, protocol (ethertype, u16 network byte order), timestamp Packet transmitted
umka_tp_stable_tcp_connect pid, saddr, sport, daddr, dport TCP connection
umka_tp_stable_tcp_close pid, saddr, sport, daddr, dport, duration_ns TCP close

Filesystem:

Tracepoint Arguments Description
umka_tp_stable_vfs_open pid, tid, inode (u64), flags (u32), mode (u32), latency_ns, timestamp File open (after path resolution)
umka_tp_stable_vfs_fsync pid, tid, inode (u64), datasync (u8), latency_ns, error (i32), timestamp fsync/fdatasync completion
umka_tp_stable_writeback_pages device_id, inode (u64), nr_pages (u64), reason (u8), timestamp Writeback batch submitted for an inode

Power:

Tracepoint Arguments Description
umka_tp_stable_cpu_frequency cpu (u32), old_khz (u32), new_khz (u32), timestamp CPU frequency transition
umka_tp_stable_pm_suspend device_id, state (u8), latency_ns, timestamp Device PM state transition (suspend/resume). state: 0=D0, 1=D1, 2=D2, 3=D3

Driver:

Tracepoint Arguments Description
umka_tp_stable_driver_load device_id, driver_name, tier, timestamp Driver loaded
umka_tp_stable_driver_crash device_id, driver_name, tier, fault_type Driver crash
umka_tp_stable_driver_recover device_id, driver_name, recovery_time_ns Driver recovered
umka_tp_stable_tier_change device_id, driver_name, old_tier, new_tier, reason Tier change (reason: manual/auto_demote/auto_promote)

FMA (generic):

Tracepoint Arguments Description
umka_tp_stable_fma_event device_id, class, severity, code, data_len, data[64] Generic health event (all subsystems)
umka_tp_stable_fma_action device_id, action, reason, health_score_before, health_score_after Corrective action taken

20.2.5.1 Per-Subsystem FMA Tracepoints

The generic umka_tp_stable_fma_event carries the full HealthEvent payload (including the raw data[64] blob), but eBPF programs that need to inspect structured fields would have to manually parse the binary layout. Per-subsystem FMA tracepoints expose the structured fields directly as named arguments, enabling zero-parse filtering in eBPF (e.g., "alert on syndrome 0x7F from DIMM slot 3").

These tracepoints are fired by fma_emit() (Section 20.1) at the same callsite as the generic umka_tp_stable_fma_event. Both the generic and the subsystem-specific tracepoint fire for every event — eBPF programs attach to whichever provides the most convenient argument layout.

FMA Memory:

Tracepoint Arguments Description
umka_tp_stable_fma_memory_ce device_id (u64), mci_idx (u32), csrow (u32), channel (u32), page (u64), offset (u64), syndrome (u32), timestamp (u64) ECC correctable (single-bit) memory error
umka_tp_stable_fma_memory_ue device_id (u64), mci_idx (u32), csrow (u32), channel (u32), page (u64), fatal (u8), offset (u64), timestamp (u64) ECC uncorrectable (multi-bit) memory error

FMA PCIe:

Tracepoint Arguments Description
umka_tp_stable_fma_pcie_aer_ce device_id (u64), bus_dev_fn (u32), aer_status (u32), timestamp (u64) PCIe AER correctable error. bus_dev_fn is the standard PCI BDF encoding (bus[23:16], device[15:11], function[10:8])
umka_tp_stable_fma_pcie_aer_ue device_id (u64), bus_dev_fn (u32), aer_status (u32), aer_severity (u32), timestamp (u64) PCIe AER uncorrectable error. aer_severity bit 0: 1 = fatal, 0 = non-fatal

FMA Storage:

Tracepoint Arguments Description
umka_tp_stable_fma_storage_smart device_id (u64), attr_id (u8), value (u8), threshold (u8), timestamp (u64) SMART health threshold event. value ≤ threshold indicates failure

FMA Thermal:

Tracepoint Arguments Description
umka_tp_stable_fma_thermal device_id (u64), temp_milli_c (i32), throttle_pct (u8), timestamp (u64) Thermal throttling event

FMA Generic (driver-defined):

Tracepoint Arguments Description
umka_tp_stable_fma_generic device_id (u64), event_code (u32), payload[16] (bytes), timestamp (u64) Generic driver-defined fault event

All per-subsystem FMA tracepoints follow the append-only versioning rules in Section 20.2: new fields may be appended in future versions, but existing fields never change position or type. The version field in each StableTracepoint definition starts at 1 for the initial schema above.

20.2.5.2 PMU Tracepoints

Performance Monitoring Unit tracepoints for sampler lifecycle, overflow handling, and hardware event multiplexing (Section 20.8).

Tracepoint Arguments Description
umka_tp_stable_perf_sampler_wakeup cpu (u32), samples_processed (u64), duration_ns (u64), timestamp (u64) PMU sampler kthread woken on NMI/PMI overflow. samples_processed = number of samples drained from the per-CPU ring in this wakeup. duration_ns = wall-clock time spent draining.
umka_tp_stable_perf_overflow_drop cpu (u32), queue_depth (u32), event_id (u64), timestamp (u64) Sample dropped because the per-CPU sample ring buffer is full. queue_depth = ring occupancy at drop time. event_id = the perf_event that overflowed. Frequent drops indicate the sampler kthread cannot keep up — operator should reduce sampling frequency or increase ring buffer size.
umka_tp_stable_perf_multiplex_rotate cpu (u32), group_id (u64), nr_events (u32), nr_active (u32), time_slice_ns (u64), timestamp (u64) Hardware event group rotated for time-sharing. Fired when the multiplexer swaps the active event set on a PMU with fewer counters than monitored events. nr_events = total events in the group, nr_active = events that fit in hardware counters simultaneously, time_slice_ns = duration of the just-completed rotation slot.

20.2.6 Versioning Rules

Same philosophy as KABI:

  1. Tracepoint names are permanent. Once published, a stable tracepoint name never changes and never disappears (without multi-release deprecation).
  2. Arguments are append-only. New arguments can be added at the end. Existing arguments never change position, type, or semantics.
  3. Version field tracks argument schema. eBPF programs can check the tracepoint version to know which arguments are available.
  4. Deprecation requires 2+ major releases. A deprecated tracepoint continues to fire (possibly with stale/dummy data) for at least two major releases before removal.

20.2.7 Built-In Aggregation Maps

For common observability patterns, provide pre-built eBPF maps that aggregate in-kernel:

/ukfs/kernel/tracing/
    syscall_latency_hist    # Per-syscall latency histogram (log2 buckets)
    block_latency_hist      # Per-device I/O latency histogram
    sched_latency_hist      # Scheduling latency histogram
    net_packet_count        # Per-interface packet counters
    page_fault_count        # Per-process page fault counters

These are always-on aggregation counters, separate from the tracepoint mechanism described in Section 20.2. The distinction is important:

  • Tracepoints (Section 20.2) have zero overhead when disabled. When no eBPF program is attached, the tracepoint site is a NOP instruction — no branch, no cost. They are designed for on-demand, deep inspection.
  • Aggregation counters are always active but are not tracepoint-based. They are simple atomic increments (e.g., AtomicU64::fetch_add(1, Relaxed)) embedded directly in the relevant code paths (syscall entry, block I/O completion, packet RX/TX, etc.). Their cost is a single atomic increment per event — typically 1-2 ns — which is negligible compared to the operation being measured. The histogram buckets are updated in-kernel with no eBPF program required.

Aggregation counters are per-CPU (PerCpu<AggregationCounters>), avoiding cross-core cache line bouncing. The per-CPU design ensures fetch_add operations are cache-local — the x86 lock prefix cost is negligible (~1 ns) when the cache line is CPU-local because no cross-core coherence traffic is generated.

Histogram bucket specification: Latency histograms (syscall_latency_hist, block_latency_hist, sched_latency_hist) use 26 log2 buckets covering 1 ns to 33.5 s: bucket i spans [2^i ns, 2^(i+1) ns) for i = 0..25, plus a final overflow bucket for values >= 2^25 ns (~33.5 s). Each bucket is a per-CPU AtomicU64 counter. Total per-CPU memory: 26 buckets x 8 bytes = 208 bytes per histogram. On read (sysfs scrape), per-CPU values are summed to produce a system-wide histogram. The log2 bucketing matches bpftrace @hist() semantics.

Observability tools (Prometheus, Grafana agents, etc.) can scrape the aggregation counter files directly without enabling any tracepoints.

20.2.8 Linux Tool Compatibility

bpftrace: Works unmodified. Stable tracepoints appear as standard tracepoints:

bpftrace -e 'tracepoint:umka_stable:syscall_entry { @[args->nr] = count(); }'

perf: Works unmodified. Tracepoints visible via perf list:

perf list 'umka_stable:*'
perf record -e umka_stable:block_complete -a sleep 10

bpftool: Works unmodified. Can list and inspect programs attached to stable tracepoints.

20.2.9 Audit Subsystem

UmkaOS provides a structured audit subsystem for security-relevant events, built on the tracepoint infrastructure (Section 20.2). Unlike Linux's auditd — a separate subsystem with its own filtering language, netlink protocol, and dispatcher daemon — UmkaOS's audit is integrated directly with the capability model and eBPF tracepoints. One mechanism serves both observability and security auditing, operating in one of two explicitly distinct delivery modes:

  • Best-effort mode (default): Audit record emission is always non-blocking. If the per-CPU ring buffer is full, the oldest unread record is overwritten. No thread ever blocks. System availability is prioritized over audit completeness.
  • Strict mode (audit_strict=true boot parameter): Audit record emission blocks with a configurable timeout (default 10 ms) when the ring buffer is full, applying backpressure to maintain gapless per-CPU sequences. This is NOT contradictory with the non-blocking hot path -- in strict mode, blocking occurs only on buffer overflow (a rare backpressure event), not on every record emission.

These modes are detailed in Section 20.2 (delivery semantics) and Section 20.2 (overflow policy).

20.2.9.1 Security Audit Events

Audited events fall into six categories. All events in the Capability and Authentication categories are audited by default; other categories follow configurable policy.

Category Events Example
Capability grant, revoke, attenuate, delegate Process A grants CAP_NET to process B
Authentication login, logout, failed auth SSH login attempt from 10.0.0.1
Access control permission denied, capability check Open /etc/shadow: CAP_READ denied
Process lifecycle exec, exit, setuid-equiv Process 1234 exec /usr/bin/sudo
Driver load, crash, tier change NVMe driver crashed, recovering
Configuration sysctl change, policy load Security policy reloaded

Each event fires a stable tracepoint in the Security category (TracepointCategory, Section 20.2), so the same eBPF tooling works for security monitoring. Audit tracepoints are non-blocking on the hot path: they enqueue the audit record into a per-CPU lock-free ring buffer (the same zero-overhead static-key mechanism as all other tracepoints). The delivery guarantee is enforced asynchronously by the drain thread and the configured delivery mode (Section 20.2). In strict mode, the producer blocks only when the ring buffer is full (backpressure), not on every record emission. In best-effort mode (the default), the oldest unread record is overwritten and no thread ever blocks.

20.2.9.2 Audit Record Format

/// Audit event type discriminator.
///
/// UmkaOS audit event IDs use a base of 3000 (`UMKA_AUDIT_BASE`) to avoid collision
/// with all Linux audit message type ranges. Linux's `audit.h` allocates:
///   - 1000-2099: kernel messages (commands, events, anomalies, integrity, etc.)
///   - 2100-2999: userspace messages (`AUDIT_FIRST_USER_MSG2` .. `AUDIT_LAST_USER_MSG2`)
/// By starting at 3000, UmkaOS IDs are above all currently allocated Linux ranges.
/// The Linux compatibility layer (Section 20.2.9.5) translates UmkaOS audit events to
/// standard Linux audit message types (e.g., `AUDIT_AVC`, `AUDIT_USER_AUTH`) for
/// tools like auditd and ausearch.
///
/// Base constant: `UMKA_AUDIT_BASE = 3000`.
/// Each audit category is allocated a range of 10 discriminants for future
/// sub-types (e.g., CapGrant 3000-3009, ProcessLifecycle 3010-3019).
/// Capability events (3000-3002) use 3 of their 10 slots.
/// Syscall audit skips from 3050 to 3070 (AdminAction 3050-3059 reserved,
/// 3060-3069 available for future categories).
#[repr(u16)]
pub enum AuditEventType {
    /// Capability grant (cap_id, target_pid, permissions).
    CapGrant        = 3000,
    /// Capability denial (cap_id, requested_permissions, reason).
    CapDeny         = 3001,
    /// Capability revocation (cap_id, holder_pid).
    CapRevoke       = 3002,
    /// Process lifecycle (fork, exec, exit).
    ProcessLifecycle = 3010,
    /// Security policy change (module load/unload, policy update).
    PolicyChange    = 3020,
    /// Driver isolation event (tier change, crash, recovery).
    DriverIsolation = 3030,
    /// Authentication event (login, sudo, key use).
    Auth            = 3040,
    /// Administrative action (sysctl, mount, module load).
    AdminAction     = 3050,
    /// Syscall audit record (syscall number, args, result).
    Syscall         = 3070,
    /// File access audit record (path, inode, mode, ownership).
    FileAccess      = 3080,
    /// Process exec audit record (argv entries).
    Exec            = 3090,
    /// Audit drain thread started (meta-event, Section 20.2.9.3).
    /// Emitted by the audit subsystem's initialization code to make
    /// the drain thread's lifecycle observable even though its internal
    /// operations are not individually audited (to prevent self-deadlock).
    AuditDrainStart = 3060,
    /// Audit drain thread stopped (meta-event, Section 20.2.9.3).
    AuditDrainStop  = 3061,
    /// Batch HMAC seal marker (Section 20.2.9.4).
    HmacBatchSeal   = 3062,
    /// Synthetic record emitted when audit events are dropped due to
    /// ring-full backpressure. Carries the count of lost events since
    /// the last LostEvents record. See `maybe_emit_lost_record()`.
    LostEvents      = 3063,
}

/// A single audit record. Fixed-size header followed by variable-length detail.
/// Written atomically to the per-CPU audit ring buffer.
///
// kernel-internal, not KABI
/// Field order is chosen for natural alignment with `#[repr(C)]`: all u64 fields
/// first, then u32 fields, then u16 fields, then u8 fields, with explicit padding
/// to avoid compiler-inserted holes. Total header size: 56 bytes (no hidden padding).
// kernel-internal, not KABI
#[repr(C)]
pub struct TamperAuditRecord {
    // --- 8-byte aligned fields (offset 0) ---

    /// Monotonic nanosecond timestamp (`CLOCK_MONOTONIC_RAW` — raw hardware
    /// clock, NOT adjusted for time namespace offsets). Audit timestamps are
    /// always in the init time namespace for cross-container consistency.
    /// Same clock source as tracepoint timestamps.
    pub timestamp_ns: u64,
    /// Per-CPU monotonically increasing sequence number. Each CPU maintains
    /// its own independent sequence counter. In strict mode, sequences are
    /// gapless (producers block on overflow); in best-effort mode, gaps
    /// indicate lost records and are detected via sequence discontinuities.
    /// On drain, the merge-sort thread interleaves per-CPU chains and
    /// produces a combined audit log with (cpu_id, per_cpu_sequence)
    /// tuples for total ordering reconstruction.
    pub sequence: u64,
    /// Capability handle used (or attempted) for this operation.
    /// `CapHandle` is a 64-bit opaque handle indexing into the process's
    /// capability table (Section 9.1.1, umka-core capability model). User space
    /// holds handles, never raw capability data. Defined as:
    /// `pub struct CapHandle(u64);`
    pub subject_cap: CapHandle,
    /// Opaque identifier for the target resource (inode, device, endpoint).
    pub object_id: u64,

    // --- 4-byte aligned fields (offset 32) ---

    /// Process ID of the subject in the init PID namespace (global PID).
    /// Always reports the init_pid_ns value for compatibility with Linux
    /// audit tools (`ausearch -p`). Container-local PIDs are available
    /// via the `detail` payload when the audit context includes a PID
    /// namespace other than init_pid_ns.
    pub pid: u32,
    /// Thread ID of the subject in the init PID namespace (global TID).
    pub tid: u32,
    /// User ID of the subject (mapped from capability-based identity).
    pub uid: u32,

    // --- 2-byte aligned fields (offset 44) ---

    /// CPU that emitted this record. Together with `sequence`, provides a
    /// globally unique, tamper-evident record identifier.
    pub cpu_id: u16,
    /// The audit event type (which category + specific event).
    pub event_type: AuditEventType,
    /// Length of the variable-length detail payload that follows.
    pub detail_len: u16,

    // --- 1-byte aligned fields (offset 50) ---

    /// Outcome of the audited operation.
    pub result: AuditResult,

    /// Record-level flags. Bitfield:
    /// - `AUDIT_DETAIL_TRUNCATED` (0x01): The variable-length detail payload was
    ///   truncated because it exceeded the per-record detail size limit. Consumers
    ///   must treat the TLV payload as incomplete when this flag is set.
    pub flags: u8,

    /// Explicit padding to 8-byte alignment boundary (56 bytes total).
    /// Prevents compiler-inserted hidden padding in `#[repr(C)]`.
    pub _pad: [u8; 4],

    // Variable-length detail follows immediately after `_pad`, encoded
    // in TLV (Type-Length-Value) format. See "Detail TLV Encoding" below.
}
// TamperAuditRecord fixed header: u64(8)*4 + u32(4)*3 + u16(2)*3 + u8(1)*2 + [u8;4] = 56 bytes.
// Audit ring buffer record header — consumed by auditd/ausearch tools.
const_assert!(core::mem::size_of::<TamperAuditRecord>() == 56);

/// Detail payload was truncated due to per-record size limit.
pub const AUDIT_DETAIL_TRUNCATED: u8 = 0x01;

Detail TLV Encoding. The variable-length detail payload uses a compact key-value encoding. Each pair is:

+-------------------+--------------------+-------------------+--------------------+
| key_len (u16, LE) | key_bytes (UTF-8)  | val_len (u16, LE) | val_bytes (UTF-8)  |
+-------------------+--------------------+-------------------+--------------------+

Pairs are repeated contiguously until detail_len bytes are consumed. Maximum total detail payload: 4096 bytes. Records with payloads exceeding 4096 bytes are truncated and tagged with a DETAIL_TRUNCATED flag in the record header.

Required keys per AuditEventType:

AuditEventType Required keys
Syscall "syscall", "success", "exit", "ppid", "pid", "uid", "gid", "comm", "exe"
FileAccess "name", "inode", "dev", "mode", "ouid", "ogid"
Exec "argc", "a0", "a1", "a2", "a3" (first 4 argv entries)
CapGrant / CapDeny "cap", "target_obj", "perm_bits"
Auth "op", "acct", "hostname", "addr", "terminal"
AdminAction "op", "key", "old_val", "new_val"

Values are encoded as UTF-8 strings (numeric values are formatted as decimal ASCII). The drain thread and userspace audit consumers parse the TLV stream using key_len/val_len boundaries — no separator characters or escaping are needed. Zero-length keys are invalid; zero-length values are permitted (key present but value empty).

#[repr(u8)]
pub enum AuditResult {
    /// Operation succeeded.
    Success = 0,
    /// Operation denied by capability check.
    Denied  = 1,
    /// Operation failed for a non-security reason.
    Error   = 2,
}

Records are written to a per-CPU byte ring buffer (1 MiB capacity, using the SPSC ring protocol from Section 3.6) — an NMI-safe lock-free ring. The ring is NOT SpscRing<TamperAuditRecord, N> (fixed-size element ring) because TamperAuditRecord has a variable-length detail payload. Instead, a byte-granularity ring with the framing protocol described below is used. Drained to persistent storage by a pool of N drain threads (default: max(2, num_cpus / 64), tunable via umka.audit.drain_threads). Each CPU maintains its own monotonic sequence counter without requiring a global atomic counter (which would defeat per-CPU parallelism). CPU rings are statically partitioned among drain threads (CPUs 0..K to thread 0, K+1..2K to thread 1, etc.) to avoid contention. Each drain thread merges and sorts its assigned per-CPU chains by timestamp. A final lightweight merge of pre-sorted per-thread outputs produces the combined record stream with (cpu_id, per_cpu_sequence) tuples for total ordering.

Timestamp ordering: The merge-sort uses (timestamp_ns, cpu_id, per_cpu_sequence) as the composite sort key — timestamp is primary, with CPU ID and per-CPU sequence as tiebreakers. TSC synchronization (Section 7.8) bounds cross-CPU skew to <100ns on modern hardware. For systems with larger skew (NUMA, pre-invariant-TSC), the drain thread applies a per-CPU offset correction table calibrated at boot.

Variable-length record framing with overwrite safety. Each TamperAuditRecord has a fixed-size 56-byte header followed by a variable-length detail payload (length given by detail_len). Since records vary in total size, the ring buffer uses a composed framing and overwrite-safety protocol. Each ring slot has the following byte layout:

+-------------+-------------+-------------------------------------------+------------+
| start_seq   | length      | TamperAuditRecord header + detail payload | end_seq    |
| (u32, 4B)   | (u32, 4B)   | (56 + detail_len bytes)                   | (u32, 4B)  |
+-------------+-------------+-------------------------------------------+------------+
 ^                           ^                                           ^
 Overwrite safety            Framing (record boundary)                   Overwrite safety

Framing protocol (record boundary detection):

  • The length field (second u32) stores the total size of the framed entry: 4 (start_seq) + 4 (length) + 56 (header) + detail_len (payload) + 4 (end_seq).
  • When a record would wrap around the end of the ring buffer (remaining space < total framed size), a skip marker is written: a slot where length == 0. This signals the consumer to skip the remaining bytes and continue from offset 0. The record is then written starting at offset 0.
  • Consumers read the length field at the current position. If length == 0 (skip marker), they advance to the buffer start. Otherwise, they read length bytes as a complete framed record.
  • The skip-marker value 0 cannot conflict with start_seq because start_seq is the first u32 in the slot while length is the second. Consumers distinguish them by position within the slot layout.
  • This skip-marker approach wastes at most max_record_size - 1 bytes per wrap-around but avoids the complexity of split-record handling in the lock-free read path. The wasted space is bounded because detail_len is capped at 4096 bytes (audit records with payloads exceeding this limit are truncated and tagged with a DETAIL_TRUNCATED flag).

Overwrite safety protocol (torn-read prevention):

  • Each entry is bracketed by start_seq (first u32) and end_seq (last u32).
  • The producer writes in four steps:
  • start_seq = write_seq | 1 (odd = write in progress)
  • length + entry data
  • end_seq = write_seq (even = commit target)
  • start_seq = write_seq (even = write complete, matches end_seq)
  • increment write_seq by 2
  • The consumer reads start_seq, copies the entry, reads end_seq. If start_seq != end_seq or start_seq is odd, the entry was being overwritten during the read — the consumer discards it and advances to the next entry.
  • The key invariant: when a write is complete, both start_seq and end_seq hold the same even value. During a write, start_seq is odd. A concurrent overwrite will first set start_seq to a new odd value, which the consumer detects on the post-copy check.
  • This ensures consumers never observe torn data. The two protocols compose cleanly: framing uses the length field to find record boundaries, while the sequence stamps at the slot's edges detect concurrent overwrites.

Sequence numbers are monotonically increasing within each CPU's stream. The gap-detection behavior depends on the configured delivery mode:

  • Strict mode (audit_strict=true): Sequence numbers are gapless. The producing thread blocks (with configurable timeout, default 10 ms) when the ring buffer is full, applying backpressure to maintain the gapless invariant. Any gap in a single CPU's sequence under strict mode indicates a bug or tampering — the audit subsystem treats this as a critical security event, emitting a synthetic AUDIT_LOST record and raising an alert via the FMA subsystem (Section 20.1).
  • Best-effort mode (default): Gaps are possible when the ring buffer overflows (oldest records are overwritten). Consumers detect lost records by observing discontinuities in the per-CPU sequence numbers. The records_dropped counter (per-CPU AtomicU64) tracks how many records were lost, and the drain thread includes the drop count in periodic "audit health" meta-records so that log consumers can quantify the gap.

20.2.9.3 Audit Policy Engine

The audit policy engine determines which events are recorded. Default policy:

  • Always audit: all capability denials, all authentication events, all exec calls.
  • Configurable: capability grants, driver lifecycle, configuration changes.

Policy is expressed as eBPF programs attached to audit tracepoints — the same mechanism used for security monitoring (Section 9.8). The delivery guarantee depends on the configured mode: best-effort (default) or strict (guaranteed delivery).

Overflow policy. The two delivery modes described above (Section 20.2) govern overflow behavior:

  • Best-effort mode (default, audit_strict=false): Drop-oldest policy. When the per-CPU ring buffer is full, the oldest unread record is overwritten. The kernel prioritizes system availability over audit completeness.
  • Strict mode (audit_strict=true boot parameter): The emitting thread blocks with a configurable timeout (default 10 ms) rather than overwriting records, maintaining gapless per-CPU sequences. To prevent self-deadlock, the drain thread is exempt from auditing its own operations via the non-audited I/O path:
  • The drain thread writes directly to a dedicated audit partition (configured at boot via audit_log_dev= kernel parameter) using raw block I/O (submit_bio directly to the block device) rather than the VFS write() syscall path that would trigger audit events. The audit partition must be a separate block device or partition — not a file on a journaled filesystem — to avoid bypassing filesystem journaling (raw submit_bio to data blocks on a journaled filesystem like ext4/XFS would corrupt the journal's consistency guarantees). The audit subsystem manages its own simple log-structured layout on this partition (sequential append with a header containing magic, version, and write offset).
  • On systems where a dedicated partition is unavailable, the fallback audit_log_path= parameter specifies a regular file, but in this mode the drain thread uses the VFS write path with the PF_NOAUDIT flag (no audit recursion) instead of raw submit_bio. This is slower but preserves filesystem integrity.
  • The block device's I/O completion path is also marked non-auditable via a per-thread flag (PF_NOAUDIT) set on the drain thread at creation.
  • This design ensures the drain thread can always make forward progress even when the audit ring buffers are full, breaking the circular dependency. To mitigate the resulting audit blind spot, the drain thread's own activation and deactivation are logged as meta-events (type AuditDrainStart and AuditDrainStop, Section 20.2) by the audit subsystem's initialization code, not by the drain thread itself. This ensures the drain thread's lifecycle is observable even though its internal operations are not individually audited. RT tasks (SCHED_FIFO / SCHED_RR) always use the drop-oldest policy regardless of strict mode, to preserve real-time scheduling guarantees.

Timeout behavior in strict mode: After the 10 ms timeout expires and the ring buffer is still full, the current record IS emitted by force-evicting the oldest unread entry (converting temporarily to best-effort for that single entry). This preserves the guarantee that the current event is never lost — only the oldest, already-in-buffer event can be evicted (which the drain thread is expected to have forwarded already). The eviction is tracked via two mechanisms: a per-CPU forced_eviction_count: AtomicU64 counter (incremented on each forced eviction), and a per-CPU missed_sequences sideband buffer that records the sequence number of each evicted entry. The drain thread reads the missed_sequences buffer and emits a synthetic AUDIT_LOST record for each evicted sequence, ensuring the gap is visible to verifiers even though the evicted record's content is irretrievably lost. This approach avoids both deadlock (the producing thread never blocks indefinitely) and silent loss of the current security event (the event being audited right now is the one most likely to be attacker-relevant).

/// Attach an audit policy program. Unlike debug tracepoints, audit programs
/// participate in the guaranteed-delivery path.
pub fn attach_audit_policy(
    tracepoint: &'static StableTracepoint,
    prog: &BpfProg,
) -> Result<AuditPolicyHandle, AuditError> {
    // Verified eBPF program acts as filter: returns true to audit, false to skip.
    // The program can inspect all tracepoint arguments to make the decision.
    // ...
}

Rate limiting. To prevent audit log flooding from misbehaving or malicious processes, each event type has a configurable rate limit (default: 10 000 events/second per type). When the rate limit is hit, the audit subsystem coalesces events into a single summary record (e.g., "5 327 additional CAP_READ denials from pid 4001 suppressed in last 1s") rather than silently dropping them. The summary record consumes a sequence number in the per-CPU stream, preserving the sequence continuity invariant (in strict mode) while bounding log volume.

The rate limiter is structured as a three-layer defense against DoS via high-rate audit-generating workloads (e.g., a process rapidly calling audited syscalls to fill the ring and stall other tenants):

/// Audit event rate limiter. Prevents DoS via high-rate audit-generating workloads
/// (e.g., a process rapidly calling audited syscalls to fill the ring and stall others).
///
/// **Three-layer defense**:
/// 1. Per-cgroup token bucket: limits events from one cgroup.
/// 2. Global byte-rate limiter: limits total audit throughput.
/// 3. Emergency throttle: drops events (with a "audit: lost N events" record) when
///    the ring consumer (auditd) is too slow.
pub struct AuditRateLimiter {
    /// Per-cgroup token bucket. Key: cgroup ID (u64).
    /// Each cgroup gets `PER_CGROUP_BURST` tokens; refilled at `PER_CGROUP_RATE_EPS`.
    ///
    /// XArray: O(1) lookup by integer cgroup ID with native RCU-compatible reads.
    /// XArray's internal lock replaces the external SpinLock. N bounded by active
    /// cgroup count (typically <100).
    /// SpinLock wraps each TokenBucket because `try_consume(&mut self)`
    /// requires exclusive access, but `xa_load()` returns `&T` (shared ref).
    pub per_cgroup: XArray<SpinLock<TokenBucket>>,

    /// Global byte-rate token bucket: limits total audit data throughput.
    /// Prevents one audit-heavy cgroup from consuming the entire ring even within
    /// its per-cgroup event limit (e.g., via large audit records).
    pub global_byte_rate: SpinLock<TokenBucket>,

    /// Count of events dropped since last "audit: lost N" message.
    pub events_lost: AtomicU64,

    /// Count of bytes dropped since last "audit: lost N bytes" message.
    pub bytes_lost: AtomicU64,
}

/// Token bucket for rate limiting.
/// Tokens represent events (for per-cgroup limiter) or bytes (for global limiter).
pub struct TokenBucket {
    /// Current token count (0..=capacity).
    pub tokens: u64,
    /// Maximum token count (burst size).
    pub capacity: u64,
    /// Token refill rate in tokens per second.
    pub refill_rate_per_sec: u64,
    /// Timestamp of last refill (nanoseconds since boot).
    pub last_refill_ns: u64,
}

impl TokenBucket {
    /// Try to consume `n` tokens. Returns true if tokens were available (event allowed).
    /// Refills tokens based on elapsed time since last refill before checking.
    pub fn try_consume(&mut self, n: u64, now_ns: u64) -> bool {
        let elapsed_ns = now_ns.saturating_sub(self.last_refill_ns);
        // Use u128 intermediate to avoid u64 overflow. Without this, the
        // multiplication overflows after ~29 minutes of inactivity with
        // refill_rate_per_sec = 10_485_760 (AUDIT_GLOBAL_BYTE_RATE), silently
        // producing a much smaller value and incorrectly rate-limiting events.
        let new_tokens = ((elapsed_ns as u128) * (self.refill_rate_per_sec as u128)
            / 1_000_000_000) as u64;
        self.tokens = (self.tokens + new_tokens).min(self.capacity);
        self.last_refill_ns = now_ns;
        if self.tokens >= n {
            self.tokens -= n;
            true
        } else {
            false
        }
    }
}

/// Per-cgroup audit rate: 1000 events/second with a burst of 5000.
pub const AUDIT_PER_CGROUP_RATE_EPS: u64 = 1_000;
pub const AUDIT_PER_CGROUP_BURST: u64 = 5_000;

/// Global audit byte rate: 10 MB/second with a burst of 50 MB.
pub const AUDIT_GLOBAL_BYTE_RATE: u64 = 10 * 1024 * 1024;
pub const AUDIT_GLOBAL_BYTE_BURST: u64 = 50 * 1024 * 1024;

Rate limiting enforcement in the audit event submission path:

fn audit_submit(event: AuditEvent, cgroup: CgroupId) -> AuditResult {
    let now_ns = clock_monotonic_ns();
    let event_size = event.serialized_size();

    // 1. Per-cgroup token bucket check (XArray API)
    // Use xa_store_if_absent to avoid TOCTOU race: two CPUs racing to create
    // the first bucket for the same cgroup would both xa_load(miss), both
    // xa_store(new bucket), and the second store silently overwrites the first,
    // resetting the token count. xa_store_if_absent is atomic: if another CPU
    // inserted between our miss and our store, we get the existing entry back.
    let bucket = match limiter.per_cgroup.xa_load(cgroup) {
        Some(b) => b,
        None => {
            let b = default_bucket();
            limiter.per_cgroup.xa_store_if_absent(cgroup, b)
        }
    };
    if !bucket.lock().try_consume(1, now_ns)
    {
        limiter.events_lost.fetch_add(1, Relaxed);
        return AuditResult::DroppedRateLimit;
    }

    // 2. Global byte-rate check
    if !limiter.global_byte_rate.lock().try_consume(event_size as u64, now_ns) {
        limiter.bytes_lost.fetch_add(event_size as u64, Relaxed);
        return AuditResult::DroppedRateLimit;
    }

    // 3. Ring fullness — drop-oldest policy (overwrite oldest unread record).
    // Drop-newest would allow an attacker to fill the ring with benign events
    // to suppress audit of malicious actions. Drop-oldest ensures the most
    // recent events are always available for forensic analysis.
    if audit_ring.is_full() {
        audit_ring.advance_tail(1);  // discard oldest unread record
        limiter.events_lost.fetch_add(1, Relaxed);
        maybe_emit_lost_record(&limiter);
    }

    audit_ring.push(event);
    AuditResult::Accepted
}

The maybe_emit_lost_record function emits a synthetic AuditEventType::LostEvents record into the ring whenever events_lost crosses a reporting threshold (default: every 100 dropped events or every 1 second since the last report, whichever comes first). This preserves the property that sequence gaps in strict mode always have an explicit explanation record, even when the explanation itself is the loss record.

20.2.9.4 Tamper-Evident Log Chain

Each CPU maintains an independent HMAC chain over its audit records, making retroactive tampering detectable without cross-CPU synchronization:

Initial:   key[0] = HMAC-SHA256(boot_secret, cpu_id || "audit-chain-v1")
Evolution: key[n] = HMAC-SHA256(key[n-1], "evolve")
After computing key[n], key[n-1] is securely erased (zeroed).
Per-record: hmac[n] = HMAC-SHA256(key[n], hmac[n-1] || serialize(record[n]))

Important: The HMAC is NOT stored in the per-CPU ring buffer. The TamperAuditRecord struct in the ring buffer contains only the event data (timestamp, event type, subject, object, detail). The HMAC is computed by the drain thread when it reads records from the per-CPU ring buffer and writes them to the persistent audit log. In the persistent format, each record (strict mode) or each batch (batched mode) is stored alongside its HMAC. This separation keeps the hot-path ring buffer append fast (no cryptographic operations on the recording CPU) while ensuring tamper evidence in the durable log.

Performance cost. HMAC-SHA256 on a typical 56-byte audit record header plus variable-length detail (average ~200-300 bytes total) takes approximately 200-500 ns per record on modern hardware. At sustained high audit rates (100K records/sec in an audit-heavy workload), this costs 20-50 ms/sec of CPU time, roughly 2-5% of one core. For most workloads (1K-10K records/sec) the cost is negligible. To mitigate the cost for non-security-critical events, HMAC computation is configurable per event category: security-critical categories (CapGrant, CapDeny, CapRevoke, Auth) always use strict mode (per-record HMAC with per-record key evolution); other categories can be configured to use batched mode or no HMAC at all. The two HMAC modes operate as follows:

  • Strict HMAC mode (per-record): Each record receives its own HMAC. The HMAC key evolves after every record: key[n] = HMAC-SHA256(key[n-1], "evolve"), and key[n-1] is erased. This provides per-record tamper evidence and forward secrecy.
  • Batched HMAC mode (per-batch): A single HMAC covers N consecutive records (default N=16). All records in a batch are serialized and HMACed together under a single key. The key evolves per-batch (not per-record): after computing the HMAC for batch B, the key evolves once to produce the key for batch B+1, and the previous key is erased. This amortizes the cryptographic cost across N records while still providing tamper evidence at batch granularity and forward secrecy at batch boundaries. Within a batch, individual record tampering is still detectable because the batch HMAC covers all records in sequence.

Batch framing: In batched mode, the drain thread accumulates N records and then writes a batch seal — a special entry with AuditEventType::HmacBatchSeal containing the batch HMAC, the batch sequence range (first and last sequence numbers covered), and the evolved key's public commitment (SHA-256 of the next key, enabling verifiers to detect key-evolution breaks). Individual records within a batch carry no HMAC; tamper evidence is provided solely by the batch seal. The verifier reads records from the persistent log until it encounters an HmacBatchSeal, verifies the batch HMAC over all preceding unsealed records (by re-serializing and re-HMACing them), then continues to the next batch. If the log ends mid-batch (e.g., crash before the drain thread wrote the seal), the trailing unsealed records are flagged as unverifiable in the verification report.

The per-category HMAC policy is set via /proc/umka/audit/hmac_policy.

The boot_secret is derived from: (a) TPM-sealed entropy if a TPM is present (preferred), or (b) hardware RNG (RDRAND/RNDR/platform-specific entropy source) collected during early boot if no TPM. There is no static fallback key embedded in the kernel image, because a key readable from the kernel binary would defeat tamper evidence (any attacker with access to the image could forge the HMAC chain). On systems without both TPM and hardware RNG, the audit subsystem generates a random seed from whatever entropy is available at boot (interrupt timing jitter, memory contents) and stores it only in kernel memory. This seed is lost on reboot, which means the HMAC chain cannot be verified across reboots without a TPM. However, within a single boot the chain provides full tamper evidence, and the inability to verify across reboots is an acceptable tradeoff: it prevents an attacker who obtains the kernel image from forging audit records. For offline log verification, the SHA-256 hash of the boot_secret (not the secret itself) is recorded in the first audit record of each boot, allowing a verifier with the original secret to replay the HMAC chain.

Each CPU's HMAC chain starts from a known initial value at boot, seeded with the CPU ID to ensure chains are distinguishable. The key material is stored in umka-core kernel memory, protected by the kernel's core isolation domain (inaccessible to drivers and all userspace). On x86, this corresponds to PKEY 0; on other architectures, equivalent protection is provided by the platform's isolation mechanism (Section 11.2). Even a compromised Tier 2 driver cannot read or forge the audit HMAC.

Tier 1 caveat: Tier 1 drivers run in Ring 0 and can in principle execute the platform's domain-switch instruction (e.g., WRPKRU on x86) to access PKEY 0 memory. This is a documented tradeoff of the Tier 1 trust model (Section 11.2): Tier 1 drivers are considered trusted for integrity but isolated for fault containment. The "inaccessible to drivers" guarantee above applies fully to Tier 2 drivers (Ring 3, hardware-enforced) and provides fault-isolation-grade protection against Tier 1 drivers (a malicious Tier 1 driver that deliberately bypasses isolation is outside the threat model — such a driver would be moved to Tier 0 or Tier 2, or rejected).

Forward secrecy. In strict HMAC mode, the key evolves after every record: key[n] is derived from key[n-1], and key[n-1] is securely erased (zeroed) immediately after computing key[n]. In batched HMAC mode, the key evolves after every batch (not every record), so forward secrecy granularity is per-batch rather than per-record. In both modes, forward secrecy applies to live key material in kernel memory: compromising the current in-memory HMAC key does not allow forging past records (or past batches), because previous keys have been erased from memory. An attacker who captures the current key can forge records from that point onward but cannot reconstruct earlier keys.

Forward secrecy does not conflict with post-crash verification, because verification is performed by an offline verifier that holds the original boot_secret, not by the running kernel. The verifier derives the full key sequence deterministically from boot_secret (the same KDF used during recording: key[0] = HMAC-SHA256(boot_secret, cpu_id || "audit-chain-v1"), key[n] = HMAC-SHA256(key[n-1], "evolve")). Erased in-memory keys are not recoverable from the running kernel; boot_secret is the durable verification root. On TPM-equipped systems with HMAC key checkpointing enabled (see below), derived chain keys may be written to persistent storage in TPM-encrypted form — forward secrecy in that configuration is bounded by the checkpoint interval (at most 1,000 derivations). On systems without checkpointing, no derived keys are ever persisted. On TPM-equipped systems the boot_secret is TPM-sealed (see "TPM-sealed audit key" below) and is the only secret that must be protected for offline verification. On non-TPM systems, the SHA-256 hash of the boot_secret is recorded in the first audit record so verifiers can confirm they hold the correct secret before replaying the chain.

HMAC key checkpointing: To prevent O(N) re-derivation after a crash, the current HMAC chain key is checkpointed to persistent storage every 1,000 derivations (or every 60 seconds, whichever comes first). The checkpoint is encrypted with a TPM-sealed key (Section 9.3) and stored alongside the audit log. On crash recovery, re-derivation starts from the most recent checkpoint rather than the root — at most 1,000 HMAC derivations (~50 μs at ~50 ns per HMAC) rather than potentially millions. The checkpoint interval is configurable via audit.hmac_checkpoint_interval. The checkpoint itself is a single 64-byte write (32-byte key + 32-byte HMAC of the key for integrity) — negligible I/O overhead.

Crash recovery. If the system crashes mid-chain (e.g., between writing a record's data and computing its HMAC), the HMAC chain must be resumable. On boot, the audit subsystem performs the following recovery procedure for each CPU's persisted chain:

  1. Read the last complete record (one with a valid HMAC field) from the persisted log.
  2. Re-derive key[record_seq] from the boot_secret (which is either TPM-unsealed or re-entered by the administrator for offline verification), and verify that record's HMAC. The boot_secret is the durable root; in-memory keys that were erased during normal operation can always be re-derived from it.
  3. If verification succeeds, the chain is intact up to that record. Any data following the last valid HMAC is treated as an incomplete record: it is preserved in the log but tagged with a special AUDIT_CRASH_TRUNCATED sentinel (hmac field set to all 0xFF bytes) and a flag indicating the record is unverified.
  4. The new boot starts a fresh HMAC chain (new boot_secret, new key[0] per CPU). The first record of the new chain includes a back-reference to the last verified record of the previous boot (boot_id, cpu_id, sequence number) so that verifiers can link chains across reboots.

Records with the AUDIT_CRASH_TRUNCATED sentinel are excluded from HMAC chain verification but remain in the log for forensic analysis. Log consumers and the umka-audit-verify tool recognize the sentinel and report "N crash-truncated records found" rather than treating them as tampering.

Verification. Any consumer of the audit log can verify each per-CPU chain independently by re-deriving the key sequence from boot_secret and replaying the HMAC computation over that CPU's stored records. A mismatch at position N in CPU C's chain proves that record N (or a predecessor on that CPU) was modified or deleted after writing. The per-CPU design means a single CPU's chain can be verified without needing records from other CPUs. The umka-audit-verify tool accepts the boot_secret (or, on TPM systems, unseals it automatically) and performs the full chain replay offline.

Threat model refinement. The HMAC chain prevents retroactive tampering by unprivileged code or compromised Tier 2 drivers — they cannot read the HMAC key from Core memory. A Tier 0/Core compromise (full kernel compromise) CAN read the key and rewrite history. This is an intentional limitation: UmkaOS's audit system provides a strong integrity guarantee against all threats SHORT OF a full kernel compromise. A full kernel compromise is out of scope for audit integrity (if the kernel is compromised, the attacker can modify anything). For regulated environments requiring hardware-backed audit integrity even against kernel compromise, use the TPM-backed audit root described below.

TPM-backed audit root (when TPM 2.0 present). When a TPM 2.0 is available, the audit subsystem uses hardware-backed key management that elevates the integrity guarantee beyond what software-only HMAC chains can provide:

  1. Hardware entropy: At boot, the initial HMAC key key[0] is generated by TPM2_GetRandom() (not by the kernel's CSPRNG) — hardware entropy source. This ensures the key is not derivable from any software-observable state.

  2. PCR-sealed key storage: key[0] is immediately sealed to a TPM PCR policy: TPM2_Create(parent=srk, sensitive=key[0], policy=PCR_7_AND_PCR_10). The plaintext is overwritten (zeroed) after sealing. PCR 7 covers the Secure Boot chain (firmware, bootloader, kernel image); PCR 10 covers the IMA measurement log (all executed code and loaded drivers). If an attacker modifies the kernel or early boot components, the audit key becomes unavailable — the system cannot produce valid audit HMACs, making the compromise visible to remote attestation.

  3. Session-bound unseal: The kernel only holds the sealed blob. To compute HMACs, it calls TPM2_Unseal() which returns the key only if PCRs 7 and 10 match (verified boot + IMA measurement log intact). The unseal call happens once at boot and the key is held in a TPM-protected session (TPM2_StartAuthSession with tpmKey parameter) bound to the audit drain thread. The session is invalidated on any PCR change, preventing a runtime attacker from continuing to use the key after compromising the boot chain.

  4. Hardware monotonic counter: The TPM's hardware monotonic counter (TPM2_NV_Increment on a counter index) is incremented on each audit epoch boundary (every 1000 records), providing a tamper-evident record count that survives reboot. An attacker who replays an old audit log cannot produce a monotonic counter value consistent with the current epoch. The counter value is included in the epoch's batch seal record.

  5. Export with TPM quote: When userspace calls ioctl(AUDIT_EXPORT), the kernel includes the TPM quote (TPM2_Quote over PCRs 7, 10, and the audit counter index) alongside the audit records. An external verifier can check: PCR values -> boot chain integrity -> audit chain validity. The quote is signed by the TPM's attestation key (AK), which chains to the TPM's endorsement key (EK) and the manufacturer's certificate. This provides a hardware-rooted proof that the audit records were produced by a verified kernel on verified hardware.

Append-only enforcement without TPM. In environments without TPM (containers, VMs, embedded systems), the audit records are cryptographically chained (each record includes the HMAC of the previous record). While an attacker with kernel access can recompute the chain, they cannot produce records with timestamps that predate records already exported to an external syslog server. Configure auditd to stream records to a remote syslog in near-real-time (< 1s buffer) via the NETLINK_AUDIT socket (Section 20.2.9.5) for effective append-only semantics in practice. The combination of HMAC chains (detecting tampering between export intervals) and near-real-time export (minimizing the window for undetected tampering) provides a practical integrity guarantee for non-TPM environments.

20.2.9.5 Linux Compatibility

UmkaOS exports audit records in formats that standard Linux audit tools understand, so existing security infrastructure works without modification.

auditctl / ausearch / aureport. UmkaOS translates its audit records to Linux audit format (type=SYSCALL msg=audit(...), type=AVC, type=USER_AUTH, etc.) and delivers them to userspace via the NETLINK_AUDIT socket (see below). The kernel never writes to /var/log/audit/audit.log directly — that is the responsibility of the userspace audit daemon (auditd or go-audit), which receives records from the netlink socket and handles persistence. This avoids circular dependencies (audit subsystem depending on VFS, block I/O, etc.) and maintains kernel/userspace separation. The auditctl command sets filter rules, which are internally compiled to eBPF audit policy programs.

audit netlink socket. UmkaOS implements the NETLINK_AUDIT protocol so that auditd or go-audit can receive events in real time. The translation layer maps UmkaOS capability events to the closest Linux audit message types (e.g., capability denial maps to type=AVC).

journald integration. Audit records are also forwarded to the systemd journal with structured fields (_AUDIT_TYPE=, _AUDIT_ID=, OBJECT_PID=, etc.), making them queryable via journalctl:

journalctl _AUDIT_TYPE=AVC --since "5 minutes ago"
journalctl _AUDIT_TYPE=USER_AUTH _HOSTNAME=prod-web-01

syslog forwarding. For centralized log collection (Splunk, Elasticsearch, Graylog), audit records can be forwarded over syslog (RFC 5424) with configurable facility and severity mapping via /etc/umka/audit.conf.


20.3 Audit Subsystem

UmkaOS provides a Linux-compatible kernel audit subsystem for compliance logging (PCI-DSS, HIPAA, SOC2, Common Criteria). The kernel-side implementation communicates with userspace auditd via NETLINK_AUDIT, emits records in the standard key=value format, and implements the same rule language. Unmodified auditd, auditctl, ausearch, and aureport binaries work without changes.

Inspired by: Linux audit subsystem (Al Viro, 2003+). IP status: Clean -- the audit netlink protocol and record format are public UAPI, documented in audit.h and the Linux Audit Documentation Project.

20.3.1 Overview

The Linux audit subsystem provides kernel-side logging of security-relevant events: syscall invocations, file accesses, authentication events, and LSM access decisions. It is the foundation for enterprise compliance: PCI-DSS requires logging of all access to cardholder data, HIPAA requires audit trails for protected health information, and SOC2 requires demonstrable access controls.

The audit subsystem has three consumers that UmkaOS must support:

  1. auditd -- the primary userspace daemon. Receives events via NETLINK_AUDIT (protocol 9), writes to /var/log/audit/audit.log. Managed by auditctl.
  2. Container security tools -- Falco, Sysdig, and Tracee use the audit subsystem (or its events via eBPF) for runtime container security monitoring.
  3. SELinux/AppArmor -- LSM denials generate AUDIT_AVC records through the audit framework (Section 9.8). AppArmor denials use AUDIT_AVC (1400) with an apparmor= key prefix, not a separate message type.

UmkaOS provides full audit compatibility: same netlink protocol, same log format, same rule language, same /proc/[pid]/loginuid interface.

20.3.2 Audit State

// umka-core/src/audit/state.rs

/// Global audit state -- initialized at boot, modified via netlink commands.
///
/// There is exactly one `AuditState` instance in the kernel.  It is allocated
/// statically and accessed through `audit_state()`.  Fields that auditd changes
/// at runtime use atomics; the rule list and backlog use appropriate locks.
pub struct AuditState {
    /// Audit enabled flag.
    ///   0 = disabled (no events generated, rules ignored).
    ///   1 = enabled (events generated per rules).
    ///   2 = immutable (enabled, and cannot be disabled until reboot).
    /// Transition 1->2 is one-way: once set to 2, any AUDIT_SET with
    /// enabled != 2 returns EPERM.  This prevents attackers who gain root
    /// from silently disabling audit after compromise.
    pub enabled: AtomicU32,

    /// PID of the auditd process that receives events via netlink unicast.
    /// 0 = no auditd connected.  Set by AUDIT_SET (auditd sends its PID
    /// after opening the netlink socket).
    pub auditd_pid: AtomicU32,

    /// Maximum number of events that can be queued in the backlog before
    /// the kernel takes action per `failure_mode`.  Default: 8192.
    /// Configurable via `auditctl -b`.
    pub backlog_limit: AtomicU32,

    /// Behavior when the backlog is full or an unrecoverable audit error occurs.
    ///   0 = silent (drop events, increment lost counter).
    ///   1 = printk (drop events, log warning to kernel log).
    ///   2 = panic (halt the system -- required by some CC profiles).
    /// Default: 1.  Configurable via `auditctl -f`.
    pub failure_mode: AtomicU32,

    /// Monotonically increasing event sequence number.  u64 for 50-year
    /// uptime: at 1M events/sec, a u64 counter exhausts in ~584,942 years.
    pub sequence: AtomicU64,

    /// Rate limit: maximum events per second.  0 = unlimited (default).
    /// Implemented as a token bucket refilled once per second.
    /// Configurable via `auditctl -r`.
    pub rate_limit: AtomicU32,

    /// Current rate-limit token count (decremented per event, refilled by timer).
    pub rate_tokens: AtomicU32,

    /// Number of events lost due to backlog full or rate limiting.
    /// Exposed via AUDIT_GET status response.
    pub lost: AtomicU64,

    /// Per-CPU event backlog rings. Each CPU has its own SpscRing — no
    /// contention on enqueue. The `kauditd` consumer merges records from
    /// all per-CPU rings by sequence number to maintain causal ordering.
    ///
    /// Uses `PerCpu<SpscRing>` (not embedded in `CpuLocalBlock` — audit
    /// ring is warm-path and its size exceeds CpuLocalBlock L1 budget).
    ///
    /// Per-CPU ring capacity = `backlog_limit / num_cpus` (rounded up
    /// to next power of two, minimum 256). The `backlog_limit` remains a
    /// global budget for Linux compatibility: `AUDIT_GET` status reports
    /// `backlog = sum(per_cpu_ring.len())`.
    ///
    /// When a CPU's ring is full, behavior depends on `failure_mode`:
    ///   0 = drop event, increment `lost` counter.
    ///   1 = drop event, increment `lost`, printk warning.
    ///   2 = panic.
    ///
    /// CPU hotplug: when a CPU goes offline, `kauditd` drains its ring
    /// before the CPU's PerCpu data is deactivated.
    pub per_cpu_backlogs: PerCpu<SpscRing<AuditRecord, PER_CPU_AUDIT_RING_CAPACITY>>,

    /// Syscall filter rules.  Updated via AUDIT_ADD_RULE / AUDIT_DEL_RULE.
    /// Published under RCU: rule evaluation on syscall exit reads the current
    /// snapshot without locking; rule changes (rare, admin-initiated) clone,
    /// modify, and publish a new list via RCU swap.
    pub rules: RcuCell<ArrayVec<AuditRule, MAX_AUDIT_RULES>>,

    /// Feature flags (AUDIT_SET_FEATURE / AUDIT_GET_FEATURE).
    /// Bit 0: backlog_wait_time (Linux 3.14+).
    /// Bit 1: loginuid_immutable (Linux 3.14+).
    pub features: AtomicU32,

    /// Maximum time in milliseconds to wait for backlog space before
    /// dropping an event (when backlog_wait_time feature is enabled).
    /// Default: 60000 (60 seconds).  0 = do not wait (drop immediately).
    pub backlog_wait_time_ms: AtomicU32,
}

20.3.3 Audit Events

// umka-core/src/audit/event.rs

/// A single audit event record, ready for delivery to auditd via NETLINK_AUDIT.
///
/// The `data` field is a fixed-size inline buffer, NOT a `Vec<u8>`, to avoid
/// heap allocation on the syscall exit hot path.
///
/// Note: The Linux-compatible audit path uses `AuditRecord` for NETLINK_AUDIT
/// delivery. The UmkaOS tamper-evident HMAC audit chain uses `TamperAuditRecord`
/// (see [Section 20.2](#stable-tracepoint-abi)). These are parallel audit paths: Linux-format
/// events are generated for auditd compatibility; tamper-evident records are
/// independently generated for kernel integrity verification.
///
/// **Hot-path cost of dual audit**: When both paths are active, each auditable
/// syscall generates TWO records (one AuditRecord + one TamperAuditRecord).
/// The combined cost is ~200-400 ns per syscall (AuditRecord: ~100-200 ns for
/// field serialization + ring push; TamperAuditRecord: ~100-200 ns for HMAC
/// chain computation). When either path is disabled (no auditd connected OR
/// tamper-evident audit disabled), its cost is zero (static key eliminates
/// the branch). The common case — auditd connected but tamper-evident disabled
/// — costs only the AuditRecord path.
pub struct AuditRecord {
    /// Sequence number assigned from `AuditState::sequence`.
    pub seq: u64,
    /// Timestamp (seconds + milliseconds since epoch).
    pub timestamp_sec: u64,
    pub timestamp_ms: u32,
    /// Message type (e.g., AUDIT_SYSCALL = 1300, AUDIT_PATH = 1302).
    pub msg_type: AuditMessageType,
    /// Formatted record payload (key=value text).
    /// Fixed-size inline buffer. `data_len` indicates the valid portion.
    /// Linux audit records are typically 128-512 bytes; the 1024-byte buffer
    /// accommodates the largest realistic single-record payload.
    pub data: [u8; AUDIT_RECORD_MAX],
    /// Number of valid bytes in `data`.
    pub data_len: u16,
}

/// Maximum size of a single audit record's formatted payload.
pub const AUDIT_RECORD_MAX: usize = 1024;

/// Default backlog ring capacity. Configurable via `auditctl -b`.
pub const AUDIT_BACKLOG_DEFAULT: usize = 8192;

/// Maximum number of audit filter rules. Linux uses a dynamically-allocated
/// list; UmkaOS uses a bounded ArrayVec under RCU protection. 1024 rules
/// covers enterprise audit configurations (typical deployments use 50-200).
pub const MAX_AUDIT_RULES: usize = 1024;

/// Audit message types.  Values match Linux UAPI `linux/audit.h`.
#[repr(u32)]
pub enum AuditMessageType {
    // --- Netlink command messages (user -> kernel) ---
    Get             = 1000,
    Set             = 1001,
    ListRules       = 1013,
    AddRule         = 1011,
    DelRule         = 1012,
    User            = 1005,
    Login           = 1006,
    SignalInfo      = 1010,
    SetFeature      = 1018,
    GetFeature      = 1019,

    // --- Kernel event messages (kernel -> user) ---
    Syscall         = 1300,
    Path            = 1302,
    Ipc             = 1303,
    Socketcall      = 1304,
    Sockaddr        = 1306,
    Cwd             = 1307,
    Execve          = 1309,
    IntegrityData   = 1800,
    Proctitle       = 1327,

    // --- LSM-originated events ---
    Avc             = 1400,
    UserAvc         = 1107,
    // AppArmor denials are emitted as AUDIT_AVC (1400) records with
    // `apparmor=` key-value prefix, matching Linux behavior. No separate
    // AUDIT_APPARMOR type exists in Linux UAPI.
    /// End-of-event sentinel.  Not a real record -- signals to auditd that
    /// all records sharing a given sequence number have been delivered.
    Eoe             = 1320,
}

20.3.4 Audit Rules

// umka-core/src/audit/rules.rs

/// Maximum field conditions per rule (matches Linux AUDIT_MAX_FIELDS).
pub const AUDIT_MAX_FIELDS: usize = 64;

/// An audit filter rule -- determines which syscalls/events generate records.
///
/// Rules are installed by `auditctl -a` and removed by `auditctl -d`.
/// The kernel evaluates rules at syscall exit against the collected audit
/// context.  A rule matches when ALL field conditions are satisfied (AND).
pub struct AuditRule {
    /// Which filter list this rule belongs to.
    pub list: AuditFilterList,
    /// Action to take when the rule matches.
    pub action: AuditAction,
    /// Field match conditions (AND'd together).  Up to AUDIT_MAX_FIELDS.
    pub fields: ArrayVec<AuditField, AUDIT_MAX_FIELDS>,
    /// Syscall mask: bit N set means syscall N is covered by this rule.
    /// 2048 bits covers all syscall numbers on all architectures
    /// (x86-64 uses up to ~450, but architectures with multiplexed
    /// syscalls or UmkaOS-native extensions may use higher numbers;
    /// 2048 bits provides headroom for the full UmkaOS native range).
    /// If all bits are set, the rule applies to all syscalls.
    pub syscall_mask: [u64; 32],
}

/// Filter list identifiers. Values from Linux include/uapi/linux/audit.h.
/// Range 0x8000+ reserved for UmkaOS extensions.
#[repr(u32)]
pub enum AuditFilterList {
    /// Filter user-originated messages.
    User    = 0,
    /// Filter at task creation (fork/clone).
    Task    = 1,
    // Entry = 2, // deprecated
    // Watch = 3, // deprecated
    /// Filter at syscall exit (most common -- matches on return value, paths, etc.).
    Exit    = 4,
    /// Exclude matching events from the log.
    Exclude = 5,
    /// Filesystem filter (Linux 5.1+): match events by filesystem type.
    Fs      = 6,
    /// io_uring exit filter (Linux 6.0+).
    UringExit = 7,
}

/// Rule action.
#[repr(u32)]
pub enum AuditAction {
    /// Always generate an audit record when this rule matches.
    Always = 1,
    /// Never generate an audit record (suppress).
    Never  = 0,
}

/// A single field match condition within a rule.
pub struct AuditField {
    /// Field type identifier.
    pub field_type: AuditFieldType,
    /// Comparison operator.
    pub op: AuditOp,
    /// Numeric value to compare against.
    pub value: u64,
    /// For string-valued fields (AUDIT_WATCH, AUDIT_DIR, AUDIT_FILTERKEY):
    /// the match string.  None for numeric fields.
    /// `KernelString` is the kernel's variable-length string type (see
    /// [Section 12.1](12-kabi.md#kabi-overview) for definition). Heap-allocated, bounded by
    /// `AUDIT_MAX_FIELD_STRING_LEN` (4096 bytes).
    pub str_value: Option<KernelString>,
}

/// Field type identifiers (values match Linux `audit.h`).
/// Values from Linux include/uapi/linux/audit.h. Range 0x8000+ reserved
/// for UmkaOS extensions.
#[repr(u32)]
pub enum AuditFieldType {
    Pid         = 0,
    Uid         = 1,
    Euid        = 2,
    Suid        = 3,
    Fsuid       = 4,
    Gid         = 5,
    Egid        = 6,
    Sgid        = 7,
    Fsgid       = 8,
    Loginuid    = 9,
    Pers        = 10,
    Arch        = 11,
    Msgtype     = 12,
    /// SELinux subject user.
    SubjUser    = 13,
    /// SELinux subject role.
    SubjRole    = 14,
    /// SELinux subject type.
    SubjType    = 15,
    /// SELinux subject sensitivity.
    SubjSen     = 16,
    /// SELinux subject clearance.
    SubjClr     = 17,
    /// Parent PID.
    Ppid        = 18,
    /// SELinux object user.
    ObjUser     = 19,
    /// SELinux object role.
    ObjRole     = 20,
    /// SELinux object type.
    ObjType     = 21,
    /// SELinux object level low.
    ObjLevLow   = 22,
    /// SELinux object level high.
    ObjLevHigh  = 23,
    /// Whether loginuid has been set (Linux 3.13+).
    LoginuidSet = 24,
    /// Session ID.
    SessionId   = 25,
    /// Filesystem type (ext4, xfs, etc.).
    FsType      = 26,
    /// Device major number.
    Devmajor    = 100,
    /// Device minor number.
    Devminor    = 101,
    /// Inode number.
    Inode       = 102,
    /// Exit code of the syscall.
    Exit        = 103,
    /// Success/failure of the syscall.
    Success     = 104,
    /// Watch a specific path.
    Watch       = 105,
    /// Process permissions (read/write/exec/attr).
    Perm        = 106,
    /// Watch a directory.
    Dir         = 107,
    /// File type (regular, directory, socket, etc.).
    Filetype    = 108,
    /// Object UID (file owner).
    ObjUid      = 109,
    /// Object GID (file group).
    ObjGid      = 110,
    /// Field comparison (compare two audit fields, e.g., uid == euid).
    FieldCompare = 111,
    /// Executable path filter.
    Exe         = 112,
    /// Socket address family filter.
    SaddrFam    = 113,
    /// User-defined filter key (for grouping rules).
    Filterkey   = 210,
}

/// Comparison operators for field matching.
/// Values from Linux include/uapi/linux/audit.h. Range 0x8000_0000+ reserved
/// for UmkaOS extensions.
#[repr(u32)]
pub enum AuditOp {
    /// Bitwise AND is non-zero.
    BitMask     = 0x0800_0000,
    /// Less than.
    Lt          = 0x1000_0000,
    /// Greater than.
    Gt          = 0x2000_0000,
    /// Not equal.
    Ne          = 0x3000_0000,
    /// Equal.
    Eq          = 0x4000_0000,
    /// Specific bit is set.
    BitTest     = 0x4800_0000,
    /// Less than or equal.
    Le          = 0x5000_0000,
    /// Greater than or equal.
    Ge          = 0x6000_0000,
}

Audit control uses NETLINK_AUDIT (protocol number 9). The netlink message format matches Linux exactly -- auditd opens socket(AF_NETLINK, SOCK_RAW, NETLINK_AUDIT) and exchanges messages using standard netlink headers (Section 16.17).

20.3.5.1 Command Messages (User to Kernel)

Command Value Description
AUDIT_GET 1000 Query current audit status (enabled, backlog, lost count)
AUDIT_SET 1001 Set audit parameters (enable, backlog_limit, failure_mode, pid)
AUDIT_LIST_RULES 1013 List all installed filter rules
AUDIT_ADD_RULE 1011 Add a filter rule
AUDIT_DEL_RULE 1012 Delete a filter rule
AUDIT_USER 1005 User-originated event (auditd forwards to kernel for logging)
AUDIT_LOGIN 1006 Report user login (creates login UID binding)
AUDIT_SIGNAL_INFO 1010 Query which process sent a signal to auditd
AUDIT_SET_FEATURE 1018 Set audit feature flags
AUDIT_GET_FEATURE 1019 Query audit feature flags

20.3.5.2 Event Messages (Kernel to User)

Type Value Content
AUDIT_SYSCALL 1300 Syscall entry/exit: arch, syscall nr, args[0..3], return value, pid, uid, gid, euid, egid, auid, ses
AUDIT_PATH 1302 File path accessed during syscall: item index, name, inode, dev, mode, ouid, ogid
AUDIT_IPC 1303 IPC object access: object type, id, uid, gid, mode, permissions
AUDIT_SOCKADDR 1306 Socket address for connect/bind/accept: saddr (hex-encoded sockaddr)
AUDIT_CWD 1307 Current working directory: cwd path
AUDIT_EXECVE 1309 execve arguments: argc, a0, a1, ... (individually hex-encoded if non-printable)
AUDIT_PROCTITLE 1327 Process title (from /proc/pid/cmdline): proctitle (hex-encoded)
AUDIT_AVC 1400 SELinux AVC denial: avc: denied { operation } for scontext tcontext tclass. AppArmor denials also use AUDIT_AVC with apparmor= key prefix (no separate type).
AUDIT_EOE 1320 End-of-event sentinel: signals all records for a sequence number are complete

20.3.6 Syscall Audit Path

The audit hot path is the per-syscall check. When audit rules are installed, the kernel sets TIF_SYSCALL_AUDIT on tasks matching the rules' task-level criteria. The syscall path proceeds in two phases:

Phase 1 -- Syscall Entry: If TIF_SYSCALL_AUDIT is set on the current task, the syscall entry path allocates an AuditContext (from a per-CPU slab cache) and captures the entry state:

/// Per-syscall audit context.  Allocated at syscall entry, freed after
/// all records are delivered to auditd.
/// Kernel-internal, not KABI -- bool is safe here.
pub struct AuditContext {
    /// Architecture identifier (AUDIT_ARCH_X86_64, AUDIT_ARCH_AARCH64, etc.).
    pub arch: u32,
    /// Syscall number.
    pub syscall_nr: u32,
    /// First four syscall arguments (sufficient for rule matching).
    pub args: [u64; 4],
    /// Syscall return value (filled at exit).
    pub return_value: i64,
    /// Whether the syscall succeeded (filled at exit).
    pub success: bool,
    /// Paths collected during the syscall (from `getname()` calls in VFS).
    /// Each VFS path lookup that occurs during the syscall appends here.
    pub paths: ArrayVec<AuditPath, 8>,
    /// Socket address captured during connect/bind/accept.
    pub sockaddr: Option<AuditSockaddr>,
    /// IPC object info captured during IPC operations.
    pub ipc: Option<AuditIpcInfo>,
    /// Current working directory (captured lazily on first relative path).
    pub cwd: Option<KernelString>,
    /// User-defined filter key from the matching rule (for ausearch -k).
    pub filter_key: Option<KernelString>,
}

/// A file path captured during a syscall.
pub struct AuditPath {
    /// Path name (as passed to the syscall or resolved by VFS).
    pub name: KernelString,
    /// Inode number (0 if not yet resolved).
    pub ino: u64,
    /// Device ID (major << 20 | minor).
    pub dev: u32,
    /// File mode (type + permissions).
    pub mode: u32,
    /// Owner UID of the inode.
    pub ouid: u32,
    /// Owner GID of the inode.
    pub ogid: u32,
}

Phase 2 -- Syscall Exit: After the syscall completes, the exit path evaluates the audit context against AUDIT_FILTER_EXIT rules:

  1. Check syscall_mask -- O(1) bitmap lookup confirms this syscall is covered.
  2. Evaluate field conditions -- each AuditField in the matching rule is checked against the audit context (pid, uid, return value, path, etc.).
  3. If a rule matches with action = Always, format the event records:
  4. AUDIT_SYSCALL (always first): contains arch, syscall, args, return, pid, uid, etc.
  5. AUDIT_PATH (one per path in AuditContext::paths).
  6. AUDIT_CWD (if any paths are relative).
  7. AUDIT_EXECVE (if syscall was execve/execveat -- args are individually encoded).
  8. AUDIT_SOCKADDR (if a socket address was captured).
  9. AUDIT_IPC (if an IPC object was accessed).
  10. AUDIT_PROCTITLE (process command line for context).
  11. AUDIT_EOE (end-of-event marker, always last).
  12. All records in one event share the same sequence number from AuditState::sequence.
  13. Enqueue the formatted records to AuditState::backlog.
  14. Wake the kauditd kernel thread to drain the backlog to auditd's netlink socket.

Architecture identifiers for the AUDIT_ARCH field (used in rule matching and event records, values match Linux audit.h):

Architecture AUDIT_ARCH value Constant
x86-64 0xc000003e AUDIT_ARCH_X86_64
AArch64 0xc00000b7 AUDIT_ARCH_AARCH64
ARMv7 0x40000028 AUDIT_ARCH_ARM
RISC-V 64 0xc00000f3 AUDIT_ARCH_RISCV64
PPC32 0x00000014 AUDIT_ARCH_PPC
PPC64LE 0xc0000015 AUDIT_ARCH_PPC64LE
s390x 0x80000016 AUDIT_ARCH_S390X
LoongArch64 0xc0000102 AUDIT_ARCH_LOONGARCH64

20.3.7 Login UID and Session Tracking

The audit subsystem tracks the original login identity of every process, distinct from the effective UID which may change via setuid/sudo:

/// Per-task audit identity, stored in the task struct.
/// Set once at login time, propagated across fork/exec unchanged.
pub struct AuditTaskInfo {
    /// Login UID (auid): the UID of the user who originally logged in.
    /// Set via write to `/proc/[pid]/loginuid` (by PAM at login time).
    /// Value `u32::MAX` (4294967295) means "unset" (no login session).
    pub loginuid: AtomicU32,

    /// Session ID: monotonically increasing per-login-session counter.
    /// Assigned when loginuid is first set.  Shared by all processes
    /// in the same login session (propagated across fork/exec).
    pub sessionid: AtomicU32,
}

Immutability rules: - Once loginuid is set to a valid value (not u32::MAX), it cannot be changed except by a process with CAP_AUDIT_CONTROL. - The audit_loginuid_immutable feature flag (set via AUDIT_SET_FEATURE) controls container behavior: when set, even CAP_AUDIT_CONTROL in a non-init user namespace cannot change loginuid. This prevents containers from forging audit identity. - sessionid is never writable from userspace -- it is assigned by the kernel when loginuid is set and propagated automatically.

procfs interface: - /proc/[pid]/loginuid -- read/write, contains decimal auid (Section 8.1). - /proc/[pid]/sessionid -- read-only, contains decimal session ID.

20.3.8 LSM Audit Integration

The audit subsystem is the delivery mechanism for LSM access denials. When an LSM hook denies an operation (Section 9.8), the LSM framework calls into audit to emit a module-specific record:

/// Emit an LSM audit record.  Called by the LSM framework when a security
/// module denies an operation and auditing is enabled.
///
/// This function is on a warm path (only called on denial, not on every
/// hook invocation).  It allocates an AuditRecord, formats the denial
/// message in the LSM-specific format, and enqueues it to the backlog.
pub fn audit_log_lsm_denial(
    module_name: &str,
    msg_type: AuditMessageType,
    subject: &TaskCredential,
    operation: &str,
    target: &str,
    reason: &str,
) {
    // SELinux denials → AUDIT_AVC (1400):
    //   type=AVC msg=audit(TIMESTAMP:SEQ): avc: denied { read } for
    //   pid=PID comm="CMD" scontext=SCTX tcontext=TCTX tclass=CLASS
    //
    // AppArmor denials → also AUDIT_AVC (1400), with apparmor= key prefix:
    //   type=AVC msg=audit(TIMESTAMP:SEQ): apparmor="DENIED"
    //   operation="open" profile="PROFILE" name="PATH" ...
    // Note: Linux has no separate AUDIT_APPARMOR message type. AppArmor uses
    // AUDIT_AVC with the `apparmor=` key to distinguish from SELinux denials.
}

FMA (Section 20.1) can subscribe to audit events as a telemetry source: repeated security denials from a single process or container may indicate a compromised workload, triggering FMA diagnosis rules.

20.3.9 Performance Considerations

Zero overhead when disabled: When AuditState::enabled == 0 and no rules are installed, no task has TIF_SYSCALL_AUDIT set. The syscall entry/exit path checks this per-task flag (a single bit test on the thread flags word) and skips all audit logic. Cost: approximately 1 ns per syscall (branch predicted not-taken).

Rule matching cost: The syscall_mask bitmap provides an O(1) pre-filter. If the current syscall's bit is not set in any rule's mask, no field matching occurs. When field matching is needed, conditions are evaluated sequentially (AND short-circuit: first non-matching field skips the rule). Typical deployments have 10-50 rules; the linear scan over rules is acceptable for a warm path (only executed when TIF_SYSCALL_AUDIT is set).

Backlog pressure: When the backlog reaches backlog_limit: - If backlog_wait_time_ms > 0: the producing task sleeps (interruptible) for up to backlog_wait_time_ms waiting for kauditd to drain space. - If the wait times out or backlog_wait_time_ms == 0: the event is dropped, AuditState::lost is incremented, and a rate-limited printk warning is emitted (unless failure_mode == 0 for silent drop or failure_mode == 2 for panic).

Rate limiting: When rate_limit > 0, a token bucket with capacity equal to rate_limit is refilled once per second. Each event consumes one token. When tokens are exhausted, events are dropped until the next refill.

20.3.10 Namespace Scoping

Audit events include the namespace context of the originating process (Section 17.1). Container-aware audit deployments use this to filter events by container identity:

  • The AUDIT_SYSCALL record includes the process's PID namespace level and container ID (derived from the cgroup path, matching the Docker/Kubernetes convention of extracting the container ID from /proc/[pid]/cgroup).
  • audit_loginuid_immutable prevents containers from forging login identity.
  • Future: full audit namespace support (Linux audit namespace patches, not yet mainlined) would allow per-container audit daemons. UmkaOS tracks the upstream proposal and will implement when the interface stabilizes.

20.3.11 Cross-References

  • Section 9.8 -- LSM hooks that generate AUDIT_AVC records (both SELinux and AppArmor use AUDIT_AVC)
  • Section 20.2 -- complementary observability: tracepoints serve performance analysis, audit serves compliance logging
  • Section 20.1 -- FMA subscribes to audit events for security anomaly detection
  • Section 16.17 -- NETLINK_AUDIT transport protocol
  • Section 8.1 -- TIF_SYSCALL_AUDIT flag, /proc/[pid]/loginuid interface
  • Section 17.1 -- audit namespace scoping for containers
  • Section 9.9 -- CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE capabilities

20.4 Debugging and Process Inspection

Inspired by: Solaris/illumos mdb, Linux ptrace, seL4 capability-gated debugging. IP status: Clean — ptrace is a standard POSIX interface; capability-gating access control is a design policy, not a patentable mechanism.

20.4.1 Capability-Gated ptrace

Linux problem: ptrace is a powerful but coarse-grained tool. A single PTRACE_ATTACH call gives the debugger complete control over the target process: read and write arbitrary memory, read and write registers, inject signals, single-step instructions, intercept syscalls. Access control is limited to UID checks and the optional Yama LSM (/proc/sys/kernel/yama/ptrace_scope). In container environments, ptrace across namespace boundaries is blocked by user namespace checks, but within a namespace any process with the same UID (or CAP_SYS_PTRACE) can attach to any other. There is no way to grant partial debug access — it is all or nothing.

UmkaOS design: Each ptrace operation requires a specific combination of capabilities. The debugger must hold explicit capability tokens scoped to the target process and the operations it intends to perform.

Operation Required Capability
PTRACE_ATTACH / PTRACE_SEIZE CAP_DEBUG on target process
PTRACE_CONT / PTRACE_DETACH CAP_DEBUG on target process
PTRACE_KILL CAP_DEBUG + CAP_KILL (or same UID) on target process
PTRACE_INTERRUPT CAP_DEBUG on target process
PTRACE_PEEKDATA / PTRACE_POKEDATA CAP_DEBUG + READ or WRITE on target address space
PTRACE_PEEKTEXT / PTRACE_POKETEXT CAP_DEBUG + READ or WRITE on target address space (aliases for PEEKDATA/POKEDATA — UmkaOS uses unified address space)
PTRACE_GETSIGINFO / PTRACE_SETSIGINFO CAP_DEBUG + READ or WRITE on target address space
PTRACE_GETREGS / PTRACE_SETREGS CAP_DEBUG + READ or WRITE on register state
PTRACE_GETFPREGS / PTRACE_SETFPREGS CAP_DEBUG + READ or WRITE on register state
PTRACE_SINGLESTEP CAP_DEBUG + EXECUTE control
PTRACE_SYSCALL CAP_DEBUG + SYSCALL_TRACE
// umka-core/src/debug/ptrace.rs

/// ptrace request type, corresponding to Linux's PTRACE_* constants.
/// Passed to the `ptrace(2)` syscall as the `request` argument.
#[repr(u32)]
pub enum PtraceRequest {
    /// Read a word from tracee's memory at `addr`. Returns the word value.
    PeekData   = 2,
    /// Read a word from tracee's text segment at `addr`.
    PeekText   = 1,
    /// Write a word to tracee's memory at `addr`.
    PokeData   = 5,
    /// Write a word to tracee's text segment at `addr`.
    PokeText   = 4,
    /// Get the tracee's general-purpose registers.
    GetRegs    = 12,
    /// Set the tracee's general-purpose registers.
    SetRegs    = 13,
    /// Get the tracee's floating-point registers.
    GetFpRegs  = 14,
    /// Set the tracee's floating-point registers.
    SetFpRegs  = 15,
    /// Attach to the tracee, stopping it with SIGSTOP.
    Attach     = 16,
    /// Detach from the tracee, optionally delivering a signal.
    Detach     = 17,
    /// Single-step the tracee: deliver SIGTRAP after the next instruction.
    SingleStep = 9,
    /// Continue the tracee, optionally delivering a signal.
    Cont       = 7,
    /// Send SIGKILL to the tracee.
    Kill       = 8,
    /// Resume tracee, stopping at next syscall entry/exit.
    Syscall    = 24,
    /// Get the siginfo_t for the tracee's current stop signal.
    GetSigInfo = 0x4202,
    /// Set the siginfo_t for the tracee's current stop signal.
    SetSigInfo = 0x4203,
    /// Attach without stopping (PTRACE_SEIZE, Linux 3.4+).
    Seize      = 0x4206,
    /// Interrupt a PTRACE_SEIZE'd tracee (PTRACE_INTERRUPT, Linux 3.4+).
    Interrupt  = 0x4207,
    /// Resume after PTRACE_SEIZE (PTRACE_CONT variant for seized tracees).
    // Note: same as Cont but only valid after Seize.
    // Represented separately for type safety.
}

/// Validate that the calling process holds sufficient capabilities
/// for the requested ptrace operation on the target.
///
/// Capability lookups are RCU-protected
/// ([Section 9.1](09-security.md#capability-based-foundation--capability-token-model)): the cap_table
/// read side is lock-free and safe from interrupt context.
fn check_ptrace_cap(
    caller: &Process,
    target: &Process,
    request: PtraceRequest,
) -> Result<(), CapError> {
    // Caller must hold CAP_DEBUG scoped to the target's object ID.
    let debug_cap = caller.cap_table.lookup(
        target.object_id(),
        PermissionBits::DEBUG,
    )?;

    // Additional permission checks based on the operation.
    match request {
        PtraceRequest::PeekData | PtraceRequest::PeekText
        | PtraceRequest::GetSigInfo => {
            caller.cap_table.check(debug_cap, PermissionBits::READ)?;
        }
        PtraceRequest::PokeData | PtraceRequest::PokeText
        | PtraceRequest::SetSigInfo => {
            caller.cap_table.check(debug_cap, PermissionBits::WRITE)?;
        }
        PtraceRequest::GetRegs | PtraceRequest::GetFpRegs => {
            caller.cap_table.check(debug_cap, PermissionBits::READ)?;
        }
        PtraceRequest::SetRegs | PtraceRequest::SetFpRegs => {
            caller.cap_table.check(debug_cap, PermissionBits::WRITE)?;
        }
        PtraceRequest::SingleStep => {
            caller.cap_table.check(debug_cap, PermissionBits::EXECUTE)?;
        }
        PtraceRequest::Syscall => {
            caller.cap_table.check(debug_cap, PermissionBits::SYSCALL_TRACE)?;
        }
        PtraceRequest::Kill => {
            caller.cap_table.check(debug_cap, PermissionBits::KILL)?;
        }
        PtraceRequest::Attach | PtraceRequest::Seize
        | PtraceRequest::Cont | PtraceRequest::Detach
        | PtraceRequest::Interrupt => {
            // CAP_DEBUG alone is sufficient for these operations.
        }
    }
    Ok(())
}

Namespace isolation: The CAP_DEBUG capability is scoped to the target's namespace. A debugger in namespace A cannot debug a process in namespace B unless it holds a CAP_DEBUG token that was explicitly delegated across the namespace boundary (using the standard capability delegation mechanism from Section 9.1). Cross-namespace debugging requires both a CAP_DEBUG on the target and a CAP_NS_TRAVERSE on every intermediate namespace. This makes container breakout via ptrace structurally impossible — there is no ambient authority to override.

seccomp interaction: When a debugger attaches to a seccomp-sandboxed process, the sandbox remains in effect. The debugger can observe syscalls (with PTRACE_SYSCALL) but cannot inject syscalls that the target's seccomp filter would deny. This prevents a class of attacks where a debugger is used to bypass seccomp restrictions.

Domain isolation interaction: As described in Section 11.2, ptrace reads/writes to domain-protected memory go through the kernel-mediated PKRU path. The debugger never gains direct access to the target's isolation domain. Capability checks happen before the kernel performs the PKRU switch.

20.4.2 Ptrace Lifecycle

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). All #[repr(C)] structs have const_assert! size verification. See CLAUDE.md §Spec Pseudocode Quality Gates.

This section specifies the runtime protocol for ptrace relationships after the initial attach (covered in the Capability-Gated ptrace section above). It defines the PtraceState struct, event delivery, the detach protocol, and the exit interaction — what happens when the tracer or tracee exits.

Cross-references: - Task and Process structs: Section 8.1 - do_exit() teardown sequence: Section 8.2 - Signal handling and group stop: Section 8.5 - Capability model: Section 9.1

20.4.2.1 PtraceState

The PtraceState struct is heap-allocated and stored in Task.ptrace: Option<Box<PtraceState>>. It is None for tasks that are not being ptraced and Some(...) while a ptrace relationship is active. The struct is allocated on PTRACE_ATTACH / PTRACE_SEIZE / PTRACE_TRACEME and freed on PTRACE_DETACH or tracer exit.

// umka-core/src/debug/ptrace.rs

/// Per-tracee ptrace state. Allocated when a task becomes a tracee
/// (PTRACE_ATTACH, PTRACE_SEIZE, or PTRACE_TRACEME) and freed on detach.
///
/// **Ownership model**: The `PtraceState` is owned by the tracee's `Task`
/// (via `Box`). The tracer references the tracee through `Task.ptraced_children`
/// (an intrusive list of all tasks this tracer is tracing). The tracee
/// references the tracer through `tracer_pid` + a `Weak<Task>` for fast
/// access (upgraded to `Arc<Task>` only when needed; the `Weak` gracefully
/// handles tracer death).
///
/// **Locking**: Fields are accessed under `sighand.lock` (the tracee's signal
/// handler lock), matching Linux's protocol where ptrace state transitions
/// require `sighand->siglock`. The atomic fields (`options`, `message`) are
/// exceptions — they use atomic operations for lock-free reads from the
/// tracer's `waitpid()` / `PTRACE_GETEVENTMSG` paths.
pub struct PtraceState {
    /// The tracer task. `Weak` reference — if the tracer exits, `upgrade()`
    /// returns `None`, which triggers the auto-detach protocol in
    /// `exit_ptrace()`. The `Weak` is never dangling: it either upgrades to
    /// a live `Arc<Task>` or returns `None`.
    pub tracer: Weak<Task>,

    /// PID of the tracer process. Stored separately from the `Weak<Task>`
    /// because PID lookups are needed for `/proc/[pid]/status` (`TracerPid`
    /// field) even when the tracer's `Task` is not pinned.
    pub tracer_pid: ProcessId,

    /// Ptrace flags. Encodes both the attach mode and per-event trace options.
    /// - Bit 0: `PT_PTRACED` (0x01) — task is being ptraced.
    /// - Bit 16: `PT_SEIZED` (0x10000) — attached via `PTRACE_SEIZE` (not `ATTACH`).
    /// - Bits 1-7: per-event flags set via `PTRACE_SETOPTIONS`.
    /// - Bit 20: `PT_EXITKILL` — send SIGKILL to tracee on tracer exit.
    /// - Bit 21: `PT_SUSPEND_SECCOMP` — suspend seccomp (requires CAP_SYS_ADMIN).
    ///
    /// AtomicU32 for lock-free reads from `/proc/[pid]/status` and
    /// `ptrace_event_enabled()`. Writes are protected by `sighand.lock`.
    pub flags: AtomicU32,

    /// Ptrace options set via `PTRACE_SETOPTIONS` (0x4200). These control
    /// which events generate ptrace stops. Stored as the raw `PTRACE_O_*`
    /// bitmask for direct comparison with event checks.
    ///
    /// | Bit | Option | Event |
    /// |-----|--------|-------|
    /// | 0 | `PTRACE_O_TRACESYSGOOD` (1) | Set bit 7 of signal number on syscall-stop |
    /// | 1 | `PTRACE_O_TRACEFORK` (2) | `PTRACE_EVENT_FORK` on `fork()` |
    /// | 2 | `PTRACE_O_TRACEVFORK` (4) | `PTRACE_EVENT_VFORK` on `vfork()` |
    /// | 3 | `PTRACE_O_TRACECLONE` (8) | `PTRACE_EVENT_CLONE` on `clone()` |
    /// | 4 | `PTRACE_O_TRACEEXEC` (16) | `PTRACE_EVENT_EXEC` on `execve()` |
    /// | 5 | `PTRACE_O_TRACEVFORKDONE` (32) | `PTRACE_EVENT_VFORK_DONE` after vfork child runs |
    /// | 6 | `PTRACE_O_TRACEEXIT` (64) | `PTRACE_EVENT_EXIT` on `do_exit()` |
    /// | 7 | `PTRACE_O_TRACESECCOMP` (128) | `PTRACE_EVENT_SECCOMP` on seccomp filter match |
    /// | 20 | `PTRACE_O_EXITKILL` (1 << 20) | Send SIGKILL to tracee when tracer exits |
    /// | 21 | `PTRACE_O_SUSPEND_SECCOMP` (1 << 21) | Suspend tracee's seccomp filters |
    ///
    /// AtomicU32 for lock-free reads from `ptrace_event()`. Writes are
    /// protected by `sighand.lock`.
    pub options: AtomicU32,

    /// The tracee's original parent PID, saved at attach time. When the tracee
    /// is attached, `Process.parent` is temporarily set to the tracer's PID
    /// (so that `waitpid()` on the tracer reports tracee stops). On detach or
    /// tracer death, `Process.parent` is restored to `saved_parent`.
    ///
    /// This matches Linux's `real_parent` / `parent` distinction:
    /// - `real_parent` = biological parent (UmkaOS: `saved_parent`)
    /// - `parent` = current parent, which is the tracer during ptrace
    pub saved_parent: ProcessId,

    /// Pending ptrace event type. Set by `ptrace_event()`, consumed by
    /// `waitpid()` on the tracer side. The event type is encoded in the
    /// wait status as `(event << 8) | SIGTRAP` — the tracer extracts it
    /// via `WPTRACEEVENT(status)` (a Linux-compatible macro: `status >> 16 & 0xff`).
    ///
    /// Protected by `sighand.lock` (set by the tracee, read by the tracer
    /// under lock during `wait4()` processing). Uses `Cell` for interior
    /// mutability: `ptrace_event()` writes through `&PtraceState` (shared
    /// reference obtained from `task.ptrace.as_ref()`).
    pub pending_event: Cell<Option<PtraceEventType>>,

    /// Ptrace message value. Set by `ptrace_event()` alongside the event,
    /// read by the tracer via `PTRACE_GETEVENTMSG` (0x4201). The meaning
    /// depends on the event type:
    ///
    /// | Event | Message value |
    /// |-------|--------------|
    /// | `PTRACE_EVENT_FORK` | Child PID |
    /// | `PTRACE_EVENT_VFORK` | Child PID |
    /// | `PTRACE_EVENT_CLONE` | Child TID |
    /// | `PTRACE_EVENT_EXEC` | Old thread-group leader TID |
    /// | `PTRACE_EVENT_VFORK_DONE` | Child PID |
    /// | `PTRACE_EVENT_EXIT` | Exit status (wait-encoded) |
    /// | `PTRACE_EVENT_SECCOMP` | Seccomp `SECCOMP_RET_DATA` value |
    /// | `PTRACE_EVENT_STOP` | 0 |
    ///
    /// AtomicU64 for lock-free reads from `PTRACE_GETEVENTMSG`.
    pub message: AtomicU64,

    /// Intrusive list link for the tracer's `ptraced_children` list.
    /// All tasks being traced by the same tracer are linked through this node.
    /// Used by `exit_ptrace()` to iterate all tracees when the tracer exits.
    pub tracer_list_link: IntrusiveListLink,

    /// Saved signal information for the current ptrace stop. Set before
    /// entering `TASK_TRACED`; the tracer reads it via `PTRACE_GETSIGINFO`
    /// and can modify it via `PTRACE_SETSIGINFO` (to suppress or alter the
    /// signal before resuming the tracee).
    pub last_siginfo: Option<SigInfo>,

    /// Credentials of the tracer at attach time. Captured during
    /// `PTRACE_ATTACH` / `PTRACE_SEIZE` to support credential checks
    /// for operations that occur after the tracer's credentials may have
    /// changed (e.g., the tracer calls `setuid()` after attaching).
    /// Matches Linux's `ptracer_cred` on `task_struct`.
    pub ptracer_cred: Arc<TaskCredential>,
}

Tracer-side state: The tracer Task maintains a list of all tasks it is currently tracing:

// In Task struct ([Section 8.1](08-process.md#process-and-task-management--task-model)):

/// List of all tasks this task is tracing (as a ptrace tracer).
/// Each tracee's `PtraceState.tracer_list_link` is linked into this list.
/// Walked by `exit_ptrace()` when this task exits. Empty for non-tracer tasks.
///
/// Protected by `process_tree_write_lock` (the global process tree lock) — matching
/// Linux's protocol where `task->ptraced` list is modified under
/// `write_lock(process_tree_write_lock)`.
///
/// Wrapped in `SpinLock` for interior mutability: `ptrace_init_child()` pushes
/// onto this list through `Arc<Task>` (shared reference). The
/// `process_tree_write_lock` provides logical synchronization; the SpinLock
/// satisfies Rust's borrow checker.
pub ptraced_children: SpinLock<IntrusiveList<PtraceState, tracer_list_link>>,

20.4.2.2 Ptrace Event Types

/// Ptrace event type, matching Linux's `PTRACE_EVENT_*` constants from
/// `include/uapi/linux/ptrace.h`. The numeric values are ABI — they appear
/// in wait status words observed by userspace debuggers.
///
/// The event is encoded in the wait status as: `(event << 8) | SIGTRAP`.
/// Userspace extracts it with: `(status >> 16) & 0xff`.
#[repr(u32)]
pub enum PtraceEventType {
    /// Child created via `fork()`. Message: child PID.
    /// Only delivered if `PTRACE_O_TRACEFORK` is set.
    Fork = 1,
    /// Child created via `vfork()`. Message: child PID.
    /// Only delivered if `PTRACE_O_TRACEVFORK` is set.
    Vfork = 2,
    /// Child created via `clone()`. Message: child TID.
    /// Only delivered if `PTRACE_O_TRACECLONE` is set.
    Clone = 3,
    /// `execve()` completed. Message: old thread-group leader TID.
    /// Only delivered if `PTRACE_O_TRACEEXEC` is set. Without this option,
    /// a legacy `SIGTRAP` is sent instead (for `PT_PTRACED`-only tracees).
    Exec = 4,
    /// vfork child has released the parent's address space (called `exec()`
    /// or `exit()`). Message: child PID.
    /// Only delivered if `PTRACE_O_TRACEVFORKDONE` is set.
    VforkDone = 5,
    /// Task is about to exit. Message: exit status (wait-encoded).
    /// Only delivered if `PTRACE_O_TRACEEXIT` is set.
    /// This is the tracer's last chance to inspect the dying tracee.
    Exit = 6,
    /// Seccomp filter returned `SECCOMP_RET_TRACE`. Message: `SECCOMP_RET_DATA`.
    /// Only delivered if `PTRACE_O_TRACESECCOMP` is set.
    Seccomp = 7,
    /// Group-stop event for `PTRACE_SEIZE`'d tracees. Message: 0.
    /// Delivered when a seized tracee enters group stop (SIGSTOP, SIGTSTP, etc.).
    /// Value 128 is not in the `(event << 8)` encoding — it uses a different
    /// wait status format: `(signal << 8) | 0x7f` with `PTRACE_EVENT_STOP`
    /// indicated by the `JOBCTL_TRAP_STOP` path.
    Stop = 128,
}

20.4.2.3 Additional PtraceRequest Variants

The following PtraceRequest variants are required for lifecycle management but were not listed in the capability-gated section above. They use the same capability check (CAP_DEBUG on the target process):

/// Additional PtraceRequest variants for lifecycle management.
/// These extend the PtraceRequest enum defined in the Capability-Gated ptrace section.
impl PtraceRequest {
    /// Set ptrace options (event subscriptions).
    /// `data` is a bitmask of `PTRACE_O_*` flags.
    /// Linux value: 0x4200.
    pub const SetOptions: u32 = 0x4200;
    /// Read the ptrace event message (child PID, exit code, etc.).
    /// Linux value: 0x4201.
    pub const GetEventMsg: u32 = 0x4201;
    /// Listen to a `PTRACE_SEIZE`'d tracee without stopping it.
    /// Used during group stop: the tracee remains stopped but the tracer
    /// is notified of subsequent events.
    /// Linux value: 0x4208.
    pub const Listen: u32 = 0x4208;
}

20.4.2.4 Ptrace Event Delivery

When a kernel code path reaches a ptrace-relevant point (fork, exec, exit, seccomp), it calls ptrace_event() to notify the tracer. This function is the single entry point for all ptrace event generation.

/// Check whether a specific ptrace event is enabled for the given task.
///
/// Returns `true` if the task is being ptraced AND the corresponding
/// `PTRACE_O_*` option bit is set in the tracee's `PtraceState.options`.
///
/// This is a hot-path check (called on every fork/exec/clone path).
/// The atomic load is lock-free; the branch is predicted not-taken
/// for non-ptraced tasks (the common case).
#[inline]
fn ptrace_event_enabled(task: &Task, event: PtraceEventType) -> bool {
    let ptrace = match &task.ptrace {
        Some(p) => p,
        None => return false,
    };
    // Map event type to option bit. PTRACE_O_TRACE* flags are (1 << event).
    // Exception: PTRACE_EVENT_STOP (128) has no option gate — it is always
    // delivered for PTRACE_SEIZE'd tracees during group stop.
    let event_val = event as u32;
    if event_val == PtraceEventType::Stop as u32 {
        return ptrace.flags.load(Acquire) & PT_SEIZED != 0;
    }
    let option_bit = 1u32 << event_val;
    ptrace.options.load(Acquire) & option_bit != 0
}

/// Deliver a ptrace event to the tracer.
///
/// If the task is being ptraced and the event is enabled (via `PTRACE_SETOPTIONS`),
/// this function:
/// 1. Stores the event type and message in `PtraceState`.
/// 2. Transitions the task to `TASK_TRACED`.
/// 3. Wakes the tracer's `waitpid()` wait queue.
/// 4. Yields the CPU — the tracee blocks in `TASK_TRACED` until the tracer
///    resumes it (via `PTRACE_CONT`, `PTRACE_DETACH`, etc.).
///
/// If the task is not ptraced or the event is not enabled, this is a no-op.
///
/// **Special case for `PTRACE_EVENT_EXEC`**: If the tracee was attached via
/// legacy `PTRACE_ATTACH` (not `PTRACE_SEIZE`) and `PTRACE_O_TRACEEXEC` is NOT
/// set, a plain `SIGTRAP` is sent instead. This preserves legacy debugger
/// compatibility (GDB < 7.0).
///
/// # Call sites
///
/// | Event | Call site | Message |
/// |-------|-----------|---------|
/// | `Fork` | `do_fork()`, after child PID allocated | Child PID |
/// | `Vfork` | `do_fork()`, after child PID allocated | Child PID |
/// | `Clone` | `do_fork()`, after child TID allocated | Child TID |
/// | `Exec` | `setup_new_exec()` step 5i, called from `do_execveat_common()` | Old leader TID |
/// | `VforkDone` | `do_fork()`, after vfork child resumes parent | Child PID |
/// | `Exit` | `do_exit()`, Step 1b | Exit status |
/// | `Seccomp` | `__seccomp_filter()`, on `RET_TRACE` | `RET_DATA` |
/// | `Stop` | `do_signal_stop()`, on group stop entry | 0 |
fn ptrace_event(task: &Task, event: PtraceEventType, message: u64) {
    if !ptrace_event_enabled(task, event) {
        // Special case: legacy exec notification.
        if event as u32 == PtraceEventType::Exec as u32 {
            if let Some(ref ptrace) = task.ptrace {
                let flags = ptrace.flags.load(Acquire);
                if flags & PT_PTRACED != 0 && flags & PT_SEIZED == 0 {
                    // Legacy PTRACE_ATTACH without PTRACE_O_TRACEEXEC:
                    // send a plain SIGTRAP instead of a ptrace event.
                    send_sig(SIGTRAP, task, /* priv */ false);
                }
            }
        }
        return;
    }

    // Acquire sighand lock for state transition.
    let _guard = task.process.sighand.lock.lock();

    let ptrace = task.ptrace.as_ref().unwrap(); // Safe: ptrace_event_enabled returned true.

    // Store event and message.
    ptrace.pending_event.set(Some(event));
    ptrace.message.store(message, Release);

    // Encode the wait status: (event << 8) | SIGTRAP.
    // The tracer's waitpid() will see this as the exit status.
    let exit_code = ((event as i32) << 8) | SIGTRAP as i32;
    task.exit_code.store(exit_code, Release);

    // Transition to TASK_TRACED.
    // set_special_state() performs a full memory barrier to ensure
    // all preceding stores (event, message, exit_code) are visible
    // before the state transition.
    task.state.store(TaskState::TRACED.bits(), Release);
    task.jobctl.fetch_or(JOBCTL_TRACED, Release);

    // Clear any pending trap flags to prevent double-stop.
    task.jobctl.fetch_and(
        !(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY | JOBCTL_TRAPPING),
        Release,
    );

    drop(_guard);

    // Notify the tracer. The tracer is the current "parent" for ptrace
    // purposes, so wake its wait_chldexit queue.
    if let Some(tracer) = ptrace.tracer.upgrade() {
        // CLD_TRAPPED tells the tracer this is a ptrace stop, not a job
        // control stop or exit.
        do_notify_parent_cldstop(task, /* for_tracer */ true, CLD_TRAPPED);
    }

    // Yield CPU. The task remains in TASK_TRACED until the tracer
    // resumes it (PTRACE_CONT, PTRACE_DETACH, PTRACE_SINGLESTEP, etc.)
    // or the tracer exits (triggering auto-detach).
    schedule();
}

20.4.2.5 Ptrace Exit Protocol

When a task exits, ptrace relationships must be cleaned up. There are three scenarios:

  1. The dying task is a tracee — notify the tracer one last time (PTRACE_EVENT_EXIT).
  2. The dying task is a tracer — detach from all tracees, restore their parents.
  3. Both — the dying task is simultaneously a tracee and a tracer (rare but valid: e.g., strace is itself being debugged by another debugger). Both scenarios execute.
20.4.2.5.1 Step 1b: exit_ptrace() in do_exit()

exit_ptrace() is called from do_exit() as Step 1b — immediately after PF_EXITING is set (Step 1) and before timer cancellation. This diverges from Linux's ordering in two ways: 1. PF_EXITING timing: Linux calls ptrace_event(PTRACE_EVENT_EXIT, code) in do_exit() after synchronize_group_exit() but BEFORE exit_signals() sets PF_EXITING. UmkaOS sets PF_EXITING first (Step 1) for signal inhibition, then does ptrace notification (Step 1b). This means the tracee has PF_EXITING set during the PTRACE_EVENT_EXIT stop, which may be visible to the tracer via /proc/[pid]/status. 2. Tracer cleanup timing: Linux calls exit_ptrace() (tracer cleanup) late in exit_notify()forget_original_parent(), near the end of do_exit(). UmkaOS consolidates both tracee notification and tracer cleanup into Step 1b for clarity — the tracee notification blocks (so nothing below runs until the tracer releases it), and the tracer cleanup only modifies other tasks' parent pointers (no dependency on the dying task's later teardown steps).

See Section 8.2 for the full do_exit() sequence.

/// Called from do_exit() as Step 1b, after PF_EXITING is set.
///
/// Handles both scenarios:
/// - If this task is a tracee: deliver PTRACE_EVENT_EXIT to the tracer.
/// - If this task is a tracer: detach from all tracees, restore parents.
///
/// # Locking
///
/// Scenario 1 (tracee): acquires `sighand.lock` via `ptrace_event()`.
/// Scenario 2 (tracer): acquires `process_tree_write_lock` (write) to modify the
/// process tree (reparenting tracees).
fn exit_ptrace(task: &Task) {
    // ---- Scenario 1: This task is a tracee ----
    //
    // If PTRACE_O_TRACEEXIT is set, the tracer wants a final notification.
    // The tracee enters TASK_TRACED, giving the tracer one last chance to
    // inspect registers, memory, and exit code before the task is destroyed.
    //
    // After the tracer resumes us (PTRACE_CONT or PTRACE_DETACH), we
    // continue with the rest of do_exit().
    if task.ptrace.is_some() {
        ptrace_event(task, PtraceEventType::Exit, task.exit_code.load(Acquire) as u64);
        // Note: ptrace_event() only stops if PTRACE_O_TRACEEXIT is enabled.
        // If not enabled, it's a no-op and we proceed immediately.
    }

    // ---- Scenario 2: This task is a tracer ----
    //
    // Detach from every task this task is tracing. Each tracee gets its
    // original parent restored, and stopped tracees are woken.
    if task.ptraced_children.is_empty() {
        return;
    }

    // Write-lock the tasklist to safely modify parent pointers.
    let _tasklist = process_tree_write_lock.write();

    // Iterate all tracees. We drain the list because the tracer is dying —
    // there is no need to preserve the list.
    while let Some(ptrace_state) = task.ptraced_children.pop_front() {
        // SAFETY: The IntrusiveListLink is embedded in PtraceState, which is
        // owned by the tracee's Task. We hold process_tree_write_lock, so the tracee
        // cannot concurrently detach itself.
        let tracee = ptrace_state_to_task(ptrace_state);

        // If PTRACE_O_EXITKILL is set, kill the tracee.
        // This is used by sandboxes (e.g., Chromium) to ensure the sandboxed
        // process cannot outlive its supervisor.
        if ptrace_state.flags.load(Acquire) & PT_EXITKILL != 0 {
            send_sig(SIGKILL, tracee, /* priv */ true);
        }

        // Perform the actual detach (shared with PTRACE_DETACH).
        ptrace_detach_locked(task, tracee);
    }

    drop(_tasklist);
}
20.4.2.5.2 ptrace_detach_locked() — Core Detach Mechanism

This function is the shared implementation for both explicit PTRACE_DETACH and tracer-exit auto-detach. It must be called with process_tree_write_lock held for write (to safely modify the process tree).

/// Detach a tracee from its tracer. Called with `process_tree_write_lock` held for write.
///
/// Performs:
/// 1. Unlinks the tracee from the tracer's `ptraced_children` list.
/// 2. Restores the tracee's parent to `saved_parent`.
/// 3. Clears the tracee's `PtraceState`.
/// 4. Wakes a stopped tracee (TASK_TRACED → RUNNING).
/// 5. Handles zombie tracees (reparent notification to real parent).
///
/// # Preconditions
/// - `process_tree_write_lock` is held for write.
/// - `tracee.ptrace` is `Some(...)` and references `tracer`.
///
/// # Postconditions
/// - `tracee.ptrace` is `None`.
/// - `tracee.process.parent` is restored to `saved_parent`.
/// - If tracee was TASK_TRACED, it is now RUNNING (or back in group stop).
fn ptrace_detach_locked(tracer: &Task, tracee: &Task) {
    let ptrace_state = tracee.ptrace.as_ref().unwrap();
    let saved_parent = ptrace_state.saved_parent;

    // Unlink from the tracer's list (if not already unlinked by the caller).
    if ptrace_state.tracer_list_link.is_linked() {
        // Already popped by exit_ptrace's drain loop, or needs explicit unlink
        // if called from the PTRACE_DETACH syscall path.
        ptrace_state.tracer_list_link.unlink();
    }

    // Restore the tracee's parent.
    tracee.process.parent.store(saved_parent.as_u64(), Release);

    // Clear ptrace job control flags.
    tracee.jobctl.fetch_and(
        !(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY | JOBCTL_TRAPPING
          | JOBCTL_TRACED | JOBCTL_LISTENING | JOBCTL_TRAP_FREEZE
          | JOBCTL_PTRACE_FROZEN),
        Release,
    );

    // Handle task state transition.
    let state = tracee.state.load(Acquire);
    if state == TaskState::TRACED.bits() {
        // The tracee was stopped in a ptrace-stop. We need to either:
        // (a) Wake it to resume execution, or
        // (b) Transition it to TASK_STOPPED if a group stop is active.
        //
        // Check if a group stop is in progress for the tracee's thread group.
        if tracee.jobctl.load(Acquire) & JOBCTL_STOP_PENDING != 0 {
            // Group stop is active: transition to TASK_STOPPED (not RUNNING).
            // The tracee participates in the group stop under its real parent.
            tracee.state.store(TaskState::STOPPED.bits(), Release);
            tracee.jobctl.fetch_or(JOBCTL_STOP_PENDING, Release);
        } else {
            // No group stop: wake the tracee. It resumes execution from
            // wherever it was stopped (ptrace_event/ptrace_stop call site).
            tracee.state.store(TaskState::RUNNING.bits(), Release);
            try_to_wake_up(tracee, TASK_TRACED, WakeFlags::empty());
        }
    } else if state == TaskState::ZOMBIE.bits() {
        // The tracee is already a zombie. During ptrace, the zombie is
        // "held" by the tracer (the tracer is the current parent, so
        // waitpid on the tracer would reap it). Now that we're detaching,
        // the zombie must be reparented to its real parent.
        //
        // Notify the real parent so it can reap the zombie.
        if saved_parent != tracer.process.pid {
            let real_parent = pid_lookup(saved_parent);
            if let Some(parent_task) = real_parent {
                do_notify_parent(tracee, parent_task);
                wake_up_interruptible(&parent_task.process.wait_chldexit);
            }
        }
        // If the real parent ignores SIGCHLD, auto-reap the zombie.
        if let Some(parent_task) = pid_lookup(saved_parent) {
            if parent_ignores_sigchld(parent_task) || parent_has_sa_nocldwait(parent_task) {
                tracee.state.store(TaskState::DEAD.bits(), Release);
                release_task(tracee);
            }
        }
    }

    // Free the PtraceState. After this, tracee.ptrace is None.
    // This drops the Weak<Task> reference to the tracer and the
    // Arc<TaskCredential> for the ptracer credentials.
    // SpinLock provides interior mutability through &Task.
    *tracee.ptrace.lock() = None;
}

20.4.2.6 Explicit PTRACE_DETACH

When the tracer calls ptrace(PTRACE_DETACH, tracee_pid, 0, signal):

/// Handle the PTRACE_DETACH request from the tracer.
///
/// # Arguments
/// - `tracer`: The calling task (must be the tracee's current tracer).
/// - `tracee`: The tracee to detach from.
/// - `signal`: Signal to deliver to the tracee on resume (0 = none).
///   Must be a valid signal number (1..=64) or 0.
///
/// # Errors
/// - `ESRCH`: tracee is not being traced by this tracer.
/// - `EINVAL`: invalid signal number.
/// - `EIO`: tracee is not in a ptrace-stop (required for PTRACE_DETACH
///   on legacy PTRACE_ATTACH; not required for PTRACE_SEIZE'd tracees).
fn ptrace_detach(tracer: &Task, tracee: &Task, signal: u32) -> Result<(), SyscallError> {
    // Validate signal.
    if signal > 64 {
        return Err(SyscallError::new(EINVAL));
    }

    // Verify the tracer relationship.
    let ptrace_state = tracee.ptrace.as_ref()
        .ok_or(SyscallError::new(ESRCH))?;
    if ptrace_state.tracer_pid != tracer.process.pid {
        return Err(SyscallError::new(ESRCH));
    }

    // Architecture-specific cleanup (e.g., clear single-step flag).
    arch::current::debug::ptrace_disable(tracee);

    // Acquire process_tree_write_lock for write to safely modify the process tree.
    let _tasklist = process_tree_write_lock.write();

    // Set the tracee's exit_code to the signal. When the tracee resumes
    // from TASK_TRACED, the signal delivery path picks up this value.
    if signal > 0 {
        tracee.exit_code.store(signal as i32, Release);
    }

    // Perform the actual detach.
    ptrace_detach_locked(tracer, tracee);

    drop(_tasklist);
    Ok(())
}

20.4.2.7 Tracer Death Semantics

When the tracer exits before its tracees, all tracees are automatically detached. This is handled by Scenario 2 of exit_ptrace(). The specific semantics:

Tracee state at tracer death Action
TASK_TRACED (ptrace-stopped) Woken; resumes execution (or enters group stop if one is active)
TASK_TRACED + PT_EXITKILL Sent SIGKILL before detach; woken to process the signal
TASK_RUNNING (between stops) Parent restored; no immediate effect on execution
TASK_STOPPED (group stop) Parent restored; remains stopped (SIGCONT from real parent resumes)
TASK_ZOMBIE Reparented to saved_parent; real parent notified via SIGCHLD
TASK_INTERRUPTIBLE / TASK_UNINTERRUPTIBLE Parent restored; no immediate wake (task was not ptrace-stopped)

Invariant: After exit_ptrace() completes, no task anywhere in the system has a PtraceState.tracer that references the dying task. All tracees have been fully detached and reparented.

20.4.2.8 Interaction with Group Stop

When a ptraced task enters group stop (SIGSTOP / SIGTSTP / SIGTTIN / SIGTTOU), the interaction between ptrace and job control follows this protocol:

  1. The tracee enters TASK_TRACED (not TASK_STOPPED) with JOBCTL_TRAP_STOP set.
  2. The tracer receives a PTRACE_EVENT_STOP notification (for PTRACE_SEIZE'd tracees) or a signal-delivery-stop with the stop signal number (for PTRACE_ATTACH'd tracees).
  3. The tracer can:
  4. PTRACE_CONT — resume the tracee (it exits the group stop).
  5. PTRACE_LISTEN — acknowledge the stop but let the tracee remain stopped (it transitions to TASK_STOPPED and participates in the group stop normally).
  6. PTRACE_DETACH — detach; the tracee transitions to TASK_STOPPED and participates in the group stop under its real parent.
  7. If the tracer dies while the tracee is in this state, exit_ptrace() transitions the tracee to TASK_STOPPED (matching the "group stop active" branch in ptrace_detach_locked()).

20.4.2.9 Interaction with exec()

When a ptraced multi-threaded process calls execve():

  1. All threads except the execing thread are killed (standard exec() semantics).
  2. The execing thread's TID changes to the thread-group leader's TID (if different).
  3. If PTRACE_O_TRACEEXEC is set: PTRACE_EVENT_EXEC is generated. The message is the old thread-group leader's TID (so the tracer can detect the TID change).
  4. If PTRACE_O_TRACEEXEC is NOT set and the tracee was attached via legacy PTRACE_ATTACH: a plain SIGTRAP is sent (legacy notification).
  5. The ptrace relationship survives exec(). The tracer continues to trace the process after exec() completes. The tracee's PtraceState is preserved across exec.

20.4.2.10 Ptrace Auto-Attach on Fork/Clone

When a ptraced task calls fork(), vfork(), or clone(), the child may be automatically attached to the tracer depending on the PTRACE_O_* options:

/// Called from do_fork() after the child task is fully initialized but
/// before it is made runnable (wake_up_new_task).
///
/// If the parent is ptraced and the appropriate PTRACE_O_TRACE* option
/// is set, the child is automatically attached to the same tracer.
fn ptrace_init_child(child: &Task, parent: &Task, clone_flags: u64) {
    let parent_ptrace = match &parent.ptrace {
        Some(p) => p,
        None => return,
    };

    // Determine which option flag to check based on clone type.
    let event = if clone_flags & CLONE_VFORK != 0 {
        PtraceEventType::Vfork
    } else if clone_flags & CLONE_THREAD != 0 {
        // CLONE_THREAD uses Clone event, not Fork.
        PtraceEventType::Clone
    } else {
        PtraceEventType::Fork
    };

    if !ptrace_event_enabled(parent, event) {
        return;
    }

    // Auto-attach the child to the parent's tracer.
    // The child inherits the parent's ptrace flags (PT_PTRACED, PT_SEIZED)
    // and options. The child's saved_parent is its real parent (the forking task).
    let tracer = match parent_ptrace.tracer.upgrade() {
        Some(t) => t,
        None => return, // Tracer died — race condition, no-op.
    };

    let child_ptrace = Box::new(PtraceState {
        tracer: Arc::downgrade(&tracer),
        tracer_pid: tracer.process.pid,
        flags: AtomicU32::new(parent_ptrace.flags.load(Acquire)),
        options: AtomicU32::new(parent_ptrace.options.load(Acquire)),
        saved_parent: parent.process.pid, // Real parent is the forking task.
        pending_event: Cell::new(None),
        message: AtomicU64::new(0),
        tracer_list_link: IntrusiveListLink::new(),
        last_siginfo: None,
        ptracer_cred: Arc::clone(&parent_ptrace.ptracer_cred),
    });

    // Link into the tracer's ptraced_children list.
    // Requires process_tree_write_lock (held by do_fork).
    // SpinLock provides interior mutability through Arc<Task>.
    tracer.ptraced_children.lock().push_back(&child_ptrace.tracer_list_link);

    // Update the child's parent to the tracer (for waitpid visibility).
    child.process.parent.store(tracer.process.pid.as_u64(), Release);

    // SpinLock provides interior mutability through &Task.
    *child.ptrace.lock() = Some(child_ptrace);
}

20.4.2.11 Ptrace Flags

/// Ptrace flag constants matching Linux `include/linux/ptrace.h`.
/// These are stored in `PtraceState.flags` (NOT in the user-visible
/// `PTRACE_O_*` options).

/// Task is being ptraced (basic flag, always set).
pub const PT_PTRACED: u32    = 0x0000_0001;
/// Task was attached via PTRACE_SEIZE (not PTRACE_ATTACH).
/// PTRACE_SEIZE'd tracees use different stop semantics:
/// - Group stops generate PTRACE_EVENT_STOP (not signal-delivery-stop).
/// - PTRACE_INTERRUPT can interrupt running tracees.
/// - PTRACE_LISTEN is available for group-stop management.
pub const PT_SEIZED: u32     = 0x0001_0000;
/// Send SIGKILL to tracee when tracer exits (PTRACE_O_EXITKILL).
pub const PT_EXITKILL: u32   = 0x0010_0000;
/// Suspend tracee's seccomp filters (PTRACE_O_SUSPEND_SECCOMP).
/// Requires CAP_SYS_ADMIN on the tracer.
pub const PT_SUSPEND_SECCOMP: u32 = 0x0020_0000;

/// Per-event trace flags, derived from PTRACE_O_TRACE* options.
/// These are set in PtraceState.flags when the tracer calls PTRACE_SETOPTIONS.
/// The mapping is: PT_EVENT_FLAG(event) = (1 << (event + PT_EVENT_FLAG_SHIFT)).
const PT_EVENT_FLAG_SHIFT: u32 = 1; // Bit 1 = SYSGOOD, bit 2 = FORK, etc.

/// Bitmask covering all event flags: bits 1-8 (SYSGOOD through SECCOMP)
/// plus bits 20-21 (EXITKILL, SUSPEND_SECCOMP).
pub const PT_OPTION_MASK: u32 = 0x000000FF | PT_EXITKILL | PT_SUSPEND_SECCOMP;

/// Freeze trap — task should freeze at the next ptrace checkpoint.
/// Used by the cgroup freezer's ptrace-aware freeze protocol.
/// Linux: bit 23. Matches `include/linux/sched/jobctl.h`.
pub const JOBCTL_TRAP_FREEZE: u32 = 1 << 23;
/// Ptrace-frozen state — task is frozen via ptrace (distinct from
/// cgroup-frozen). Linux: bit 24.
pub const JOBCTL_PTRACE_FROZEN: u32 = 1 << 24;
/// `JOBCTL_TRACED` — indicates the task is currently in a ptrace-stop
/// (inside `ptrace_stop()`). Set when entering TASK_TRACED, cleared on
/// resume. Used by the scheduler and signal delivery to identify
/// ptrace-stopped tasks. Linux: bit 27.
pub const JOBCTL_TRACED: u32 = 1 << 27;

20.4.3 Hardware Debug Registers

Each architecture provides hardware breakpoint and watchpoint registers. UmkaOS exposes these through the ptrace interface, with debug register state saved and restored as part of ArchContext on every context switch.

Architecture HW Breakpoints HW Watchpoints Mechanism
x86-64 4 total (DR0-DR3, shared) ← shared with breakpoints DR7 configures each DR0-DR3 as breakpoint or watchpoint; 4 total in any combination. DR6 status
AArch64 2-16 (implementation defined) 2-16 (implementation defined) DBGBCR/DBGBVR, DBGWCR/DBGWVR
ARMv7 2-16 (implementation defined) 2-16 (implementation defined) DBGBCR/DBGBVR via cp14
RISC-V via trigger module via trigger module tselect, tdata1-tdata3 CSRs
PPC32 1 (IAC1) 1-2 (DAC1, DAC2) IAC/DAC SPRs, DBCR0/DBCR1 control
PPC64LE 1 (CIABR) 1 (DAWR0), 2 on POWER10 (DAWR0/1) CIABR, DAWR0/DAWRX0 SPRs
// umka-core/src/arch/x86_64/context.rs (excerpt)

/// x86-64 debug register state, saved/restored on context switch.
// kernel-internal, not KABI — per-thread debug register save area.
#[repr(C)]
pub struct DebugRegState {
    /// Address breakpoint registers.
    pub dr0: u64,
    pub dr1: u64,
    pub dr2: u64,
    pub dr3: u64,
    /// Debug status register (read on debug exception, cleared after).
    pub dr6: u64,
    /// Debug control register (enables breakpoints, sets conditions).
    pub dr7: u64,
}
// umka-core/src/arch/aarch64/context.rs (excerpt)

// Per ARM DDI 0487 (ARM Architecture Reference Manual)
// ID_AA64DFR0_EL1.BRPs field + 1 gives actual count; 16 is the architectural maximum.
// (DBGDIDR is the AArch32 equivalent; AArch64 uses ID_AA64DFR0_EL1.)
const MAX_HW_BREAKPOINTS: usize = 16;  // AArch64: up to 16 (ID_AA64DFR0_EL1.BRPs+1)
const MAX_HW_WATCHPOINTS: usize = 16;  // AArch64: up to 16 (ID_AA64DFR0_EL1.WRPs+1)
// x86-64: DR0-DR3 = 4 breakpoints, DR0-DR3 = 4 watchpoints (same regs, different config)
// Runtime: query actual count via ID_AA64DFR0_EL1 on AArch64, assume 4 on x86-64

/// AArch64 debug register state. The number of breakpoint/watchpoint
/// register pairs is discovered at boot via ID_AA64DFR0_EL1.
pub struct DebugRegState {
    /// Breakpoint control/value register pairs.
    pub bcr: [u32; MAX_HW_BREAKPOINTS],
    pub bvr: [u64; MAX_HW_BREAKPOINTS],
    /// Watchpoint control/value register pairs.
    pub wcr: [u32; MAX_HW_WATCHPOINTS],
    pub wvr: [u64; MAX_HW_WATCHPOINTS],
    /// Actual number of pairs available on this CPU.
    pub num_brps: u8,
    pub num_wrps: u8,
}

Debug register state is part of ArchContext (Section 8.1) and is saved/restored on every context switch. When a hardware breakpoint or watchpoint fires, the CPU generates a debug exception (#DB on x86-64, Breakpoint Exception (EC=0x30/0x31) on AArch64, breakpoint exception on RISC-V). Hardware watchpoints generate Watchpoint Exception (EC=0x34/0x35) on AArch64. The exception handler checks whether the faulting thread is being ptraced. If so, the event is delivered to the debugger as a SIGTRAP with si_code set to TRAP_HWBKPT. If the thread is not being debugged, the signal is delivered directly to the process (default action: terminate with core dump).

20.4.4 Core Dump Generation

On receipt of a fatal signal (SIGSEGV, SIGABRT, SIGFPE, SIGBUS, SIGILL, SIGSYS), the kernel generates an ELF core dump before terminating the process.

Contents of a core dump:

  • Register state (general-purpose, floating-point, vector, debug registers)
  • Memory mappings (VMA list with permissions, file backing, offsets)
  • Writable memory segments (stack, heap, anonymous mappings)
  • Signal information (siginfo_t for the fatal signal)
  • Auxiliary vector (AT_* entries)
  • Thread list with per-thread register state (for multi-threaded processes)

Capability gating: Core dump generation requires write access to the dump destination. The process's capability set must include WRITE on the target path (a filesystem location) or WRITE on the pipe to a handler program (configured via /proc/sys/kernel/core_pattern, same as Linux). If the process does not hold the required capability, no dump is written and the kernel logs a diagnostic message.

Core dump filter: A per-process bitmask controls which VMA types are included, compatible with Linux's /proc/pid/coredump_filter:

Bit VMA Type Default
0 Anonymous private on
1 Anonymous shared on
2 File-backed private off
3 File-backed shared off
4 ELF headers on
5 Private huge pages on
6 Shared huge pages off
7 Private DAX pages off
8 Shared DAX pages off

Compressed core dumps: When umka.coredump_compress=zstd is set (boot parameter or runtime sysctl), core dumps are compressed with zstd at level 3 before writing. For large processes (multi-GB heaps), this reduces dump size by 5-10x and reduces I/O time, making core dumps practical in production. The resulting file has a standard zstd frame that tools can decompress before loading into GDB.

20.4.5 Kernel Debugging and Crash Dumps

Kernel panic handler: On kernel panic, the handler (Section 3.14, Tier 0 code) captures a comprehensive snapshot of system state:

  1. Register state of all CPUs — The panicking CPU sends an IPI (or NMI on x86-64 if IPIs are not functioning) to all other CPUs. Each CPU saves its register state to a per-CPU crash save area and halts.
  2. Kernel stack — The faulting CPU's kernel stack is captured, with ORC-based unwinding (Section 20.8) to produce a symbolic backtrace.
  3. Kernel log buffer — The most recent 64 KB of the kernel ring buffer (printk output) is included.
  4. Capability table state — A summary of the capability table (number of entries, recent grants/revocations) for post-mortem security analysis.
  5. Driver registry state — Status of all registered drivers, including tier, device bindings, and crash counts.

The panic handler writes all of this into the reserved crash region as an ELF core dump (see Section 11.9 for crash recovery and the NVMe polled write-out path).

kdump equivalent: For systems that require maximum crash dump reliability, UmkaOS supports reserving a crash kernel memory region at boot (umka.crashkernel=256M). On panic, kexec loads the crash kernel, which boots into a minimal environment with a single purpose: write the dump to persistent storage and reboot. The crash kernel is stripped to the minimum — serial driver (Tier 0), block driver (Tier 0, polled mode), ELF writer, and nothing else. No scheduler, no interrupts, no capability system.

GDB remote stub: Development builds can enable a built-in GDB remote protocol stub (umka.gdb=serial or umka.gdb=net). This provides full kernel debugging over a serial port or UDP connection:

  • Set breakpoints in kernel code (software breakpoints via int3 / brk / ebreak)
  • Single-step through kernel execution paths
  • Read and write kernel memory and registers
  • Inspect per-CPU state, thread lists, and capability tables
  • Attach to a running kernel or halt at boot (umka.gdb_wait=1)

The GDB stub is compiled out of release builds (#[cfg(feature = "gdb-stub")]). It is never present in production kernels — this is a development-only facility.

Required GDB RSP (Remote Serial Protocol) packets. The stub implements the following packet set, which provides full-featured kernel debugging compatible with unmodified GDB clients (including IDE integrations such as CLion, VS Code, and Eclipse CDT):

Packet Purpose Notes
? Halt reason Returns S05 (SIGTRAP) or S09 (SIGKILL)
g / G Read/write all registers Architecture-specific register set (x86-64: 24 regs, AArch64: 34 regs, RISC-V: 33 regs)
p n / P n=r Read/write single register Register number n is arch-specific; GDB target descriptions declare the mapping
m addr,len / M addr,len:data Read/write memory Kernel virtual addresses; physical memory access via qRcmd monitor commands
c [addr] / s [addr] Continue / single-step Single-step uses hardware debug facilities: x86-64 EFLAGS.TF, AArch64 MDSCR_EL1.SS, RISC-V dcsr.step
z0,addr,len / Z0,addr,len Software breakpoint insert/remove Replaces instruction with int3 (x86-64) / brk #0 (AArch64) / ebreak (RISC-V); original bytes saved in a per-CPU breakpoint table
z1,addr,len / Z1,addr,len Hardware breakpoint insert/remove Uses debug registers: x86-64 DR0-DR3 with DR7 condition bits, AArch64 BVR0-BVR15/BCR0-BCR15, RISC-V trigger module tdata1/tdata2
z2,addr,len / Z2,addr,len Write watchpoint insert/remove x86-64: DR0-DR3 with DR7 W condition; AArch64: WVR/WCR; RISC-V: trigger module with store match
z3,addr,len / Z3,addr,len Read watchpoint insert/remove x86-64: DR0-DR3 with DR7 R condition; AArch64: WVR/WCR with load match; RISC-V: trigger module with load match
z4,addr,len / Z4,addr,len Access watchpoint (read+write) insert/remove x86-64: DR0-DR3 with DR7 RW condition; AArch64/RISC-V: combined load+store match
qSupported Feature negotiation Reports: PacketSize=4096, vContSupported+, swbreak+, hwbreak+, qXfer:features:read+
vCont Extended continue/step vCont;c continues all CPUs; vCont;s:N single-steps CPU N while others remain halted
H g tid / H c tid Set thread for subsequent register/continue ops tid = CPU number + 1 (GDB thread IDs are 1-based); H g 0 selects the CPU that hit the breakpoint
qfThreadInfo / qsThreadInfo Enumerate threads (CPUs) Returns one "thread" per online CPU; qfThreadInfo returns the first batch, qsThreadInfo returns subsequent batches, terminated by l
qC Current thread Returns the CPU that triggered the debug event (breakpoint, watchpoint, or single-step trap)
qRcmd,hex Remote monitor command Supported commands: reset (warm reboot), panic (trigger kernel panic for crash dump), dump <addr> <len> (hex dump of physical memory), cpuinfo (print per-CPU state summary)
qXfer:features:read:target.xml Target description Provides GDB with the architecture-specific register layout so that register names, sizes, and types are correctly displayed

Multi-CPU handling. When one CPU hits a breakpoint or watchpoint, the GDB stub must halt all other CPUs to present a consistent view of kernel state:

  1. The trapping CPU disables its local interrupts and enters the GDB stub event loop.
  2. The stub sends an NMI (x86-64) or FIQ (AArch64) or IPI (RISC-V, ARMv7) to all other online CPUs with a HALT_FOR_DEBUG flag in the IPI payload.
  3. Each receiving CPU saves its full register state into a per-CPU GdbCpuState struct and spins in a wait loop polling an atomic resume flag.
  4. The GDB stub presents each CPU as a separate GDB thread (thread ID = CPU number + 1). H g N selects CPU N for register reads/writes; H c -1 continues all CPUs when the user issues a continue command.
  5. On continue (c or vCont;c), the stub sets the resume flag for all CPUs. Each CPU restores its register state and returns to its interrupted context. The trapping CPU resumes last, after re-enabling its breakpoint.
// umka-core/src/debug/gdb_stub.rs

/// Per-CPU register state saved when halted for GDB debugging.
/// Each online CPU saves its state here when it receives the
/// `HALT_FOR_DEBUG` IPI, and restores from here on resume.
pub struct GdbCpuState {
    /// Full register file (architecture-specific).
    /// `SavedRegs` = `arch::current::context::SavedRegs`.
    /// Each architecture defines `SavedRegs` as a `#[repr(C)]` struct whose
    /// field order matches the GDB RSP `g`/`G` packet register ordering below.
    /// See [Per-architecture SavedRegs definitions](#per-architecture-savedregs-definitions)
    /// in this section for the authoritative struct layouts (x86-64, AArch64, RISC-V 64).
    pub regs: SavedRegs,

    /// The CPU's current halt reason, or `None` if it was halted by IPI
    /// rather than by hitting a breakpoint/watchpoint itself.
    pub halt_reason: Option<GdbHaltReason>,

    /// Atomic flag polled by the halted CPU. The GDB stub sets this to
    /// `true` when the user issues a continue/step command.
    pub resume: AtomicBool,

    /// If single-stepping, this is `true` and the CPU will re-enter the
    /// stub after executing one instruction.
    pub single_step: bool,
}

/// Reason a CPU entered the GDB stub.
pub enum GdbHaltReason {
    /// Software breakpoint (`int3` / `brk` / `ebreak`).
    SwBreakpoint { addr: u64 },
    /// Hardware breakpoint (debug register match on instruction fetch).
    HwBreakpoint { addr: u64 },
    /// Write watchpoint (debug register match on store).
    WriteWatchpoint { addr: u64 },
    /// Read watchpoint (debug register match on load).
    ReadWatchpoint { addr: u64 },
    /// Access watchpoint (debug register match on load or store).
    AccessWatchpoint { addr: u64 },
    /// Single-step trap (TF flag / SS bit / dcsr.step).
    SingleStep,
    /// Halted by IPI from another CPU's debug event.
    HaltedByIpi,
}

GDB RSP register ordering for g/G packets. The SavedRegs struct on each architecture is laid out so that a single memcpy produces the exact byte sequence expected by GDB's g (read all registers) and G (write all registers) packets. The qXfer:features:read:target.xml response describes this layout to GDB so that register names display correctly in the debugger UI.

Architecture Register order Count Size (bytes)
x86-64 rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8, r9, r10, r11, r12, r13, r14, r15, rip, eflags, cs, ss, ds, es, fs, gs 24 192 (24 x 8)
AArch64 x0, x1, x2, ..., x30, sp, pc, cpsr 34 272 (34 x 8)
RISC-V 64 x0 (zero), x1 (ra), x2 (sp), ..., x31 (t6), pc 33 264 (33 x 8)

Note: eflags, cs, ss, ds, es, fs, and gs on x86-64 are transmitted as 64-bit values (zero-extended from their native 32-bit or 16-bit widths) to maintain uniform 8-byte-per-register packing in the g/G packet. cpsr on AArch64 is similarly zero-extended to 64 bits.

Per-architecture SavedRegs definitions. Each architecture defines SavedRegs in umka_core::arch::<arch>::context. The struct is #[repr(C)] with fields ordered to match the GDB RSP g/G packet layout above, so a single memcpy produces the correct wire bytes. These are the authoritative definitions; the GDB stub and context-switch code share the same type.

// umka_core::arch::x86_64::context

/// x86-64 register file: 24 registers, 192 bytes total.
/// Field order matches the GDB RSP g/G packet layout.
/// Crosses userspace boundary via PTRACE_GETREGS / GDB stub.
#[repr(C)]
pub struct SavedRegs {
    pub rax: u64, pub rbx: u64, pub rcx: u64, pub rdx: u64,
    pub rsi: u64, pub rdi: u64, pub rbp: u64, pub rsp: u64,
    pub r8:  u64, pub r9:  u64, pub r10: u64, pub r11: u64,
    pub r12: u64, pub r13: u64, pub r14: u64, pub r15: u64,
    pub rip: u64,
    /// RFLAGS, zero-extended to 64 bits.
    pub eflags: u64,
    /// Segment registers, each zero-extended from 16 bits to 64 bits.
    pub cs: u64, pub ss: u64, pub ds: u64, pub es: u64,
    pub fs: u64, pub gs: u64,
}
// x86-64 SavedRegs: 24 × u64(8) = 192 bytes.
const _: () = assert!(core::mem::size_of::<SavedRegs>() == 192);

// umka_core::arch::aarch64::context

/// AArch64 register file: 34 registers, 272 bytes total.
/// Crosses userspace boundary via PTRACE_GETREGS / GDB stub.
#[repr(C)]
pub struct SavedRegs {
    /// General-purpose registers x0..x30.
    pub x: [u64; 31],
    /// Stack pointer (SP_EL0 or SP_ELx depending on exception level).
    pub sp: u64,
    /// Program counter.
    pub pc: u64,
    /// Current Program Status Register, zero-extended to 64 bits.
    pub cpsr: u64,
}
// AArch64 SavedRegs: 34 × u64(8) = 272 bytes.
const _: () = assert!(core::mem::size_of::<SavedRegs>() == 272);

// umka_core::arch::riscv64::context

/// RISC-V 64 register file: 33 registers, 264 bytes total.
/// Crosses userspace boundary via PTRACE_GETREGS / GDB stub.
#[repr(C)]
pub struct SavedRegs {
    /// Integer registers x0 (zero) through x31 (t6).
    /// x0 is always 0 but included for index alignment with
    /// the GDB register numbering scheme.
    pub x: [u64; 32],
    /// Program counter.
    pub pc: u64,
}
// RISC-V SavedRegs: 33 × u64(8) = 264 bytes.
const _: () = assert!(core::mem::size_of::<SavedRegs>() == 264);

Transport layer. The GDB stub communicates over either a serial port (umka.gdb=serial, default baud 115200, 8N1) or a UDP socket (umka.gdb=net,addr=10.0.0.1,port=1234). The serial transport uses the standard GDB RSP framing: $packet-data#checksum, with +/- acknowledgment. The UDP transport uses the same framing but without acknowledgment (UDP is already checksummed; retransmission is handled at the GDB client level). The stub processes one packet at a time in a polling loop — no interrupts or DMA are used for transport I/O while the kernel is halted, ensuring deterministic behavior.

Driver crash debugging: When a Tier 1 driver crashes, the fault handler (Section 20.1) captures the crash context before initiating recovery. The captured state includes:

  • Driver thread register state (all register classes)
  • Isolation domain state (PKRU value, domain assignment)
  • Driver-private memory snapshot (pages in the driver's isolation domain, up to a configurable limit, default 4 MB)
  • Recent ring buffer entries from the driver's communication channels
  • IOMMU mapping state for the driver's devices

This context is written to:

/sys/kernel/umka/drivers/{name}/crash_dump

The file persists until the next driver load or until explicitly cleared. Tools like umka-crashdump or GDB (with UmkaOS-aware scripts) can parse the dump for root cause analysis without reproducing the crash.

20.4.6 /proc/pid Interface

UmkaOS provides compatibility with the Linux /proc/pid interface that debuggers, profilers, and monitoring tools depend on. Each entry is capability-checked individually.

Path Content Capability Required
/proc/pid/maps Memory mappings (address, perms, offset, device, inode, path) CAP_DEBUG or same-process
/proc/pid/mem Process memory (seek + read/write) CAP_DEBUG + READ/WRITE
/proc/pid/status Task state, memory usage, capability summary None (public fields) or CAP_DEBUG (private fields)
/proc/pid/stack Kernel stack trace CAP_DEBUG + KERNEL_READ
/proc/pid/syscall Current syscall number and arguments CAP_DEBUG
/proc/pid/wchan Wait channel (function name where task is sleeping) None
/proc/pid/coredump_filter Core dump VMA filter bitmask Same-process or CAP_DEBUG

Per-access capability checking: Unlike Linux, where /proc/pid/mem is checked at open() time and then freely readable, UmkaOS checks capabilities on every read() and write() call. This eliminates TOCTOU vulnerabilities where a capability is revoked between open and access — a revoked CAP_DEBUG takes effect immediately, even on already-open file descriptors.

// umka-sysapi/src/procfs/mem.rs

/// Read handler for /proc/pid/mem.
/// Capability is checked on every read, not just on open.
fn proc_pid_mem_read(
    file: &ProcFile,
    buf: &mut [u8],
    offset: u64,
) -> Result<usize, IoError> {
    let caller = current_process();
    let target = file.target_process();

    // Re-check capability on every access (not cached from open).
    if caller.pid() != target.pid() {
        caller.cap_table.lookup(
            target.object_id(),
            PermissionBits::DEBUG | PermissionBits::READ,
        ).map_err(|_| IoError::PermissionDenied)?;
    }

    // Perform the read through the kernel's memory access path.
    // For domain-protected memory, this goes through the PKRU
    // mediation path (Section 11.2).
    target.address_space().read_remote(offset, buf)
}

Public vs. private fields in /proc/pid/status: Fields like State, Pid, PPid, Uid, Gid, and VmSize are considered public and readable without CAP_DEBUG (same as Linux). Fields like VmPeak, VmData, VmStk, CapInh, CapPrm, CapEff, Seccomp, and voluntary_ctxt_switches are private and require CAP_DEBUG on the target. This prevents information leakage that could aid side-channel or timing attacks while preserving compatibility with tools like ps and top that only read public fields.


20.5 Unified Object Namespace

Inspired by: Windows NT Object Manager, Plan 9 namespace concepts. IP status: Clean — basic OS design concept from Multics (1960s), any NT patents expired (filed 1989-1993, expired 2009-2013).

20.5.1 Problem

Linux organizes kernel resources through multiple unrelated mechanisms:

  • Files/sockets/pipes → file descriptors (integer indices into per-process table)
  • Processes → PIDs (global integer namespace)
  • Signals → signal numbers (per-process bitmask)
  • IPC → sysv IPC keys, POSIX named semaphores, futex addresses
  • Devices → /dev nodes (major/minor numbers)
  • Kernel tunables → /proc/sys (sysctl)
  • Device tree → /sys (sysfs)
  • Timer resources → timerfd, POSIX timers (separate handle space)
  • Event notification → eventfd, epoll (yet another handle space)

There is no unified way to enumerate "all kernel resources held by process X" or "all kernel resources related to device Y." Each subsystem has its own introspection mechanism (or none at all).

UmkaOS already has a capability system (Section 9.1) where every resource is accessed through capability tokens. The object namespace makes this explicit and queryable — a hierarchical tree where every kernel object has a canonical path and uniform access control.

20.5.2 Design: Kernel-Internal Object Tree

// umka-core/src/namespace/mod.rs (kernel-internal)

/// Every kernel resource is an Object.
pub struct Object {
    /// Unique object ID. The (slot, generation) pair is unique: slots are
    /// reused via a freelist, but the generation counter increments on each
    /// reuse, preventing stale-handle collisions.
    pub id: ObjectId,

    /// Object type (from existing cap/mod.rs ObjectType).
    pub object_type: ObjectType,

    /// Reference count. Bounded by concurrent reference count (not a
    /// monotonic ID); u32 matches Linux refcount_t.
    pub refcount: AtomicU32,

    /// Capability security descriptor.  `SecurityDescriptor` is defined in
    /// [Section 9.1](09-security.md#capability-based-foundation) (after
    /// `ObjectType`). It carries `owner: CredId`, `group: CredId`,
    /// `access_mask: u32`, and `label_id: u32` — the minimum per-object
    /// security metadata for capability validation and MAC enforcement.
    pub security: SecurityDescriptor,

    /// Type-specific data (tagged union).
    pub data: ObjectData,
}

/// Type-specific payload for an `Object`.  Each variant carries the
/// minimum data needed for namespace queries without requiring a full
/// subsystem lookup (e.g., listing devices shows bus type and tier
/// without querying the device registry).
pub enum ObjectData {
    /// Device object — mirrors essential device registry fields.
    Device(DeviceObjectData),
    /// Process object — mirrors essential task fields.
    Process(ProcessObjectData),
    /// Socket object — mirrors essential socket fields.
    Socket(SocketObjectData),
    /// File object — mirrors essential inode fields.
    File(FileObjectData),
    /// Generic object — subsystem-defined payload for object types not
    /// covered by the above variants.
    Generic {
        /// Subsystem type name (e.g., "timer", "semaphore", "shm").
        type_name: ArrayString<32>,
        /// Opaque subsystem-specific data (max 64 bytes).
        extra: [u8; 64],
    },
}

/// Device-specific namespace data.
pub struct DeviceObjectData {
    /// Bus type (PCI, USB, Platform, etc.).
    pub bus: BusType,
    /// Name of the driver currently bound (empty if unbound).
    pub driver_name: ArrayString<32>,
    /// Isolation tier (0 = in-kernel, 1 = domain-isolated, 2 = ring-3).
    pub tier: u8,
}

/// Process-specific namespace data.
pub struct ProcessObjectData {
    /// Process ID.
    pub pid: u32,
    /// Command name (first 16 bytes of argv[0]).
    pub comm: ArrayString<16>,
    /// Real user ID.
    pub uid: u32,
}

/// Socket-specific namespace data.
pub struct SocketObjectData {
    /// Address family (AF_INET = 2, AF_INET6 = 10, AF_UNIX = 1, ...).
    pub family: u16,
    /// Socket type (SOCK_STREAM = 1, SOCK_DGRAM = 2, ...).
    pub sock_type: u16,
    /// Protocol number (IPPROTO_TCP = 6, IPPROTO_UDP = 17, ...).
    pub protocol: u16,
}

/// File-specific namespace data.
pub struct FileObjectData {
    /// Inode number.
    pub ino: u64,
    /// Device number (major:minor encoded).
    pub dev: u64,
    /// File mode (permissions + type bits).
    pub mode: u32,
}

/// The namespace tree.
pub struct ObjectNamespace {
    /// Root directory of the namespace.
    root: ObjectDirectory,

    /// Sparse index for O(1) lookup by ObjectId slot index.
    ///
    /// ObjectId slots are reused (with generation increments), but the
    /// slot index space is sparse — a dense array sized to max slot would
    /// waste memory (e.g., a long-running server creates millions of
    /// short-lived process objects). XArray's radix tree uses memory
    /// proportional to the number of *live* objects, not to the maximum
    /// slot index ever allocated, while providing O(1) lookup and native
    /// RCU-compatible reads.
    ///
    /// The key is the ObjectId's numeric value. Entries are inserted on
    /// registration and removed when the object's refcount reaches zero.
    // Note: `Vec` in these definitions is a kernel-internal equivalent
    // (bounded array, slab-allocated), not `std::vec::Vec`. The `std`-style
    // type name is used here for readability.
    //
    // XArray: O(1) lookup by integer ObjectId with native RCU-compatible reads.
    // Preferred over HashMap for integer-keyed hot-path lookups.
    //
    // SAFETY INVARIANT: The XArray entry is a non-owning raw pointer. The
    // entry MUST be removed via `deregister()` BEFORE the Object's memory
    // is freed. The raw pointer does NOT participate in reference counting.
    // Violation causes a dangling pointer in the namespace index.
    //
    // Lifecycle (all callers MUST follow this order):
    //   register(): XArray insert → object accessible by ObjectId.
    //   deregister(): XArray remove → object no longer findable.
    //   drop(): runs AFTER deregister(); memory freed.
    //
    // The type system does NOT enforce this ordering — it is a convention
    // maintained by code review. All paths that drop an Object (normal
    // close, error cleanup, process exit) MUST call deregister() first.
    //
    // Defense-in-depth: in debug builds, `deregister()` asserts
    // `Arc::strong_count() > 1` to verify the caller still holds a
    // reference after removal. A strong_count of 1 at deregister time
    // means the next `Arc::drop` will free memory with no remaining
    // XArray entry — correct. A strong_count of 0 is impossible (caller
    // has a reference). A strong_count > 1 at deregister time means
    // other references exist — also correct (they will outlive the index
    // entry). The assertion catches the case where deregister is called
    // from the Drop impl itself (strong_count == 1 and about to go to 0).
    index: XArray<*mut Object>,
}

/// A directory in the namespace — contains named references to objects.
pub struct ObjectDirectory {
    /// This directory's object identity.
    object: Object,

    /// Named entries (sorted for binary search). A sorted `Vec` provides
    /// cache-friendly binary search and compact memory layout. Most directories
    /// contain fewer than 100 entries, making `Vec` with binary search optimal.
    /// The largest potential directory (`/Processes`) is accessed by PID via an
    /// `XArray` (integer-keyed O(1) lookup), not iterated through this Vec.
    /// No dynamic switching to `BTreeMap` is needed.
    /// **Bound**: `MAX_DIRECTORY_ENTRIES` (4096). Insertions beyond this limit
    /// return `-ENOSPC`. The bound prevents runaway namespace pollution from
    /// buggy or malicious subsystems. 4096 entries covers the largest practical
    /// directories (e.g., `/Devices` on a system with thousands of devices).
    entries: Vec<(ArrayString<64>, ObjectEntry)>,
}

pub enum ObjectEntry {
    /// Direct reference to an object.
    Object(ObjectId),
    /// Subdirectory.
    ///
    /// `Box` is required here: `ObjectEntry` is stored inside `ObjectDirectory`,
    /// which is stored inside `ObjectEntry::Directory`. Without indirection, the
    /// type would be infinitely sized and Rust would reject it at compile time.
    Directory(Box<ObjectDirectory>),
    /// Symbolic link to another path in the namespace.
    Symlink(ArrayString<256>),
}

20.5.3 Namespace Layout

/                                       (root)
+-- Devices                             (device registry mirror)
|   +-- pci0000:00
|   |   +-- 0000:00:1f.2               (SATA controller)
|   |   +-- 0000:03:00.0               (NVMe)
|   |   +-- 0000:04:00.0               (NIC)
|   +-- usb1
|   |   +-- usb1-1                      (hub)
|   |       +-- usb1-1.1               (keyboard)
|
+-- Drivers                             (loaded driver instances)
|   +-- umka-nvme                       (driver object)
|   +-- umka-e1000                      (driver object)
|   +-- umka-xhci                       (driver object)
|
+-- Processes                           (process objects)
|   +-- 1                               (init/systemd)
|   |   +-- Threads
|   |   |   +-- 1                       (main thread)
|   |   |   +-- 2                       (worker thread)
|   |   +-- Handles                     (fd table: capabilities)
|   |   |   +-- 0                       (stdin - pipe)
|   |   |   +-- 1                       (stdout - tty)
|   |   |   +-- 3                       (socket)
|   |   +-- Memory                      (VMA tree)
|   |   +-- Capabilities                (capability set)
|   +-- 42                              (some user process)
|
+-- Memory                              (physical memory regions)
|   +-- Node0                           (NUMA node 0)
|   +-- Node1                           (NUMA node 1)
|
+-- Network                             (network stack objects)
|   +-- Interfaces
|   |   +-- eth0                        (NIC)
|   |   +-- lo                          (loopback)
|   +-- Sockets                         (open sockets)
|
+-- IPC                                 (IPC endpoints)
|   +-- Pipes
|   +-- SharedMemory
|   +-- Semaphores
|
+-- Security                            (security policy objects)
|   +-- Capabilities                    (capability type registry)
|   +-- LSM                             (security module state)
|
+-- Health                              (FMA — Section 20.1)
|   +-- ByDevice
|   |   +-- 0000:03:00.0               (NVMe health)
|   +-- RetiredPages
|   +-- DiagnosisRules
|
+-- Scheduler                           (scheduler state)
|   +-- RunQueues
|   +-- CbsServers                      (Section 7.3)
|
+-- Tracing                             (Section 20.2)
    +-- StableTracepoints
    +-- AggregationMaps

20.5.4 What The Namespace Provides

Uniform enumeration: "Show me everything related to device 0000:03:00.0" is a namespace traversal starting at /Devices/pci0000:00/0000:03:00.0, following links to its driver instance (/Drivers/umka-nvme), its health data (/Health/ByDevice/0000:03:00.0), and any processes with open handles to it.

Uniform security: Every object in the namespace has a SecurityDescriptor that ties into the capability system. Access checks are uniform regardless of object type.

Uniform lifecycle: Objects are reference-counted. When refcount hits zero, the type-specific destructor runs. No per-subsystem cleanup code — the namespace manages object lifetime uniformly.

Cross-reference discovery: "What processes have handles to this device?" is a query across /Processes/*/Handles/*, filtering by target ObjectId. Without the namespace, this requires per-subsystem ad-hoc code.

20.5.5 Registration Strategy: Eager vs Lazy

Not all kernel objects are registered with equal urgency. High-frequency objects would add unacceptable overhead if registered on every creation:

Eagerly registered (created infrequently, high value for introspection): - Devices, drivers, processes, NUMA nodes, cgroups, IPC endpoints, security policies. - Registered at creation time, deregistered at destruction.

Lazily registered (created/destroyed at high frequency): - File descriptors, sockets, VMAs (virtual memory areas), anonymous pages. - The namespace entry is created on first query, not on creation. The namespace maintains a per-process "generation counter"; when a query finds a stale generation, it re-syncs from the kernel's authoritative data structures (fd table, VMA tree). - This means /proc/<pid>/umka/objects may have brief inconsistencies (a just-closed fd might still appear) but the hot path (open/close/read/write) has zero overhead.

Memory budget per eagerly-registered object:

Component Per Object Notes
Object struct (fixed fields) ~64 bytes ID, type, refcount, security
Namespace entry (name + link) ~80 bytes ArrayString<64> + pointer
XArray index entry ~16 bytes Radix tree node slot
Total per object ~160 bytes

Typical system: ~2000 eagerly registered objects (devices + drivers + processes) = ~320 KB baseline. File descriptors are lazy, so they don't add to this baseline.

20.5.6 Linux Interface Exposure — Standard Mechanisms

The namespace is kernel-internal. Linux applications never see it. But UmkaOS-specific tools can access it through standard Linux interfaces as additive extensions:

Via procfs (new entries under /proc/umka/, additive):

/proc/umka/objects/
    summary             # Total object count by type
    by_type/
        Device          # List of all device objects
        Process         # List of all process objects
        FileDescriptor  # List of all open FDs system-wide
        Socket          # List of all sockets
        Capability      # List of all capability grants
    by_id/
        <object_id>     # Full details of a specific object

/proc/umka/namespace/
    tree                # Full namespace tree (text dump, similar to NT WinObj)
    resolve/<path>      # Resolve a namespace path to an object

/proc/<pid>/umka/
    capabilities        # Full capability set for this process
    objects             # All objects this process holds handles to
    namespace_view      # This process's view of the namespace

Via sysfs (additive attributes on existing device nodes):

/sys/devices/.../umka_object_id     # Object ID in the namespace
/sys/devices/.../umka_refcount      # Current reference count
/sys/devices/.../umka_capabilities  # Capabilities granted for this device

Via a pseudo-filesystem (umkafs, mountable):

mount -t umkafs none /mnt/umka

# # Now the object namespace is browsable as a filesystem:
ls /mnt/umka/Devices/pci0000:00/
cat /mnt/umka/Processes/42/Capabilities
cat /mnt/umka/Health/ByDevice/0000:03:00.0/status

This is the same pattern as Linux's debugfs, tracefs, configfs — a pseudo- filesystem for kernel introspection. No new syscalls. Standard open/read/readdir. Any Linux tool that can read files can introspect the namespace.

20.5.7 umkafs Detail

// umka-sysapi/src/umkafs/mod.rs

/// umkafs: pseudo-filesystem exposing the object namespace.
/// Mounted via: mount -t umkafs none /mountpoint
///
/// Read-only by default. Write access (for admin operations like
/// forcing a driver reload or revoking a capability) requires
/// CAP_SYS_ADMIN (mapped to UmkaOS's admin capability set).
pub struct UmkaFs {
    /// Reference to the kernel's object namespace.
    namespace: &'static ObjectNamespace,

    /// Mount options.
    options: UmkaFsMountOptions,
}

pub struct UmkaFsMountOptions {
    /// Show only objects matching this type filter.
    pub type_filter: Option<ObjectType>,

    /// Show only objects accessible to this UID.
    pub uid_filter: Option<u32>,

    /// Maximum depth of directory listing (avoid huge listings).
    pub max_depth: u32,

    /// Enable write operations (default: false).
    pub writable: bool,
}

umkafs file format for object details:

$ cat /mnt/umka/Devices/pci0000:00/0000:03:00.0
type: Device
id: 4217
refcount: 3
state: Active
driver: umka-nvme
tier: 1
bus: pci
vendor: 0x144d
device: 0xa808
class: 0x010802
numa_node: 0
health: ok
power: D0Active
capabilities_granted: 2
  cap[0]: DMA_ACCESS (perms: READ|WRITE)
  cap[1]: INTERRUPT (perms: MANAGE_IRQ)
handles_held_by:
  process 1 (systemd): fd 7
  process 834 (postgres): fd 12
  process 835 (postgres): fd 13

20.5.8 Admin Operations via umkafs (write)

With writable mount option and admin capabilities:

# # Force a driver reload (triggers crash recovery path on a healthy driver)
echo "reload" > /mnt/umka/Drivers/umka-nvme/control

# # Revoke a specific capability
echo "revoke 4217:0" > /mnt/umka/Security/Capabilities/control

# # Disable a device (sets to Error state in registry)
echo "disable" > /mnt/umka/Devices/pci0000:00/0000:04:00.0/control

# # Change FMA diagnosis rule threshold
echo "threshold 200" > /mnt/umka/Health/DiagnosisRules/dimm_degradation/control

# # Change driver tier (write 0/1/2)
echo "2" > /mnt/umka/Drivers/umka-e1000/tier

These are all standard file write operations. Any shell script, ansible playbook, or management tool can use them. No custom CLI tools required.

20.5.9 How Subsystems Register Objects

Each kernel subsystem registers its objects with the namespace during initialization:

// Example: Device registry registers each device node.
// In umka-core/src/registry/mod.rs:

fn register_device_node(&mut self, node: &DeviceNode) {
    // Note: `format!()` here represents `ArrayString::from_fmt()` (stack-allocated,
    // fixed-capacity formatting). The kernel does not use heap-allocated `String`.
    let path = format!("/Devices/{}", node.sysfs_path());
    self.namespace.register(path, Object::from_device(node));
}

// Example: Process creation registers process object.
// In umka-core/src/sched/process.rs:

fn create_process(&mut self, ...) -> Process {
    let proc = Process::new(...);
    // Note: `format!()` here represents `ArrayString::from_fmt()` (stack-allocated,
    // fixed-capacity formatting). The kernel does not use heap-allocated `String`.
    let path = format!("/Processes/{}", proc.pid);
    self.namespace.register(path, Object::from_process(&proc));
    proc
}

// Objects are automatically deregistered when they are destroyed
// (refcount -> 0 triggers namespace removal).

20.5.10 Device Naming and Registration

The /Devices/ hierarchy is kernel-arbitrated: drivers do not choose their own names. The kernel's device model assigns a canonical name at device enumeration time, before any driver binds. This prevents naming collisions and ensures the namespace layout is stable and predictable across reboots.

20.5.10.1 Canonical Naming Convention

Each device type uses a deterministic naming scheme based on its bus topology:

Device type Name format Example
PCI device pci-<domain>:<bus>:<slot>.<func> pci-0000:02:00.0
USB device usb-<bus>:<port> usb-1:3
Platform device platform-<name>-<index> platform-serial-0
Virtio device virtio-<type>-<id> virtio-blk-0
ACPI device acpi-<hid>-<uid> acpi-PNP0501-0
NVMe namespace nvme-<ctrl>n<ns> nvme-0n1

These names are derived purely from bus-reported topology data (domain, bus, slot, function, port, UID). They are stable across reboots as long as the hardware topology does not change, and they are globally unique within a running system because bus topology values are assigned by hardware and are physically distinct.

20.5.10.2 Kernel-Arbitrated Registration

Drivers register with the device model via umkafs_register_device(handle), where handle is the device's DeviceHandle obtained from the KABI (Section 11.4). The kernel computes the canonical name from the device's bus topology and returns it to the driver. The driver does not supply a name.

/// Register a device with the umkafs namespace.
///
/// The kernel computes the canonical `/Devices/` name from the device's
/// bus topology (PCI BDF, USB bus:port, ACPI HID:UID, etc.) and creates
/// the namespace entry. The canonical name is returned to the caller so
/// the driver can log it and reference it in diagnostics.
///
/// Returns `Err(UmkafsError::Exists)` if a device with the same canonical
/// name is already registered (indicates a bus enumeration bug — see
/// "Collision Prevention" in unified-object-namespace).
///
/// # Safety
///
/// `handle` must be a valid `DeviceHandle` obtained from the KABI during
/// driver initialization. It must not have been passed to
/// `umkafs_register_device` previously.
pub unsafe fn umkafs_register_device(
    handle: DeviceHandle,
) -> Result<ArrayString<64>, UmkafsError>;

/// Remove a device's umkafs namespace entry.
///
/// Called during driver teardown. Automatically removes all symlinks
/// pointing to this device's canonical path.
///
/// # Safety
///
/// `handle` must be a valid registered `DeviceHandle`. After this call,
/// the handle must not be used with any umkafs API.
pub unsafe fn umkafs_deregister_device(handle: DeviceHandle);

20.5.10.3 Collision Prevention

The kernel MUST reject umkafs_register_device() calls that would produce a duplicate canonical name. Because canonical names are derived deterministically from hardware bus topology, a duplicate can only arise if the device model assigned the same bus address to two distinct devices simultaneously — which indicates a bus enumeration bug in the kernel, not a driver bug.

On rejection: - umkafs_register_device() returns Err(UmkafsError::Exists). - The kernel emits a KERN_WARNING via printk identifying both the existing and attempted registration, including the canonical name and the bus topology of each conflicting device. - The event is also recorded as a HealthEventClass::Generic health event (Section 20.1) on the conflicting device, with HealthSeverity::Warning, so it appears in the FMA ring and can trigger diagnosis rules.

The rejected device is not registered. The driver that called umkafs_register_device for the duplicate receives Err and must fail its initialization, causing the device model to place the device in Error state.

In addition to the canonical path (e.g., /Devices/pci-0000:02:00.0), the device model publishes human-readable symlinks under class-specific alias directories:

/Devices/by-class/
    block/
        sda  ->  ../../pci-0000:02:00.0        (first block device)
        sdb  ->  ../../pci-0000:03:00.0        (second block device)
    net/
        eth0  ->  ../../pci-0000:04:00.0       (first Ethernet NIC)
    nvme/
        nvme0  ->  ../../pci-0000:05:00.0      (first NVMe controller)

Aliases are assigned in device discovery order. Unlike canonical names, aliases may potentially collide if two drivers independently attempt to claim the same alias (for example, if two NVMe drivers both try to register the nvme0 alias). Collision handling for aliases differs from canonical names:

  • The first registration of a given alias wins.
  • Subsequent registrations that would produce a duplicate alias are assigned a deduplicated name: <alias>~<n> where n is the lowest positive integer that produces a unique name (e.g., sda~1, sda~2).
  • A KERN_WARNING is emitted identifying both the winning and losing alias, along with the canonical names of both devices. Alias collisions are expected to be uncommon (they require two devices of the same class discovered in ambiguous order) and are not treated as errors.

Aliases are reconstructed at each boot in discovery order and are not guaranteed stable across reboots (unlike canonical names). Persistent stable naming for userspace (e.g., /dev/disk/by-id/) is the responsibility of udev rules operating on canonical names and device attributes, not of the umkafs alias directory.

20.5.11 Relationship to Existing Interfaces

The namespace does NOT replace /proc, /sys, or /dev. Those remain as the Linux- compatible interfaces that existing tools depend on. The namespace is an additional unified view:

Interface Audience Purpose Status
/proc Linux tools (ps, top, etc.) Process info, kernel stats Required for compat
/sys Linux tools (udev, lspci, etc.) Device tree, attributes Required for compat
/dev Linux tools (everything) Device access Required for compat
/proc/umka/* UmkaOS-aware admin tools Object namespace queries New, additive
umkafs UmkaOS-aware admin tools Full namespace browse/control New, additive, optional

The namespace is the kernel-internal source of truth. /proc, /sys, and /dev are generated from it (just as /sys is generated from the device registry, and /proc from process state). The difference is that the namespace provides a unified view where cross-subsystem queries are natural.

20.5.12 Unified Management CLI (umkactl)

Multiple kernel subsystems expose sysfs/procfs interfaces that administrators interact with through different tools (umka-mltool, veritysetup, sysctl, direct sysfs writes). umkactl provides a single-entry-point CLI for UmkaOS system administration.

Design principle: umkactl is a userspace tool that reads from the Unified Object Namespace (Section 20.5) and writes to sysfs/procfs. It is a convenience layer, not a privileged daemon — every operation it performs is also possible via direct sysfs writes or existing tools. This ensures the kernel never depends on umkactl and that scripting/automation can bypass it entirely.

Subcommands:

Subcommand Description Kernel sections
umkactl device list\|info\|health Device registry queries Section 11.4
umkactl driver load\|unload\|tier Driver management Section 12.1, Section 11.3
umkactl policy list\|swap\|compare Policy module management Section 19.9
umkactl intent set\|status\|explain Intent management Section 7.10
umkactl fma rules\|events\|status Fault management queries Section 20.1
umkactl evolve status\|apply\|rollback Live evolution management Section 13.18
umkactl cluster join\|leave\|status\|nodes Cluster orchestration Section 5.1
umkactl power budget\|status Power budgeting Section 7.7

Argument specification — detailed arguments for each subcommand:

Subcommand Arguments Output
driver load <name> --tier <0\|1\|2> JSON status: {"driver": "<name>", "tier": N, "state": "loaded"}
driver unload <name> --force (skip quiescence wait) JSON status: {"driver": "<name>", "state": "unloaded"}
device health <path> --format <json\|table> (default: json) Health data: score, event count, escalation state
device list --bus <pci\|usb\|platform> (filter by bus type) Device tree: array of {canonical_name, bus, driver, tier, state}
trace start --event <name> --filter <expr> (eBPF filter expression) Trace ID: {"trace_id": N}
trace stop <id> (none) Summary: {"trace_id": N, "events_captured": N, "duration_ms": N}
fma rules list (none) Rule table: array of {rule_id, priority, condition, action, cooldown}
fma inject <device> <event> --severity <info\|warning\|critical\|fatal> Injection result: {"device": "...", "event": "...", "accepted": true}

All subcommands accept the global flags: --human (formatted table output), --watch (streaming updates), --cluster (cluster-wide), --node=N (specific node). Exit code: 0 on success, 1 on error, 2 on invalid arguments.

Output format: JSON by default (machine-parseable for automation pipelines). --human flag enables formatted tables for interactive use. --watch streams updates in real-time (poll-based on sysfs inotify).

Cluster mode: When run on a cluster node, subcommands accept a --cluster flag to operate on the cluster fabric via distributed IPC (Section 5.1). umkactl cluster status shows all nodes. umkactl --node=N device list queries a specific remote node. Cluster operations are strictly read-only unless explicitly confirmed with --yes (to prevent accidental cross-node mutations).

Implementation phases: - Phase 3: Basic device, driver, and policy commands - Phase 4: FMA, intent, and evolution management - Phase 5: Cluster commands and cross-node operations


20.6 EDAC — Error Detection and Correction Framework

EDAC (Error Detection and Correction) is the kernel framework for reporting CPU and memory hardware errors to userspace. On production servers, EDAC is the primary early-warning system for failing DIMMs: correctable errors (CEs) reliably precede uncorrectable errors (UEs), giving the system operator time to replace hardware before data loss occurs.

20.6.1 Architecture

EDAC decouples hardware-specific memory controller drivers (e.g., amd64_edac for AMD processors, Intel iMC drivers for Intel platforms) from the common error reporting infrastructure. The framework:

  • Collects error counts from hardware via polling or MCE (Machine Check Exception) notifications.
  • Aggregates errors per DIMM, per channel, per memory controller.
  • Exposes a sysfs hierarchy at /sys/bus/edac/ for Linux-compatible tooling (edac-util, rasdaemon).
  • Integrates with FMA (Section 20.1) for correlated fault detection and proactive response.

The EDAC framework lives in umka-core/src/edac/. Hardware-specific drivers (e.g., umka-core/src/edac/amd64.rs) register McController instances at boot via edac_mc_add_mc() after the memory controller is detected on the bus.

20.6.2 Core Data Structures

// umka-core/src/edac/mod.rs

/// A memory controller — one per CPU socket in NUMA systems.
/// Registered by hardware-specific drivers at boot.
pub struct McController {
    /// Index across all registered controllers (mc0, mc1, ...).
    pub mc_idx:      u32,
    /// DRAM type capabilities this controller supports (DDR4, DDR5, LPDDR5, ...).
    pub mtype_cap:   MemTypeCapability,
    /// Active EDAC mode (what error detection hardware enforces).
    pub edac_mode:   EdacMode,
    /// Error detection capabilities this controller advertises.
    pub edac_cap:    EdacCapability,
    /// Device tree node for this controller.
    /// Arc: EDAC outlives device on server platforms (no hot-remove).
    pub dev:         Arc<DevNode>,
    /// Chip-select rows (ranks) managed by this controller.
    /// Bounded by hardware: max ~32 csrows per MC (AMD Genoa: 12).
    pub csrows:      Vec<Arc<CsRow>>,
    /// Hardware-specific operations table.
    pub ops:         &'static McOps,
    /// Driver name, e.g. "amd64_edac". Null-terminated, fixed length.
    pub ctl_name:    [u8; 32],
    /// Controller label, e.g. "MC#0". Null-terminated, fixed length.
    pub mc_name:     [u8; 32],
    /// Hardware scrub mode (none, HW background scrub, software scrub).
    pub scrub_mode:  ScrubMode,
    /// Total correctable errors observed since boot (or last reset).
    pub ce_count:    AtomicU64,
    /// Total uncorrectable errors observed since boot (or last reset).
    pub ue_count:    AtomicU64,
    /// Polling interval in milliseconds. 0 = interrupt-driven only.
    pub poll_msec:   u32,
}

/// Detected EDAC mode (error detection/correction capability in use).
#[repr(u8)]
pub enum EdacMode {
    /// No error detection.
    None        = 0,
    /// Error detection only (no correction).
    EcOnly      = 1,
    /// Single-bit error correction, double-bit detection (SECDED).
    Secded      = 2,
    /// S4ECD4ED: 4-bit symbol correction, 4-bit detection.
    S4ecd4ed    = 3,
    /// S8ECD8ED: 8-bit symbol correction, 8-bit detection.
    S8ecd8ed    = 4,
    /// S16ECD16ED: 16-bit symbol correction, 16-bit detection.
    S16ecd16ed  = 5,
}

/// One chip-select row — maps to one or two physical DIMMs.
pub struct CsRow {
    pub csrow_idx: u32,
    /// First physical page frame in this rank's address range.
    pub first_page: u64,
    /// Last physical page frame in this rank's address range (inclusive).
    pub last_page:  u64,
    /// Page address mask used for address decoding.
    pub page_mask:  u64,
    /// Smallest addressable unit in bytes (ECC granule).
    pub grain:      u32,
    /// DRAM package type.
    pub dtype:      DimmType,
    /// Per-channel state within this chip-select row.
    /// Bounded by hardware: max 8 channels per rank.
    pub channels:   Vec<CsRowChannel>,
    pub ce_count:   AtomicU64,
    pub ue_count:   AtomicU64,
}

/// One DRAM channel within a chip-select row.
pub struct CsRowChannel {
    pub chan_idx: u32,
    pub ce_count: AtomicU64,
    /// DIMM label. Human-readable slot identifier, settable from userspace.
    /// Example: "CPU_SrcID#0_MC#0_Chan#0_DIMM#0"
    pub label:    [u8; 32],
    /// DIMM information, if a DIMM is populated in this slot.
    pub dimm:     Option<Arc<DimmInfo>>,
}

/// Physical DIMM information.
pub struct DimmInfo {
    pub dimm_idx:  u32,
    pub dtype:     DimmType,
    pub mtype:     MemType,
    pub edac_mode: EdacMode,
    /// Size in pages.
    pub nr_pages:  u64,
    /// ECC granule in bytes.
    pub grain:     u32,
    pub ce_count:  AtomicU64,
    pub ue_count:  AtomicU64,
    /// Human-readable label (set by userspace via sysfs write, e.g. from DMI slot data).
    pub label:     [u8; 64],
}

/// Hardware-specific operations provided by each memory controller driver.
///
/// All methods take `&McController` (shared reference), not `&mut McController`,
/// because the driver list is RCU-protected and access paths yield only `&T`.
/// Mutable state (error counters, polling timestamps) uses interior mutability
/// (`AtomicU64`, `AtomicU32`) within `McController` and `CsRow`.
pub struct McOps {
    /// Poll hardware for new CE/UE counts.
    /// Called every `poll_msec` ms from a kernel timer, or on MCE notification.
    /// Must be async-safe (called from both process and interrupt contexts).
    /// Uses interior mutability (`AtomicU64` counters) for error recording.
    pub check_error: fn(mci: &McController),

    /// Inject a synthetic error for hardware validation (e.g., post-RAS testing).
    /// Optional — not all platforms support error injection.
    pub inject_error: Option<fn(mci: &McController, inject: &EdacInject) -> Result<(), EdacError>>,

    /// Reset CE/UE counters back to zero.
    /// Requires `CAP_SYS_ADMIN`. Called from sysfs write to `reset_counters`.
    /// Counters are `AtomicU64` — reset is `store(0, Relaxed)` through `&self`.
    pub reset_counters: Option<fn(mci: &McController)>,
}

20.6.3 Error Reporting

When McOps::check_error() detects new errors, or when the MCE handler decodes a memory error, it calls the EDAC reporting functions:

// umka-core/src/edac/report.rs

/// Report a correctable error (CE).
///
/// Called from both the polling path (preemptible process context) and the
/// MCE handler (NMI context on x86-64). The implementation must be NMI-safe:
/// no sleeping, no heap allocation, no unbounded loops.
///
/// # Arguments
/// - `mci`:     The memory controller that detected the error.
/// - `csrow`:   Chip-select row index where the error was located.
/// - `channel`: Channel index within the chip-select row.
/// - `page`:    Physical page frame number where the error occurred (0 = unknown).
/// - `offset`:  Byte offset within the page (0 = unknown).
/// - `grain`:   ECC granule size in bytes.
/// - `syndrome`:ECC syndrome reported by hardware (opaque; hardware-specific decode).
/// - `msg`:     Additional context string (static, NMI-safe).
pub fn edac_mc_handle_ce(
    mci:      &McController,
    csrow:    u32,
    channel:  u32,
    page:     u64,
    offset:   u32,
    grain:    u32,
    syndrome: u32,
    msg:      &'static str,
) {
    // 1. Increment counters atomically (Relaxed ordering: counters are
    //    monotonically increasing; total ordering not required here).
    mci.ce_count.fetch_add(1, Ordering::Relaxed);
    let row = &mci.csrows[csrow as usize];
    row.ce_count.fetch_add(1, Ordering::Relaxed);
    row.channels[channel as usize].ce_count.fetch_add(1, Ordering::Relaxed);

    // 2. Write to kernel log (pr_warn equivalent). NMI-safe printk path.
    klog!(Warn, "EDAC MC{}: CE page 0x{:x} offset 0x{:x} grain {} syndrome 0x{:x} {}",
          mci.mc_idx, page, offset, grain, syndrome, msg);

    // 3. Emit FMA event (Section 20.1). Written to the per-device FMA ring.
    //    FMA event emission is lock-free and NMI-safe (ring buffer + atomic tail).
    fma_emit(FaultEvent::MemoryCe {
        mci_idx:  mci.mc_idx,
        csrow,
        channel,
        page,
        offset:   offset as u64,
        syndrome,
    });

    // 4. Threshold check: if CE rate exceeds CE_ADVISORY_PER_DAY on this DIMM,
    //    escalate to FMA advisory. Rate is computed by FMA from the event timestamps
    //    in the FMA ring; EDAC itself only records the raw count.
    //    The threshold is configurable via sysfs (see "sysfs Interface" in edac-error-detection-and-correction-framework).
}

/// Whether to panic on uncorrectable memory errors. Pre-resolved from the
/// `kernel.panic_on_uncorrected_error` sysctl for NMI-safe access.
/// The `KernelParam` setter writes to this static on sysctl update.
static PANIC_ON_UE: AtomicBool = AtomicBool::new(true);

/// Report an uncorrectable error (UE). The affected memory page is compromised.
///
/// A UE means a multi-bit error that ECC could not correct. The contents of the
/// affected memory are unreliable. If that memory held kernel code or critical
/// data structures, a panic is the safest response.
pub fn edac_mc_handle_ue(
    mci:     &McController,
    csrow:   u32,
    channel: u32,
    page:    u64,
    msg:     &'static str,
) {
    mci.ue_count.fetch_add(1, Ordering::Relaxed);
    mci.csrows[csrow as usize].ue_count.fetch_add(1, Ordering::Relaxed);

    klog!(Crit, "EDAC MC{}: UE page 0x{:x} csrow {} channel {} {}",
          mci.mc_idx, page, csrow, channel, msg);

    fma_emit(FaultEvent::MemoryUe {
        mci_idx: mci.mc_idx,
        csrow,
        channel,
        page,
        fatal: true,
        offset: u64::MAX,   // Sub-page offset not reported by EDAC UE handler.
    });

    // Unconditional panic if the UE affects a kernel page-table page.
    // Corrupt page tables mean the MMU will translate addresses incorrectly,
    // leading to silent data corruption or arbitrary code execution. The
    // kernel cannot safely continue regardless of panic_on_uncorrected_error.
    if page_table_check(page) {
        kernel_panic(
            "EDAC: uncorrectable error on kernel page-table page 0x{:x} — \
             corrupt page tables are unrecoverable",
            page
        );
        // unreachable
    }

    // Panic if configured (kernel.panic_on_uncorrected_error).
    // Default is true on production builds; false on developer builds.
    // NMI-safe: reads a pre-resolved static AtomicBool directly, not a
    // string-based sysctl lookup. The sysctl write path
    // (`param_write("kernel.panic_on_uncorrected_error", ...)`) stores the
    // new value into PANIC_ON_UE via the KernelParam setter.
    if PANIC_ON_UE.load(Ordering::Relaxed) {
        kernel_panic("EDAC: uncorrectable memory error");
        // unreachable
    }

    // If not panicking, defer FMA health report to process context.
    // edac_mc_handle_ue() may be called from NMI context (MCE handler on x86-64),
    // where sleeping is forbidden. Queue the request to EDAC_OFFLINE_QUEUE for the
    // edac_offline_worker kthread to call fma_report_health(). EDAC does NOT call
    // offline_page() directly — FMA owns the page retirement decision.
    if page != 0 {
        EDAC_OFFLINE_QUEUE.enqueue(OfflineRequest { pfn: page });
    }
}

CE threshold and advisory — The per-DIMM CE rate is computed by the FMA engine from timestamps in the FMA event ring. When the rate exceeds the configured threshold (default: 100 CE/day per DIMM), the FMA engine triggers FmaEscalation::Drain (Section 20.1) and emits a maintenance advisory. This is a lower-severity trending alert distinct from the CorrectableHighRate variant (100 CE/hour), which triggers immediate DIMM-level concern and proactive page-offlining in the EDAC poller (see EdacErrorType::CorrectableHighRate below). The FMA threshold is configurable per-controller at /sys/bus/edac/devices/mc0/ce_count_limit.

MCE integration on x86-64 — The MCE NMI handler mce_handler() (Section 2.23) performs severity triage and logs the event to the MCE ring buffer. The EDAC poller (running in process context on a periodic timer) drains MCE events from the ring buffer and calls edac_mc_handle_ce or edac_mc_handle_ue to translate them into EDAC events. This two-stage flow (NMI → ring buffer → poller) avoids sleeping or allocating in NMI context. AMD platforms register an edac_mce_amd decode handler that translates AMD MCA bank bits (MCA_STATUS, MCA_ADDR) into EDAC events. Intel platforms use the iMC driver's MCE decode path analogously.

Kernel page-table page detectionpage_table_check(pfn) returns true if the physical page is currently used as a kernel page-table page (PGD, P4D, PUD, PMD, or PTE page). The check is O(1): the struct Page for each physical frame carries a PageFlags field; bit PG_TABLE is set when the page is allocated by the page-table allocator (alloc_pt_page()) and cleared when the page is freed. The UE handler reads this flag with a plain load — no locks, NMI-safe. User-space page-table pages are also flagged; a UE on a user page-table page triggers SIGBUS to the owning process (via the existing MCE-to-signal path) rather than a kernel panic, because the kernel's own translations are not affected.

FMA-mediated page retirement — EDAC does NOT call offline_page() directly, neither from the UE handler nor from the polling path. All page retirement decisions are routed through FMA via fma_report_health(). This ensures FMA is the single source of truth for page retirement: FMA tracks all retired pages, prevents double-retirement, and applies consistent escalation policy (e.g., retiring too many pages on a single DIMM triggers a DIMM-level advisory). The edac_mc_handle_ue() function emits a FaultEvent::MemoryUe to the FMA ring and, when called from NMI context, also queues the event to EDAC_OFFLINE_QUEUE: BoundedMpmcRing<PhysAddr, 64> for deferred FMA processing by the edac_offline_worker workqueue thread (sleeping is forbidden in NMI context).

20.6.4 sysfs Interface

EDAC registers its sysfs hierarchy under /sys/bus/edac/devices/. All writes require CAP_SYS_ADMIN.

/sys/bus/edac/devices/
  mc0/                          # Memory controller 0
    ce_count                    # (r)  Total CEs since boot or last reset
    ue_count                    # (r)  Total UEs since boot or last reset
    reset_counters              # (w)  Write "1" to reset all CE/UE counters
    ce_count_limit              # (rw) CE/day advisory threshold (default: 100)
    mc_name                     # (r)  Controller name, e.g. "F19h_MC0"
    edac_mode                   # (r)  Active EDAC mode, e.g. "SECDED"
    size_mb                     # (r)  Total memory under this controller (MB)
    scrub_mode                  # (rw) Active scrub mode; "none" / "hw_scrub"
    csrow0/                     # Chip-select row 0
      ce_count                  # (r)  CE count for this rank
      ue_count                  # (r)  UE count for this rank
      size_mb                   # (r)  Rank size in MB
      edac_mode                 # (r)  EDAC mode for this rank
      dev_type                  # (r)  DIMM package: RDIMM / LRDIMM / UDIMM / NVDIMM
      mem_type                  # (r)  DRAM standard: DDR4 / DDR5 / LPDDR5
      ch0_ce_count              # (r)  CE count for channel 0
      ch0_dimm_label            # (rw) DIMM slot label (set by userspace from DMI data)
      ch1_ce_count              # (r)  CE count for channel 1
      ch1_dimm_label            # (rw) DIMM slot label for channel 1

20.6.5 umkafs Integration

EDAC data is also mirrored in the umkafs unified namespace (Section 20.5), enabling cross-subsystem queries alongside device health, scheduler state, and cluster topology:

/Memory/EDAC/
  mc0/
    ce_count
    ue_count
    ce_count_limit
    csrow0/
      ce_count
      ue_count
      ch0_ce_count
      ch0_dimm_label
      ch1_ce_count
      ch1_dimm_label

The umkafs entries are thin wrappers over the same McController/CsRow data structures — reads are lock-free atomic loads; writes are forwarded to the same sysfs write handlers (same permission checks apply).

20.6.6 Integration with FMA

Each edac_mc_handle_ce/ue call emits an FaultEvent into the FMA ring (Section 20.1). The FMA diagnosis engine correlates these events:

Pattern FMA Diagnosis Response
Rising CE rate on one DIMM FmaEscalation::DrainDeviceState::Degraded Maintenance advisory; retire pages if possible
UE event FmaEscalation::OfflineDeviceState::Error FMA calls memory_hotplug::offline_page (deduplicated); alert operator
Multiple DIMMs in same rank failing Possible MC fault Escalate to platform-level alert
CE burst correlated with thermal event Thermal-induced instability Correlate with FMA thermal events

The FMA ring can be consumed by the umka-ml-anomaly service (Section 23.1) for time-series analysis of CE rates. The service fits an exponential moving average over the CE event timestamps and alerts if the rate exceeds three standard deviations above the per-DIMM baseline.

Implementation phases: - Phase 1: Framework registration (edac_mc_add_mc, edac_mc_handle_ce/ue), sysfs hierarchy - Phase 2: AMD (amd64_edac) and Intel iMC drivers - Phase 4: umkafs integration, FMA correlation rules, umka-ml-anomaly CE-rate monitor

20.6.7 Polling Mechanism

Hardware memory controllers expose ECC error counts through memory-mapped registers or model-specific registers (MSRs). UmkaOS reads these via a dedicated polling controller that runs on a kworker thread at a configurable interval.

20.6.7.1 Controller and Driver Interface

// umka-core/src/edac/poller.rs

/// Maximum simultaneously registered EDAC controllers.
/// Even 8-socket NUMA systems have at most 16 memory controllers
/// (2 IMCs per socket). Bounded at compile time — no heap allocation.
pub const MAX_EDAC_CONTROLLERS: usize = 16;

/// Maximum errors reported per poll cycle per controller.
/// Hardware ECC error registers are bounded (typically 1-4 per MCA bank).
/// 16 entries covers the worst case of all banks reporting simultaneously.
pub const MAX_ERRORS_PER_POLL: usize = 16;

/// EDAC polling controller — one singleton per system, initialized at boot.
///
/// Owns the kworker thread and the registered driver list. Drivers register
/// at bus-scan time (PCI enumeration for Intel iMC / AMD UMC controllers)
/// and deregister on hot-remove.
pub struct EdacPoller {
    /// Polling interval in milliseconds. Default: 1000 ms.
    /// Range: 100 ms (aggressive, ~0.1% overhead) to 60000 ms (conservative).
    /// Readable and writable via `/ukfs/kernel/edac/poll_interval_ms`.
    /// A write takes effect on the next wakeup; no restart is required.
    pub poll_interval_ms: AtomicU32,
    /// Registered EDAC drivers. One entry per memory controller instance.
    /// RCU-protected: the poller thread reads under an RCU guard (lock-free,
    /// no contention on steady-state polls). Register/unregister (cold path,
    /// bus-scan only) clones the `ArrayVec`, modifies, and RCU-publishes.
    /// Arc<dyn>: supports MC hot-add/remove. Cold path (1s polling), no perf impact.
    pub drivers: RcuCell<ArrayVec<Arc<dyn EdacDriver>, MAX_EDAC_CONTROLLERS>>,
    /// Serialises driver register/unregister (cold path only).
    pub driver_write_lock: Mutex<()>,
    /// Handle to the dedicated kworker thread (`edac_poller/0`).
    /// Created at EDAC framework init; runs until system shutdown.
    pub poller: KworkerHandle,
    /// Monotonically increasing correctable error count since boot.
    /// Incremented from the poller thread; readable via
    /// `/ukfs/kernel/edac/ce_count` without locking.
    pub total_ce_count: AtomicU64,
    /// Monotonically increasing uncorrectable error count since boot.
    pub total_ue_count: AtomicU64,
    /// Sliding window of UE timestamps (seconds since boot), used to detect
    /// UE bursts that trigger a panic. Small fixed-size circular buffer.
    /// Both push and pop are performed by the same poller thread (no
    /// concurrent access). Uses explicit head/len for O(1) front/pop_front.
    pub ue_window: UeWindow,
}

/// Small fixed-size circular buffer for UE burst timestamps.
/// Capacity: `UE_BURST_WINDOW` entries (8). Uses manual head/len
/// index for O(1) front() and pop_front() operations that ArrayVec
/// does not support.
pub struct UeWindow {
    pub buf: [u64; UE_BURST_WINDOW],
    pub head: usize,
    pub len: usize,
}

/// Maximum UE events in the burst window before a panic is triggered.
/// A burst of 3 UEs within 60 seconds indicates multi-bit failure spreading
/// beyond what page-offlining can contain; continuing risks silently corrupt data.
pub const UE_PANIC_THRESHOLD: usize = 3;
/// Width of the UE burst detection window in seconds.
pub const UE_BURST_WINDOW_SECS: u64 = 60;
/// Ring capacity for the UE burst window (must be >= UE_PANIC_THRESHOLD).
pub const UE_BURST_WINDOW: usize = 8;

/// Per-memory-controller EDAC driver interface.
///
/// Implemented by hardware-specific drivers (e.g., `amd64_edac`, `intel_imc_edac`).
/// All methods are called from the poller kthread (preemptible process context)
/// unless noted otherwise. Implementations may sleep and acquire non-IRQ-safe locks.
///
/// Methods take `&self` (not `&mut self`) because the driver list is RCU-protected
/// and shared references are all that RCU read guards provide. Drivers use interior
/// mutability (`Mutex<HwRegs>`, `AtomicU64` counters) for hardware register access
/// and internal state — this is correct regardless of the collection design, since
/// hardware MMIO reads require synchronisation even with exclusive access.
pub trait EdacDriver: Send + Sync {
    /// Human-readable controller name for logging and sysfs, e.g., `"AMD_UMC_0"`,
    /// `"Intel_IMC_0"`. Must be unique across all registered drivers.
    fn name(&self) -> &str;

    /// Poll hardware ECC registers once and return all errors detected since the
    /// previous call to `clear_counts()`.
    ///
    /// The implementation reads the relevant status/count registers from hardware
    /// (e.g., AMD `MCG_STATUS`, `MCA_STATUS_UMC`; Intel `MC{n}_STATUS` MSRs or
    /// iMC MMIO error registers) and translates them to `EdacError` values.
    ///
    /// Returns an empty `ArrayVec` if no new errors have occurred since the last
    /// poll. A non-empty return does not imply a fatal condition — correctable
    /// errors are expected on aging hardware. The `MAX_ERRORS_PER_POLL` bound
    /// matches hardware: ECC registers report a bounded number of errors per
    /// MCA bank per poll cycle.
    fn poll(&self) -> ArrayVec<EdacError, MAX_ERRORS_PER_POLL>;

    /// Clear the hardware error count registers so that the next `poll()` call
    /// reports only newly detected errors, not cumulative counts.
    ///
    /// Called immediately after `poll()` by the poller thread. On platforms where
    /// clearing is not possible (read-only sticky registers), the driver must
    /// track the previous value internally and return the delta in `poll()`.
    fn clear_counts(&self);
}

/// A single ECC error event detected by polling a hardware memory controller.
pub struct EdacError {
    /// Error classification.
    pub kind: EdacErrorKind,
    /// Physical memory address where the error occurred, if the hardware
    /// provides address decoding. `None` if the controller does not record
    /// the fault address (e.g., LPDDR5 in some mobile configurations).
    pub phys_addr: Option<PhysAddr>,
    /// DIMM physical location decoded from the hardware address (channel,
    /// slot, rank, bank, row, column). Fields not decoded by the hardware
    /// are set to `u32::MAX`.
    pub location: EdacLocation,
    /// ECC syndrome bits as reported by hardware. Opaque; interpretation
    /// is hardware-specific and provided for diagnostic/logging purposes only.
    pub syndrome: u64,
    /// Count of how many times this error (same `phys_addr`) has been seen
    /// in the current polling epoch. Most hardware reports 1 per poll cycle;
    /// some aggregate counts across multiple addresses.
    pub count: u32,
}

/// Physical DIMM location decoded by the hardware memory controller driver.
/// Fields that the hardware does not provide are `u32::MAX`.
pub struct EdacLocation {
    pub channel: u32,
    pub slot:    u32,
    pub rank:    u32,
    pub bank:    u32,
    pub row:     u32,
    pub column:  u32,
}

/// Classification of a detected ECC error.
pub enum EdacErrorKind {
    /// Correctable Error (CE): single-bit error that the ECC hardware corrected
    /// transparently. The read data delivered to the CPU was correct. CEs do not
    /// cause data loss. A rising CE rate on a single DIMM location is the
    /// canonical predictor of an imminent uncorrectable error.
    Correctable,
    /// Uncorrectable Error (UE): multi-bit error that ECC could not correct.
    /// The data at the affected physical address is unreliable. If that address
    /// holds kernel code, a page table, or a live task's memory, a panic is the
    /// safest response to prevent silent corruption from propagating.
    Uncorrectable,
    /// Correctable Error at a high rate: more than 100 CEs per hour on the same
    /// DIMM location. The error is still hardware-corrected, but the rate indicates
    /// the DIMM is degrading. Page-offlining is recommended before a UE occurs.
    /// Emitted as a separate variant (not `Correctable`) so the poller can take
    /// proactive action without waiting for the FMA CE-rate threshold check.
    CorrectableHighRate,
}

20.6.7.2 Polling Algorithm

The poller kthread (edac_poller/0) runs the following loop:

EDAC poller loop (runs every poll_interval_ms milliseconds):

for each driver in EdacPoller.drivers (RCU read guard held, lock-free):
    errors = driver.poll()
    driver.clear_counts()

    for each error in errors:
        match error.kind:

        Correctable | CorrectableHighRate:
            EdacPoller.total_ce_count += error.count
            driver's McController.ue_count unchanged
            csrow.ce_count += error.count
            emit FaultEvent::MemoryCe { ... } to FMA ring

            if error.kind == CorrectableHighRate:
                emit FaultEvent with HealthSeverity::Warning
                // EDAC does NOT call offline_page() directly — FMA owns the
                // page retirement decision to prevent double-retirement and
                // ensure consistent escalation. FMA's diagnosis engine
                // determines whether to retire the page, accounting for
                // prior retirements and overall DIMM health.
                fma_report_health(HealthEvent {
                    timestamp_ns: ktime_get_ns(),
                    class: HealthEventClass::Memory,
                    code: 0x0003,  // CorrectableHighRate (distinct from MemoryCe 0x0001)
                    severity: HealthSeverity::Warning,
                    data: encode_edac_location(error),
                    data_len: 32,
                })

        Uncorrectable:
            EdacPoller.total_ue_count += error.count
            csrow.ue_count += error.count
            emit FaultEvent::MemoryUe { ... } with HealthSeverity::Critical

            // EDAC reports UE events to FMA via fma_report_health(). FMA's
            // diagnosis engine determines whether to retire the page,
            // accounting for prior retirements and overall DIMM health.
            // EDAC does NOT call offline_page() directly — FMA owns the
            // page retirement decision to prevent double-retirement and
            // ensure consistent escalation.
            if error.phys_addr is Some(addr):
                fma_report_health(HealthEvent {
                    timestamp_ns: ktime_get_ns(),
                    class: HealthEventClass::Memory,
                    code: 0x0002,  // MemoryUe
                    severity: HealthSeverity::Critical,
                    data: encode_edac_location(error),
                    data_len: 32,
                })

            // Record UE timestamp in burst window.
            ue_window.push(now_secs())

            // Evict entries older than UE_BURST_WINDOW_SECS.
            while ue_window.front() < now_secs() - UE_BURST_WINDOW_SECS:
                ue_window.pop_front()

            // Burst check: too many UEs in the sliding window → panic.
            if ue_window.len() >= UE_PANIC_THRESHOLD:
                kernel_panic(
                    "EDAC: {} uncorrectable memory errors in {}s — halting",
                    UE_PANIC_THRESHOLD, UE_BURST_WINDOW_SECS
                )
                // unreachable

Page offlining. EDAC reports errors to FMA via fma_report_health(). FMA's diagnosis engine (Section 20.1) determines whether to retire the page by calling memory_hotplug::offline_page() (Section 4.1), which removes the page from the buddy allocator's free pool. The page is never allocated again. FMA tracks all retired pages to prevent double-retirement and applies consistent escalation policy. Existing mappings of the page are not forcibly invalidated — the process or kernel subsystem that holds the mapping may read corrupt data, but further allocations will not land on the bad page. A UE in a kernel mapping triggers a panic before FMA processes the event, since the kernel cannot safely continue with a bad kernel page.

UE event deduplication: UE events are deduplicated by physical address: if offline_requested[phys_addr >> PAGE_SHIFT] is already set, the duplicate UE is logged but no additional FMA event is emitted. This prevents error storms from flooding the FMA ring with duplicate retirement requests for the same page.

Panic threshold rationale. Three UEs in 60 seconds indicates a fault condition beyond individual page-offlining: either the memory controller itself is failing, or a hardware fault is affecting a wide address range. Continuing risks silently corrupt data in pages that have not yet been detected as bad. A panic with a clear EDAC message is the correct response; the operator can analyze the FMA event log to determine the scope of the failure.

20.6.7.3 Userspace Visibility

Error counters and configuration are exposed through two parallel namespaces:

sysfs (Linux-compatible, for edac-util, rasdaemon):

/sys/bus/edac/devices/mc{n}/
    ce_count                   (r)  Total CE count since boot (or last reset)
    ue_count                   (r)  Total UE count since boot (or last reset)
    reset_counters             (w)  Write "1" to clear all counters (CAP_SYS_ADMIN)
    scrub_mode                 (rw) "none" or "hw_scrub"
    csrow{m}/ch{c}_ce_count    (r)  Per-channel CE count for rank m, channel c
    csrow{m}/ue_count          (r)  Per-rank UE count

umkafs (cross-subsystem queries, Prometheus scraping):

/ukfs/kernel/edac/
    poll_interval_ms           (rw) Polling interval; range [100, 60000]; default 1000
    ce_count                   (r)  System-wide total CE count since boot
    ue_count                   (r)  System-wide total UE count since boot
    dimm/{location}/ce_count   (r)  Per-DIMM CE count (location = "mc0_csrow0_ch0")
    dimm/{location}/ue_count   (r)  Per-DIMM UE count

POLLPRI notification. A UE event sets the POLLPRI flag on the memory controller's character device /dev/edac/mc{n}. Userspace monitoring daemons (e.g., mcelog, edac-util) may poll()/epoll() on this fd to receive prompt notification without busy-waiting. CE events do not set POLLPRI; they are only reported through the sysfs/umkafs counters and the FMA event ring.


20.7 pstore — Panic Log Persistence

pstore (persistent storage) preserves kernel panic logs, oops messages, and console output across reboots by writing them to non-volatile storage during the panic path. On production datacenter hosts, pstore is the primary mechanism for post-mortem analysis of kernel panics when a full kdump capture is impractical (insufficient crashkernel reservation, early-boot panics before kdump is armed, or EFI-only hosts).

20.7.1 Architecture

pstore decouples the persistence backend (EFI variables, ramoops DRAM region, BERT ACPI table) from the event producers (panic handler, oops handler, console, MCE log). The framework:

  • Calls all registered backends on panic/oops via the KmsgDumper interface.
  • Mounts pstorefs at /sys/fs/pstore/ after boot (done by systemd-pstore.service or manually by the administrator).
  • Each saved record appears as a virtual file: dmesg-efi-12345, console-efi-12345, mce-efi-12346, etc.
  • Files are removed from pstorefs (and erased from the backend) via unlink() or automatically by systemd-pstore.service, which copies them to /var/lib/systemd/pstore/ first.

The pstore framework lives in umka-core/src/pstore/. Backend drivers live in umka-core/src/pstore/efi.rs, umka-core/src/pstore/ramoops.rs, and (read-only) umka-core/src/pstore/bert.rs.

20.7.2 Backend Interface

// umka-core/src/pstore/mod.rs

/// Registered pstore backend. One instance per storage medium.
/// Backends are registered at boot via `pstore_register()`.
pub struct PstoreInfo {
    /// Backend name, e.g. "efi-pstore", "ramoops". Null-terminated.
    pub name:       [u8; 32],
    /// Which record types this backend handles.
    pub flags:      PstoreFlags,
    /// Maximum panic reason severity this backend accepts.
    pub max_reason: KmsgDumpReason,

    /// Write one pstore record during panic/oops.
    ///
    /// # Safety
    ///
    /// Called from NMI/panic context. Must not sleep, allocate heap memory,
    /// or acquire non-spinlock mutexes. The implementation must complete in
    /// bounded time (no retry loops on EFI runtime errors).
    ///
    /// Returns the backend-assigned record ID on success, used later by `erase`.
    pub write: fn(info: &PstoreInfo, record: &mut PstoreRecord) -> Result<u64, PstoreError>,

    /// Read the next available record (enumeration).
    ///
    /// Called repeatedly at pstorefs mount time until it returns `None`.
    /// Must be callable from process context (may sleep for slow backends).
    pub read: fn(info: &PstoreInfo, iter: &mut PstoreReadIter) -> Option<PstoreRecord>,

    /// Erase record `id` from the backend.
    ///
    /// Called when userspace unlinks the corresponding pstorefs file.
    /// Must be callable from process context.
    pub erase: fn(info: &PstoreInfo, id: u64) -> Result<(), PstoreError>,
}

/// A single pstore record (one logical unit of saved state).
/// Kernel-internal Tier 0 struct, not KABI -- bool is safe.
/// (Compare with RamoopsHeader which uses u8 for the on-disk wire format.)
pub struct PstoreRecord {
    /// Record type tag.
    pub type_:      PstoreType,
    /// Backend-assigned ID (set by `write`, used by `erase`).
    pub id:         u64,
    /// Monotonically increasing panic counter (survives reboot via EFI variable).
    /// **Longevity**: u32 wraps after ~4.29 billion panics. At 1 panic/day
    /// (worst case), wrap occurs after ~11.7 million years. Constrained to
    /// u32 by the Linux pstore ABI (pstorefs filenames encode the count as
    /// a decimal u32). Safe for 50-year uptime.
    pub count:      u32,
    /// Reason the kernel was dumping (panic, oops, MCE, ...).
    pub reason:     KmsgDumpReason,
    /// Part number for multi-part records (0-based). Large logs are split into
    /// backend-sized chunks and stored as parts 0, 1, 2, ...
    pub part:       u32,
    /// True if `buf` contains LZ4-compressed data.
    pub compressed: bool,
    /// Number of valid bytes in `buf`.
    pub size:       usize,
    /// Pre-allocated, statically-sized buffer (NMI-safe — no heap).
    /// Size determined by the backend at registration time.
    ///
    /// # Safety invariant
    ///
    /// Each pstore backend owns a distinct, statically-allocated buffer
    /// (e.g., `EfiPstoreBackend::compress_buf: [u8; 65536]`). The
    /// `backend.record_buf_mut()` call returns a `&'static mut [u8]`
    /// reference to that backend's own buffer. Since each backend is a
    /// global singleton with 'static lifetime and the panic dump path is
    /// serialized (only one CPU runs the panic handler), there is no
    /// aliasing: each loop iteration gets a `&'static mut` from a
    /// different backend. The `PstoreRecord` is dropped before the next
    /// backend reuses its buffer.
    pub buf:        &'static mut [u8],
}

/// Panic dump reason, in order of increasing severity.
#[repr(u32)]
pub enum KmsgDumpReason {
    Undef       = 0,
    Panic       = 1,
    Oops        = 2,
    Emerg       = 3,
    Shutdown    = 4,  // Consolidated from Restart/Halt/Poweroff in Linux 5.14+
    // Values 5+ are UmkaOS extensions not present in Linux's enum kmsg_dump_reason.
    SoftRestart = 5,
    Mce         = 6,
}

bitflags! {
    /// Which record types a backend supports.
    pub struct PstoreFlags: u32 {
        const DMESG   = 0x0001;
        const CONSOLE = 0x0002;
        const FTRACE  = 0x0004;
        const MCE     = 0x0008;
        const PMSG    = 0x0010;
    }
}

20.7.3 EFI Backend (efi_pstore)

The EFI backend stores records as EFI non-volatile NVRAM variables, which survive across hard resets and power cycles on UEFI platforms.

Variable naming — Each record part becomes one EFI variable:

VendorGuid: {9f2f919e-b88e-4e47-...}
VariableName: "dump-type0-<part>-<count>-<timestamp>"
where part = part number, count = panic counter, timestamp = seconds since epoch from the EFI GetTime() runtime service.

Size limit — EFI variable size is platform-specific. The backend queries QueryVariableInfo() at registration time and stores the result in max_var_size. Typical values: 8–64 KB per variable. Large dmesg logs are split into ceil(uncompressed_size / max_chunk_size) parts.

Compression — Each chunk is LZ4-compressed before writing. The compression buffer is statically allocated at registration time (no heap allocation in panic path). If LZ4 cannot compress a chunk below max_var_size, the chunk is stored uncompressed and PstoreRecord::compressed is set to false.

NMI-safety on x86-64 — EFI runtime services (SetVariable) are not unconditionally NMI-safe on all firmware implementations. The EFI backend disables interrupts (cli) and acquires an NMI-safe spinlock (no scheduler interaction) before entering EFI runtime. This matches the approach used by Linux's efi_call_rts() path.

// umka-core/src/pstore/efi.rs

pub struct EfiPstoreBackend {
    pub info:         PstoreInfo,
    /// Maximum bytes per EFI variable (from QueryVariableInfo).
    pub max_var_size: usize,
    /// Pre-allocated, stack-pinned compression output buffer.
    pub compress_buf: [u8; 65536],
    /// Monotonically increasing panic count, stored in one dedicated EFI variable.
    /// **Longevity**: u32 constrained by Linux pstore EFI variable naming
    /// convention. Wraps after ~4.29 billion panics (~11.7M years at 1/day).
    pub panic_count:  AtomicU32,
    /// Lock serializing EFI runtime service calls (NMI-safe spinlock).
    pub efi_lock:     NmiSpinlock,
}

impl EfiPstoreBackend {
    /// Boot-time initialization: read panic_count from EFI, register with pstore.
    pub fn init() -> Result<(), PstoreError> {
        let backend = Self::new()?;
        pstore_register(backend.info)
    }
}

Boot-time enumeration — At pstorefs mount, EfiPstoreBackend::read() calls GetNextVariableName() in a loop collecting all variables matching the GUID. It reassembles multi-part records (sorted by part number), decompresses if compressed = true, and yields PstoreRecord instances for each complete log.

Erase pathEfiPstoreBackend::erase() calls SetVariable() with DataSize = 0, which deletes the EFI variable. If systemd-pstore.service is running, it erases all records after copying them to /var/lib/systemd/pstore/, preventing EFI NVRAM exhaustion across many panics.

20.7.4 Ramoops Backend

For systems without EFI — embedded boards, some RISC-V platforms, BIOS-based x86-64 hosts — ramoops writes to a reserved DRAM region that survives soft resets (warm reboots) but not power loss.

Configuration — via device-tree node or ACPI SSDT:

ramoops {
    compatible = "ramoops";
    memory-region = <&pstore_reserved>;  /* Reserved DRAM, carved out at early boot */
    record-size    = <0x20000>;          /* 128 KB per dmesg record slot */
    console-size   = <0x40000>;          /* 256 KB console ring */
    ftrace-size    = <0>;                /* No ftrace backend in this config */
    max-reason     = <1>;                /* PANIC only */
    ecc            = <16>;               /* 16-byte ECC for ramoops header */
};

On-disk layout — The reserved region is divided into fixed-size slots. Each slot begins with a RamoopsHeader:

/// Header preceding each ramoops record slot (stored in DRAM).
///
/// **Alignment note**: `repr(C, packed)` means fields are not naturally aligned.
/// On strict-alignment architectures (ARMv7, some RISC-V implementations),
/// accessing `time: u64` at offset 8 from a potentially unaligned base
/// requires `ptr::read_unaligned()`. All accesses to packed struct fields
/// must use unaligned reads/writes. This is especially critical in NMI/panic
/// context where alignment faults cannot be recovered.
#[repr(C, packed)]
pub struct RamoopsHeader {
    /// Magic number: 0x43415441 ("CATA") — validates slot is written.
    pub magic:       u32,
    /// Panic count at time of write.
    pub count:       u32,
    /// Wall-clock timestamp (seconds since epoch) at time of write.
    pub time:        u64,
    /// True if the data following this header is LZ4-compressed.
    pub compressed:  u8,
    /// ECC block size in bytes (for header integrity check). 0 = no ECC.
    pub ecc_size:    u8,
    /// Size of this header structure (for forward compatibility).
    pub header_size: u32,
    /// Number of valid data bytes following the header.
    pub data_size:   u32,
}
const_assert!(core::mem::size_of::<RamoopsHeader>() == 26);

On boot, the ramoops driver maps the reserved region, reads each slot, validates the magic and (if configured) the ECC check over the header, and presents valid slots as pstorefs files. Slots with invalid magic are treated as empty.

20.7.4.1 BERT Backend (bert_pstore)

The BERT (Boot Error Record Table) backend provides read-only access to firmware-reported hardware errors captured before the OS booted. On UEFI platforms with ACPI BERT support, the firmware writes CPER (Common Platform Error Record) entries to a reserved physical memory region. The pstore BERT backend maps this region, parses the records, and presents them as pstorefs files so that systemd-pstore.service and rasdaemon can archive them alongside panic logs.

// umka-core/src/pstore/bert.rs

/// BERT pstore backend — read-only.
///
/// Parses the ACPI BERT table to locate firmware-written error records
/// from the previous boot or power cycle.  These records describe
/// hardware errors (memory, PCIe, processor) that occurred before the
/// OS gained control — for example, memory training failures, PCIe
/// link errors during POST, or machine-check exceptions handled by SMM.
pub struct BertPstoreBackend {
    /// Physical base address of the BERT error data region
    /// (from ACPI BERT table `BootErrorRegion` field).
    pub bert_region_phys: u64,
    /// Length of the BERT error data region in bytes
    /// (from ACPI BERT table `BootErrorRegionLength` field).
    pub bert_region_len: u32,
    /// Parsed error records from the BERT region.
    pub error_records: ArrayVec<BertErrorRecord, 64>,
}

/// A single CPER error record parsed from the BERT region.
pub struct BertErrorRecord {
    /// Section type GUID identifying the error source (e.g.,
    /// processor generic, memory, PCIe, firmware).
    /// UEFI spec Appendix N defines the standard GUIDs.
    pub section_type: [u8; 16],
    /// Error severity: 0 = recoverable, 1 = fatal, 2 = corrected,
    /// 3 = informational (matches CPER severity encoding).
    pub severity: u32,
    /// Byte offset from the start of the BERT region to this
    /// record's section data.
    pub data_offset: u32,
    /// Length of the section data in bytes.
    pub data_len: u32,
}

Initialization sequence (runs during early ACPI table enumeration, before the root filesystem is mounted):

  1. Locate BERT table: Scan the ACPI RSDT/XSDT for the "BERT" signature. If absent, BERT backend is not registered (many non-server platforms omit BERT).
  2. Validate header: Check that BootErrorRegionLength > 0 and that the physical address range does not overlap with the kernel text or other reserved regions.
  3. Map physical region: ioremap(bert_region_phys, bert_region_len) to obtain a kernel virtual address. The region is mapped as write-back cacheable (the firmware already wrote the data; we only read it).
  4. Iterate CPER records: Walk the mapped region using the CPER Generic Error Data Entry format (UEFI Specification, Section N.2.5). Each entry contains a section type GUID, severity, and a variable- length data payload. Entries are parsed until the cumulative length reaches bert_region_len or a zero-length entry is encountered.
  5. Populate error_records: Each valid CPER entry is stored as a BertErrorRecord. At most 64 records are kept (capacity of the ArrayVec); additional records are counted but not stored.
  6. Register with pstore: Call pstore_register() with PstoreFlags::DMESG and write: None (read-only backend — the firmware owns the BERT region; the OS never writes to it).

pstorefs presentation: BERT records appear as dmesg-bert-<n> files in /sys/fs/pstore/. The file contents are a human-readable rendering of the CPER section data (severity, source GUID, and hex dump of the section payload). systemd-pstore.service archives them alongside panic logs.

FMA integration: After BERT initialization, each record with severity == 1 (fatal) or severity == 0 (recoverable) is forwarded to the FMA telemetry path (Section 20.1) as a HealthEventClass::Memory or HealthEventClass::Pcie event (selected by matching the section type GUID against known CPER section types). This ensures firmware-detected errors from the previous boot feed into the FMA diagnosis engine for correlation with runtime errors.

20.7.5 pstorefs

pstorefs is a RAM-backed virtual filesystem mounted at /sys/fs/pstore/. It is populated at mount time by iterating all registered backends' read() functions. Each record is represented as a read-only file:

Filename Contents
dmesg-efi-<id> Kernel log tail from the last panic (part 0)
dmesg-efi-<id>-<part> Continuation parts of the same log
console-efi-<id> Console output captured during panic
mce-efi-<id> MCE information decoded at panic time
dmesg-ramoops-<id> Kernel log from ramoops backend

Reading — Files are read-only. If the stored record is LZ4-compressed, the pstorefs file layer decompresses on first read into a per-file page cache. Reads thereafter are served from the page cache.

Deletionunlink() on a pstorefs file calls PstoreInfo::erase() on the owning backend, removing the record from NVRAM/DRAM. This is the mechanism by which systemd-pstore.service reclaims NVRAM space.

systemd-pstore.service interaction: 1. Service starts on every boot. 2. Mounts /sys/fs/pstore/ if not already mounted. 3. Copies each file to /var/lib/systemd/pstore/<hostname>/<boot-id>/. 4. Calls unlink() on each file to trigger backend erase. 5. If /var/lib/systemd/pstore/ is full (configurable threshold): rotates oldest panic logs out before copying new ones.

20.7.6 Panic Handler Integration

The kernel panic path calls kmsg_dump(KmsgDumpReason::Panic) which iterates all registered KmsgDumper callbacks in priority order. pstore registers at priority KMSG_DUMP_PSTORE. The pstore dump callback:

// umka-core/src/pstore/dump.rs

/// KmsgDumper callback — called from panic/oops context.
///
/// # Constraints
///
/// - NMI context on x86-64 (MCE-originated panics): no sleeping,
///   no heap allocation, no non-NMI-safe locks.
/// - Process context otherwise (software panics, oops).
/// - Must complete in bounded time; EFI backends have a firmware timeout
///   of ~10 seconds per variable write.
pub fn pstore_kmsg_dump(dumper: &mut KmsgDumper, reason: KmsgDumpReason) {
    // 1. Check if reason >= min(backend.max_reason) for any registered backend.
    //    If no backend wants this reason, return immediately.
    if !pstore_has_backend_for(reason) {
        return;
    }

    // 2. Compute which backends will accept this record.
    let backends = pstore_backends_for_flags(PstoreFlags::DMESG);

    // 3. Read from the kernel message ring buffer tail using the kmsg_dump iterator.
    //    Iterates newest-first; we collect up to the backend's chunk capacity.
    let mut iter = dumper.iter_lines();
    let mut part = 0u32;

    while let Some(chunk) = collect_chunk(&mut iter, MAX_CHUNK_BYTES) {
        // 4. Compress chunk (LZ4). Uses statically allocated compression workspace.
        let (buf, compressed) = lz4_compress_static(chunk);

        // 5. Write to each registered backend.
        for backend in &backends {
            let mut record = PstoreRecord {
                type_:      PstoreType::Dmesg,
                id:         0,   // assigned by backend write()
                count:      PANIC_COUNT.load(Ordering::Relaxed),
                reason,
                part,
                compressed,
                size:       buf.len(),
                buf:        backend.record_buf_mut(),
            };
            record.buf[..buf.len()].copy_from_slice(buf);
            record.size = buf.len();
            let _ = (backend.info.write)(&backend.info, &mut record);
            // Errors are silently dropped: we are in panic context and cannot
            // meaningfully recover. The panic continues regardless.
        }

        part += 1;
    }
}

20.7.7 umkafs Integration

pstore records from the most recent panic are also reflected in the umkafs unified namespace (Section 20.5):

/ukfs/kernel/paniclog/
  last_panic_time       # Wall-clock timestamp of most recent panic (epoch seconds)
  last_panic_reason     # Panic string (first line of dmesg-* record)
  panic_count           # Monotonically increasing counter across all panics
  records/              # Symlinks into /sys/fs/pstore/
    dmesg-efi-12345 -> /sys/fs/pstore/dmesg-efi-12345
    console-efi-12346 -> /sys/fs/pstore/console-efi-12346

last_panic_time and last_panic_reason are populated at pstorefs mount time from the most recent record (highest count). They are read-only; clearing requires unlink() of the underlying pstorefs file (which requires CAP_SYS_ADMIN).

Implementation phases: - Phase 1: pstore framework, EFI backend, pstorefs - Phase 2: ramoops backend (for non-EFI architectures: ARMv7, RISC-V, PPC32) - Phase 3: umkafs integration, systemd-pstore.service compatibility - Phase 4: BERT (read-only ACPI firmware-reported errors) backend


20.8 Performance Monitoring Unit (perf_event_open)

The perf_event_open subsystem exposes CPU hardware performance counters, kernel software events, and tracepoints to userspace through a uniform file-descriptor interface. It is the foundation for perf stat, perf record, bpftrace, BCC tools, and JVM JIT profiling.

UmkaOS implements the full Linux perf_event_open interface — identical syscall number, identical perf_event_attr wire format, identical ioctl codes, identical /proc/sys knobs — while replacing Linux's internal implementation with a design that avoids the global perf_event_mutex, uses per-CPU contexts as the primary scheduling unit, and offloads NMI-context work to per-CPU kernel threads.

20.8.1 Syscall Interface

perf_event_open(attr, pid, cpu, group_fd, flags) → fd | errno

Syscall number: 298 (x86-64). All eight UmkaOS architectures use the same ABI value as their Linux counterpart (see Section 20.8 for the full table).

Parameters:

Parameter Type Meaning
attr *const PerfEventAttr Event configuration (see Section 20.8)
pid i32 0 = current task; -1 = all tasks on cpu; >0 = specific task (requires PTRACE_MODE_READ or CAP_SYS_PTRACE)
cpu i32 ≥0 = pin to specific CPU; -1 = follow task across CPUs (requires pid != -1)
group_fd i32 -1 = standalone event; or fd of group leader (events measured as an atomic group sharing the same set of hardware counters)
flags u64 See flag table below

flags bitmask:

Flag Value Effect
PERF_FLAG_FD_NO_GROUP 0x01 Ignore group_fd even if non-negative
PERF_FLAG_FD_OUTPUT 0x02 Redirect output to group_fd's ring buffer
PERF_FLAG_PID_CGROUP 0x04 pid is a cgroup fd, not a task pid
PERF_FLAG_FD_CLOEXEC 0x08 Set O_CLOEXEC on the returned fd

Return value and errors:

Errno Condition
EINVAL Invalid attr fields, unknown event type, or conflicting flags
EPERM Insufficient privilege for the requested event (see Section 20.8)
EMFILE Per-process fd limit exhausted
ENODEV Event type not supported on this CPU (e.g., Intel PT on AMD)
EACCES perf_event_paranoid blocks the request (see Section 20.8)
ENOENT Referenced tracepoint event ID does not exist
ESRCH Target pid does not exist
ENOSPC Per-cgroup perf_event.max limit exhausted (Section 17.2)
EOPNOTSUPP Requested sample type or branch filter not supported by this PMU

The returned file descriptor supports read(), mmap(), ioctl(), poll()/epoll(), and close(). It does not support write().

20.8.2 perf_event_attr Wire Format

The perf_event_attr struct is 136 bytes (PERF_ATTR_SIZE_VER8) on all architectures and matches the Linux ABI. Linux has grown this struct across 9 versions: VER0=64, VER1=72, VER2=80, VER3=96, VER4=104, VER5=112, VER6=120, VER7=128, VER8=136 (config3), VER9=144 (config4). UmkaOS tracks VER8. The kernel accepts any size field value — if size is larger than the kernel's known struct size, trailing bytes are silently ignored (forward compatibility). If size is smaller, missing fields are treated as zero (backward compatibility). For example, when userspace passes size=144 (VER9), the kernel reads 136 bytes and silently ignores the trailing 8 bytes (config4). This is handled by the generic forward-compat logic above.

/// Performance event attributes — wire-format compatible with Linux perf_event_attr.
/// Size: 136 bytes (PERF_ATTR_SIZE_VER8). Tracks Linux VER8 (config3 added).
///
/// # Safety
///
/// This struct is shared with userspace via the `perf_event_open` syscall.
/// All fields must be read with copy-in (via `copy_from_user`), never via raw pointer.
#[repr(C)]
pub struct PerfEventAttr {
    /// Event type (PERF_TYPE_*). Selects which PMU driver handles this event.
    pub event_type: u32,
    /// sizeof(perf_event_attr) as seen by the calling binary. Kernel uses this
    /// to determine which fields are present. See backward/forward compat above.
    pub size: u32,
    /// Event-specific configuration word. Meaning depends on event_type:
    /// - HARDWARE: PERF_COUNT_HW_* constant
    /// - SOFTWARE: PERF_COUNT_SW_* constant
    /// - TRACEPOINT: tracepoint ID from umkafs/tracefs
    /// - HW_CACHE: encoded cache level/op/result (see "Event Types" below)
    /// - RAW: raw MSR event code, programmed directly into PMU
    pub config: u64,
    /// If FREQ flag set: sampling frequency in Hz (kernel adjusts period dynamically).
    /// If FREQ flag clear: count between samples (fixed period).
    pub sample_period_or_freq: u64,
    /// PERF_SAMPLE_* bitmask. Determines what data is recorded per sample.
    pub sample_type: u64,
    /// PERF_FORMAT_* bitmask. Determines layout of data returned by read().
    pub read_format: u64,
    /// Packed bitfield of boolean event modifiers (disabled, inherit, pinned, …).
    pub flags: PerfEventFlags,
    /// If PERF_WATERMARK: ring buffer watermark in bytes before wakeup.
    /// Otherwise: number of samples before wakeup.
    pub wakeup_events_or_watermark: u32,
    /// Breakpoint type (PERF_TYPE_BREAKPOINT only): HW_BREAKPOINT_X/R/W/RW.
    pub bp_type: u32,
    /// For BREAKPOINT: breakpoint address. For RAW/HW_CACHE: extended config1.
    pub bp_addr_or_config1: u64,
    /// For BREAKPOINT: breakpoint length (1/2/4/8). For RAW/HW_CACHE: config2.
    pub bp_len_or_config2: u64,
    /// PERF_SAMPLE_BRANCH_* bitmask for branch stack sampling (LBR on x86-64).
    pub branch_sample_type: u64,
    /// User-space register mask for PERF_SAMPLE_REGS_USER.
    pub sample_regs_user: u64,
    /// Stack dump size in bytes for PERF_SAMPLE_STACK_USER (max 65528).
    pub sample_stack_user: u32,
    /// Clock source ID for PERF_SAMPLE_TIME. -1 = default (CLOCK_MONOTONIC_RAW).
    pub clockid: i32,
    /// Interrupt-context register mask for PERF_SAMPLE_REGS_INTR.
    pub sample_regs_intr: u64,
    /// AUX ring buffer watermark in bytes (for Intel PT / ARM SPE).
    pub aux_watermark: u32,
    /// Maximum callchain depth for PERF_SAMPLE_CALLCHAIN.
    pub sample_max_stack: u16,
    pub _reserved_2: u16,
    /// AUX sample size limit (Intel PT / ARM SPE per-sample cap).
    pub aux_sample_size: u32,
    pub _reserved_3: u32,
    /// Signal data for PERF_SAMPLE_SIGTRAP (sent to task on overflow).
    pub sig_data: u64,
    /// Extended event config word (Intel: config3 = extended PEBS options).
    pub config3: u64,
}
// VER8 = 136 bytes. VER9 (144 bytes, adds config4) is tracked by forward-compat
// logic: when userspace passes size=144, the kernel reads 136 bytes and silently
// ignores the trailing 8 bytes. UmkaOS can adopt VER9 by adding config4 and
// bumping the const_assert to 144.
const_assert!(size_of::<PerfEventAttr>() == 136);

bitflags! {
    /// Boolean modifiers packed into a u64 bitfield within perf_event_attr.
    /// Bit positions match Linux exactly.
    pub struct PerfEventFlags: u64 {
        /// Event starts disabled; requires PERF_EVENT_IOC_ENABLE.
        const DISABLED          = 1 << 0;
        /// Children created via fork() inherit this event.
        const INHERIT           = 1 << 1;
        /// Pin event to PMU slot; return EBUSY if unavailable.
        const PINNED            = 1 << 2;
        /// Exclusive use of PMU; no other events on this CPU while active.
        const EXCLUSIVE         = 1 << 3;
        /// Count only in user space (ring 3).
        const EXCLUDE_USER      = 1 << 4;
        /// Count only in kernel space (ring 0).
        const EXCLUDE_KERNEL    = 1 << 5;
        /// Exclude hypervisor events.
        const EXCLUDE_HV        = 1 << 6;
        /// Exclude idle task.
        const EXCLUDE_IDLE      = 1 << 7;
        /// Enable ring buffer mmap().
        const MMAP              = 1 << 8;
        /// Include comm events (task name changes).
        const COMM              = 1 << 9;
        /// Use sampling frequency (sample_period_or_freq = Hz), not fixed period.
        const FREQ              = 1 << 10;
        /// Inherit counter value across fork (not just enable/disable).
        const INHERIT_STAT      = 1 << 11;
        /// Enable event on exec.
        const ENABLE_ON_EXEC    = 1 << 12;
        /// Emit `PERF_RECORD_FORK` / `PERF_RECORD_EXIT` on fork/exit.
        const TASK              = 1 << 13;
        /// Use ring buffer watermark (wakeup_events_or_watermark = bytes).
        const WATERMARK         = 1 << 14;
        /// Mask for the `precise_ip` 2-bit field (bits 15-16). This is NOT
        /// two independent flags — it is a single 2-bit integer field encoding
        /// the requested IP precision level:
        ///   0 = SAMPLE_IP can have arbitrary skid
        ///   1 = SAMPLE_IP must have constant skid
        ///   2 = SAMPLE_IP requested to have 0 skid (best-effort)
        ///   3 = SAMPLE_IP must have 0 skid (fail if unavailable)
        /// Extract with: `(flags.bits() & PRECISE_IP_MASK) >> 15`
        /// Set with: `flags | PerfEventFlags::from_bits_truncate(level << 15)`
        /// On x86-64, levels 1-3 require PEBS; on AArch64, SPE provides level 2+.
        const PRECISE_IP_MASK   = 0x3 << 15;
        /// Also record data (non-executable) mmap events. Executable mappings are
        /// always included when MMAP (bit 8) is set; this flag adds non-executable
        /// mappings (heap, stack, data segments) to the PERF_RECORD_MMAP stream.
        const MMAP_DATA         = 1 << 17;
        /// Include sample_id in all non-SAMPLE record types.
        const SAMPLE_ID_ALL     = 1 << 18;
        /// Exclude events from non-host (guest) context.
        const EXCLUDE_HOST      = 1 << 19;
        /// Exclude events from host context (count only in guest).
        const EXCLUDE_GUEST     = 1 << 20;
        /// Exclude kernel callchains.
        const EXCLUDE_CALLCHAIN_KERNEL = 1 << 21;
        /// Exclude user callchains.
        const EXCLUDE_CALLCHAIN_USER   = 1 << 22;
        /// Enable AUX ring buffer (Intel PT, ARM SPE).
        const MMAP2             = 1 << 23;
        /// Emit PERF_RECORD_COMM with exec flag on exec.
        const COMM_EXEC         = 1 << 24;
        /// Use task clock (per-task time) for timestamps.
        const USE_CLOCKID       = 1 << 25;
        /// Include context-switch records in the ring buffer.
        const CONTEXT_SWITCH    = 1 << 26;
        /// Write ring buffer from the end (tail first, for overwrite mode).
        const WRITE_BACKWARD    = 1 << 27;
        /// Emit PERF_RECORD_NAMESPACES on namespace entry.
        const NAMESPACES        = 1 << 28;
        /// Emit PERF_RECORD_KSYMBOL for new kernel symbols (eBPF JIT).
        const KSYMBOL           = 1 << 29;
        /// Emit PERF_RECORD_BPF_EVENT for BPF program load/unload.
        const BPF_EVENT         = 1 << 30;
        /// Enable AUX output interleaved in the main ring buffer.
        const AUX_OUTPUT        = 1 << 31;
        /// Emit PERF_RECORD_CGROUP for cgroup switches.
        const CGROUP            = 1 << 32;
        /// Include text poke events (JIT code patching, ftrace trampolines).
        const TEXT_POKE         = 1 << 33;
        /// Use build ID (not path) in PERF_RECORD_MMAP2 events. Reduces
        /// event size and avoids path resolution overhead.
        const BUILD_ID          = 1 << 34;
        /// Children inherit this event only if `CLONE_THREAD` is used.
        const INHERIT_THREAD    = 1 << 35;
        /// Event is removed from the task on exec.
        const REMOVE_ON_EXEC    = 1 << 36;
        /// Send synchronous SIGTRAP to the task on sample overflow.
        const SIGTRAP           = 1 << 37;
    }
}

20.8.3 Event Types (PERF_TYPE_*)

The event_type field selects the PMU driver and determines how config is interpreted.

PERF_TYPE_HARDWARE (0) — CPU hardware counters. The config field selects a generic hardware event; the PMU driver maps each to the appropriate hardware-specific event code for the current microarchitecture.

config Constant Description
0 PERF_COUNT_HW_CPU_CYCLES CPU clock cycles
1 PERF_COUNT_HW_INSTRUCTIONS Instructions retired
2 PERF_COUNT_HW_CACHE_REFERENCES Last-level cache references
3 PERF_COUNT_HW_CACHE_MISSES Last-level cache misses
4 PERF_COUNT_HW_BRANCH_INSTRUCTIONS Branch instructions retired
5 PERF_COUNT_HW_BRANCH_MISSES Mispredicted branches
6 PERF_COUNT_HW_BUS_CYCLES Bus cycles
7 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND Cycles stalled in the frontend
8 PERF_COUNT_HW_STALLED_CYCLES_BACKEND Cycles stalled in the backend
9 PERF_COUNT_HW_REF_CPU_CYCLES Reference CPU cycles (unthrottled)

If a generic event is unavailable on the host PMU, perf_event_open returns EOPNOTSUPP.

PERF_TYPE_SOFTWARE (1) — Kernel software counters. No hardware counter slots are consumed; counting is performed in software at the corresponding kernel callsites via per-CPU atomic increments.

config Constant Description
0 PERF_COUNT_SW_CPU_CLOCK CPU wall-clock time (CLOCK_MONOTONIC_RAW)
1 PERF_COUNT_SW_TASK_CLOCK Task CPU time (only while task runs)
2 PERF_COUNT_SW_PAGE_FAULTS Page faults (minor + major)
3 PERF_COUNT_SW_CONTEXT_SWITCHES Context switches
4 PERF_COUNT_SW_CPU_MIGRATIONS Task migrations between CPUs
5 PERF_COUNT_SW_PAGE_FAULTS_MIN Minor page faults (no I/O)
6 PERF_COUNT_SW_PAGE_FAULTS_MAJ Major page faults (I/O required)
7 PERF_COUNT_SW_ALIGNMENT_FAULTS Alignment faults (fixup path)
8 PERF_COUNT_SW_EMULATION_FAULTS Emulated instruction faults
9 PERF_COUNT_SW_DUMMY No-op placeholder event
10 PERF_COUNT_SW_BPF_OUTPUT BPF program output via bpf_perf_event_output
11 PERF_COUNT_SW_CGROUP_SWITCHES Cgroup context switches (Linux 5.13+)

Relationship to always-on aggregation counters: PERF_TYPE_SOFTWARE events tap the same underlying per-CPU counters used by always-on kernel aggregation (e.g., page_fault_count in /proc/vmstat). The counters are incremented unconditionally at each callsite. Perf event sampling observes these counters — it does NOT add a separate increment. Enabling a software perf event therefore has zero additional cost at the increment site; the only cost is the periodic PMI or timer-based sample collection in the perf ring buffer.

PERF_TYPE_TRACEPOINT (2) — Kernel tracepoints. The config field is the numeric tracepoint ID exposed through umkafs at /ukfs/kernel/tracing/events/<subsystem>/<name>/id. These IDs match Linux's tracefs (/sys/kernel/tracing/events/) for tool compatibility. See Section 20.2 for the tracepoint ABI.

TracepointPmu bridge: PERF_TYPE_TRACEPOINT events are handled by a dedicated TracepointPmu that implements PmuOps, bridging the perf event subsystem to the tracepoint infrastructure:

/// PMU implementation for PERF_TYPE_TRACEPOINT events.
/// Bridges perf_event sampling to the tracepoint callsite registry.
pub struct TracepointPmu;

impl PmuOps for TracepointPmu {
    /// Look up the TracepointCallsite by numeric ID (from perf_event_attr.config),
    /// then register a per-event probe callback on that callsite. The callback
    /// serializes the tracepoint arguments per the callsite's schema into
    /// PERF_SAMPLE_RAW data and writes a PERF_RECORD_SAMPLE record to the
    /// event's perf ring buffer.
    fn event_add(&self, event: &PerfEvent, flags: PerfEventFlags) -> i32;

    /// Unregister the per-event probe callback from the TracepointCallsite.
    /// After this call, the tracepoint no longer writes to this event's
    /// ring buffer.
    fn event_del(&self, event: &PerfEvent);

    /// Return the event count: the number of times the tracepoint has fired
    /// since the event was enabled (or last reset).
    fn read(&self, event: &mut PerfEvent);
}

The probe callback registered by event_add executes in the tracepoint callsite's context (typically process context or softirq, depending on the tracepoint). It serializes the callsite's typed arguments into the PERF_SAMPLE_RAW format (a length-prefixed byte array matching the tracepoint's format schema from /ukfs/kernel/tracing/events/<subsystem>/<name>/format) and appends a PERF_RECORD_SAMPLE record to the event's ring buffer.

PERF_TYPE_HW_CACHE (3) — Cache-level events. The config field encodes three sub-fields packed into a single u64:

config = cache_id | (cache_op_id << 8) | (cache_result_id << 16)
Sub-field Values
cache_id 0=L1D, 1=L1I, 2=LLC, 3=ITLB, 4=DTLB, 5=BPU, 6=NODE
cache_op_id 0=READ, 1=WRITE, 2=PREFETCH
cache_result_id 0=ACCESS, 1=MISS

Not all combinations are available on all microarchitectures. Unavailable combinations return EOPNOTSUPP.

PERF_TYPE_RAW (4) — Raw PMU event. The config field is the raw event-select value programmed directly into the PMU's event-select register (e.g., IA32_PERFEVTSELx on Intel x86-64, PMEVTYPER<n>_EL0 on AArch64). Format is PMU-specific and documented in the CPU vendor's software optimization manual. Requires perf_event_paranoid ≤ 1 or CAP_PERFMON.

PERF_TYPE_BREAKPOINT (5) — Hardware data/instruction breakpoint. The bp_addr_or_config1 field is the watch address, bp_len_or_config2 is the access width (1, 2, 4, or 8 bytes), and bp_type is the access type:

bp_type Value Meaning
HW_BREAKPOINT_EMPTY 0 Breakpoint disabled
HW_BREAKPOINT_R 1 Read watchpoint
HW_BREAKPOINT_W 2 Write watchpoint
HW_BREAKPOINT_RW 3 Read or write watchpoint
HW_BREAKPOINT_X 4 Instruction execution breakpoint

Breakpoints use hardware debug registers (DR0-DR3 on x86-64, DBGBCRn_EL1/DBGWCRn_EL1 on AArch64, DBGBCRn/DBGWCRn on ARMv7). Execution breakpoints fire before the instruction commits; data watchpoints fire after the faulting load/store.

PERF_TYPE_MAX (6+) — Dynamic PMU types registered by drivers at boot time. Examples: Intel PT (Processor Trace), ARM SPE (Statistical Profiling Extension), platform uncore PMUs (memory controllers, PCIe root complexes). Each dynamic PMU has a type number allocated by pmu_register() at driver init time; userspace discovers it via /sys/bus/event_source/devices/<name>/type.

20.8.4 Sample Type Flags (PERF_SAMPLE_*)

The sample_type bitmask in perf_event_attr determines what data is recorded in each ring buffer sample. Each set bit appends a fixed or variable-length field to the PERF_RECORD_SAMPLE record, in the order listed below. The same sample_type value is stored in perf.data headers so that perf report and other tools know the record layout without kernel involvement.

Bit Value Field added to sample record
PERF_SAMPLE_IP 0x00001 Instruction pointer at time of sample
PERF_SAMPLE_TID 0x00002 pid (process ID) and tid (thread ID) as two u32 words
PERF_SAMPLE_TIME 0x00004 Timestamp in nanoseconds (CLOCK_MONOTONIC_RAW by default)
PERF_SAMPLE_ADDR 0x00008 Faulted virtual address (memory events, requires hardware support)
PERF_SAMPLE_READ 0x00010 Counter values at sample time (layout per read_format)
PERF_SAMPLE_CALLCHAIN 0x00020 Kernel and user callchain (depth limited by sample_max_stack)
PERF_SAMPLE_ID 0x00040 Unique event ID (for matching records across multiplexed events)
PERF_SAMPLE_CPU 0x00080 CPU number and a reserved padding word
PERF_SAMPLE_PERIOD 0x00100 Sample period at the time of the overflow
PERF_SAMPLE_STREAM_ID 0x00200 Group leader event ID
PERF_SAMPLE_RAW 0x00400 Raw tracepoint data bytes (for PERF_TYPE_TRACEPOINT)
PERF_SAMPLE_BRANCH_STACK 0x00800 Branch stack (LBR on Intel x86-64, BRBE on AArch64)
PERF_SAMPLE_REGS_USER 0x01000 User-space register file (mask from sample_regs_user)
PERF_SAMPLE_STACK_USER 0x02000 User stack dump (up to sample_stack_user bytes)
PERF_SAMPLE_WEIGHT 0x04000 Event weight / memory access latency in cycles
PERF_SAMPLE_DATA_SRC 0x08000 Memory data source (L1/L2/LLC/DRAM/remote hit/miss encoding)
PERF_SAMPLE_IDENTIFIER 0x10000 Duplicate event ID placed first in record (for fast demux)
PERF_SAMPLE_TRANSACTION 0x20000 TSX transaction flags and abort reason
PERF_SAMPLE_REGS_INTR 0x40000 Register file at interrupt time (mask from sample_regs_intr)
PERF_SAMPLE_PHYS_ADDR 0x80000 Physical address of PERF_SAMPLE_ADDR (requires CAP_SYS_ADMIN)
PERF_SAMPLE_AUX 0x100000 AUX area data appended inline in sample
PERF_SAMPLE_CGROUP 0x200000 Cgroup ID at sample time
PERF_SAMPLE_DATA_PAGE_SIZE 0x400000 Page size backing the sampled data address
PERF_SAMPLE_CODE_PAGE_SIZE 0x800000 Page size backing the sample IP
PERF_SAMPLE_WEIGHT_STRUCT 0x1000000 Extended weight struct (latency, type, variance fields)

20.8.5 Internal Data Structures

The internal design uses three layers: per-event state (PerfEvent), per-CPU hardware context (PerfEventContext), and an optional task-level overlay (TaskPerfCtx). UmkaOS uses per-CPU contexts as the primary scheduling unit, matching PMU hardware reality. Task-pinned events attach an overlay on top of the CPU context and are swapped in/out on each context switch.

// umka-core/src/perf/event.rs

/// Per-event kernel object. One instance per open perf fd.
///
/// Reference-counted via `Arc<PerfEvent>`. Weak references are used for group
/// sibling links to avoid cycles: the group leader holds `Arc` references to
/// siblings, and siblings hold a `Weak` back to the leader.
pub struct PerfEvent {
    /// Copy of userspace-supplied attributes, validated at open time.
    pub attr: PerfEventAttr,
    /// Static reference to the PMU driver that owns this event.
    /// Immutable after open; never null.
    pub pmu: &'static dyn PmuOps,
    /// The per-CPU context that owns this event.
    pub ctx: Arc<PerfEventContext>,
    /// Group leader (None if this event is its own leader).
    pub group_leader: Option<Weak<PerfEvent>>,
    /// Sibling events in the same group (only populated for group leaders).
    /// Bounded by `PERF_MAX_GROUP_SIZE` (32) — enforced at group creation time.
    /// Uses `ArrayVec` to make the bound compile-time enforced and avoid heap
    /// allocation for group management.
    pub siblings: Mutex<ArrayVec<Weak<PerfEvent>, PERF_MAX_GROUP_SIZE>>,
    /// Ring buffer for sample data. None if `attr.mmap` was not set.
    /// Shared with userspace via Arc; same physical pages, no data copy.
    pub ring_buffer: Option<Arc<PerfRingBuffer>>,
    /// Current scheduling state (PerfEventState).
    pub state: AtomicU32,
    /// Accumulated event count. Updated on context-out and explicit reads.
    pub count: AtomicU64,
    /// Total nanoseconds this event has been enabled (for multiplexing scale).
    pub time_enabled_ns: AtomicU64,
    /// Total nanoseconds this event has run on PMU hardware.
    pub time_running_ns: AtomicU64,
    /// Optional overflow callback. Invoked from per-CPU kthread (Stage 2).
    pub overflow_handler: Option<PerfOverflowFn>,
    /// CPU this event is pinned to, or -1 to follow the owning task.
    pub cpu: i32,
    /// Hardware counter slot index assigned by the PMU driver.
    /// -1 means the event is not currently scheduled onto hardware.
    pub hw_counter_idx: i32,
    /// Unique event ID assigned at open time (used with PERF_FORMAT_ID).
    pub id: u64,
    /// Timestamp of last schedule-in for this event (used by rotation logic).
    /// `AtomicU64` because `perf_rotate_context` writes through `Arc<PerfEvent>`
    /// (shared reference — requires interior mutability). Consistent with
    /// `count`, `time_enabled_ns`, and `time_running_ns` which also use AtomicU64.
    pub last_schedule_in_ns: AtomicU64,
}

/// Scheduling state of a PerfEvent.
#[repr(u32)]
pub enum PerfEventState {
    /// Explicitly disabled by userspace (attr.disabled or IOC_DISABLE).
    Off      = 0,
    /// Desired but hardware resource unavailable (slot conflict).
    Error    = 1,
    /// Enabled but not currently on PMU: preempted by multiplexing.
    Inactive = 2,
    /// Currently programmed into a hardware counter and counting.
    Active   = 3,
    /// Owning task exited; event is orphaned. The event's fd may still be
    /// open (held by another process via `pidfd_getfd` or `SCM_RIGHTS`),
    /// but no further samples will be collected. `read()` returns the final
    /// accumulated count. Transitions to this state are irreversible.
    Dead     = 4,
}

/// Overflow callback type.
///
/// Called from the per-CPU perf sampler kthread (not NMI context).
/// May allocate memory and acquire non-NMI-safe locks.
pub type PerfOverflowFn = fn(event: &Arc<PerfEvent>, regs: &SavedRegs, count: u64);
// umka-core/src/perf/context.rs

/// Maximum hardware performance counters simultaneously active on one CPU.
/// Intel: 4-8 general + 3 fixed = 11 max. AMD: 6 general = 6 max.
/// ARM PMUv3: 6 general + 1 cycle = 7 max. Use 16 as safe upper bound.
pub const PERF_MAX_ACTIVE: usize = 16;

/// Maximum number of sibling events in a perf event group.
/// Linux uses 64K as a soft limit; UmkaOS caps at 32 because real PMUs have ≤16
/// counters and excessive group sizes only cause multiplexing overhead.
pub const PERF_MAX_GROUP_SIZE: usize = 32;

/// Maximum number of perf events pinned to a single task.
/// Matches Linux's practical limit (tasks rarely have more than 16 hardware
/// counters + a few software events). Exceeding this returns ENOSPC.
pub const MAX_TASK_EVENTS: usize = 64;

/// Per-CPU PMU context.
///
/// One instance per CPU, allocated at boot time and stored in the CpuLocal area
/// (see Section 3.2). There is no global PMU lock. Each CPU context is protected
/// by its own lock. Cross-CPU operations (e.g., reading a task event from a
/// remote CPU) use IPI + per-CPU lock sequences.
///
/// # Hot Path
///
/// `active[0..active_count]` is accessed on **every context switch**
/// (`perf_schedule_in` / `perf_schedule_out`). This path MUST be lock-free
/// for system-wide events. Task-pinned events acquire a per-task SpinLock
/// (`TaskPerfCtx.task_events.lock()`) during schedule-in/out to swap the
/// task's event list into the CPU context. This is bounded to the migrating
/// task's event list — contention is limited to concurrent `perf_event_open`/
/// `close` on the same task, not to the global context switch rate.
/// SpinLock (not Mutex) because schedule-in/out runs with preemption disabled.
/// The previous design used `SpinlockIrq<Vec<Arc<PerfEvent>>>` which acquired a
/// lock on every context switch — unacceptable at context switch frequencies of
/// 100k–1M/sec/CPU. The new design uses a fixed-size array of raw event pointers
/// with an atomic count so the system-wide context switch path never acquires a
/// lock, never touches a reference count, and never touches the heap allocator.
pub struct PerfEventContext {
    /// CPU index this context belongs to.
    pub cpu: u32,

    /// Active event pointers: `active[0..active_count]` are valid non-null
    /// pointers to `PerfEvent` objects kept alive by `events_lock.active_events`.
    ///
    /// Uses `AtomicPtr` (not raw `*const PerfEvent`) because these pointers
    /// are written through `&self` (shared reference) during event add/remove
    /// while the context switch path reads them concurrently without holding
    /// `events_lock`. Without `AtomicPtr`, writes through `&self` to non-atomic
    /// pointers would be undefined behavior (violates Rust's aliasing model,
    /// requires `UnsafeCell`/atomic for interior mutability).
    ///
    /// Lock-free read during context switch: load `active_count` with
    /// `Ordering::Acquire`, then iterate `active[0..count]` loading each
    /// `AtomicPtr` with `Ordering::Acquire`.
    ///
    /// Written only during event add/remove (infrequent), under `events_lock`.
    /// Removal decrements `active_count` with `Ordering::Release` BEFORE
    /// clearing the pointer, ensuring lockless readers never see a stale index.
    pub active: [AtomicPtr<PerfEvent>; PERF_MAX_ACTIVE],
    /// Number of valid entries in `active[]`. Atomic so the context switch path
    /// can read it without holding `events_lock`.
    pub active_count: AtomicU32,

    /// Lock protecting modifications to `active[]` and the mutable event lists.
    /// NOT held during context switch — only during event add/remove/rotate.
    /// SpinLock (not Mutex) because `perf_rotate_context()` is called from
    /// `scheduler_tick()` with preemption disabled and runqueue lock held.
    /// A sleeping lock (Mutex) would panic if it contended in that context.
    pub events_lock: SpinLock<PerfEventMutable>,

    /// Number of general-purpose hardware counter slots on this CPU.
    pub hpc_slots: u8,
    /// Number of fixed-function counter slots (e.g., Intel fixed PMC0-2).
    pub fixed_slots: u8,
    /// Task-level overlay: swapped on context switch for pid-pinned events.
    /// None if no task-pinned events are active on this CPU.
    pub task_ctx: Option<Arc<TaskPerfCtx>>,
    /// Total nanoseconds this context has been enabled.
    pub time_enabled_ns: AtomicU64,
    /// Total nanoseconds events have been running on PMU hardware.
    pub time_running_ns: AtomicU64,
    /// Per-CPU kthread handle for async sample processing ("Sampling Overflow Handling" below).
    pub sampler_thread: KthreadHandle,
    /// SPSC ring from NMI handler to sampler kthread (Stage 1 → Stage 2).
    pub raw_sample_queue: SpscRing<RawSample, RAW_SAMPLE_QUEUE_DEPTH>,
    /// Bitmask of hardware counter slots currently in use.
    pub enabled_counter_mask: u64,
    /// Timestamp of last rotation pass on this context.
    pub last_rotate_ns: u64,
}

/// Mutable event state, modified only under `PerfEventContext::events_lock`.
pub struct PerfEventMutable {
    /// Authoritative list of active events (keeps `Arc<PerfEvent>` alive so that
    /// the raw pointers in `PerfEventContext::active[]` remain valid).
    pub active_events: ArrayVec<Arc<PerfEvent>, PERF_MAX_ACTIVE>,
    /// Events waiting for a hardware counter slot (multiplexing queue).
    /// Bounded: total events per context (active + pending) cannot exceed
    /// `PERF_MAX_ACTIVE` — event creation fails with `ENOSPC` beyond that limit.
    /// Uses ArrayVec (not VecDeque) to avoid heap allocation under the
    /// events_lock, which may be acquired on scheduler tick (hot path).
    pub pending_events: ArrayVec<Arc<PerfEvent>, PERF_MAX_ACTIVE>,
}

// SAFETY: `PerfEventContext::active[]` contains raw pointers to `PerfEvent`
// objects. Those objects are kept alive by `Arc<PerfEvent>` entries in
// `PerfEventMutable::active_events`. Raw pointers are set before `active_count`
// is incremented (Release store) and cleared only after `active_count` is
// decremented (Release store), so lockless readers always see consistent state.
unsafe impl Send for PerfEventContext {}
unsafe impl Sync for PerfEventContext {}

/// Task-level PMU overlay.
///
/// Created on the first `perf_event_open` with `pid != -1` for that task.
/// Stored in the `Task` struct and swapped into the owning CPU context on
/// schedule-in; removed on schedule-out.
pub struct TaskPerfCtx {
    /// Back-reference to the owning task (Weak to avoid cycle).
    pub task: Weak<Task>,
    /// Events tracking this specific task (pid-pinned).
    /// SpinLock (not Mutex) because perf_schedule_out/in runs on every
    /// context switch with preemption disabled and the runqueue lock held.
    /// A sleeping lock is illegal in this context. ArrayVec bounds the
    /// hot-path allocation per the collection usage policy.
    pub task_events: SpinLock<ArrayVec<Arc<PerfEvent>, MAX_TASK_EVENTS>>,
}

/// Capacity of the raw-sample SPSC queue (entries; must be a power of 2).
/// Sized to absorb bursts at the maximum PMI rate before the kthread drains it.
/// At 100 kHz max sample rate and 1 ms kthread wake latency: 100 entries minimum.
/// 512 provides comfortable headroom for multi-counter simultaneous overflow.
const RAW_SAMPLE_QUEUE_DEPTH: usize = 512;

/// Minimal sample record written by the NMI handler (Stage 1).
/// Transferred to the sampler kthread for full record construction (Stage 2).
///
/// RawSample size is architecture-dependent due to SavedRegs.
/// Worst case: x86-64 with 16 GPRs = 46 + 128 + 2(pad) = 176 bytes.
/// The SPSC queue (RAW_SAMPLE_QUEUE_DEPTH=512 entries) memory budget assumes
/// entries <= 256 bytes. Per-arch const_assert! in arch/*/mod.rs.
// kernel-internal, not KABI — NMI handler → kthread transfer, never exposed to userspace.
#[repr(C)]
pub struct RawSample {
    /// Instruction pointer at time of PMI.
    pub ip: u64,
    /// Stack pointer (for user stack unwind in kthread).
    pub sp: u64,
    /// Frame pointer (0 if frame pointer elimination is active).
    pub fp: u64,
    /// Nanosecond timestamp from the timekeeping fast path (no lock required).
    pub timestamp_ns: u64,
    /// CPU number.
    pub cpu: u32,
    /// PID of the interrupted task.
    pub pid: u32,
    /// TID of the interrupted task.
    pub tid: u32,
    /// Saved general-purpose registers at time of interrupt.
    pub regs: SavedRegs,
    /// Index into `active[]` identifying the overflowed counter.
    pub event_idx: u16,
}

Context switch fast path. perf_schedule_out and perf_schedule_in are called on every context switch for every CPU that has active perf events. The system-wide event path MUST be lock-free. The fixed-size active[] array and atomic active_count make this possible: the context switch path reads the count once with Ordering::Acquire, then iterates the raw pointer array — no lock acquisition, no reference count manipulation, no heap access. Task-pinned events use a per-task SpinLock (task_events.lock()) which is bounded to the migrating task's event list — contention is limited to concurrent perf_event_open/close on the same task, not to the global context switch rate.

The scheduler's context_switch() procedure (Section 7.3) invokes perf_schedule_out(prev_ctx) as step 2 (immediately after updating prev's scheduling class state) and perf_schedule_in(next_ctx) as step 7 (immediately after restoring next's general-purpose registers). These calls are unconditional on every context switch; when active_count == 0, each reduces to a single atomic load (~1ns).

// umka-core/src/perf/context.rs (context switch fast path)

/// Called on context switch out — removes the outgoing task's events from PMU hardware.
///
/// # Safety
///
/// The caller (scheduler) guarantees no concurrent modification of `active[]`
/// during this call: `events_lock` cannot be held across a `schedule()` call
/// because `schedule()` is not nestable with `events_lock`. The `active_count`
/// Acquire load pairs with the Release store in `perf_event_add`/`perf_event_del`,
/// ensuring the pointer writes are visible before the count is visible here.
pub fn perf_schedule_out(ctx: &mut PerfEventContext) {
    // Save and remove task-pinned events (if any).
    // The outgoing task's TaskPerfCtx is detached from the CPU context so its
    // events are not programmed into PMU hardware while the task is off-CPU.
    //
    // Note: system-wide events are lock-free on the context switch path (no lock
    // acquired, no reference count touched). ONLY task-pinned events acquire the
    // per-task SpinLock below. Tasks with no perf events skip this entirely.
    if let Some(ref task_ctx) = ctx.task_ctx {
        let task_events = task_ctx.task_events.lock(); // SpinLock, not Mutex
        for event in task_events.iter() {
            event.pmu.event_stop(event, PERF_EF_UPDATE);
            event.pmu.event_del(event, 0);
        }
    }
    // Detach the task overlay: `.take()` moves the Arc<TaskPerfCtx> out of
    // `ctx.task_ctx` and returns it. The scheduler stores it back into
    // `prev_task.perf_ctx` (via `prev_task.perf_ctx = taken_ctx`). On
    // schedule_in, `next_task.perf_ctx.take()` moves the incoming task's
    // TaskPerfCtx into the CPU context. This swap requires `&mut self` on
    // `PerfEventContext` — the scheduler calls with `&mut` because it owns
    // the per-CPU context (not shared). The `PmuOps` trait methods also take
    // `&mut PerfEventContext` for consistency.
    let _detached = ctx.task_ctx.take();
    // `_detached` is returned to the caller (scheduler) for storage in
    // prev_task.perf_ctx. System-wide (per-CPU) events remain programmed
    // across context switches — they are NOT stopped/restarted here.
}

/// Called on context switch in — programs the incoming task's events into PMU hardware.
///
/// # Safety
///
/// Same invariants as `perf_schedule_out`.
pub fn perf_schedule_in(ctx: &PerfEventContext) {
    // System-wide (per-CPU) events remain programmed — no action needed.
    // Only task-pinned events need to be installed from the incoming task.

    // Install task-pinned events from the incoming task's TaskPerfCtx.
    // The scheduler retrieves next_task.perf_ctx and attaches it to ctx.task_ctx.
    // Then program each task-pinned event into PMU hardware.
    if let Some(ref task_ctx) = ctx.task_ctx {
        let task_events = task_ctx.task_events.lock();
        for event in task_events.iter() {
            event.pmu.event_add(event, PERF_EF_START).ok();
        }
    }
}

Event add slow path. Adding and removing events is infrequent (user calls perf_event_open / closes the fd). These paths hold events_lock and update both the Arc-owning active_events list (which keeps the objects alive) and the raw pointer array read by the fast path:

// umka-core/src/perf/context.rs (event add/remove slow paths)

/// Add a new event to this CPU context.
///
/// Slow path: acquires `events_lock`. Must not be called from context switch
/// or NMI context.
pub fn perf_event_add(ctx: &PerfEventContext, event: Arc<PerfEvent>) -> Result<()> {
    let mut mutable = ctx.events_lock.lock();

    if mutable.active_events.len() >= PERF_MAX_ACTIVE {
        // No free HW slot: queue for multiplexing rotation.
        mutable.pending_events.push(event);
        return Ok(());
    }

    // Initialize and start the hardware counter.
    event.pmu.event_init(&event)?;
    event.pmu.event_add(&event, PERF_EF_START)?;

    let idx = mutable.active_events.len();
    // Store raw pointer BEFORE incrementing count (Release ordering on the
    // count store pairs with Acquire on the fast-path load).
    ctx.active[idx].store(Arc::as_ptr(&event) as *mut PerfEvent, Ordering::Release);
    mutable.active_events.push(event);

    // Release store: makes the pointer write visible to lockless readers
    // before they can observe the incremented count.
    ctx.active_count.fetch_add(1, Ordering::Release);
    Ok(())
}

/// Remove an event from this CPU context.
///
/// Slow path: acquires `events_lock`. Decrements `active_count` with Release
/// BEFORE removing the pointer so lockless readers at the old count can still
/// safely dereference `active[old_count - 1]`.
pub fn perf_event_del(ctx: &PerfEventContext, event: &Arc<PerfEvent>) -> Result<()> {
    let mut mutable = ctx.events_lock.lock();

    let pos = mutable.active_events
        .iter()
        .position(|e| Arc::ptr_eq(e, event))
        .ok_or(Error::NotFound)?;

    event.pmu.event_stop(event, PERF_EF_UPDATE);
    event.pmu.event_del(event, 0);

    // Swap-remove to fill the hole: move the last pointer into pos.
    let last = mutable.active_events.len() - 1;
    // Fill the hole BEFORE decrementing count. Lockless readers see the
    // old count and may read active[pos]; the new pointer must be valid
    // before the count decrement makes the old `last` slot unreachable.
    if pos != last {
        ctx.active[pos].store(
            ctx.active[last].load(Ordering::Acquire) as *mut PerfEvent,
            Ordering::Release
        );
    }
    ctx.active_count.fetch_sub(1, Ordering::Release);
    // Clear the stale duplicate at active[last]. After the count decrement,
    // lockless readers iterate active[0..new_count] and will not reach `last`.
    // But a reader that loaded the OLD count before the decrement may still
    // iterate to `last`. Nulling prevents double-counting the moved event.
    // Reader iteration skips null entries.
    if pos != last {
        ctx.active[last].store(core::ptr::null_mut(), Ordering::Relaxed);
    }
    mutable.active_events.swap_remove(pos);

    // Promote a pending event if one is waiting.
    if !mutable.pending_events.is_empty() {
        let next = mutable.pending_events.swap_remove(0);
        let _ = perf_event_add_locked(ctx, &mut mutable, next);
    }
    Ok(())
}
// umka-core/src/perf/ring_buffer.rs

/// Shared ring buffer between kernel and userspace.
///
/// The same physical pages are mapped twice: into the kernel's direct-map for
/// writing, and into the userspace address space (via mmap) for reading.
/// No data is ever copied between the two mappings.
pub struct PerfRingBuffer {
    /// Pointer to the 4 KB mmap header page. Writable by kernel, mapped
    /// read-only into userspace as the first page of the mmap region.
    pub header: NonNull<PerfMmapPage>,
    /// Pointer to the data pages. Writable by kernel.
    /// `data_size` must be a power of 2.
    pub data: NonNull<[u8]>,
    /// Size of the data region in bytes (power of 2).
    pub data_size: usize,
    /// Kernel write position in bytes (wraps mod data_size).
    /// Written with Release ordering; userspace reads with Acquire.
    ///
    /// **Wrap safety**: u64 wraps at ~18.4 EB of data. At 10 GB/s sustained
    /// throughput, wrap occurs after ~58,000 years — safe for 50-year uptime.
    ///
    /// **Lost-sample detection (userspace protocol)**: Userspace detects
    /// lost samples by checking `data_head - data_tail > data_size`. When
    /// this condition is true, the kernel has overwritten data that
    /// userspace has not yet consumed. Userspace must advance `data_tail`
    /// to `data_head - data_size` to re-synchronize. This matches the
    /// standard Linux `perf_event_mmap_page` ring protocol.
    pub data_head: AtomicU64,
    /// Optional AUX buffer for Intel PT / ARM SPE trace output.
    /// Tuple: (pointer to AUX pages, size in bytes).
    pub aux_data: Option<(NonNull<[u8]>, usize)>,
    /// Kernel AUX write position.
    pub aux_head: AtomicU64,
    /// Physical page descriptors for the data region. Kept alive until the
    /// last Arc reference to this PerfRingBuffer is dropped, which may be
    /// after the perf fd is closed if userspace still has an mmap open.
    /// Bounded by data_size / PAGE_SIZE; max limited by perf_event_mlock_kb
    /// sysctl (default 516 KB = 129 pages of 4 KB).
    /// debug_assert!(pages.len() <= MAX_PERF_RING_PAGES) enforced at construction.
    pub pages: Vec<Arc<PhysPage>>,
    /// Samples dropped because the ring buffer was full.
    pub lost_samples: AtomicU64,
}

20.8.6 perf_event_mmap_page Header

The first 4 KB page of the mmap() region is the PerfMmapPage header. It exposes kernel-maintained state to userspace without system calls: ring buffer positions, time-conversion coefficients, and an optional direct PMC read shortcut (rdpmc on x86-64).

The layout exactly matches Linux's perf_event_mmap_page:

/// First page of the perf mmap region — stable ABI shared with userspace.
///
/// Fields in the first 1 KB are stable across kernel versions.
/// Fields at and beyond offset 1 KB are architecture-specific PMC shortcuts;
/// userspace must check `capabilities` before using them.
// Userspace ABI — matches Linux `struct perf_event_mmap_page` exactly.
// Layout frozen by Linux perf ABI. Shared with userspace via mmap() and read
// directly by perf, bpftrace, BCC, JVM perf agents, and all PMU consumers.
// Do NOT modify field layout without verifying Linux ABI compatibility.
#[repr(C)]
pub struct PerfMmapPage {
    /// Kernel ABI version. Currently 0. Userspace checks on open.
    pub version:         u32,
    /// Compatibility version. Must be 0 for userspace to proceed.
    pub compat_version:  u32,
    /// Sequence lock word. Userspace must retry if this changes during read.
    pub lock:            u32,
    /// PMC index for `rdpmc` (1-based; 0 = not available).
    pub index:           u32,
    /// Offset added to `rdpmc` result to get the signed event count.
    pub offset:          i64,
    /// Total time this event has been enabled (nanoseconds).
    pub time_enabled:    u64,
    /// Total time this event has been running on PMU (nanoseconds).
    pub time_running:    u64,
    /// Capability flags:
    ///   bit 0 = cap_bit0 (always 1; reserved)
    ///   bit 1 = cap_user_time (time_* fields are valid)
    ///   bit 2 = cap_user_rdpmc (index/offset are valid for rdpmc)
    ///   bit 3 = cap_user_time_zero (time_zero is valid)
    pub capabilities:    u64,
    /// PMC bit width (for sign-extending the rdpmc result).
    pub pmc_width:       u16,
    /// Shift for TSC-to-nanosecond conversion: ns = (tsc * time_mult) >> time_shift.
    pub time_shift:      u16,
    /// Multiplier for TSC-to-nanosecond conversion.
    pub time_mult:       u32,
    /// Nanosecond offset added after TSC scaling.
    pub time_offset:     i64,
    /// Nanosecond base time at the TSC reference point for time_zero.
    pub time_zero:       u64,
    /// Size of this struct in bytes (currently 4096).
    pub size:            u32,
    pub _reserved_1:     u32,
    /// TSC snapshot used as the reference for time_zero.
    pub time_cycles:     u64,
    /// TSC mask for CPUs with < 64-bit TSC.
    pub time_mask:       u64,
    /// Padding to 1 KB boundary.
    pub _reserved_to_1kb: [u8; 928],

    // Ring buffer control — at offset 1024 (0x400):
    /// Write head in bytes. Updated by kernel with Release ordering.
    pub data_head:       AtomicU64,
    /// Read tail in bytes. Updated by userspace with Release ordering.
    pub data_tail:       AtomicU64,
    /// Byte offset from mmap start to first data byte (= 4096 = one page).
    pub data_offset:     u64,
    /// Data region size in bytes.
    pub data_size:       u64,

    // AUX ring buffer control (Intel PT / ARM SPE):
    /// AUX write head. Updated by kernel.
    pub aux_head:        AtomicU64,
    /// AUX read tail. Updated by userspace.
    pub aux_tail:        AtomicU64,
    /// Byte offset from mmap start to first AUX byte.
    pub aux_offset:      u64,
    /// AUX region size in bytes.
    pub aux_size:        u64,

    /// Architecture-specific PMC shortcut area. Linux reserves the remainder
    /// of the first page for per-arch PMC data (user_page->pmc_width, etc.).
    /// Zero-filled; contents defined per architecture (see `cap_user_rdpmc`).
    pub _reserved_pmc:   [u8; 3008],
}
// PerfMmapPage occupies exactly the first page of the mmap region.
const_assert!(size_of::<PerfMmapPage>() == 4096);

Counting without syscalls — For events where cap_user_rdpmc = 1, userspace can read the PMU counter directly using the rdpmc instruction (x86-64) or the equivalent on other architectures:

/* x86-64 userspace pseudocode — no syscall required */
uint64_t perf_rdpmc(struct perf_event_mmap_page *pc) {
    uint32_t seq, idx;
    int64_t  count;
    do {
        seq   = __atomic_load_n(&pc->lock, __ATOMIC_ACQUIRE);
        idx   = pc->index;
        count = pc->offset;
        if (idx)
            count += (int64_t)__builtin_ia32_rdpmc(idx - 1);
    } while (__atomic_load_n(&pc->lock, __ATOMIC_ACQUIRE) != seq);
    return (uint64_t)count;
}

This seqlock protocol is identical to Linux. perf stat uses it to achieve near-zero overhead in counting mode.

20.8.7 PMU Driver Trait (PmuOps)

UmkaOS replaces Linux's struct pmu (a C struct of nullable function pointers with manual lifetime discipline) with the PmuOps trait. The trait enforces at compile time that every required method is implemented and that the driver is Send + Sync + 'static, making per-CPU access safe without runtime checks.

// umka-core/src/perf/pmu.rs

/// PMU hardware driver interface.
///
/// Implementors are registered at boot via `pmu_register()` and live for the
/// lifetime of the kernel. All methods are called from process context unless
/// explicitly marked as interrupt-disabled paths.
pub trait PmuOps: Send + Sync + 'static {
    /// PMU driver name, e.g. "intel-core", "arm-pmuv3", "amd-core".
    fn name(&self) -> &'static str;

    /// Dynamic PMU type number allocated by `pmu_register()`.
    /// For built-in types (HARDWARE, SOFTWARE, etc.) returns the fixed constant.
    fn pmu_type(&self) -> u32;

    /// Validate and initialize a newly created event for this PMU.
    ///
    /// Called once at `perf_event_open()` time, from process context.
    /// Must verify `event.attr.config` is a valid event code for this PMU,
    /// map generic `PERF_TYPE_HARDWARE` codes to hardware-specific event selects,
    /// allocate any per-event private state, and set `event.hw_counter_idx = -1`.
    fn event_init(&self, event: &mut PerfEvent) -> Result<(), PmuError>;

    /// Schedule event onto the PMU hardware.
    ///
    /// Called with interrupts disabled on the schedule-in path or when the
    /// multiplexer installs this event. Assigns a free hardware counter slot
    /// and programs the event-select and counter-value MSRs/CSRs/SPRs.
    ///
    /// `flags` bits:
    /// - `PERF_EF_START` (0x01): call `event_start` immediately after adding.
    /// - `PERF_EF_RELOAD` (0x02): event is returning from multiplexer; reload
    ///   the saved partial count (for correct scaling).
    fn event_add(&self, event: &PerfEvent, flags: u32) -> Result<(), PmuError>;

    /// Remove event from the PMU (without destroying it).
    ///
    /// Called with interrupts disabled on schedule-out or when the multiplexer
    /// rotates this event out. Must save the current hardware count to
    /// `event.count` before clearing the PMU registers.
    ///
    /// `flags` bits:
    /// - `PERF_EF_UPDATE` (0x04): accumulate current hardware count into
    ///   `event.count` before removing.
    fn event_del(&self, event: &PerfEvent, flags: u32);

    /// Start counting (unmask / enable the hardware counter).
    ///
    /// Called after `event_add` when `attr.disabled` is clear, or in response
    /// to `PERF_EVENT_IOC_ENABLE`. The counter was already programmed by
    /// `event_add`; this call only unmasks it.
    fn event_start(&self, event: &PerfEvent, flags: u32);

    /// Stop counting (mask / disable the hardware counter).
    ///
    /// Called in response to `PERF_EVENT_IOC_DISABLE` or before `event_del`.
    /// Does not remove the event from the PMU; `event_start` can resume it.
    fn event_stop(&self, event: &PerfEvent, flags: u32);

    /// Read the current hardware counter value and accumulate into `event.count`.
    ///
    /// Called in response to a `read()` syscall on the perf fd, or before
    /// `event_del` with `PERF_EF_UPDATE`. Must be idempotent (calling twice
    /// without an intervening `event_start` must not double-count).
    fn event_read(&self, event: &mut PerfEvent) -> u64;

    /// Context switch in: install the incoming task's events onto the PMU.
    ///
    /// Called by the context-switch path after the task switch completes.
    /// Swaps in `ctx.task_ctx` (if any) and calls `event_add` for each
    /// task-pinned event.
    fn event_schedule_in(&self, ctx: &mut PerfEventContext);

    /// Context switch out: remove the outgoing task's events from the PMU.
    ///
    /// Called before the task switch. Calls `event_del` (with `PERF_EF_UPDATE`)
    /// for each task-pinned event and removes the task overlay from `ctx.task_ctx`.
    fn event_schedule_out(&self, ctx: &mut PerfEventContext);

    /// Return static hardware capabilities of this PMU.
    fn capabilities(&self) -> PmuCapabilities;

    /// Handle a sampling overflow interrupt (NMI on x86-64, PMI on others).
    ///
    /// Called from the architecture interrupt handler. Must do the absolute
    /// minimum: push a `RawSample` to `queue` (non-blocking) and reprogram
    /// the counter. Full sample construction happens in the sampler kthread.
    ///
    /// Must not: allocate memory, take sleeping locks, or unwind the call stack.
    ///
    /// Default implementation: no-op (for non-sampling PMUs).
    fn event_overflow(
        &self,
        _event: &PerfEvent,
        _raw: &mut RawSample,
        _queue: &SpscRing<RawSample, RAW_SAMPLE_QUEUE_DEPTH>,
    ) {
    }
}

/// Static capability descriptor for a PMU.
pub struct PmuCapabilities {
    /// Number of general-purpose hardware counters.
    pub num_gp_counters: u8,
    /// Number of fixed-function counters (cycles, instructions, ref-cycles, etc.).
    pub num_fixed_counters: u8,
    /// Bit width of each counter register (typically 48 on modern CPUs).
    pub counter_width: u8,
    /// Supports overflow interrupt (sampling mode).
    pub supports_sampling: bool,
    /// Supports hardware callchain (e.g., Intel LBR, ARM BRBE).
    pub supports_callchain: bool,
    /// Supports cgroup-scoped filtering in hardware.
    pub supports_cgroup_events: bool,
    /// Supports precise instruction pointer (PEBS on Intel, SPE on AArch64).
    pub supports_precise_ip: bool,
    /// Supports branch stack sampling.
    pub supports_branch_stack: bool,
    /// Maximum branch stack depth (0 if not supported).
    pub max_branch_depth: u8,
}

/// Error returned by PMU driver operations.
#[derive(Debug)]
pub enum PmuError {
    /// The requested event code is not supported by this PMU.
    UnsupportedEvent,
    /// All hardware counter slots are occupied.
    NoSlotAvailable,
    /// Event conflicts with a concurrently scheduled exclusive event.
    ExclusiveConflict,
    /// PMU hardware returned an error status.
    HardwareError(u32),
}

20.8.8 Architecture PMU Implementations

Each architecture provides a PmuOps implementation registered at boot by the architecture hardware initialization code. PMU characteristics are discovered at runtime (number of counters from CPUID / PMU control registers), not compile-time constants.

Architecture Driver name GP counters Fixed counters Sampling Precise IP
x86-64 (Intel, gen ≥ 3) intel-core 4–8 (CPUID.0AH) 3 (cycles, instrs, ref-cycles) Yes (PEBS) Yes (PEBS exact IP)
x86-64 (AMD Zen2+) amd-core 6 0 Yes (NMI on overflow) No
AArch64 (PMUv3) arm-pmuv3 ≥6 (PMCR_EL0.N) 1 (PMCCNTR_EL0) Yes (GIC PMI) Yes (with SPE)
ARMv7 arm-pmu ≥4 (PMCR.N) 1 (PMCCNTR) Yes (GIC PMI) No
RISC-V (SBI PMU ext.) riscv-pmu Firmware-defined 3 (HPMCOUNTER3-5) Firmware-dependent No
PPC32 (Book E) ppc32-pmu 4 (PMC0-PMC3) 0 Yes (Performance Monitor interrupt) No
PPC64LE (POWER10) power-pmu 6 (PMC1-6) 2 (MMCR0 fixed) Yes (SIGP PMI) No
s390x (z15+) cpumf-pmu 6 (runtime-discovered) 0 Yes (CPU Measurement Alert) No
LoongArch64 (3A5000+) loongarch-pmu 4 (CPUCFG.0x6) 0 Yes (HWI PMI) No

Intel Core PMU (intel-core). Counter count and event availability are discovered via CPUID leaf 0x0A (Architectural Performance Monitoring): - EAX[15:8]: number of general-purpose counters per logical processor. - EAX[23:16]: counter bit width. - EBX[6:0]: bitmask of available architectural events (0 = event available).

Events are programmed into IA32_PERFEVTSELx MSRs (event select, unit mask, USR/OS filters, edge detect, interrupt enable) and counters read from IA32_PMCx MSRs. Fixed-function counters use IA32_FIXED_CTR_CTRL and IA32_FIXED_CTRx. PEBS (Precise Event-Based Sampling) is activated by setting the corresponding bit in IA32_PEBS_ENABLE; the CPU writes complete PEBS records to a ring buffer in a kernel-allocated PEBS buffer mapped at a fixed virtual address per logical CPU.

AMD Core PMU (amd-core). Programs PERFEVTSEL0PERFEVTSEL5 and reads PERFCTR0PERFCTR5. On Zen2+, the extended event-select format encodes bits [11:8] of the event code into PERFEVTSEL[35:32]. Overflow generates a standard x86-64 NMI via the local APIC Performance Counter Interrupt (PCI) vector.

ARM PMUv3 (arm-pmuv3). The count of general-purpose counters (PMCR_EL0.N) is read at boot and may range from 0 to 31. Events are programmed into PMEVTYPER<n>_EL0 (event type, filter flags) and counted in PMEVCNTR<n>_EL0. The dedicated cycle counter uses PMCCNTR_EL0 and PMCCFILTR_EL0. Overflow interrupts arrive via a GIC PPI (per-processor interrupt) configured by firmware (typically IRQ 23 in the PPI range). If ID_AA64DFR0_EL1.PMSVer >= 1, the ARM Statistical Profiling Extension is available for precise-IP sampling.

RISC-V SBI PMU (riscv-pmu). Counter management goes through the SBI PMU extension (RISC-V SBI spec chapter 11). The kernel calls sbi_pmu_counter_get_info() at boot to enumerate available counters and their capabilities. sbi_pmu_counter_start() and sbi_pmu_counter_stop() program and deprogramm counters. Overflow notification depends on the SBI firmware implementation; UmkaOS requires at minimum SBI v0.3 with PMU extension ID 0x504D55.

POWER PMU (power-pmu). Events are programmed via MMCR0, MMCR1, MMCRA, and MMCR2 SPRs. Counter values are read from PMC1PMC6. A Performance Monitor Alert (SIGP signal bit in the exception register) is generated on overflow and handled via the dedicated POWER performance monitor exception vector. On POWER10, memory latency events are available through the combination of MMCRA load/store sampling mode and the PM_MEM_LATENCY_* event codes.

PPC32 Book E PMU (ppc32-pmu). PPC32 Book E processors (e500, e500mc) provide Performance Monitor Counters PMC0-PMC3, programmed via PMGC0 (global control) and PMLCa0-PMLCa3 / PMLCb0-PMLCb3 (per-counter event select and condition registers). Counter values are read from PMC0-PMC3 SPRs. Overflow generates a Performance Monitor interrupt (Book E interrupt vector 0x0260). The kernel reads PMGC0[FAC] (Freeze All Counters) to determine interrupt source and clears the overflow condition by reloading the counter start value. Event codes are processor-specific (Freescale/NXP e500v2 vs e500mc use different event tables); UmkaOS discovers the core type from PVR (Processor Version Register) at boot and selects the appropriate event map.

s390x CPU Measurement Facility (cpumf-pmu). The s390x CPU Measurement Facility (CPUMF) provides two modes: counting and sampling. Counting mode uses the ECCTR (Extract CPU Counter) and SCCTR (Set CPU Counter) instructions to read and program hardware counters. The number of available counters is discovered at boot via STCCTM (Store CPU Counter Multiple) and the CPUMF authorization control in CR0. Counter sets are organized into Basic (cycles, instructions), Crypto, Extended, and MT-diagnostic groups. Sampling mode (CPU Measurement Sampling Facility, CMSF) writes sample records into a kernel-allocated sampling buffer; the buffer-full condition generates a CPU Measurement Alert external interrupt (interrupt code 0x1407). UmkaOS enables CPUMF via LCCTL (Load CPU Counter Controls) and configures sampling via LSCTL (Load Sampling Controls). The facility requires z15 or later for full counter set availability.

LoongArch64 PMU (loongarch-pmu). LoongArch64 processors (Loongson 3A5000, 3A6000) provide CSR-based performance counters. Counter availability is discovered at boot via CPUCFG word 0x6 (Performance Counter feature bits). Counter control registers CSR.PERFCTRL[0-3] select events and enable counting; counter value registers CSR.PERFCNTR[0-3] hold the 64-bit counts. Typical implementations provide 4 programmable counters. Overflow generates a Hardware Interrupt (HWI) routed through the LoongArch EIOINTC interrupt controller. The kernel handles this interrupt by reading CSR.ESTAT to identify the PMI source, then dispatches to the perf overflow handler. Event codes are microarchitecture-specific; UmkaOS reads CPUCFG word 0x0 (Processor Identity) at boot to select the appropriate event table.

20.8.9 Sampling Overflow Handling

Sampling events trigger a PMI (Performance Monitoring Interrupt) — an NMI on x86-64 delivered via the APIC Performance Counter vector, and a regular interrupt on other architectures — when the hardware counter overflows from the programmed start value back through zero.

Two-stage design. Linux performs full callchain unwinding and stack copying directly in the NMI handler. This is unsafe on deeply nested kernel frames and risks stack overflow or incorrect unwinding. UmkaOS splits the work:

Stage 1 — NMI/interrupt handler (minimal work):

// umka-core/src/arch/x86_64/perf/overflow.rs

/// PMU overflow NMI handler (x86-64).
///
/// # Safety
///
/// Called from NMI context. Must not sleep, allocate heap memory, or acquire
/// any lock that is not NMI-safe. Must complete in bounded time.
pub unsafe fn perf_nmi_handler(regs: &SavedRegs) {
    // Read overflow status: IA32_PERF_GLOBAL_STATUS (MSR 0x38E).
    let overflow_mask = rdmsr(IA32_PERF_GLOBAL_STATUS);
    if overflow_mask == 0 {
        return; // Spurious NMI — not from the PMU.
    }

    // Acknowledge overflow bits before re-enabling counters.
    wrmsr(IA32_PERF_GLOBAL_OVF_CTRL, overflow_mask);

    let ctx = CpuLocal::get().perf_ctx;
    // Lock-free read: load count with Acquire, then iterate raw pointers.
    // NMI handlers must not acquire locks (risk of deadlock if NMI fires while
    // events_lock is held). The fixed-size active[] array with atomic count makes
    // this safe — see PerfEventContext doc for pointer lifetime invariants.
    let active_count = ctx.active_count.load(Ordering::Acquire) as usize;

    for idx in 0..active_count {
        // Null check: after swap-remove in perf_event_del, the last slot
        // is nulled. If NMI fires between active_count decrement and pointer
        // clear, we may see a null pointer at a valid index.
        let ptr = ctx.active[idx].load(Ordering::Acquire);
        if ptr.is_null() { continue; }
        // SAFETY: non-null pointer in active[0..active_count] is a valid
        // PerfEvent kept alive by events_lock.active_events (Arc).
        let event = unsafe { &*ptr };

        // Map through hw_counter_idx: overflow_mask bits correspond to
        // hardware PMU counter slots, NOT active[] array indices. After
        // swap-remove or multiplexing rotation, these diverge.
        let hw_idx = event.hw_counter_idx;
        if hw_idx < 0 { continue; }  // event not currently on hardware
        if overflow_mask & (1u64 << hw_idx as u32) == 0 {
            continue;
        }

        let raw = RawSample {
            ip:           regs.ip,
            sp:           regs.sp,
            fp:           regs.bp,
            timestamp_ns: timekeeping_fast_ns(),
            cpu:          cpu_id() as u32,
            pid:          current_task().pid,
            tid:          current_task().tid,
            regs:         *regs,
            event_idx:    idx as u16,
        };

        // Non-blocking push. If queue is full, increment lost_samples and drop.
        if ctx.raw_sample_queue.try_push(raw).is_err() {
            if let Some(rb) = &event.ring_buffer {
                rb.lost_samples.fetch_add(1, Ordering::Relaxed);
            }
        }

        // Reprogram counter to -(period) so next overflow fires after `period` events.
        let width = event.pmu.capabilities().counter_width;
        let period = event.attr.sample_period_or_freq;
        let start_val = (1u64 << width).wrapping_sub(period);
        wrmsr(IA32_PMC0 + hw_idx as u32, start_val);
    }

    // Re-enable global performance counter control.
    wrmsr(IA32_PERF_GLOBAL_CTRL, ctx.enabled_counter_mask);
}

Stage 2 — per-CPU sampler kthread (full work):

// umka-core/src/perf/sampler.rs

/// Per-CPU perf sampler kthread.
///
/// Runs at `SCHED_FIFO` priority 1 (lowest RT class), pinned to its CPU.
/// Woken by a flag set in the scheduler tick when `raw_sample_queue` is
/// non-empty. Performs all work that is unsafe in NMI context: callchain
/// unwinding, user stack copying, ring buffer writes.
pub fn perf_sampler_thread(cpu: u32) -> ! {
    loop {
        kthread_park_wait(); // Blocks until woken.

        let ctx = CpuLocal::for_cpu(cpu).perf_ctx;

        while let Some(raw) = ctx.raw_sample_queue.try_pop() {
            // Resolve the event that overflowed.
            let event = match ctx.active_events
                .lock()
                .get(raw.event_idx as usize)
                .map(Arc::clone)
            {
                Some(e) => e,
                None    => continue,
            };

            let rb = match event.ring_buffer.as_ref().map(Arc::clone) {
                Some(rb) => rb,
                None     => continue,
            };

            // Build the complete perf record (may sleep, may allocate).
            let record = build_perf_sample(&event, &raw);
            ring_buffer_write(&rb, &record);

            // Invoke optional overflow handler (e.g., attached eBPF program).
            if let Some(handler) = event.overflow_handler {
                handler(&event, &raw.regs, event.attr.sample_period_or_freq);
            }

            // Wake userspace readers if the wakeup threshold is reached.
            if ring_buffer_should_wakeup(&rb, &event) {
                rb.waitqueue.wake_all();
            }
        }
    }
}

/// All 22 perf record types. Matches Linux `enum perf_event_type` in
/// `include/uapi/linux/perf_event.h`. This enum is the dispatch key for the
/// ring buffer parser — `perf report`, `bpftrace`, and BCC depend on every
/// type being present with the correct discriminant value.
///
/// Verified against torvalds/linux master `include/uapi/linux/perf_event.h`.
#[repr(u32)]
pub enum PerfRecordType {
    /// Memory mapping record. Emitted for executable mappings when MMAP flag
    /// is set in `perf_event_attr.flags`.
    /// Data: `PerfRecordMmap` (pid, tid, addr, len, pgoff, filename[]).
    Mmap              = 1,
    /// Lost events record. Emitted when ring buffer overflows.
    /// Data: `PerfRecordLost` (id, lost_count).
    Lost              = 2,
    /// Task name change record. Emitted on `prctl(PR_SET_NAME)` or exec.
    /// Data: `PerfRecordComm` (pid, tid, comm[]).
    Comm              = 3,
    /// Task exit record. Emitted on `do_exit()`.
    /// Data: `PerfRecordForkExit` (pid, ppid, tid, ptid, time).
    Exit              = 4,
    /// Throttle record. Emitted when kernel throttles event sampling.
    /// Data: `PerfRecordThrottle` (time, id, stream_id).
    Throttle          = 5,
    /// Unthrottle record. Emitted when sampling resumes after throttling.
    /// Data: `PerfRecordThrottle` (time, id, stream_id).
    Unthrottle        = 6,
    /// Task fork record. Emitted on `do_fork()`.
    /// Data: `PerfRecordForkExit` (pid, ppid, tid, ptid, time).
    Fork              = 7,
    /// Read record. Emitted for group reads.
    /// Data: `PerfRecordRead` (pid, tid, read_format values).
    Read              = 8,
    /// Sample record. The primary sampled event record.
    /// Data: `PerfRecordSample` (variable fields per `sample_type` bitmask).
    Sample            = 9,
    /// Extended memory mapping record (supersedes MMAP for detailed info).
    /// Includes device/inode, protection flags, and optionally build ID.
    /// Data: `PerfRecordMmap2`.
    Mmap2             = 10,
    /// AUX buffer notification. Emitted when AUX data is available.
    /// Data: `PerfRecordAux` (aux_offset, aux_size, flags).
    Aux               = 11,
    /// Instruction tracing start marker. Emitted when instruction tracing
    /// (Intel PT, ARM CoreSight) begins for a task.
    /// Data: `PerfRecordItraceStart` (pid, tid).
    ItraceStart       = 12,
    /// Lost samples (distinguishable from Lost: these are sample events that
    /// were dropped, not ring buffer overflow).
    /// Data: `PerfRecordLostSamples` (lost_count).
    LostSamples       = 13,
    /// Context switch record (task-scope). Emitted on context switch when
    /// CONTEXT_SWITCH flag is set.
    /// Data: sample_id only; `misc` field bit 13 = SWITCH_OUT.
    Switch            = 14,
    /// Context switch record (CPU-wide). Includes the switching-in/out task.
    /// Data: `PerfRecordSwitchCpuWide` (next_prev_pid, next_prev_tid).
    SwitchCpuWide     = 15,
    /// Namespace entry record. Emitted when NAMESPACES flag is set.
    /// Data: `PerfRecordNamespaces` (pid, tid, nr_namespaces, link_info[]).
    Namespaces        = 16,
    /// Kernel symbol registration/unregistration (eBPF JIT, kprobes).
    /// Data: `PerfRecordKsymbol` (addr, len, ksym_type, flags, name[]).
    Ksymbol           = 17,
    /// BPF program load/unload event.
    /// Data: `PerfRecordBpfEvent` (bpf_event_type, flags, id, tag[8]).
    BpfEvent          = 18,
    /// Cgroup switch record. Emitted when CGROUP flag is set.
    /// Data: `PerfRecordCgroup` (id, path[]).
    Cgroup            = 19,
    /// Text poke record. Emitted when kernel code is modified at runtime
    /// (ftrace, kprobes, static keys).
    /// Data: `PerfRecordTextPoke` (addr, old_len, new_len, bytes[]).
    TextPoke          = 20,
    /// AUX output hardware ID. Associates AUX buffer data with the hardware
    /// event that produced it (architecture-specific disambiguation).
    /// Data: `PerfRecordAuxOutputHwId` (hw_id).
    AuxOutputHwId     = 21,
    /// Deferred callchain record. Callchain captured asynchronously when
    /// the full chain was not available at sample time.
    /// Data: `PerfRecordCallchainDeferred` (cookie, nr, ips[nr]).
    CallchainDeferred = 22,
}

/// Common perf event header prepended to every ring buffer record.
/// Matches Linux `struct perf_event_header` exactly (8 bytes).
#[repr(C)]
pub struct PerfEventHeader {
    /// Record type (`PerfRecordType` discriminant value).
    /// See `PerfRecordType` enum for all 22 types (1-22).
    pub type_: u32,
    /// Misc flags encoding the CPU execution mode at sample time.
    /// Bits 0-2 (`PERF_RECORD_MISC_CPUMODE_MASK = 0x7`):
    ///   0 = unknown, 1 = kernel, 2 = user, 3 = hypervisor,
    ///   4 = guest kernel, 5 = guest user.
    /// Bit 13 (`PERF_RECORD_MISC_SWITCH_OUT`): set on Switch/SwitchCpuWide
    ///   records when the record represents a switch-out (not switch-in).
    /// Bit 13 (`PERF_RECORD_MISC_MMAP_DATA`): set on Mmap/Mmap2 records for
    ///   non-executable (data) mappings.
    /// Bit 14 (`PERF_RECORD_MISC_SWITCH_OUT_PREEMPT`): task was preempted
    ///   in TASK_RUNNING state (only with SWITCH_OUT).
    /// Bit 14 (`PERF_RECORD_MISC_EXACT_IP`): the IP in this record is exact
    ///   (no skid). Set by hardware when precise_ip >= 2 is satisfied.
    /// Bit 15 (`PERF_RECORD_MISC_EXT_RESERVED`): reserved for future use.
    pub misc: u16,
    /// Total size of this record in bytes, including the header.
    /// The ring buffer consumer advances `data_tail` by this amount after
    /// reading the record. Records are always aligned to 8 bytes.
    pub size: u16,
}
const_assert!(size_of::<PerfEventHeader>() == 8);

/// Misc flag constants for `PerfEventHeader.misc`.
pub const PERF_RECORD_MISC_CPUMODE_MASK:        u16 = 7;
pub const PERF_RECORD_MISC_KERNEL:              u16 = 1;
pub const PERF_RECORD_MISC_USER:                u16 = 2;
pub const PERF_RECORD_MISC_HYPERVISOR:          u16 = 3;
pub const PERF_RECORD_MISC_GUEST_KERNEL:        u16 = 4;
pub const PERF_RECORD_MISC_GUEST_USER:          u16 = 5;
pub const PERF_RECORD_MISC_MMAP_DATA:           u16 = 1 << 13;
pub const PERF_RECORD_MISC_SWITCH_OUT:          u16 = 1 << 13;
pub const PERF_RECORD_MISC_EXACT_IP:            u16 = 1 << 14;
pub const PERF_RECORD_MISC_SWITCH_OUT_PREEMPT:  u16 = 1 << 14;
pub const PERF_RECORD_MISC_EXT_RESERVED:        u16 = 1 << 15;

// --- Data structs for each PERF_RECORD type ---
// All structs below follow the PerfEventHeader in the ring buffer.
// Variable-length fields (filename[], comm[], path[], name[], bytes[])
// are written inline; the total record size is encoded in header.size.
// All records are padded to 8-byte alignment.
// When SAMPLE_ID_ALL is set in perf_event_attr, a sample_id trailer is
// appended after the type-specific data (fields selected by sample_type).

/// PERF_RECORD_MMAP (type 1): memory mapping for executable regions.
/// Emitted on `mmap()` for executable mappings, plus non-executable when
/// `MMAP_DATA` is set.
#[repr(C)]
pub struct PerfRecordMmap {
    pub header: PerfEventHeader,
    pub pid: u32,
    pub tid: u32,
    /// Start virtual address of the mapping.
    pub addr: u64,
    /// Length of the mapping in bytes.
    pub len: u64,
    /// File offset of the mapping.
    pub pgoff: u64,
    // Followed by: filename (NUL-terminated, padded to 8-byte alignment).
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordMmap fixed header: 8 + 4 + 4 + 8 + 8 + 8 = 40 bytes.
// Userspace ABI struct — perf ring buffer record (variable-length filename follows).
const_assert!(size_of::<PerfRecordMmap>() == 40);

/// PERF_RECORD_LOST (type 2): ring buffer overflow notification.
#[repr(C)]
pub struct PerfRecordLost {
    pub header: PerfEventHeader,
    /// Event ID of the overflowing event.
    pub id: u64,
    /// Number of events lost.
    pub lost: u64,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordLost: 8 + 8 + 8 = 24 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordLost>() == 24);

/// PERF_RECORD_COMM (type 3): task name change.
#[repr(C)]
pub struct PerfRecordComm {
    pub header: PerfEventHeader,
    pub pid: u32,
    pub tid: u32,
    // Followed by: comm (NUL-terminated, padded to 8-byte alignment).
    // `misc` bit `PERF_RECORD_MISC_COMM_EXEC` (1 << 13) is set when
    // the name change is due to exec (not prctl).
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordComm fixed header: 8 + 4 + 4 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordComm>() == 16);

/// PERF_RECORD_EXIT (type 4) and PERF_RECORD_FORK (type 7): shared layout.
#[repr(C)]
pub struct PerfRecordForkExit {
    pub header: PerfEventHeader,
    /// PID of the process.
    pub pid: u32,
    /// Parent PID (for fork: parent process; for exit: same as pid).
    pub ppid: u32,
    /// TID of the thread.
    pub tid: u32,
    /// Parent TID (for fork: creating thread; for exit: same as tid).
    pub ptid: u32,
    /// Timestamp in nanoseconds (clock source per `use_clockid`).
    pub time: u64,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordForkExit: 8 + 4*4 + 8 = 32 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordForkExit>() == 32);

/// PERF_RECORD_THROTTLE (type 5) and PERF_RECORD_UNTHROTTLE (type 6): shared layout.
#[repr(C)]
pub struct PerfRecordThrottle {
    pub header: PerfEventHeader,
    /// Timestamp when throttle/unthrottle occurred.
    pub time: u64,
    /// Event ID being throttled.
    pub id: u64,
    /// Stream ID.
    pub stream_id: u64,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordThrottle: 8 + 8*3 = 32 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordThrottle>() == 32);

/// PERF_RECORD_READ (type 8): group counter read.
#[repr(C)]
pub struct PerfRecordRead {
    pub header: PerfEventHeader,
    pub pid: u32,
    pub tid: u32,
    // Followed by: read_format values (layout depends on PERF_FORMAT_* flags).
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordRead: 8 + 4 + 4 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordRead>() == 16);

/// PERF_RECORD_MMAP2 (type 10): extended memory mapping record.
/// Supersedes MMAP with device/inode and protection information.
#[repr(C)]
pub struct PerfRecordMmap2 {
    pub header: PerfEventHeader,
    pub pid: u32,
    pub tid: u32,
    /// Start virtual address of the mapping.
    pub addr: u64,
    /// Length of the mapping in bytes.
    pub len: u64,
    /// File offset of the mapping.
    pub pgoff: u64,
    /// Union: either (maj, min, ino, ino_generation) for file-backed mappings,
    /// or (build_id_size, reserved, build_id[20]) when
    /// `misc` bit PERF_RECORD_MISC_MMAP_BUILD_ID is set.
    /// Device major number (file-backed mapping).
    pub maj: u32,
    /// Device minor number (file-backed mapping).
    pub min: u32,
    /// Inode number (file-backed mapping).
    pub ino: u64,
    /// Inode generation (file-backed mapping).
    pub ino_generation: u64,
    /// Memory protection flags (PROT_READ, PROT_WRITE, PROT_EXEC).
    pub prot: u32,
    /// Mapping flags (MAP_SHARED, MAP_PRIVATE, etc.).
    pub flags: u32,
    // Followed by: filename (NUL-terminated, padded to 8-byte alignment).
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordMmap2: 8 + 4 + 4 + 8*3 + 4*2 + 8*2 + 4*2 = 72 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordMmap2>() == 72);

/// PERF_RECORD_AUX (type 11): AUX buffer data notification.
#[repr(C)]
pub struct PerfRecordAux {
    pub header: PerfEventHeader,
    /// Offset into the AUX buffer where new data starts.
    pub aux_offset: u64,
    /// Size of new AUX data in bytes.
    pub aux_size: u64,
    /// Flags: PERF_AUX_FLAG_TRUNCATED (0x01), PERF_AUX_FLAG_OVERWRITE (0x02),
    /// PERF_AUX_FLAG_PARTIAL (0x04), PERF_AUX_FLAG_COLLISION (0x08).
    pub flags: u64,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordAux: 8 + 8*3 = 32 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordAux>() == 32);

/// PERF_RECORD_ITRACE_START (type 12): instruction tracing start.
#[repr(C)]
pub struct PerfRecordItraceStart {
    pub header: PerfEventHeader,
    /// PID of the traced task.
    pub pid: u32,
    /// TID of the traced task.
    pub tid: u32,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordItraceStart: 8 + 4 + 4 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordItraceStart>() == 16);

/// PERF_RECORD_LOST_SAMPLES (type 13): dropped sample notification.
/// Distinct from LOST (type 2): LOST_SAMPLES counts individual samples
/// discarded by the kernel, while LOST counts ring buffer overflow events.
#[repr(C)]
pub struct PerfRecordLostSamples {
    pub header: PerfEventHeader,
    /// Number of samples lost.
    pub lost: u64,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordLostSamples: 8 + 8 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordLostSamples>() == 16);

/// PERF_RECORD_SWITCH (type 14): task-scope context switch.
/// The `misc` field encodes switch direction:
/// - `PERF_RECORD_MISC_SWITCH_OUT` (bit 13) set = switch out.
/// - Clear = switch in.
/// No type-specific fields beyond sample_id trailer.

/// PERF_RECORD_SWITCH_CPU_WIDE (type 15): CPU-wide context switch.
/// Includes identity of the task switching in/out.
#[repr(C)]
pub struct PerfRecordSwitchCpuWide {
    pub header: PerfEventHeader,
    /// PID of the task switching in (on switch-out) or out (on switch-in).
    pub next_prev_pid: u32,
    /// TID of the task switching in (on switch-out) or out (on switch-in).
    pub next_prev_tid: u32,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordSwitchCpuWide: 8 + 4 + 4 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordSwitchCpuWide>() == 16);

/// PERF_RECORD_NAMESPACES (type 16): namespace information.
#[repr(C)]
pub struct PerfRecordNamespaces {
    pub header: PerfEventHeader,
    pub pid: u32,
    pub tid: u32,
    /// Number of namespace entries in the `link_info` array.
    pub nr_namespaces: u64,
    // Followed by: nr_namespaces * PerfNsLinkInfo entries.
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordNamespaces: 8 + 4 + 4 + 8 = 24 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordNamespaces>() == 24);

/// Namespace device/inode pair. Matches Linux `struct perf_ns_link_info`.
#[repr(C)]
pub struct PerfNsLinkInfo {
    /// Device number of the namespace proc entry.
    pub dev: u64,
    /// Inode number of the namespace proc entry.
    pub ino: u64,
}
const_assert!(size_of::<PerfNsLinkInfo>() == 16);

/// PERF_RECORD_KSYMBOL (type 17): kernel symbol registration/unregistration.
#[repr(C)]
pub struct PerfRecordKsymbol {
    pub header: PerfEventHeader,
    /// Symbol virtual address.
    pub addr: u64,
    /// Symbol size in bytes.
    pub len: u32,
    /// Symbol type: 0 = unknown, 1 = BPF (JIT compiled), 2 = OOL (out-of-line
    /// code: kprobe-replaced instructions, optimized kprobes, ftrace trampolines).
    pub ksym_type: u16,
    /// Flags: bit 0 = PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER (symbol removed).
    pub flags: u16,
    // Followed by: name (NUL-terminated, padded to 8-byte alignment).
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordKsymbol: 8 + 8 + 4 + 2 + 2 = 24 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordKsymbol>() == 24);

/// Symbol type constants for `PerfRecordKsymbol.ksym_type`.
pub const PERF_RECORD_KSYMBOL_TYPE_UNKNOWN: u16 = 0;
pub const PERF_RECORD_KSYMBOL_TYPE_BPF:     u16 = 1;
pub const PERF_RECORD_KSYMBOL_TYPE_OOL:     u16 = 2;

/// PERF_RECORD_BPF_EVENT (type 18): BPF program load/unload.
#[repr(C)]
pub struct PerfRecordBpfEvent {
    pub header: PerfEventHeader,
    /// Event sub-type: 0 = unknown, 1 = PROG_LOAD, 2 = PROG_UNLOAD.
    pub bpf_event_type: u16,
    /// Reserved flags.
    pub flags: u16,
    /// BPF program ID.
    pub id: u32,
    /// BPF program tag (SHA hash of instructions, 8 bytes).
    pub tag: [u8; 8],  // BPF_TAG_SIZE = 8
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordBpfEvent: 8 + 2 + 2 + 4 + 8 = 24 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordBpfEvent>() == 24);

/// BPF event sub-type constants for `PerfRecordBpfEvent.bpf_event_type`.
pub const PERF_BPF_EVENT_UNKNOWN:     u16 = 0;
pub const PERF_BPF_EVENT_PROG_LOAD:   u16 = 1;
pub const PERF_BPF_EVENT_PROG_UNLOAD: u16 = 2;

/// PERF_RECORD_CGROUP (type 19): cgroup switch.
#[repr(C)]
pub struct PerfRecordCgroup {
    pub header: PerfEventHeader,
    /// Cgroup ID (from `cgroup_id()`, matches `/proc/[pid]/cgroup` inode).
    pub id: u64,
    // Followed by: path (NUL-terminated, padded to 8-byte alignment).
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordCgroup: 8 + 8 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordCgroup>() == 16);

/// PERF_RECORD_TEXT_POKE (type 20): runtime kernel code modification.
/// Emitted when ftrace, kprobes, or static keys modify kernel text.
#[repr(C)]
pub struct PerfRecordTextPoke {
    pub header: PerfEventHeader,
    /// Virtual address of the modified code.
    pub addr: u64,
    /// Length of old code in bytes.
    pub old_len: u16,
    /// Length of new code in bytes.
    pub new_len: u16,
    // Followed by: old_len bytes of old code, then new_len bytes of new code.
    // Padded to 8-byte alignment.
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordTextPoke: 8 (header) + 8 (addr) + 2 (old_len) + 2 (new_len) = 20 bytes fixed.
// repr(C) pads to 24 for u64 alignment. Ring buffer serialization writes the 20-byte
// fixed header directly (not size_of which is 24), followed by old_len + new_len
// variable-length code bytes. header.size = (20 + old_len + new_len + 7) & !7 (8-byte aligned).
// Userspace ABI struct.
const_assert!(size_of::<PerfRecordTextPoke>() == 24);

/// PERF_RECORD_AUX_OUTPUT_HW_ID (type 21): AUX output hardware ID.
/// Associates AUX buffer data with the hardware event that produced it.
#[repr(C)]
pub struct PerfRecordAuxOutputHwId {
    pub header: PerfEventHeader,
    /// Architecture-specific hardware ID for disambiguating multiple
    /// AUX-producing events (e.g., Intel PT on different cores).
    pub hw_id: u64,
    // Followed by: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordAuxOutputHwId: 8 + 8 = 16 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordAuxOutputHwId>() == 16);

/// PERF_RECORD_CALLCHAIN_DEFERRED (type 22): deferred callchain.
/// Emitted when a full callchain was not available at sample time and
/// was captured asynchronously.
#[repr(C)]
pub struct PerfRecordCallchainDeferred {
    pub header: PerfEventHeader,
    /// Cookie linking this record to the original SAMPLE record.
    pub cookie: u64,
    /// Number of instruction pointers in the callchain.
    pub nr: u64,
    // Followed by: nr * u64 instruction pointers.
    // Then: sample_id trailer if SAMPLE_ID_ALL is set.
}
// PerfRecordCallchainDeferred: 8 + 8 + 8 = 24 bytes. Userspace ABI struct.
const_assert!(size_of::<PerfRecordCallchainDeferred>() == 24);

/// Special call chain context markers written into `callchain[]` to indicate
/// a mode switch in the call stack (e.g., user → kernel boundary).
/// These match Linux `PERF_CONTEXT_*` constants exactly.
pub const PERF_CONTEXT_HV:           u64 = u64::MAX - 31;
pub const PERF_CONTEXT_KERNEL:       u64 = u64::MAX - 127;
pub const PERF_CONTEXT_USER:         u64 = u64::MAX - 511;
pub const PERF_CONTEXT_GUEST:        u64 = u64::MAX - 2047;
pub const PERF_CONTEXT_GUEST_KERNEL: u64 = u64::MAX - 2175;
pub const PERF_CONTEXT_GUEST_USER:   u64 = u64::MAX - 2559;

/// Maximum call chain depth recorded in a `PerfRecordSample`.
/// Matches Linux `PERF_MAX_STACK_DEPTH`. The call chain is terminated by
/// `PERF_CONTEXT_*` markers when the unwinder crosses a privilege boundary.
pub const PERF_MAX_STACK_DEPTH: usize = 127;

/// A single perf sample record written to the perf mmap ring buffer.
///
/// This struct represents the in-memory layout of a `PERF_RECORD_SAMPLE` record
/// as it appears in the ring buffer. It is binary-compatible with Linux perf ABI,
/// allowing unmodified userspace tools (`perf`, `bpftrace`, BCC) to consume it.
///
/// # Variable-length layout
///
/// Not all fields are present in every record. The `perf_event_attr::sample_type`
/// bitmask controls which optional fields are included. Fields are written in the
/// order they appear in this struct, with absent fields skipped entirely (not
/// zeroed). The `header.size` field specifies the total byte length of the record
/// as written; the ring buffer consumer must use `header.size` — not `sizeof` this
/// struct — to advance past the record.
///
/// Fields after `callchain_nr` / `callchain` are only present if the corresponding
/// `PERF_SAMPLE_*` bit is set:
///
/// | Field(s)         | `PERF_SAMPLE_*` bit      | Value if absent |
/// |------------------|--------------------------|-----------------|
/// | `id`             | `PERF_SAMPLE_IDENTIFIER` | not written     |
/// | `ip`             | `PERF_SAMPLE_IP`         | not written     |
/// | `pid`, `tid`     | `PERF_SAMPLE_TID`        | not written     |
/// | `time_ns`        | `PERF_SAMPLE_TIME`       | not written     |
/// | `addr`           | `PERF_SAMPLE_ADDR`       | not written     |
/// | `cpu`, `_cpu_res`| `PERF_SAMPLE_CPU`        | not written     |
/// | `period`         | `PERF_SAMPLE_PERIOD`     | not written     |
/// | `callchain_nr` + `callchain[0..callchain_nr]` | `PERF_SAMPLE_CALLCHAIN` | not written |
/// | `raw_size` + `raw_data[0..raw_size]` | `PERF_SAMPLE_RAW` | not written |
///
/// The `encode()` method serializes only the fields selected by `sample_type`
/// into a flat `Vec<u8>`, which is then written to the ring buffer via
/// `ring_buffer_write()`.
///
/// # Linux ABI compatibility
///
/// Field ordering and types match the Linux kernel's `perf_sample_data` and the
/// on-disk `perf.data` format parsed by `perf report`. The `PERF_RECORD_SAMPLE`
/// type value is 9, matching `enum perf_event_type` in Linux.
///
/// Alignment: the record as a whole is padded to an 8-byte boundary so that the
/// next record's header starts 8-byte aligned. `header.size` includes this padding.
pub struct PerfRecordSample {
    /// Common perf event header (8 bytes). Always present.
    pub header: PerfEventHeader,

    /// Sample identifier (present if `PERF_SAMPLE_IDENTIFIER` set, 8 bytes).
    /// Matches the `perf_event`'s `id` field assigned at `perf_event_open` time.
    /// Placed first (before `ip`) when `PERF_SAMPLE_IDENTIFIER` is set, so that
    /// the consumer can identify the event without parsing the full record layout.
    pub id: u64,

    /// Instruction pointer at sample time (present if `PERF_SAMPLE_IP` set).
    pub ip: u64,

    /// PID and TID of the sampled thread (present if `PERF_SAMPLE_TID` set).
    pub pid: u32,
    pub tid: u32,

    /// Timestamp in nanoseconds (monotonic, from `timekeeping_fast_ns()`)
    /// (present if `PERF_SAMPLE_TIME` set).
    pub time_ns: u64,

    /// Memory address that triggered the sample, for memory-access events
    /// (present if `PERF_SAMPLE_ADDR` set). Zero if not applicable.
    pub addr: u64,

    /// CPU number on which the sample was taken (present if `PERF_SAMPLE_CPU` set).
    pub cpu: u32,
    /// Reserved padding to align `cpu` pair to 8 bytes. Always zero.
    pub _cpu_res: u32,

    /// Sampling period (number of events between consecutive samples)
    /// (present if `PERF_SAMPLE_PERIOD` set).
    pub period: u64,

    /// Number of call chain entries that follow (present if `PERF_SAMPLE_CALLCHAIN` set).
    /// A value of 0 means no call chain was captured (e.g., unwinding failed).
    pub callchain_nr: u64,

    /// Call chain instruction pointers (present if `PERF_SAMPLE_CALLCHAIN` set).
    /// Only `callchain[0..callchain_nr]` are written to the ring buffer.
    /// `PERF_CONTEXT_*` markers are interspersed to indicate mode transitions
    /// (e.g., `PERF_CONTEXT_USER` separates the kernel and user portions of the
    /// stack). Maximum depth is `PERF_MAX_STACK_DEPTH = 127` entries (inclusive
    /// of context markers); the unwinder stops when this limit is reached.
    pub callchain: [u64; PERF_MAX_STACK_DEPTH],

    /// Raw record length in bytes (present if `PERF_SAMPLE_RAW` set).
    /// The `raw_data` that follows is `raw_size` bytes long, then padded to
    /// align the next field to 8 bytes.
    pub raw_size: u32,
    // raw_data follows in the ring buffer: [u8; raw_size], padded to 8-byte alignment.
    // It is not a fixed array member here because its length is variable.
    // The `encode()` method appends raw_data bytes directly after raw_size.

    /// `sample_type` bitmask from `perf_event_attr` that was active when this
    /// sample was captured. Determines which of the above fields are valid.
    /// Not written to the ring buffer (internal use only for `encode()`).
    pub sample_type: u64,

    /// User-space register snapshot (present if `PERF_SAMPLE_REGS_USER` set).
    /// Contains the user-space GPRs at sample time, selected by `attr.sample_regs_user`
    /// bitmask. `None` if the flag is not set or the sample was taken in kernel context.
    pub regs_user: Option<SavedRegs>,

    /// User-space stack snapshot (present if `PERF_SAMPLE_STACK_USER` set).
    /// Up to `attr.sample_stack_user` bytes copied from the user stack pointer.
    /// `None` if the flag is not set.
    ///
    /// **Note:** `Vec<u8>` represents the logical field layout for this spec.
    /// At runtime, sample data is written directly into the perf mmap ring
    /// buffer shared with userspace — no per-sample heap allocation occurs.
    pub stack_user: Option<Vec<u8>>,

    /// Raw sample data bytes (present if `PERF_SAMPLE_RAW` set).
    /// Contains the raw tracepoint record or BPF-provided data, up to 65535 bytes.
    /// Not directly represented in the ring buffer as a fixed array — appended
    /// as `raw_size` bytes after the `raw_size` field during `encode()`.
    ///
    /// **Note:** Same as `stack_user` — `Vec<u8>` is the spec representation;
    /// runtime uses the perf mmap ring buffer with no heap allocation per sample.
    pub raw_data: Vec<u8>,
}

/// Construct a complete `PERF_RECORD_SAMPLE` from a `RawSample`.
fn build_perf_sample(event: &PerfEvent, raw: &RawSample) -> PerfRecordSample {
    let st = event.attr.sample_type;
    let mut rec = PerfRecordSample::new(st);

    if st & PERF_SAMPLE_IP        != 0 { rec.ip        = raw.ip; }
    if st & PERF_SAMPLE_TID       != 0 { rec.pid       = raw.pid; rec.tid = raw.tid; }
    if st & PERF_SAMPLE_TIME      != 0 { rec.time_ns   = raw.timestamp_ns; }
    if st & PERF_SAMPLE_CPU       != 0 { rec.cpu       = raw.cpu; }
    if st & PERF_SAMPLE_PERIOD    != 0 { rec.period    = event.attr.sample_period_or_freq; }
    if st & PERF_SAMPLE_ID        != 0 { rec.id        = event.id; }

    if st & PERF_SAMPLE_CALLCHAIN != 0 {
        rec.callchain = callchain_unwind(
            raw.ip, raw.sp, raw.fp,
            event.attr.sample_max_stack,
            event.attr.flags.contains(PerfEventFlags::EXCLUDE_CALLCHAIN_KERNEL),
            event.attr.flags.contains(PerfEventFlags::EXCLUDE_CALLCHAIN_USER),
        );
    }
    if st & PERF_SAMPLE_REGS_USER != 0 {
        rec.regs_user = Some(raw.regs.clone());
    }
    if st & PERF_SAMPLE_STACK_USER != 0 {
        // copy_from_user is safe in kthread context (process context, not NMI).
        rec.stack_user = copy_user_stack(raw.sp, event.attr.sample_stack_user as usize);
    }
    rec
}

The sampler kthread is woken by a check in scheduler_tick():

// umka-core/src/sched/tick.rs (relevant fragment)

pub fn scheduler_tick(cpu: u32) {
    let ctx = CpuLocal::for_cpu(cpu).perf_ctx;
    if !ctx.raw_sample_queue.is_empty() {
        ctx.sampler_thread.wake();
    }
    perf_rotate_context(ctx); // multiplexing rotation ("Event Multiplexing" below)
}

Sampler Thread Configuration:

  • Thread priority: SCHED_FIFO at priority 1 (lowest RT priority — yields to any real RT task while still preempting normal timesharing). One sampler thread per CPU (perf_sampler/{cpu_id}).

  • Wake source: Each CPU's PMU (Performance Monitoring Unit) overflow interrupt triggers a per-CPU perf_sample_pending flag. The interrupt handler sets the flag and does NOT call the sampler directly (interrupt context). The scheduler's post_irq_work() hook calls sampler_thread_wakeup() at the next scheduling point after the interrupt.

  • Wake throttle: The sampler thread enforces a minimum inter-wake interval of 1000/sample_rate_hz milliseconds. If the PMU fires faster than this (e.g., due to a very hot loop causing frequent PMU overflows), samples are coalesced: the sampler processes all pending sample records in a single wakeup instead of waking once per overflow. Maximum coalesced samples per wakeup: 256 (to bound sampler thread runtime).

  • CPU budget: The sampler thread has a CBS bandwidth limit of 5% CPU per core by default. Configurable via perf_sampler_cpu_budget_pct sysctl. If the budget is exceeded, the sampler yields and posts remaining samples in the next scheduling slot. Note: perf_sampler_cpu_budget_pct (CBS, default 5%) limits the sampler kthread. perf_cpu_time_max_percent (default 25%) throttles total PMU overhead including NMI time. The CBS budget is a subset of max_percent — if the CBS sampler uses 5%, max_percent allows 20% more for NMI handling and other PMU overhead.

  • Buffer overflow: If the per-CPU output ring buffer fills before userspace drains it (output ring size: 2048 samples × 128 bytes = 256 KiB per CPU), overflow samples are counted in perf_overflow_count (accessible via umkafs) and dropped. The sampler thread increases its wakeup frequency by 25% for 10 seconds after a drop to reduce future overflow probability. Note: this 2048-entry output ring (Stage 2, userspace-visible) is distinct from the 512-entry RAW_SAMPLE_QUEUE_DEPTH SPSC queue (Stage 1, NMI handler to sampler kthread).

20.8.9.1 Callchain Unwinder Specification

The callchain_unwind() function referenced in build_perf_sample above performs kernel and user stack unwinding to produce the callchain[] array embedded in PERF_RECORD_SAMPLE records. This section specifies the unwinding algorithm, validation strategy, and performance budget.

Primary algorithm — ORC (Oops Rewind Capability) unwinder. UmkaOS uses the same design as Linux's ORC unwinder. ORC generates a compact per-instruction unwind table during compilation, stored in two ELF sections:

  • .orc_unwind — array of OrcEntry structs, one per code range, specifying how to recover the previous frame's stack pointer and return address from the current SP and IP.
  • .orc_unwind_ip — sorted array of instruction pointer offsets corresponding 1:1 with .orc_unwind entries.

Each OrcEntry encodes the SP offset, the frame pointer register (if any), and the return address location relative to the computed previous SP. ORC lookup is an O(log N) binary search on the .orc_unwind_ip array, keyed by the current instruction pointer. ORC handles code compiled without frame pointers and correctly unwinds through interrupt frames, exception frames, and signal trampoline frames because the compiler generates explicit entries for prologue and epilogue transitions.

/// Compact ORC unwind entry. Stored in `.orc_unwind` section.
/// One entry per code range (typically per function, with extra entries
/// for prologue/epilogue/exceptional regions).
#[repr(C, packed)]
pub struct OrcEntry {
    /// Signed offset from the current SP to the previous frame's SP.
    /// A value of `ORC_REG_UNDEFINED` indicates the end of the chain.
    pub sp_offset: i16,
    /// Signed offset from the computed previous SP to the return address.
    pub ra_offset: i16,
    /// Register used as the base for SP recovery:
    /// `ORC_REG_SP` (current stack pointer), `ORC_REG_FP` (frame pointer),
    /// or `ORC_REG_UNDEFINED` (end of chain / not recoverable).
    pub sp_reg: u8,
    /// Register used as the base for FP recovery (if frame pointers are in use).
    pub fp_reg: u8,
    /// Flags: `ORC_TYPE_CALL` (normal call frame), `ORC_TYPE_REGS` (full
    /// interrupt/exception register save), `ORC_TYPE_REGS_PARTIAL` (partial
    /// register save, e.g., fast syscall entry).
    pub entry_type: u8,
    /// Padding for alignment.
    pub _pad: u8,
}
const_assert!(core::mem::size_of::<OrcEntry>() == 8);

/// Register encoding constants for `OrcEntry::sp_reg` and `fp_reg`.
pub const ORC_REG_UNDEFINED: u8 = 0;
pub const ORC_REG_SP: u8        = 1;
pub const ORC_REG_FP: u8        = 2;

/// Entry type constants for `OrcEntry::entry_type`.
pub const ORC_TYPE_CALL: u8         = 0;
pub const ORC_TYPE_REGS: u8         = 1;
pub const ORC_TYPE_REGS_PARTIAL: u8 = 2;

Fallback — frame pointer unwinding. If the ORC table lookup fails for a given IP (e.g., code in a dynamically loaded eBPF JIT region, or a Tier 2 driver loaded without ORC tables), the unwinder falls back to frame pointer chain walking: follow rbp (x86-64) / x29 (AArch64) / s0 (RISC-V) / r11 (ARMv7) to the previous frame, read the return address from the conventional location adjacent to the frame pointer. Frame pointer unwinding requires that kernel code is compiled with -fno-omit-frame-pointer (enforced in the UmkaOS build system for all in-kernel code; Tier 1 driver code is similarly required to preserve frame pointers). The unwinder marks frames recovered via the fallback path with an internal FRAME_FP_FALLBACK flag (not exposed to userspace) for diagnostic purposes.

Per-Architecture Frame Layouts and Fallback Unwinding:

The performance profiler uses two unwinding strategies: 1. DWARF unwinding (preferred): Uses .eh_frame / .debug_frame CFI records. Accurate but slow (~1-5μs per frame); used for offline symbolization. 2. Frame pointer unwinding (perf hot path): O(1) per frame; requires -fno-omit-frame-pointer (UmkaOS kernel is compiled with frame pointers always; Tier 2 drivers may not be).

Per-architecture frame pointer layout:

Architecture Frame pointer register Return address location Stack layout
x86-64 RBP [RBP+8] (caller's RIP) push RBP; mov RBP, RSP at function entry
AArch64 X29 (FP) [X29+8] (LR saved by callee) stp x29, x30, [sp, #-16]!; mov x29, sp
ARMv7 R11 [R11-4] (return address = saved LR) push {fp, lr}; add fp, sp, #4
RISC-V 64 S0/FP (X8) [S0-8] (saved RA) sd ra, -8(sp); addi s0, sp, frame_size
PPC64LE R1 (SP) [R1+16] (LR save area, ABI-defined) ABI-defined back-chain word at [R1+0]
PPC32 R1 (SP) [R1+4] (LR save area) Back-chain at [R1+0], LR at [R1+4]

Fallback for missing frame pointers: When frame pointer unwinding fails (frame pointer is zero or points outside the stack): 1. Attempt ORC (Oops Rewind Capability) unwinding if ORC metadata is available for the code address. ORC is a compact, architecture-specific alternative to DWARF generated at compile time (scripts/orc_gen). 2. If ORC is unavailable, scan the stack for plausible return addresses. Scanning rules: (a) pointer must be word-aligned (8-byte-aligned on 64-bit architectures, 4-byte-aligned on 32-bit); (b) pointer must fall within an executable .text segment mapped for the current kernel or a loaded driver; (c) the 1-5 bytes immediately preceding the candidate address must decode as a CALL / BL / jalr instruction (verified by checking the opcode byte pattern for each architecture); (d) at most 256 stack words (2 KB on 64-bit) are scanned per frame level before the scanner advances to the next candidate or stops; (e) at most 32 heuristic frames total are recovered before the walk terminates. 3. Mark heuristically-recovered frames with FRAME_HEURISTIC flag in the sample; symbolizer skips CFI validation for these.

Sampling overhead: Frame pointer walk costs ~3-10ns per frame (pointer dereference + bounds check). Maximum 128 frames unwound per sample; truncated stacks are marked TRUNCATED.

Stack validity checking. Before following any pointer during unwinding, the unwinder validates that the pointer lies within a known kernel stack region:

  1. The current task's kernel stack (task.stack_base .. task.stack_base + STACK_SIZE).
  2. The per-CPU IRQ stack (if the unwinder is walking through an interrupt frame).
  3. The per-CPU exception stack (e.g., x86-64 IST stacks for double fault, NMI, MCE).

If a candidate frame pointer or return address fails validation, the unwinder terminates the chain at that point and writes a PERF_CONTEXT_MAX sentinel (equivalent to [invalid frame]) into the callchain array. It does not fault or panic — an invalid frame simply truncates the trace. For user-mode portions of the callchain, addresses are validated against the task's user-mode VMA mappings; unmapped user addresses similarly terminate the user portion of the unwind.

Maximum depth. 64 frames per kernel callchain, 64 frames per user callchain, for a combined maximum of 128 frames plus PERF_CONTEXT_* markers. The callchain_unwind() function accepts a max_frames parameter (capped at PERF_MAX_STACK_DEPTH = 127 including context markers) and returns the actual number of frames unwound.

Performance budget. Less than 1 us for typical 10-frame kernel stacks. The ORC binary search is O(log N) over approximately 100K-500K entries for a typical kernel image, which completes in under 100 ns. The per-frame SP/RA recovery is a few memory loads per frame. The dominant cost for deep stacks (> 20 frames) or cold paths is L2/L3 cache misses when accessing the .orc_unwind table or stack pages that have been evicted. The sampler kthread (Stage 2) runs in process context, so these cache misses are tolerable — they do not affect NMI latency.

// umka-core/src/perf/unwind.rs

/// Unwind the kernel call stack starting from the given register state.
///
/// Fills `out` with instruction pointers from innermost (most recent call)
/// to outermost (entry point / syscall boundary). `PERF_CONTEXT_KERNEL`
/// is prepended as the first entry; `PERF_CONTEXT_USER` is inserted before
/// any user-mode frames if `include_user` is true.
///
/// Returns the number of entries written to `out` (including context markers).
/// Stops at `max_frames`, at the first invalid frame, or at the bottom of
/// the call stack — whichever comes first.
///
/// # Algorithm selection
///
/// 1. Look up `ip` in the ORC table (`.orc_unwind_ip` binary search).
/// 2. If found, use the `OrcEntry` to compute the previous SP and RA.
/// 3. If not found, fall back to frame-pointer walking from `fp`.
/// 4. Validate every pointer against known stack regions before dereferencing.
pub fn callchain_unwind(
    ip: u64, sp: u64, fp: u64,
    max_frames: u32,
    exclude_kernel: bool, exclude_user: bool,
) -> ([u64; PERF_MAX_STACK_DEPTH], u64) {
    let mut chain = [0u64; PERF_MAX_STACK_DEPTH];
    let mut depth: u64 = 0;
    let (mut cur_ip, mut cur_sp, mut cur_fp) = (ip, sp, fp);

    while depth < max_frames as u64 {
        // Step 1: skip kernel or user frames per caller request.
        let is_kernel = cur_ip >= KERNEL_TEXT_START;
        if (is_kernel && exclude_kernel) || (!is_kernel && exclude_user) { break; }

        chain[depth as usize] = cur_ip;
        depth += 1;

        // Step 2: try ORC unwinding first (preferred — works without frame pointers).
        if let Some(entry) = orc_table_lookup(cur_ip) {
            let prev_sp = match entry.sp_reg {
                OrcReg::Sp => cur_sp.wrapping_add(entry.sp_offset as u64),
                OrcReg::Bp => cur_fp.wrapping_add(entry.sp_offset as u64),
                OrcReg::Undefined => break, // end of ORC data
            };
            // Validate prev_sp is within a known stack region.
            if !is_valid_stack_addr(prev_sp) { break; }
            cur_ip = unsafe { *(prev_sp.wrapping_sub(8) as *const u64) }; // return addr
            cur_fp = if entry.bp_reg == OrcReg::PrevSp {
                unsafe { *(prev_sp.wrapping_add(entry.bp_offset as u64) as *const u64) }
            } else { 0 };
            cur_sp = prev_sp;
        } else {
            // Step 3: ORC miss — fall back to frame-pointer walking.
            if cur_fp == 0 || !is_valid_stack_addr(cur_fp) { break; }
            // Frame pointer chain: [fp] = prev_fp, [fp+8] = return addr.
            let prev_fp = unsafe { *(cur_fp as *const u64) };
            cur_ip = unsafe { *((cur_fp + 8) as *const u64) };
            cur_sp = cur_fp + 16;
            cur_fp = prev_fp;
        }
    }
    (chain, depth)
}

20.8.10 Ring Buffer Write Protocol

The ring buffer uses a single-producer (kernel) / single-consumer (userspace) lock-free protocol. The kernel is the sole writer; userspace is the sole reader.

Kernel write path:

// umka-core/src/perf/ring_buffer.rs

/// Write a complete perf record to the ring buffer.
///
/// # Ordering
///
/// Sample data is written to the circular buffer before `data_head` is updated.
/// `data_head` is stored with Release ordering. Userspace reads `data_head`
/// with Acquire ordering, ensuring it sees the complete sample data.
///
/// On ARMv7 and RISC-V (no hardware TSO), a `SeqCst` fence is emitted before
/// the `data_head` store to prevent the store from being reordered ahead of the
/// data writes by the CPU's out-of-order write buffer.
pub fn ring_buffer_write(rb: &PerfRingBuffer, record: &PerfRecordSample) {
    let bytes = record.encode();
    let len   = bytes.len() as u64;

    let head = rb.data_head.load(Ordering::Acquire);
    let tail = {
        // SAFETY: header pointer is valid and kernel-writable for the ring
        // buffer lifetime. Tail is only written by userspace (Release) and
        // read here (Acquire) — no race with the kernel write path.
        unsafe { (*rb.header.as_ptr()).data_tail.load(Ordering::Acquire) }
    };

    // Full condition: ring has no room for this record.
    if head.wrapping_sub(tail) + len > rb.data_size as u64 {
        rb.lost_samples.fetch_add(1, Ordering::Relaxed);
        emit_lost_record(rb, 1);
        return;
    }

    // Write bytes into the circular data buffer (wrapping at data_size).
    let data = unsafe {
        core::slice::from_raw_parts_mut(rb.data.as_ptr() as *mut u8, rb.data_size)
    };
    let offset = (head as usize) & (rb.data_size - 1);
    ring_copy(data, offset, &bytes);

    // Fence before head update on weakly-ordered architectures.
    #[cfg(any(target_arch = "arm", target_arch = "riscv64"))]
    core::sync::atomic::fence(Ordering::SeqCst);

    let new_head = head + len;
    rb.data_head.store(new_head, Ordering::Release);
    // Mirror into the mmap header page for userspace.
    unsafe {
        (*rb.header.as_ptr()).data_head.store(new_head, Ordering::Release);
    }
}

/// Copy `src` into `buf` starting at `offset`, wrapping around at `buf.len()`.
fn ring_copy(buf: &mut [u8], offset: usize, src: &[u8]) {
    let cap   = buf.len();
    let first = (cap - offset).min(src.len());
    buf[offset..offset + first].copy_from_slice(&src[..first]);
    if first < src.len() {
        buf[..src.len() - first].copy_from_slice(&src[first..]);
    }
}

/// Check if userspace readers should be woken: returns true when the ring
/// buffer fill level crosses the wakeup_events or wakeup_watermark threshold
/// configured in perf_event_attr.
fn ring_buffer_should_wakeup(rb: &PerfRingBuffer, event: &PerfEvent) -> bool {
    let head = rb.data_head.load(Ordering::Acquire);
    let tail = unsafe { (*rb.header.as_ptr()).data_tail.load(Ordering::Acquire) };
    let used = head.wrapping_sub(tail);
    if event.attr.flags.contains(PerfEventFlags::WATERMARK) {
        // WATERMARK flag set: wakeup_events_or_watermark is bytes threshold.
        used >= event.attr.wakeup_events_or_watermark as u64
    } else {
        // Count-based wakeup: approximate by comparing bytes used against
        // wakeup_events * average record size (header = 8 bytes minimum).
        let threshold = event.attr.wakeup_events_or_watermark as u64 * size_of::<PerfEventHeader>() as u64;
        used >= threshold
    }
}

/// Copy the user-mode stack starting at `sp` for up to `max_bytes`.
/// Reads via copy_from_user (safe in kthread/process context, not NMI).
/// Returns an empty Vec (cold-path heap allocation) if the stack page
/// is unmapped or inaccessible.
///
/// Uses a per-CPU scratch buffer (`PerCpuStackCopyBuf`) to avoid a
/// global mutable static. The per-CPU buffer is accessed with preemption
/// disabled (the sampler kthread runs pinned to one CPU), so no locking
/// is needed. The data is copied into a Vec before returning so the
/// per-CPU buffer can be reused immediately.
fn copy_user_stack(sp: usize, max_bytes: usize) -> Vec<u8> {
    let user_ptr = UserPtr::new(sp);
    let percpu_buf = CpuLocal::current().stack_copy_buf.as_mut();
    let buf = &mut percpu_buf[..max_bytes.min(PERF_MAX_STACK_COPY)];
    match user_ptr.copy_from_user(buf) {
        Ok(n) => buf[..n].to_vec(),
        Err(_fault) => Vec::new(),
    }
}

/// Maximum user stack bytes to copy per sample. Matches Linux's
/// `PERF_MAX_STACK_DEPTH * sizeof(u64)` default (~1024 bytes).
/// Allocated per-CPU in `CpuLocalBlock.stack_copy_buf`.
const PERF_MAX_STACK_COPY: usize = 8192;

/// Write a PERF_RECORD_LOST record into the ring buffer to notify userspace
/// that `count` samples were dropped. The lost record itself is best-effort —
/// if the ring is completely full, it is silently skipped.
fn emit_lost_record(rb: &PerfRingBuffer, count: u64) {
    let lost = PerfRecordLost {
        header: PerfEventHeader {
            type_: PerfRecordType::Lost as u32,
            misc: 0,
            size: size_of::<PerfRecordLost>() as u16,
        },
        id: 0,
        lost: count,
    };
    // Best-effort: skip silently if ring is completely full.
    let _ = rb.try_write(as_bytes(&lost));
}

Userspace read protocol (no system call required):

/* Pseudocode — purely userspace, no kernel involvement */
void perf_drain_ring(struct perf_event_mmap_page *hdr, void *data) {
    uint64_t size = hdr->data_size;
    uint64_t head = __atomic_load_n(&hdr->data_head, __ATOMIC_ACQUIRE);
    uint64_t tail = hdr->data_tail;

    while (tail != head) {
        struct perf_event_header *rec =
            (void *)((char *)data + (tail & (size - 1)));
        handle_record(rec);
        tail += rec->size;
    }
    __atomic_store_n(&hdr->data_tail, tail, __ATOMIC_RELEASE);
}

mmap size constraints. The mmap() call on a perf fd must request exactly (1 + 2^n) pages: 1 header page followed by 2^n data pages. Requesting any other size returns EINVAL. Typical configurations used by tools:

mmap pages Data region Typical use
1 + 2 8 KB Breakpoint events
1 + 8 32 KB Light profiling
1 + 64 256 KB Default perf record
1 + 512 2 MB High-frequency sampling
1 + 2048 8 MB Intel PT / ARM SPE AUX buffer

PERF_RECORD_LOST emission. When the ring is full and a sample is dropped, UmkaOS emits a PERF_RECORD_LOST record (type 2) to the ring buffer if there is space for the small fixed-size lost record. This matches Linux behavior and allows perf report to report the number of dropped samples.

20.8.11 Event Multiplexing

When more events are requested than the PMU has hardware counter slots, UmkaOS rotates events in and out of the PMU on each scheduler tick. Accumulated counts are scaled by the time_enabled / time_running ratio to give approximate full-interval counts.

Rotation algorithm. UmkaOS uses deterministic round-robin rotation. Linux uses a semi-random scheduling order that can starve low-priority events; UmkaOS's deterministic order ensures every event gets equal time.

Helper functions used by rotation and event add/remove:

// umka-core/src/perf/context.rs

/// Returns the `PmuOps` trait object for the PMU that owns events in `ctx`.
/// All events in a given context share the same PMU (enforced at event creation).
#[inline]
/// Returns the `PmuOps` for the given context by reading from the mutable
/// guard's event lists. This MUST be called with the `events_lock` already
/// held (the `mutable` parameter IS the lock guard). The old `ctx_pmu()`
/// function acquired `events_lock` itself, which caused a guaranteed deadlock
/// when called from `perf_rotate_context()` (which already holds the lock).
fn ctx_pmu_from_guard(mutable: &PerfEventMutable) -> &'static dyn PmuOps {
    // All events in a context share the same PMU.
    // Precondition: the context has at least one event (active or pending).
    // Callers (perf_rotate_context, perf_event_add) ensure this by only
    // operating on contexts that have registered events.
    if let Some(event) = mutable.active_events.first() {
        event.pmu
    } else if let Some(event) = mutable.pending_events.first() {
        event.pmu
    } else {
        // SAFETY: unreachable — callers guarantee at least one event exists.
        // If this fires, the caller violated the precondition by calling
        // ctx_pmu_from_guard on an empty context.
        unreachable!("ctx_pmu_from_guard called on context with no events")
    }
}

/// Add `event` to `ctx` under `events_lock`, programming it into a hardware
/// counter slot if one is available.
///
/// Called from event creation and from pending-event promotion after removal.
/// `mutable` must be the locked `events_lock` guard for `ctx`.
///
/// Returns `Ok(())` if the event was scheduled onto hardware, or `Err(())` if
/// no hardware slot is available (the event remains in `pending_events`).
fn perf_event_add_locked(
    ctx: &PerfEventContext,
    mutable: &mut PerfEventMutable,
    event: Arc<PerfEvent>,
) -> Result<(), ()> {
    let total_slots = ctx.hpc_slots as usize + ctx.fixed_slots as usize;
    let count = ctx.active_count.load(Ordering::Acquire) as usize;

    if count < total_slots {
        // Hardware slot available: add to active set.
        event.pmu.event_add(&event, PERF_EF_START | PERF_EF_RELOAD);
        ctx.active[count].store(Arc::as_ptr(&event) as *mut PerfEvent, Ordering::Release);
        mutable.active_events.push(event);
        ctx.active_count.store((count + 1) as u32, Ordering::Release);
        Ok(())
    } else {
        // No slot: queue for multiplexing rotation.
        mutable.pending_events.push(event);
        Err(())
    }
}

The rotation runs in perf_rotate_context(), called from scheduler_tick():

// umka-core/src/perf/context.rs

/// Rotate events in a CPU context: swap out active set, swap in pending set.
///
/// Called from the scheduler tick with interrupts disabled on `ctx.cpu`.
/// Holds `events_lock` (SpinLock) for the duration — this is acceptable
/// because rotation is an infrequent slow path (once per tick, not once per
/// context switch). The context switch hot path (`perf_schedule_in`/
/// `perf_schedule_out`) is lock-free and will not contend here.
pub fn perf_rotate_context(ctx: &PerfEventContext) {
    let total_slots = ctx.hpc_slots as usize + ctx.fixed_slots as usize;
    let now_ns = timekeeping_fast_ns();

    let mut mutable = ctx.events_lock.lock();

    // Extract the PMU from the guard — NOT via ctx_pmu() which would
    // deadlock by re-acquiring events_lock.
    let pmu = ctx_pmu_from_guard(&mutable);

    // 1. Save counts, update time_running, remove active events from hardware.
    for event in mutable.active_events.iter() {
        pmu.event_del(event, PERF_EF_UPDATE);
        event.time_running_ns.fetch_add(
            now_ns.wrapping_sub(event.last_schedule_in_ns.load(Ordering::Relaxed)),
            Ordering::Relaxed,
        );
    }

    // 2. Move all active events to the back of the pending queue.
    //    Zero the active_count first (Release) so the NMI handler's lockless
    //    read sees an empty active set while we rebuild it.
    ctx.active_count.store(0, Ordering::Release);
    for event in mutable.active_events.drain(..) {
        mutable.pending_events.push(event);
    }

    // 3. Pull the next batch from the pending queue and rebuild the
    //    raw pointer array. swap_remove(0) rotates fairly.
    let count = total_slots.min(mutable.pending_events.len());
    for i in 0..count {
        let event = mutable.pending_events.swap_remove(0);
        event.last_schedule_in_ns.store(now_ns, Ordering::Relaxed);
        // Store pointer before incrementing count.
        ctx.active[i].store(Arc::as_ptr(&event) as *mut PerfEvent, Ordering::Release);
        mutable.active_events.push(event);
    }
    // Release store: makes all pointer writes visible before the new count.
    ctx.active_count.store(count as u32, Ordering::Release);

    // 4. Program the new active set into hardware.
    for event in mutable.active_events.iter() {
        pmu.event_add(event, PERF_EF_START | PERF_EF_RELOAD);
    }

    // 5. Update time_enabled for all events (active + pending).
    let elapsed = now_ns.wrapping_sub(ctx.last_rotate_ns);
    ctx.last_rotate_ns = now_ns;
    for event in mutable.active_events.iter() {
        event.time_enabled_ns.fetch_add(elapsed, Ordering::Relaxed);
    }
    for event in mutable.pending_events.iter() {
        event.time_enabled_ns.fetch_add(elapsed, Ordering::Relaxed);
    }
}

Scaling formula. Userspace computes the full-interval estimate from the read() return value when PERF_FORMAT_TOTAL_TIME_ENABLED and PERF_FORMAT_TOTAL_TIME_RUNNING are set:

scaled_count = raw_count × (time_enabled_ns / time_running_ns)

The kernel returns these values in the read() response:

/// Layout returned by read() on a perf fd.
/// Which fields are present depends on attr.read_format.
#[repr(C)]
pub struct PerfReadFormat {
    /// Raw accumulated event count (always present).
    pub value:        u64,
    /// Total nanoseconds the event has been enabled.
    /// Present if PERF_FORMAT_TOTAL_TIME_ENABLED is set.
    pub time_enabled: u64,
    /// Total nanoseconds the event has been running on PMU hardware.
    /// Present if PERF_FORMAT_TOTAL_TIME_RUNNING is set.
    pub time_running: u64,
    /// Event unique ID.
    /// Present if PERF_FORMAT_ID is set.
    pub id:           u64,
    /// Samples lost since last read.
    /// Present if PERF_FORMAT_LOST is set (kernel 6.0+ extension).
    pub lost:         u64,
}
const_assert!(core::mem::size_of::<PerfReadFormat>() == 40);

Pinned events. Events with PerfEventFlags::PINNED set are never rotated out. If a pinned event cannot be scheduled because no hardware slot is available, it transitions to PerfEventState::Error. Subsequent read() calls on that fd return EINVAL. Userspace may recover by closing the fd and reopening with fewer concurrent events.

Group atomicity. Events sharing the same group (same group_fd) are scheduled atomically. Either all siblings fit simultaneously into available hardware slots, or none are scheduled. This guarantees that ratios computed between group members (instructions per cycle, cache miss rate) are always measured over the same time interval with no inter-event skew.

20.8.12 ioctl Operations

All ioctl codes are identical to Linux (same numeric values, same semantics). The third argument to ioctl(fd, request, arg) is the arg; for codes with PERF_IOC_FLAG_GROUP support, the arg | PERF_IOC_FLAG_GROUP form applies the operation to all events in the group simultaneously.

ioctl request Code arg type Description
PERF_EVENT_IOC_ENABLE 0x2400 Enable event (begin counting)
PERF_EVENT_IOC_DISABLE 0x2401 Disable event (stop counting; count is preserved)
PERF_EVENT_IOC_REFRESH 0x2402 int Enable for arg overflow samples, then auto-disable
PERF_EVENT_IOC_RESET 0x2403 Reset accumulated count to zero (does not affect enabled state)
PERF_EVENT_IOC_PERIOD 0x40082404 u64 * Update sample period; takes effect after the next overflow
PERF_EVENT_IOC_SET_OUTPUT 0x2405 int (fd) Redirect samples to another fd's ring buffer
PERF_EVENT_IOC_SET_FILTER 0x40082406 char * Set ftrace filter string (TRACEPOINT events only)
PERF_EVENT_IOC_ID 0x80082407 u64 * Write the event's unique ID into *arg
PERF_EVENT_IOC_SET_BPF 0x40042408 int (bpf fd) Attach BPF_PROG_TYPE_PERF_EVENT program to this event
PERF_EVENT_IOC_PAUSE_OUTPUT 0x40042409 int (bool) Pause or resume ring buffer output without disabling counting
PERF_EVENT_IOC_QUERY_BPF 0x8008240a perf_event_query_bpf * Retrieve IDs of all attached BPF programs
PERF_EVENT_IOC_MODIFY_ATTRIBUTES 0x4008240b perf_event_attr * Modify writable attributes of an existing event in-place

PERF_EVENT_IOC_SET_BPF. Attaches an eBPF program of type BPF_PROG_TYPE_PERF_EVENT to the event. The program runs in the sampler kthread context (Stage 2, Section 20.8) on every overflow sample, before the sample is written to the ring buffer. The program may filter samples by returning 0 (drop), or may call bpf_perf_event_output() to write a custom record to a BPF_MAP_TYPE_PERF_EVENT_ARRAY. Requires CAP_BPF + CAP_PERFMON. See Section 19.5 for eBPF verifier and JIT details.

PERF_EVENT_IOC_MODIFY_ATTRIBUTES. Permitted modifications are limited to fields that can be changed safely on a live event: sample_period_or_freq, wakeup_events_or_watermark, clockid, and sig_data. Attempting to change event_type, config, or structural flags returns EINVAL.

20.8.13 /proc/sys Tunables and Linux Compatibility

UmkaOS exports the same /proc/sys/kernel/perf_* tunable knobs as Linux, with identical names, semantics, and default values.

/proc/sys/kernel/perf_event_paranoid

Controls which callers may open perf events. Checked at perf_event_open() time against the caller's credentials and capabilities.

Value Restriction
-1 No restriction. All events available to all users.
0 Raw tracepoints and kernel events allowed to all users.
1 CPU-wide hardware events allowed for non-root processes; per-process events always allowed.
2 (default) Only aggregate kernel-wide stats allowed for non-root; no per-process hardware events.
3 perf_event_open returns EACCES for all non-root callers.

CAP_SYS_ADMIN or CAP_PERFMON bypasses all paranoid restrictions. CAP_PERFMON (introduced in Linux 5.8) is the preferred least-privilege capability for performance monitoring; UmkaOS recognizes it with the same semantics.

LSM integration: Before event creation, the kernel calls security_perf_event_open(attr, type) — an LSM hook that allows security modules (SELinux, AppArmor, BPF-LSM) to deny event creation based on the caller's security context and event attributes. This runs AFTER the perf_event_paranoid check succeeds and BEFORE PerfEvent::new(). The hook receives the perf_event_attr struct and the event type. Additional hooks: security_perf_event_alloc (at event allocation), security_perf_event_read and security_perf_event_write (at data access). These match Linux's LSM perf hooks (since Linux 5.1).

Cgroup perf_event limit: After the LSM hook succeeds and before PerfEvent::new(), the kernel checks the calling task's cgroup perf_event controller (Section 17.2). If the controller is enabled (cgroup.perf_event.is_some()), the kernel atomically increments PerfEventController::nr_events and verifies it does not exceed max_events. If the limit is exceeded, the increment is rolled back and perf_event_open returns -ENOSPC. The check walks from the task's cgroup to the root, verifying each ancestor's limit — a child cgroup cannot bypass a parent's restriction. On close() of the perf event fd, nr_events is decremented at each level. This prevents untrusted containers from exhausting PMU resources and denying monitoring to other workloads or the host.

/proc/sys/kernel/perf_event_max_sample_rate

Maximum sampling frequency in Hz. Default: 100000. Sampling frequency requests above this value (when PerfEventFlags::FREQ is set) are silently clamped. The kernel dynamically reduces this limit if cumulative PMI processing time exceeds perf_cpu_time_max_percent of available CPU time, and recovers it as load decreases. Reduction and recovery are reported to userspace via PERF_RECORD_THROTTLE and PERF_RECORD_UNTHROTTLE ring buffer records.

/proc/sys/kernel/perf_event_mlock_kb

Maximum kilobytes of ring buffer memory that may be mlock()-ed (pinned against swap) per user account. Default: 516 KB, sufficient for one default-sized perf record ring buffer (1 header page + 128 data pages). Each mmap() of a ring buffer deducts from the caller's per-user quota.

/proc/sys/kernel/perf_cpu_time_max_percent

Maximum percentage of CPU time that PMI processing may consume before the kernel throttles the sample rate. Default: 25. When exceeded, UmkaOS reduces the active sample period and emits PERF_RECORD_THROTTLE. Recovery (PERF_RECORD_UNTHROTTLE) occurs when the PMI load drops below half the threshold for at least one second.

umkafs unified namespace. The same knobs are accessible via umkafs (Section 20.5):

/ukfs/kernel/perf/
  paranoid                  # rw; maps to perf_event_paranoid
  max_sample_rate           # rw; maps to perf_event_max_sample_rate
  mlock_kb                  # rw; maps to perf_event_mlock_kb
  cpu_time_max_percent      # rw; maps to perf_cpu_time_max_percent
  pmu/
    <name>/                 # one directory per registered PmuOps driver
      type                  # ro; dynamic PMU type number
      cpumask               # ro; CPUs where this PMU is available
      nr_addr_filters       # ro; number of address filters supported
      format/               # ro; event format descriptors (per-field files)
      events/               # ro; vendor-named canonical event aliases

Syscall number table. All values match the Linux ABI for the corresponding architecture:

Architecture Syscall number
x86-64 298
AArch64 241
ARMv7 364
RISC-V 64 241
PPC32 319
PPC64LE 319
s390x 331
LoongArch64 241

Tool compatibility. The following widely-used tools work without modification on UmkaOS:

Tool Mechanism Notes
perf stat perf_event_open + read() + userspace rdpmc Full group event support including PERF_FORMAT_*
perf record perf_event_open + mmap() + ring buffer All PERF_RECORD_* types emitted
perf report Reads perf.data No kernel interaction; purely userspace
perf script Reads perf.data No kernel interaction
bpftrace perf_event_open + PERF_EVENT_IOC_SET_BPF Full compat; BPF_PROG_TYPE_PERF_EVENT
BCC (profile, offcputime, funclatency, etc.) Same as bpftrace Full compat
pmu-tools (toplev, ocperf) PERF_TYPE_RAW event codes Arch-specific; works on Intel/AMD/ARM
JVM via perf-map-agent Reads /tmp/perf-PID.map for JIT symbols Pure userspace file protocol; zero kernel changes
perf c2c (false sharing detection) PERF_SAMPLE_DATA_SRC + PERF_SAMPLE_ADDR Requires hardware data-source support (Intel/AMD)
Intel VTune Profiler (CLI) perf_event_open + raw Intel events Intel x86-64 only

20.8.14 Namespace-Aware Perf Output

Perf events report PIDs and timestamps. In a containerized environment, the kernel's internal task IDs and monotonic clock differ from what the container expects. UmkaOS applies namespace translation at the point where perf records are written to the ring buffer, ensuring that tools like perf record inside a container see container-local PIDs and namespace-adjusted timestamps.

20.8.14.1 PID Translation in Perf Records

Every PERF_RECORD_SAMPLE (and other record types with PERF_SAMPLE_TID) includes pid and tid fields. These must reflect the PID/TID as seen by the reader of the perf event, not the kernel-internal task ID.

Translation rule: When writing a perf record to the ring buffer, the kernel translates the target task's PID through the reader's PID namespace:

let reader_ns = reader_task.nsproxy.load();
perf_record.pid = task_pid_nr_ns(target_task, &reader_ns.pid_ns)
perf_record.tid = task_tgid_nr_ns(target_task, &reader_ns.pid_ns)

Where task_pid_nr_ns(task, ns) looks up the task's PID in the specified PID namespace via the Idr<TaskId> PID map (Section 17.1). If the target task is not visible in the reader's PID namespace (e.g., a host task monitored from inside a container), the PID fields are set to -1 (unknown).

This matches Linux behavior: perf record inside a PID namespace sees container-local PIDs, while perf record on the host sees global PIDs.

20.8.14.2 Timestamp Namespace Adjustment

When PERF_SAMPLE_TIME is requested and the perf event's clock source is CLOCK_MONOTONIC or CLOCK_BOOTTIME, timestamps are adjusted by the reader's time namespace offset:

if clock_id == CLOCK_MONOTONIC:
    perf_record.timestamp = ktime_get_ns() + reader_timens.monotonic_offset
if clock_id == CLOCK_BOOTTIME:
    perf_record.timestamp = ktime_get_boottime_ns() + reader_timens.boottime_offset

Where reader_timens is the TimeNamespace of the task that opened the perf event (Section 17.1). If the reader is in the init time namespace, the offsets are zero and no adjustment occurs.

CLOCK_MONOTONIC_RAW (the default perf clock, clockid = -1) is NOT adjusted by time namespaces — it always reports hardware-monotonic time. This matches Linux: raw monotonic is immune to timens offsets.

20.8.14.3 Namespace-Scoped Privilege Checks

perf_event_open() privilege checks must be namespace-aware to prevent containers from monitoring host processes:

perf_event_open(attr, pid, cpu, group_fd, flags):
  ...
  // Privilege check: use ns_capable() scoped to the caller's user namespace.
  if attr.event_type requires kernel access:
      if !ns_capable(current_cred().user_ns, CAP_PERFMON)
         && !ns_capable(current_cred().user_ns, CAP_SYS_ADMIN):
          return EACCES

  // PID targeting: resolve pid through caller's PID namespace.
  if pid > 0:
      target = find_task_by_vpid(pid)  // resolves in caller's pid_ns
      if target is None:
          return ESRCH
      // Cross-namespace check: caller must be able to ptrace the target.
      if !ptrace_may_access(target, PTRACE_MODE_READ):
          return EACCES

  // CPU-wide monitoring: requires CAP_PERFMON in the root user namespace
  // (a container cannot monitor all tasks on a physical CPU).
  if pid == -1 && cpu >= 0:
      if !ns_capable(&init_user_ns, CAP_PERFMON):
          return EACCES
  ...

20.8.14.4 FMA Timestamp Namespace Adjustment

FMA health events (Section 20.1) include a timestamp_ns field. When FMA events are exposed to containerized monitoring tools via the umkafs FMA interface (/ukfs/health/events/), timestamps are adjusted by the reading process's time namespace offset (same adjustment as perf timestamps above). FMA events read from the init time namespace use raw kernel timestamps with no adjustment.

Self-observability: The PMU subsystem emits stable tracepoints (Section 20.2) for its own overhead measurement: umka_stable:perf_sampler_wakeup (per sampler kthread wakeup, fields: cpu, samples_processed, duration_ns), umka_stable:perf_overflow_drop (per dropped sample, fields: cpu, queue_depth), and umka_stable:perf_multiplex_rotate (per rotation, fields: cpu, events_rotated). These allow administrators to monitor the profiling subsystem's own CPU consumption and sample loss rate.

20.8.15 Perf Event Exit Cleanup

When a task exits, all perf events owned by that task must be stopped and their hardware PMU counters released. This cleanup runs in do_exit() BEFORE close_files() (Step 5 in Section 8.2) because perf events may hold references to the task's mm (for userspace sampling buffers via mmap()) and to file descriptors (the perf event fd itself). If close_files() ran first, it would attempt to close perf event fds, triggering perf_event_release() while the task's PMU context is still active — potentially racing with a concurrent context switch that programs the dying task's counters.

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). See CLAUDE.md Spec Pseudocode Quality Gates.

Ordering constraint: perf_event_exit_task() MUST execute after PF_EXITING is set (Step 1) and before both mm teardown (Step 4) and close_files() (Step 5). It is placed at Step 3d in the do_exit() sequence — per-thread cleanup that every exiting thread performs. The perf event fd entries still exist in the shared fd table and are closed later in Step 5 (last-thread only).

/// Called from do_exit() to clean up all perf events owned by the dying task.
///
/// Stops all active hardware PMU counters, marks events as dead, and releases
/// hardware counter reservations. Events whose fds are held by external
/// processes (via `pidfd_getfd` or `SCM_RIGHTS`) become orphaned — they
/// remain accessible for `read()` (returning the final count) but no longer
/// collect samples.
///
/// # Preconditions
/// - `task.flags` has `PF_EXITING` set (Step 1 of do_exit).
/// - The task IS still on the run queue and CAN be preempted. `do_exit()`
///   does NOT disable preemption. `PF_EXITING` prevents new perf events
///   from being attached to this task (perf_event_open checks PF_EXITING),
///   but the task may still be preempted during cleanup. The cleanup code
///   must be preemption-safe.
///
/// # Postconditions
/// - All hardware PMU counters previously reserved by this task are freed.
/// - `task.perf_events` is empty.
/// - Each event's state is `PerfEventState::Dead`.
fn perf_event_exit_task(task: &Task) {
    // Step 1: Detach the task's perf overlay from its CPU context.
    // After this, context_switch perf_schedule_in/out will not attempt
    // to program this task's events.
    if let Some(task_ctx) = task.perf_ctx.take() {
        let cpu_ctx = &CpuLocal::get().perf_ctx;
        // If the task_ctx is currently attached to this CPU's context,
        // detach it. The task is running do_exit on this CPU, so it is
        // the current task — its task_ctx is attached.
        let mut mutable = cpu_ctx.events_lock.lock();
        if mutable.task_ctx_matches(&task_ctx) {
            // Stop and remove all task-pinned events from PMU hardware.
            let task_events = task_ctx.task_events.lock();
            for event in task_events.iter() {
                if event.state.load(Acquire) == PerfEventState::Active as u32 {
                    // Read the final counter value before stopping.
                    event.pmu.event_stop(event, PERF_EF_UPDATE);
                }
                event.pmu.event_del(event, 0);
            }
            drop(task_events);
            mutable.detach_task_ctx();
        }
        drop(mutable);
    }

    // Step 2: Walk the task's perf_events list and transition all events
    // to Dead state. This covers both task-pinned events (handled above)
    // and events that target this task but are CPU-wide or in multiplexing
    // pending queues.
    //
    // IntrusiveList::drain() removes all entries from the list, yielding
    // ownership of each PerfEvent's intrusive link. The Arc<PerfEvent>
    // itself may survive if an external fd reference exists.
    while let Some(event) = task.perf_events.pop_front() {
        let prev_state = event.state.load(Acquire);

        // Stop the event if it is still active on any CPU's PMU hardware.
        // This handles system-wide events that were opened with pid=this_task
        // but are pinned to a specific CPU — those were not detached in Step 1.
        if prev_state == PerfEventState::Active as u32
            || prev_state == PerfEventState::Inactive as u32
        {
            // IPI the target CPU to stop the event if it is currently on
            // a remote CPU's PMU. For the local CPU, stop directly.
            if event.cpu >= 0 && event.cpu as u32 != smp_processor_id() {
                smp_call_function_single(event.cpu as u32, || {
                    let cpu_ctx = &CpuLocal::get().perf_ctx;
                    let mut mutable = cpu_ctx.events_lock.lock();
                    mutable.remove_event(&event);
                    event.pmu.event_stop(&event, PERF_EF_UPDATE);
                    event.pmu.event_del(&event, 0);
                });
            }
        }

        // Release the hardware counter reservation. The PMU driver's
        // per-CPU allocation tracking is updated to free the slot.
        if event.hw_counter_idx >= 0 {
            event.pmu.free_counter(event.hw_counter_idx);
        }

        // Mark the ring buffer as orphaned — no more samples will be
        // written. Existing data remains readable by userspace mmap.
        if let Some(ref ring_buf) = event.ring_buffer {
            ring_buf.mark_orphaned();
        }

        // Transition to Dead. This is irreversible. If the event's fd is
        // held by another process, read() will return the final count
        // (already captured by event_stop + PERF_EF_UPDATE above).
        event.state.store(PerfEventState::Dead as u32, Release);

        // The Arc<PerfEvent> is dropped here unless external fd references
        // (via pidfd_getfd / SCM_RIGHTS) keep it alive. When the last
        // reference drops, perf_event_release() frees the event struct.
    }
}

Inherited events: If the dying task has events with PerfEventFlags::INHERIT set, child tasks that were created via fork() after the event was opened have their own independent copies of the event (created by perf_event_init_task() during copy_process()). These child copies are NOT affected by the parent's perf_event_exit_task() — they are owned by the child task and will be cleaned up when the child exits. The parent's event accumulates the child's counts via the INHERIT_STAT mechanism only if PerfEventFlags::INHERIT_STAT is also set.

Multiplexing cleanup: Events in the pending_events queue (waiting for a hardware slot due to multiplexing) are also drained during exit. These events were never programmed into hardware, so no event_stop is needed — they transition directly from Inactive to Dead.

PMU implementation stages (note: these 5 stages all fall within roadmap Phase 4 from Section 24.2. They are NOT the kernel-wide phases from Chapter 24 — they describe the internal build-out ordering of the PMU subsystem within a single roadmap phase): - PMU Stage 1: Core infrastructure — PerfEvent, PerfEventContext, PmuOps trait, PERF_TYPE_SOFTWARE events, ring buffer mmap protocol, basic ioctl set (ENABLE/DISABLE/RESET/ID), perf_event_paranoid enforcement, Intel Core PMU driver. - PMU Stage 2: Sampling — NMI handler (x86-64), per-CPU sampler kthread, callchain unwinding (frame-pointer), PERF_TYPE_HARDWARE for Intel and AMD, PERF_RECORD_SAMPLE ring buffer records, PERF_RECORD_LOST. - PMU Stage 3: Tracepoint events (PERF_TYPE_TRACEPOINT), hardware cache events (PERF_TYPE_HW_CACHE), ARM PMUv3 driver, RISC-V SBI PMU driver, PPC64LE POWER PMU driver, LoongArch64 PMU driver (CSR-based performance counters: CPUCFG.0x6 capability query, CSR.PERFCTRL[0-3] counter control registers, CSR.PERFCNTR[0-3] counter value registers; up to 4 programmable counters on typical Loongson 3A5000/3A6000 implementations). - PMU Stage 4: Advanced features — Intel PEBS precise-IP, LBR branch stack (PERF_SAMPLE_BRANCH_STACK), Intel PT AUX buffer, ARM SPE, PERF_TYPE_BREAKPOINT, eBPF integration (PERF_EVENT_IOC_SET_BPF), PERF_RECORD_MMAP2. - PMU Stage 5: Multiplexing deterministic rotation, cgroup-scoped events (PERF_FLAG_PID_CGROUP), perf_event_attr.inherit across fork, uncore PMU drivers (memory controller, PCIe root complex), dynamic sample rate throttling.


20.9 Kernel Parameter Store (Typed Sysctl)

Replaces /proc/sys/ sysctl text files with a typed, schema-bearing parameter store exposed through umkafs. All existing /proc/sys/ paths continue to work unchanged as a compatibility shim.

20.9.1 Problem with /proc/sys/ Sysctl

Every Linux kernel tunable is a plain text file. This creates several operational problems:

  • No type information at the interface. Writing "abc" to a numeric sysctl returns EINVAL, but only at write time. There is no machine-readable way to discover that the file expects an integer rather than a string.
  • No range information. A monitoring agent cannot determine that net.core.somaxconn accepts values from 1 to 65535 without consulting out-of-tree documentation.
  • No enumeration schema. Tools that want to list all tunables with their types, defaults, and constraints must scrape kernel source or man pages — there is no structured registry.
  • No per-namespace scoping metadata. Some sysctls are per-network-namespace, others are host-global; this distinction is not queryable at the interface.

20.9.2 Design: Typed Parameter Descriptors

Each kernel tunable is described by a KernelParam struct registered at link time. The registration is zero-cost: the descriptor lives in a read-only ELF section (.umka_params), and the umkafs driver iterates that section to populate the namespace. No runtime registration call is needed.

// umka-core/src/params/mod.rs

/// A kernel tunable parameter descriptor. Registered at compile time via macro.
pub struct KernelParam {
    /// Canonical dot-separated name (e.g., "net.core.somaxconn").
    pub name: &'static str,
    /// Type and constraints.
    pub schema: ParamSchema,
    /// Human-readable one-line description.
    pub description: &'static str,
    /// Whether CAP_SYS_ADMIN is required to write this parameter.
    pub privileged: bool,
    /// Whether the parameter is scoped per-user-namespace (false = host-global only).
    pub per_namespace: bool,
    /// Read the current value.
    ///
    /// **Synchronization**: For scalar parameters (`U32`, `U64`, `I32`, `Bool`),
    /// the backing storage is an `AtomicU32`/`AtomicU64`/`AtomicBool`. The getter
    /// loads with `Ordering::Relaxed` (sufficient for scalar reads — no
    /// cross-field consistency is required). The setter stores with
    /// `Ordering::Release` so that a subsequent getter on another CPU sees the
    /// new value after any setup associated with the parameter change.
    ///
    /// For string and compound parameters (`Str`, `Enum`, `Flags`), the backing
    /// storage is protected by `RcuCell<T>`: the getter returns an RCU snapshot
    /// (lock-free read), and the setter replaces the cell via `rcu_assign_pointer`
    /// (writers serialized by a per-parameter `SpinLock`). Readers see either the
    /// old or new value, never a torn intermediate.
    /// Read the current value, optionally scoped to a namespace.
    ///
    /// For non-namespace parameters (`per_namespace == false`), the
    /// `NamespaceContext` argument is `&GLOBAL_NS_CTX` (the host-global
    /// sentinel). The getter ignores it and reads the global atomic.
    ///
    /// For namespace-scoped parameters (`per_namespace == true`, e.g.,
    /// `net.core.somaxconn`), the getter reads the per-namespace value
    /// from the `NetNamespace.params: XArray<ParamValue>` keyed by
    /// `self.name`. If no per-namespace override exists, falls back to
    /// the global default.
    pub getter: fn(ns: &NamespaceContext) -> ParamValue,

    /// Validate and apply a new value, optionally scoped to a namespace.
    ///
    /// For non-namespace parameters, `ns` is `&GLOBAL_NS_CTX`. For
    /// namespace-scoped parameters, `ns` identifies the target namespace.
    pub setter: fn(ns: &NamespaceContext, val: ParamValue) -> Result<(), ParamError>,
}

/// Type and constraints for a kernel parameter.
pub enum ParamSchema {
    U32  { min: u32, max: u32, default: u32 },
    U64  { min: u64, max: u64, default: u64 },
    I32  { min: i32, max: i32, default: i32 },
    Bool { default: bool },
    Str  { max_len: usize, default: &'static str },
    Enum { choices: &'static [&'static str], default: usize },
    /// Bitmask of named flags. Humans write `"flag1|flag2"`, kernel parses bitmask.
    Flags { known_flags: &'static [(&'static str, u64)], default: u64 },
}

/// A bounded-length string for kernel parameter values.
/// Stack-allocated with a compile-time maximum length to avoid heap
/// allocation on the sysctl read/write path (warm path).
///
/// Semantically equivalent to `ArrayString<MAX_PARAM_STR_LEN>` but with
/// explicit length tracking for C ABI compatibility (NUL-terminated
/// buffer + separate length field).
pub struct BoundedStr {
    /// NUL-terminated UTF-8 buffer. Only `len` bytes are valid (excluding NUL).
    buf: [u8; MAX_PARAM_STR_LEN + 1],
    /// Number of valid bytes (not counting the trailing NUL).
    len: usize,
}

/// Maximum length of a string parameter value.
/// Matches Linux's `MAXLEN_SYSCTL_VALUE` (256 bytes) for compatibility.
pub const MAX_PARAM_STR_LEN: usize = 256;

/// A typed parameter value, as read from or written to the parameter store.
pub enum ParamValue {
    U32(u32),
    U64(u64),
    I32(i32),
    Bool(bool),
    Str(BoundedStr),
    Enum(usize),
    Flags(u64),
}

/// Error returned when a write fails schema validation.
pub enum ParamError {
    /// Value is outside the declared `min`/`max` range.
    /// Uses `i128` to represent both `u64` and `i32` ranges without truncation.
    /// (`i64` would silently wrap `u64::MAX` to `-1` in error messages.)
    OutOfRange { min: i128, max: i128 },
    /// String is not one of the declared enum choices.
    InvalidChoice { choices: &'static [&'static str] },
    /// Value tag does not match the parameter's declared type.
    TypeMismatch,
    /// Write rejected by the setter (e.g., value conflicts with hardware state).
    SetterRejected,
}

20.9.3 umkafs Layout

Each parameter appears as a directory under /ukfs/kernel/params/. The directory name is the canonical dot-separated parameter name. Within the directory, fixed-name metadata files provide the schema and current value:

/ukfs/kernel/params/<name>/
    value         rw   Current value. Text on read; text or binary on write.
                       **Atomicity**: A single read() that fits within one page (4 KiB)
                       returns a consistent snapshot — the getter is called once and
                       the result is formatted into the page buffer. Concurrent setter
                       calls may race, but the reader sees either the old or new value
                       (never a torn mix), because scalar getters use atomic loads and
                       compound getters use RCU snapshots (see KernelParam::getter doc).
    default       ro   Factory default value (text).
    schema        ro   JSON object: type, constraints, default.
    description   ro   One-line human-readable description.
    privileged    ro   "1" if CAP_SYS_ADMIN is required to write, "0" otherwise.
    namespace     ro   "per-ns" or "global".

The top-level directory also exposes a single aggregate file for bulk enumeration:

/ukfs/kernel/params/.schema.json   ro   JSON array of all parameter schemas.

Example schema file for net.core.somaxconn:

{"type":"u32","min":1,"max":65535,"default":128,"privileged":true,"namespace":"per-ns"}

20.9.4 Read and Write Protocol

Text reads (for shell scripts and human inspection):

cat /ukfs/kernel/params/net.core.somaxconn/value
128

Binary reads (for programmatic, type-safe access — avoids string parsing):

/* Binary wire format: 1-byte tag followed by the typed value. */
struct ParamValue {
    uint8_t  tag;      /* 1=U32, 2=U64, 3=I32, 4=Bool, 5=Str, 6=Enum, 7=Flags */
    union {
        uint32_t u32_val;
        uint64_t u64_val;
        int32_t  i32_val;
        uint8_t  bool_val;
        uint64_t enum_idx;
        uint64_t flags_val;
        char     str_val[256];
    };
};

int fd = open("/ukfs/kernel/params/net.core.somaxconn/value", O_RDONLY);
struct ParamValue pv;
read(fd, &pv, sizeof(pv));
/* pv.tag == 1 (U32); pv.u32_val == 128 */

Text writes (validated against the schema; EINVAL on type or range error):

echo 4096 > /ukfs/kernel/params/net.core.somaxconn/value

Binary writes (exact typed value; no text parsing):

struct ParamValue pv = { .tag = 1 /* U32 */, .u32_val = 4096 };
write(fd, &pv, sizeof(pv));

Both text and binary writes pass through the same schema validation path before the setter is called.

20.9.5 Schema Validation on Write

// umka-core/src/params/write.rs

/// Validate `raw` bytes against the parameter schema and, if valid, call the setter.
///
/// # Errors
///
/// Returns `ParamError` if the value fails type or range validation, or if the
/// setter rejects the value.
pub fn param_write(param: &KernelParam, ns: &NamespaceContext, raw: &[u8]) -> Result<(), ParamError> {
    let value = parse_text_or_binary(raw)?;
    match (&param.schema, &value) {
        (ParamSchema::U32 { min, max, .. }, ParamValue::U32(v)) => {
            if v < min || v > max {
                return Err(ParamError::OutOfRange {
                    min: i128::from(*min),
                    max: i128::from(*max),
                });
            }
        }
        (ParamSchema::U64 { min, max, .. }, ParamValue::U64(v)) => {
            if v < min || v > max {
                return Err(ParamError::OutOfRange {
                    min: i128::from(*min),
                    max: i128::from(*max),
                });
            }
        }
        (ParamSchema::I32 { min, max, .. }, ParamValue::I32(v)) => {
            if v < min || v > max {
                return Err(ParamError::OutOfRange {
                    min: i128::from(*min),
                    max: i128::from(*max),
                });
            }
        }
        (ParamSchema::Enum { choices, .. }, ParamValue::Enum(idx)) => {
            if *idx >= choices.len() {
                return Err(ParamError::InvalidChoice { choices });
            }
        }
        (ParamSchema::Flags { known_flags, .. }, ParamValue::Flags(v)) => {
            let all_known: u64 = known_flags.iter().map(|(_, bit)| bit).fold(0, |a, b| a | b);
            if v & !all_known != 0 {
                return Err(ParamError::OutOfRange { min: 0, max: i128::from(all_known) });
            }
        }
        (ParamSchema::Bool { .. }, ParamValue::Bool(_)) => {
            // No range validation needed for booleans.
        }
        (ParamSchema::Str { max_len, .. }, ParamValue::Str(ref s)) => {
            if s.len() > *max_len {
                return Err(ParamError::OutOfRange { min: 0, max: *max_len as i128 });
            }
        }
        _ => return Err(ParamError::TypeMismatch),
    }
    (param.setter)(ns, value).map_err(|_| ParamError::SetterRejected)
}

20.9.6 Registration Macro

Kernel subsystems register parameters using a declarative macro. The macro emits a KernelParam descriptor into the .umka_params ELF section; the linker script groups all descriptors into a contiguous array bounded by __umka_params_start and __umka_params_end. The umkafs driver walks that range at mount time to populate /ukfs/kernel/params/. There is no runtime list insertion and no dynamic allocation during registration.

// Example registration in umka-net/src/core/socket.rs

kernel_param! {
    name:          "net.core.somaxconn",
    schema:        ParamSchema::U32 { min: 1, max: 65535, default: 128 },
    description:   "Maximum backlog for listen() sockets.",
    privileged:    true,
    per_namespace: true,
    getter:        || ParamValue::U32(NET_SOMAXCONN.load(Ordering::Relaxed)),
    setter:        |v| {
        if let ParamValue::U32(n) = v {
            NET_SOMAXCONN.store(n, Ordering::Release);
            Ok(())
        } else {
            Err(ParamError::TypeMismatch)
        }
    },
}

20.9.7 eBPF Access

eBPF programs can read (but not write) kernel parameters via the bpf_param_read(name, value_out) helper. The eBPF verifier resolves the name string at JIT time, validates that the referenced parameter exists and is readable, and replaces the helper call with a direct memory load of the parameter's current value. There is no runtime string parsing in the hot path.

Write access is intentionally excluded from eBPF: parameter writes require CAP_SYS_ADMIN enforcement that the verifier cannot safely model for all program attachment points.

Allowed program types: bpf_param_read is available in BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_PROG_TYPE_SCHED_CLS, and BPF_PROG_TYPE_CGROUP_SKB programs. It is not available in BPF_PROG_TYPE_XDP (no kernel parameter access in the XDP fast path) or BPF_PROG_TYPE_SOCKET_FILTER.

Caching: The verifier marks the result as idempotent within a single BPF program invocation — the JIT may cache the loaded value in a register across the program body. If the parameter changes between invocations, the next program run sees the new value.

Non-existent parameters: If the name string does not match any registered parameter, the verifier rejects the program at load time with ENOENT. This is a static check — there is no runtime "parameter not found" path.

20.9.8 Per-Namespace Parameter Scoping

Parameters with per_namespace: true maintain separate values for each user namespace (and, for net.* parameters, each network namespace). The scoping rules are:

Prefix Scope Namespace Type
net.* Per-network-namespace Network namespace (Section 17.1)
user.* Per-user-namespace User namespace
kernel.* Host-global None (init namespace only)
vm.* Host-global None
fs.* Host-global None
dev.* Host-global None

Inheritance: When a new namespace is created (clone(CLONE_NEWNET)), its parameters are initialized to the parent namespace's current values (copy-on-write semantics). Subsequent writes in the child namespace do not affect the parent.

Storage: Per-namespace values are stored in the namespace struct itself — each NetNamespace contains a params: XArray<ParamValue> keyed by parameter index (the offset of the KernelParam in the .umka_params section). The getter/setter function pointers receive a NamespaceContext argument that resolves to the calling task's active namespace:

/// Namespace context passed to namespace-scoped parameter getters/setters.
/// Resolves to the calling task's active namespace for the parameter's
/// namespace type (network, user, etc.).
///
/// Lifetime: valid only for the duration of the getter/setter call. The
/// context borrows the namespace via RCU read lock — callers must not
/// store the pointer beyond the function call.
// kernel-internal, not KABI — parameter store internal context, not exposed to userspace.
#[repr(C)]
pub struct NamespaceContext {
    /// Discriminant indicating which namespace type this context refers to.
    pub ns_type: NamespaceType,
    /// Opaque pointer to the actual namespace struct (e.g., `*const NetNamespace`
    /// for `net.*` parameters, `*const UserNamespace` for `user.*` parameters).
    /// Cast to the concrete type based on `ns_type`.
    pub ns_ptr: *const (),
}

/// Namespace type discriminant for `NamespaceContext`.
#[repr(u8)]
pub enum NamespaceType {
    /// Network namespace (`net.*` parameters).
    Net = 0,
    /// User namespace (`user.*` parameters).
    User = 1,
}

// NOTE: The `ParamGetterNs` / `ParamSetterNs` type aliases that were
// previously defined here are no longer needed. The unified
// `KernelParam.getter: fn(&NamespaceContext) -> ParamValue` and
// `KernelParam.setter: fn(&NamespaceContext, ParamValue) -> Result<...>`
// signatures handle both host-global and namespace-scoped parameters.
// For host-global parameters, callers pass `&GLOBAL_NS_CTX`.

Visibility: A process in a non-init namespace reading /proc/sys/net/... sees its own namespace's values. Writing requires CAP_SYS_ADMIN within that namespace (not the init namespace — namespaced capabilities are sufficient).

20.9.9 Inter-Parameter Validation

Some parameters have dependency relationships — changing one may invalidate another. The kernel enforces these constraints via optional validator callbacks:

/// Optional cross-parameter validator. Called after schema validation
/// passes but before the setter is invoked. Allows checking invariants
/// that span multiple parameters.
///
/// Returns `Ok(())` if the new value is consistent with the current
/// state of all related parameters. Returns `Err(ParamError::SetterRejected)`
/// with a diagnostic string if the invariant would be violated.
pub struct KernelParamGroup {
    /// Parameters in this group that must satisfy a joint invariant.
    pub members: &'static [&'static str],
    /// Validator called when any member is about to change.
    /// `changing` is the name of the parameter being modified.
    /// `new_value` is the proposed new value.
    /// The validator reads current values of other members via their getters.
    pub validate: fn(changing: &str, new_value: &ParamValue) -> Result<(), ParamError>,
}

Registered groups (exhaustive list of cross-parameter constraints):

Group Members Invariant
vm.dirty vm.dirty_ratio, vm.dirty_background_ratio dirty_background_ratio < dirty_ratio
vm.dirty_bytes vm.dirty_bytes, vm.dirty_background_bytes dirty_background_bytes < dirty_bytes (when bytes mode active)
net.tcp_mem net.ipv4.tcp_mem (3-tuple: low, pressure, high) low < pressure < high
net.tcp_rmem net.ipv4.tcp_rmem (3-tuple: min, default, max) min ≤ default ≤ max
net.tcp_wmem net.ipv4.tcp_wmem (3-tuple) min ≤ default ≤ max

Groups are registered via a kernel_param_group! macro that emits descriptors into a .umka_param_groups ELF section. The param_write() path checks group membership and runs the validator before calling the setter.

20.9.10 Boot Parameter Registry

Boot parameters (umka.*) are parsed from the kernel command line at early boot and stored in a centralized registry. Unlike sysctl parameters (which are runtime-tunable), boot parameters are immutable after boot — they configure one-time initialization decisions.

/// Error returned when a boot parameter fails to parse from the kernel
/// command line. Distinct from `ParamError` (runtime sysctl validation)
/// because boot parameters are parsed once at early boot when the error
/// reporting path differs (serial console only, no sysfs).
pub enum BootParamError {
    /// The value string could not be parsed as the expected type.
    /// E.g., "abc" passed for a U32 parameter.
    ParseFailed,
    /// The parsed value is outside the declared `min`/`max` range.
    OutOfRange,
    /// The parsed value is not one of the declared enum choices.
    InvalidChoice,
    /// The parameter was specified but no value was provided
    /// (e.g., `umka.isolation` without `=auto`).
    MissingValue,
}

/// A boot parameter descriptor. Registered at compile time via macro.
pub struct BootParam {
    /// Canonical dotted name (e.g., "umka.isolation", "umka.pcie_aspm").
    pub name: &'static str,
    /// Expected value type.
    pub schema: ParamSchema,
    /// Human-readable description.
    pub description: &'static str,
    /// Parser: converts the command-line string value to typed storage.
    /// On `Err`, the kernel logs a warning to the early console and uses
    /// the default value defined by the parameter's `ParamSchema`.
    /// **Backing storage pattern**: The parse function writes the parsed value
    /// to a module-level `static` variable (typically `AtomicU32`/`AtomicBool`).
    /// The same static is read by the corresponding `KernelParam::getter`.
    /// This two-endpoint pattern (BootParam::parse writes, KernelParam::getter
    /// reads) ensures boot parameters are accessible via both the command line
    /// and the runtime sysctl/ukfs interfaces.
    pub parse: fn(&str) -> Result<(), BootParamError>,
}

Known boot parameters (centralized reference):

Parameter Type Default Description
umka.isolation Enum(auto,tier0,tier1,tier2) auto Default driver isolation tier
umka.pcie_aspm Enum(default,performance,powersave,powersupersave) default PCIe ASPM policy
umka.dma_default_policy Enum(iommu,identity,swiotlb) iommu DMA mapping strategy
umka.blacklist Str (comma-separated) "" Driver blacklist (Section 11.4)
umka.block_device Str (comma-separated BDFs) "" Device probe blocklist
umka.deferred_probe_timeout_ms U32 30000 Deferred probe timeout
umka.verify Enum(strict,warn,off) strict Driver signature verification mode
umka.module_sig Enum(enforce,warn,off) enforce Module signing enforcement
umka.ima Enum(enforce,log,off) enforce IMA measurement policy
umka.panic Enum(halt,reboot) reboot Action on kernel panic
umka.crashkernel Str "" Crash kernel reservation (e.g., 256M)
umka.coredump_compress Enum(none,lz4,zstd) zstd Core dump compression
umka.watchdog.nowayout Bool true Watchdog nowayout mode
umka.thunderbolt_auth_timeout_s U32 10 Thunderbolt device auth timeout
umka.fault_inject Str "" Fault injection (debug builds only)
umka.tier1_aarch64 Enum(poe,pagetable,auto) auto AArch64 Tier 1 isolation method
umka.driver.<name>.<key> varies per-driver Per-driver probe config (Section 11.4)

umkafs exposure: Boot parameters are readable (not writable) at /ukfs/kernel/boot_params/<name>. The aggregate JSON is available at /ukfs/kernel/boot_params/.schema.json.

Registration: Same ELF-section mechanism as KernelParam, but in a separate .umka_boot_params section. The early boot parser iterates this section to match command-line tokens.

20.9.11 Bulk Enumeration

Orchestration tools and monitoring agents can enumerate all parameters without iterating individual directories:

ls /ukfs/kernel/params/
fs.file-max/
kernel.perf_event_paranoid/
net.core.somaxconn/
net.ipv4.tcp_rmem/
vm.swappiness/
...

# Machine-readable schema for all parameters in one read:
cat /ukfs/kernel/params/.schema.json
[
  {"name":"net.core.somaxconn","type":"u32","min":1,"max":65535,
   "default":128,"privileged":true,"namespace":"per-ns",
   "description":"Maximum backlog for listen() sockets."},
  {"name":"vm.swappiness","type":"u32","min":0,"max":200,
   "default":60,"privileged":true,"namespace":"global",
   "description":"Tendency of the kernel to swap anonymous memory."},
  ...
]

The .schema.json file is generated on-the-fly by the umkafs driver by iterating the .umka_params section; it is not stored anywhere on disk.

20.9.12 Linux Compatibility

/proc/sys/ text files continue to work identically. Each existing sysctl path is implemented as a read-write passthrough in the SysAPI layer that delegates to the corresponding KernelParam getter and setter. Shell scripts, Ansible playbooks, and container runtimes that write to /proc/sys/net/core/somaxconn or similar paths observe unchanged behavior.

sysctl(8) tool works without modification. It reads and writes /proc/sys/ paths, which pass through the compat shim.

sysctl(2) syscall (deprecated since Linux 5.5, retained for binary compatibility) is handled by umka-sysapi. The syscall translates the legacy integer key array into a dot-separated parameter name, then dispatches to the same getter/setter as the umkafs and /proc/sys/ paths.

/ukfs/kernel/params/ is an UmkaOS addition. No Linux kernel or userspace tool uses this path; it is purely additive.

Implementation phases: - Phase 1: Core infrastructure — KernelParam, ParamSchema, ParamValue, kernel_param! macro, .umka_params ELF section, umkafs driver population. Migrate all parameters currently exported via /proc/sys/ to typed descriptors. - Phase 2: /proc/sys/ compat shim backed by KernelParam getters/setters. sysctl(2) syscall dispatch. Text and binary read/write protocol on value file. - Phase 3: .schema.json aggregate file. eBPF bpf_param_read helper and verifier support. Per-namespace parameter scoping enforcement. - Phase 4: sysctl(8) extended output mode showing type and range annotations (opt-in flag; default output unchanged for compat).

Cross-references: Section 20.5, Section 19.1, Section 1.1.